foverlaps.Rd
A fast binarysearch based overlap join of two data.table
s.
This is very much inspired by findOverlaps
function from the Bioconductor
package IRanges
(see link below under See Also
).
Usually, x
is a very large data.table with small interval ranges, and
y
is much smaller keyed data.table
with relatively larger
interval spans. For a usage in genomics
, see the examples section.
NOTE: This is still under development, meaning it is stable, but some features are yet to be implemented. Also, some arguments and/or the function name itself could be changed.
foverlaps(x, y, by.x = if (!is.null(key(x))) key(x) else key(y), by.y = key(y), maxgap = 0L, minoverlap = 1L, type = c("any", "within", "start", "end", "equal"), mult = c("all", "first", "last"), nomatch = NA, which = FALSE, verbose = getOption("datatable.verbose"))
x, y 


by.x, by.y  A vector of column names (or numbers) to compute the overlap
joins. The last two columns in both 
maxgap  It should be a nonnegative integer value, >= 0. Default is 0 (no
gap). For intervals 
minoverlap  It should be a positive integer value, > 0. Default is 1. For
intervals 
type  Default value is The types shown here are identical in functionality to the function
NB: 
mult  When multiple rows in 
nomatch  When a row (with interval say, 
which  When 
verbose 

Very briefly, foverlaps()
collapses the twocolumn interval in y
to onecolumn of unique values to generate a lookup
table, and
then performs the join depending on the type of overlap
, using the
already available binary search
feature of data.table
. The time
(and space) required to generate the lookup
is therefore proportional
to the number of unique values present in the interval columns of y
when combined together.
Overlap joins takes advantage of the fact that y
is sorted to speedup
finding overlaps. Therefore y
has to be keyed (see ?setkey
)
prior to running foverlaps()
. A key on x
is not necessary,
although it might speed things further. The columns in by.x
argument should correspond to the columns specified in by.y
. The last
two columns should be the interval columns in both by.x
and
by.y
. The first interval column in by.x
should always be <= the
second interval column in by.x
, and likewise for by.y
. The
storage.mode
of the interval columns must be either double
or integer
. It therefore works with bit64::integer64
type as well.
The lookup
generation step could be quite time consuming if the number
of unique values in y
are too large (ex: in the order of tens of millions).
There might be improvements possible by constructing lookup using RLE, which is
a pending feature request. However most scenarios will not have too many unique
values for y
.
A new data.table
by joining over the interval columns (along with other
additional identifier columns) specified in by.x
and by.y
.
NB: When which=TRUE
: a)
mult="first" or "last"
returns a
vector
of matching row numbers in y
, and b)
when
mult="all"
returns a data.table with two columns with the first
containing row numbers of x
and the second column with corresponding
row numbers of y
.
nomatch=NANULL
also influences whether nonmatching rows are returned
or not, as explained above.
data.table
,
https://www.bioconductor.org/packages/release/bioc/html/IRanges.html,
setNumericRounding
require(data.table) ## simple example: x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10) y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3) setkey(y, start, end) foverlaps(x, y, type="any", which=TRUE) ## return overlap indices#> xid yid #> <int> <int> #> 1: 1 NA #> 2: 2 2 #> 3: 2 3 #> 4: 3 2 #> 5: 4 NAfoverlaps(x, y, type="any") ## return overlap join#> start end val1 i.start i.end val2 #> <num> <num> <int> <num> <num> <int> #> 1: NA NA NA 5 8 7 #> 2: 20 35 2 31 50 8 #> 3: 30 45 3 31 50 8 #> 4: 20 35 2 22 25 9 #> 5: NA NA NA 16 18 10foverlaps(x, y, type="any", mult="first") ## returns only first match#> start end val1 i.start i.end val2 #> <num> <num> <int> <num> <num> <int> #> 1: NA NA NA 5 8 7 #> 2: 20 35 2 31 50 8 #> 3: 20 35 2 22 25 9 #> 4: NA NA NA 16 18 10foverlaps(x, y, type="within") ## matches iff 'x' is within 'y'#> start end val1 i.start i.end val2 #> <num> <num> <int> <num> <num> <int> #> 1: NA NA NA 5 8 7 #> 2: NA NA NA 31 50 8 #> 3: 20 35 2 22 25 9 #> 4: NA NA NA 16 18 10## with extra identifiers (ex: in genomics) x = data.table(chr=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"), start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60)) y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1), end=c(4, 18, 55), geneid=letters[1:3]) setkey(y, chr, start, end) foverlaps(x, y, type="any", which=TRUE)#> xid yid #> <int> <int> #> 1: 1 NA #> 2: 2 2 #> 3: 3 3 #> 4: 4 3 #> 5: 5 3foverlaps(x, y, type="any")#> chr start end geneid i.start i.end #> <char> <num> <num> <char> <num> <num> #> 1: Chr1 NA NA <NA> 5 11 #> 2: Chr1 15 18 b 10 20 #> 3: Chr2 1 55 c 1 4 #> 4: Chr2 1 55 c 25 52 #> 5: Chr2 1 55 c 50 60foverlaps(x, y, type="any", nomatch=NULL)#> chr start end geneid i.start i.end #> <char> <num> <num> <char> <num> <num> #> 1: Chr1 15 18 b 10 20 #> 2: Chr2 1 55 c 1 4 #> 3: Chr2 1 55 c 25 52 #> 4: Chr2 1 55 c 50 60foverlaps(x, y, type="within", which=TRUE)#> xid yid #> <int> <int> #> 1: 1 NA #> 2: 2 NA #> 3: 3 3 #> 4: 4 3 #> 5: 5 NAfoverlaps(x, y, type="within")#> chr start end geneid i.start i.end #> <char> <num> <num> <char> <num> <num> #> 1: Chr1 NA NA <NA> 5 11 #> 2: Chr1 NA NA <NA> 10 20 #> 3: Chr2 1 55 c 1 4 #> 4: Chr2 1 55 c 25 52 #> 5: Chr2 NA NA <NA> 50 60foverlaps(x, y, type="start")#> chr start end geneid i.start i.end #> <char> <num> <num> <char> <num> <num> #> 1: Chr1 NA NA <NA> 5 11 #> 2: Chr1 NA NA <NA> 10 20 #> 3: Chr2 1 55 c 1 4 #> 4: Chr2 NA NA <NA> 25 52 #> 5: Chr2 NA NA <NA> 50 60## x and y have different column names  specify by.x x = data.table(seq=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"), start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60)) y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1), end=c(4, 18, 55), geneid=letters[1:3]) setkey(y, chr, start, end) foverlaps(x, y, by.x=c("seq", "start", "end"), type="any", which=TRUE)#> xid yid #> <int> <int> #> 1: 1 NA #> 2: 2 2 #> 3: 3 3 #> 4: 4 3 #> 5: 5 3