Fast overlap joins
foverlaps.Rd
A fast binary-search based overlap join of two data.table
s.
This is very much inspired by findOverlaps
function from the Bioconductor
package IRanges
(see link below under See Also
).
Usually, x
is a very large data.table with small interval ranges, and
y
is much smaller keyed data.table
with relatively larger
interval spans. For a usage in genomics
, see the examples section.
NOTE: This is still under development, meaning it is stable, but some features are yet to be implemented. Also, some arguments and/or the function name itself could be changed.
Arguments
- x, y
data.table
s.y
needs to be keyed, but not necessarilyx
. See examples.- by.x, by.y
A vector of column names (or numbers) to compute the overlap joins. The last two columns in both
by.x
andby.y
should each correspond to thestart
andend
interval columns inx
andy
respectively. And thestart
column should always be <=end
column. Ifx
is keyed,by.x
is equal tokey(x)
, elsekey(y)
.by.y
defaults tokey(y)
.- maxgap
It should be a non-negative integer value, >= 0. Default is 0 (no gap). For intervals
[a,b]
and[c,d]
, wherea<=b
andc<=d
, whenc > b
ord < a
, the two intervals don't overlap. If the gap between these two intervals is<= maxgap
, these two intervals are considered as overlapping. Note: This is not yet implemented.- minoverlap
It should be a positive integer value, > 0. Default is 1. For intervals
[a,b]
and[c,d]
, wherea<=b
andc<=d
, whenc<=b
andd>=a
, the two intervals overlap. If the length of overlap between these two intervals is>= minoverlap
, then these two intervals are considered to be overlapping. Note: This is not yet implemented.- type
Default value is
any
. Allowed values areany
,within
,start
,end
andequal
.The types shown here are identical in functionality to the function
findOverlaps
in the bioconductor packageIRanges
. Let[a,b]
and[c,d]
be intervals inx
andy
witha<=b
andc<=d
. Fortype="start"
, the intervals overlap iffa == c
. Fortype="end"
, the intervals overlap iffb == d
. Fortype="within"
, the intervals overlap iffa>=c and b<=d
. Fortype="equal"
, the intervals overlap iffa==c and b==d
. Fortype="any"
, as long asc<=b and d>=a
, they overlap. In addition to these requirements, they also have to satisfy theminoverlap
argument as explained above.NB:
maxgap
argument, when > 0, is to be interpreted according to the type of the overlap. This will be updated oncemaxgap
is implemented.- mult
When multiple rows in
y
match to the row inx
,mult=.
controls which values are returned -"all"
(default),"first"
or"last"
.- nomatch
When a row (with interval say,
[a,b]
) inx
has no match iny
,nomatch=NA
(default) meansNA
is returned fory
's non-by.y
columns for that row ofx
.nomatch=NULL
(or0
for backward compatibility) means no rows will be returned for that row ofx
.- which
When
TRUE
, ifmult="all"
returns a two columndata.table
with the first column corresponding tox
's row number and the second corresponding toy
's. Whennomatch=NA
, no matches returnNA
fory
, and ifnomatch=NULL
, those rows where no match is found will be skipped; ifmult="first" or "last"
, a vector of length equal to the number of rows inx
is returned, with no-match entries filled withNA
or0
corresponding to thenomatch
argument. Default isFALSE
, which returns a join with the rows iny
.- verbose
TRUE
turns on status and information messages to the console. Turn this on by default usingoptions(datatable.verbose=TRUE)
. The quantity and types of verbosity may be expanded in future.
Details
Very briefly, foverlaps()
collapses the two-column interval in y
to one-column of unique values to generate a lookup
table, and
then performs the join depending on the type of overlap
, using the
already available binary search
feature of data.table
. The time
(and space) required to generate the lookup
is therefore proportional
to the number of unique values present in the interval columns of y
when combined together.
Overlap joins takes advantage of the fact that y
is sorted to speed-up
finding overlaps. Therefore y
has to be keyed (see ?setkey
)
prior to running foverlaps()
. A key on x
is not necessary,
although it might speed things further. The columns in by.x
argument should correspond to the columns specified in by.y
. The last
two columns should be the interval columns in both by.x
and
by.y
. The first interval column in by.x
should always be <= the
second interval column in by.x
, and likewise for by.y
. The
storage.mode
of the interval columns must be either double
or integer
. It therefore works with bit64::integer64
type as well.
The lookup
generation step could be quite time consuming if the number
of unique values in y
are too large (ex: in the order of tens of millions).
There might be improvements possible by constructing lookup using RLE, which is
a pending feature request. However most scenarios will not have too many unique
values for y
.
Value
A new data.table
by joining over the interval columns (along with other
additional identifier columns) specified in by.x
and by.y
.
NB: When which=TRUE
: a)
mult="first" or "last"
returns a
vector
of matching row numbers in y
, and b)
when
mult="all"
returns a data.table with two columns with the first
containing row numbers of x
and the second column with corresponding
row numbers of y
.
nomatch=NA|NULL
also influences whether non-matching rows are returned
or not, as explained above.
Examples
require(data.table)
## simple example:
x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10)
y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3)
setkey(y, start, end)
foverlaps(x, y, type="any", which=TRUE) ## return overlap indices
#> xid yid
#> <int> <int>
#> 1: 1 NA
#> 2: 2 2
#> 3: 2 3
#> 4: 3 2
#> 5: 4 NA
foverlaps(x, y, type="any") ## return overlap join
#> start end val1 i.start i.end val2
#> <num> <num> <int> <num> <num> <int>
#> 1: NA NA NA 5 8 7
#> 2: 20 35 2 31 50 8
#> 3: 30 45 3 31 50 8
#> 4: 20 35 2 22 25 9
#> 5: NA NA NA 16 18 10
foverlaps(x, y, type="any", mult="first") ## returns only first match
#> start end val1 i.start i.end val2
#> <num> <num> <int> <num> <num> <int>
#> 1: NA NA NA 5 8 7
#> 2: 20 35 2 31 50 8
#> 3: 20 35 2 22 25 9
#> 4: NA NA NA 16 18 10
foverlaps(x, y, type="within") ## matches iff 'x' is within 'y'
#> start end val1 i.start i.end val2
#> <num> <num> <int> <num> <num> <int>
#> 1: NA NA NA 5 8 7
#> 2: NA NA NA 31 50 8
#> 3: 20 35 2 22 25 9
#> 4: NA NA NA 16 18 10
## with extra identifiers (ex: in genomics)
x = data.table(chr=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, type="any", which=TRUE)
#> xid yid
#> <int> <int>
#> 1: 1 NA
#> 2: 2 2
#> 3: 3 3
#> 4: 4 3
#> 5: 5 3
foverlaps(x, y, type="any")
#> chr start end geneid i.start i.end
#> <char> <num> <num> <char> <num> <num>
#> 1: Chr1 NA NA <NA> 5 11
#> 2: Chr1 15 18 b 10 20
#> 3: Chr2 1 55 c 1 4
#> 4: Chr2 1 55 c 25 52
#> 5: Chr2 1 55 c 50 60
foverlaps(x, y, type="any", nomatch=NULL)
#> chr start end geneid i.start i.end
#> <char> <num> <num> <char> <num> <num>
#> 1: Chr1 15 18 b 10 20
#> 2: Chr2 1 55 c 1 4
#> 3: Chr2 1 55 c 25 52
#> 4: Chr2 1 55 c 50 60
foverlaps(x, y, type="within", which=TRUE)
#> xid yid
#> <int> <int>
#> 1: 1 NA
#> 2: 2 NA
#> 3: 3 3
#> 4: 4 3
#> 5: 5 NA
foverlaps(x, y, type="within")
#> chr start end geneid i.start i.end
#> <char> <num> <num> <char> <num> <num>
#> 1: Chr1 NA NA <NA> 5 11
#> 2: Chr1 NA NA <NA> 10 20
#> 3: Chr2 1 55 c 1 4
#> 4: Chr2 1 55 c 25 52
#> 5: Chr2 NA NA <NA> 50 60
foverlaps(x, y, type="start")
#> chr start end geneid i.start i.end
#> <char> <num> <num> <char> <num> <num>
#> 1: Chr1 NA NA <NA> 5 11
#> 2: Chr1 NA NA <NA> 10 20
#> 3: Chr2 1 55 c 1 4
#> 4: Chr2 NA NA <NA> 25 52
#> 5: Chr2 NA NA <NA> 50 60
## x and y have different column names - specify by.x
x = data.table(seq=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, by.x=c("seq", "start", "end"),
type="any", which=TRUE)
#> xid yid
#> <int> <int>
#> 1: 1 NA
#> 2: 2 2
#> 3: 3 3
#> 4: 4 3
#> 5: 5 3