R: Fast overlap joins

foverlaps {data.table}

R Documentation

Fast overlap joins

Description

A fast binary-search based overlap join of two data.tables. This is very much inspired by findOverlaps function from the Bioconductor package IRanges (see link below under See Also).

Usually, x is a very large data.table with small interval ranges, and y is much smaller keyed data.table with relatively larger interval spans. For a usage in genomics, see the examples section.

NOTE: This is still under development, meaning it is stable, but some features are yet to be implemented. Also, some arguments and/or the function name itself could be changed.

Usage

foverlaps(x, y, by.x = if (!is.null(key(x))) key(x) else key(y),
    by.y = key(y), maxgap = 0L, minoverlap = 1L,
    type = c("any", "within", "start", "end", "equal"),
    mult = c("all", "first", "last"),
    nomatch = NA,
    which = FALSE, verbose = getOption("datatable.verbose"))

Arguments

`x`, `y`	`data.table`s. `y` needs to be keyed, but not necessarily `x`. See examples.
`by.x`, `by.y`	A vector of column names (or numbers) to compute the overlap joins. The last two columns in both `by.x` and `by.y` should each correspond to the `start` and `end` interval columns in `x` and `y` respectively. And the `start` column should always be <= `end` column. If `x` is keyed, `by.x` is equal to `key(x)`, else `key(y)`. `by.y` defaults to `key(y)`.
`maxgap`	It should be a non-negative integer value, >= 0. Default is 0 (no gap). For intervals `[a,b]` and `[c,d]`, where `a<=b` and `c<=d`, when `c > b` or `d < a`, the two intervals don't overlap. If the gap between these two intervals is `<= maxgap`, these two intervals are considered as overlapping. Note: This is not yet implemented.
`minoverlap`	It should be a positive integer value, > 0. Default is 1. For intervals `[a,b]` and `[c,d]`, where `a<=b` and `c<=d`, when `c<=b` and `d>=a`, the two intervals overlap. If the length of overlap between these two intervals is `>= minoverlap`, then these two intervals are considered to be overlapping. Note: This is not yet implemented.
`type`	Default value is `any`. Allowed values are `any`, `within`, `start`, `end` and `equal`. The types shown here are identical in functionality to the function `findOverlaps` in the bioconductor package `IRanges`. Let `[a,b]` and `[c,d]` be intervals in `x` and `y` with `a<=b` and `c<=d`. For `type="start"`, the intervals overlap iff `a == c`. For `type="end"`, the intervals overlap iff `b == d`. For `type="within"`, the intervals overlap iff `a>=c and b<=d`. For `type="equal"`, the intervals overlap iff `a==c and b==d`. For `type="any"`, as long as `c<=b and d>=a`, they overlap. In addition to these requirements, they also have to satisfy the `minoverlap` argument as explained above. NB: `maxgap` argument, when > 0, is to be interpreted according to the type of the overlap. This will be updated once `maxgap` is implemented.
`mult`	When multiple rows in `y` match to the row in `x`, `mult=.` controls which values are returned - `"all"` (default), `"first"` or `"last"`.
`nomatch`	When a row (with interval say, `[a,b]`) in `x` has no match in `y`, `nomatch=NA` (default) means `NA` is returned for `y`'s non-`by.y` columns for that row of `x`. `nomatch=NULL` (or `0` for backward compatibility) means no rows will be returned for that row of `x`.
`which`	When `TRUE`, if `mult="all"` returns a two column `data.table` with the first column corresponding to `x`'s row number and the second corresponding to `y`'s. When `nomatch=NA`, no matches return `NA` for `y`, and if `nomatch=NULL`, those rows where no match is found will be skipped; if `mult="first" or "last"`, a vector of length equal to the number of rows in `x` is returned, with no-match entries filled with `NA` or `0` corresponding to the `nomatch` argument. Default is `FALSE`, which returns a join with the rows in `y`.
`verbose`	`TRUE` turns on status and information messages to the console. Turn this on by default using `options(datatable.verbose=TRUE)`. The quantity and types of verbosity may be expanded in future.

Details

Very briefly, foverlaps() collapses the two-column interval in y to one-column of unique values to generate a lookup table, and then performs the join depending on the type of overlap, using the already available binary search feature of data.table. The time (and space) required to generate the lookup is therefore proportional to the number of unique values present in the interval columns of y when combined together.

Overlap joins takes advantage of the fact that y is sorted to speed-up finding overlaps. Therefore y has to be keyed (see ?setkey) prior to running foverlaps(). A key on x is not necessary, although it might speed things further. The columns in by.x argument should correspond to the columns specified in by.y. The last two columns should be the interval columns in both by.x and by.y. The first interval column in by.x should always be <= the second interval column in by.x, and likewise for by.y. The storage.mode of the interval columns must be either double or integer. It therefore works with bit64::integer64 type as well.

The lookup generation step could be quite time consuming if the number of unique values in y are too large (ex: in the order of tens of millions). There might be improvements possible by constructing lookup using RLE, which is a pending feature request. However most scenarios will not have too many unique values for y.

Value

A new data.table by joining over the interval columns (along with other additional identifier columns) specified in by.x and by.y.

NB: When which=TRUE: a) mult="first" or "last" returns a vector of matching row numbers in y, and b) when mult="all" returns a data.table with two columns with the first containing row numbers of x and the second column with corresponding row numbers of y.

nomatch=NA|NULL also influences whether non-matching rows are returned or not, as explained above.

Examples

require(data.table)
## simple example:
x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10)
y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3)
setkey(y, start, end)
foverlaps(x, y, type="any", which=TRUE) ## return overlap indices
foverlaps(x, y, type="any") ## return overlap join
foverlaps(x, y, type="any", mult="first") ## returns only first match
foverlaps(x, y, type="within") ## matches iff 'x' is within 'y'

## with extra identifiers (ex: in genomics)
x = data.table(chr=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
               start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
               end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, type="any", which=TRUE)
foverlaps(x, y, type="any")
foverlaps(x, y, type="any", nomatch=NULL)
foverlaps(x, y, type="within", which=TRUE)
foverlaps(x, y, type="within")
foverlaps(x, y, type="start")

## x and y have different column names - specify by.x
x = data.table(seq=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
               start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
               end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, by.x=c("seq", "start", "end"),
            type="any", which=TRUE)

[Package data.table version 1.15.99 Index]