Optimisations in data.table
datatable-optimize.Rd
data.table
internally optimises certain expressions in order to improve
performance. This section briefly summarises those optimisations.
Note that there's no additional input needed from the user to take advantage of these optimisations. They happen automatically.
Run the code under the example section to get a feel for the performance benefits from these optimisations.
Note that for all optimizations involving efficient sorts, the caveat mentioned
in setorder
applies -- whenever data.table does the sorting,
it does so in "C-locale". This has some subtle implications; see Examples.
Details
data.table
reads the global option datatable.optimize
to figure
out what level of optimisation is required. The default value Inf
activates all available optimisations.
At optimisation level >= 1
, i.e., getOption("datatable.optimize")
>= 1, these are the optimisations:
The base function
order
is internally replaced withdata.table
's fast ordering. That is,DT[order(...)]
gets internally optimised toDT[forder(...)]
.The expression
DT[, lapply(.SD, fun), by=.]
gets optimised toDT[, list(fun(a), fun(b), ...), by=.]
wherea,b, ...
are columns in.SD
. This improves performance tremendously.Similarly, the expression
DT[, c(.N, lapply(.SD, fun)), by=.]
gets optimised toDT[, list(.N, fun(a), fun(b), ...)]
..N
is just for example here.base::mean
function is internally optimised to usedata.table
'sfastmean
function.mean()
frombase
is an S3 generic and gets slow with many groups.
At optimisation level >= 2
, i.e., getOption("datatable.optimize")
>= 2, additional optimisations are implemented on top of the optimisations already shown above.
Expressions in
j
which contain only the functionsmin, max, mean, median, var, sd, sum, prod, first, last, head, tail
(for example,DT[, list(mean(x), median(x), min(y), max(y)), by=z]
), they are very effectively optimised using what we call GForce. These functions are automatically replaced with a corresponding GForce version with patterng*
, e.g.,prod
becomesgprod
.Normally, once the rows belonging to each group are identified, the values corresponding to the group are gathered and the
j
-expression is evaluated. This can be improved by computing the result directly without having to gather the values or evaluating the expression for each group (which can get costly with large number of groups) by implementing it specifically for a particular function. As a result, it is extremely fast.In addition to all the functions above, `.N` is also optimised to use GForce, when used separately or when combined with the functions mentioned above. Note further that GForce-optimized functions must be used separately, i.e., code like
DT[ , max(x) - min(x), by=z]
will not currently be optimized to usegmax, gmin
.Expressions of the form
DT[i, j, by]
are also optimised wheni
is a subset operation andj
is any/all of the functions discussed above.
At optimisation level >= 3
, i.e., getOption("datatable.optimize")
>= 3, additional optimisations for subsets in i are implemented on top of the optimisations already shown above. Subsetting operations are - if possible - translated into joins to make use of blazing fast binary search using indices and keys. The following queries are optimized:
Supported operators:
==
,%in%
. Non-equi operators(>, <, etc.) are not supported yet because non-equi joins are slower than vector based subsets.Queries on multiple columns are supported, if the connector is '
&
', e.g.DT[x == 2 & y == 3]
is supported, butDT[x == 2 | y == 3]
is not.Optimization will currently be turned off when doing subset when cross product of elements provided to filter on exceeds > 1e4. This most likely happens if multiple
%in%
, or%chin%
queries are combined, e.g.DT[x %in% 1:100 & y %in% 1:200]
will not be optimized since100 * 200 = 2e4 > 1e4
.Queries with multiple criteria on one column are not supported, e.g.
DT[x == 2 & x %in% c(2,5)]
is not supported.Queries with non-missing j are supported, e.g.
DT[x == 3 & y == 5, .(new = x-y)]
orDT[x == 3 & y == 5, new := x-y]
are supported. Also extends to queries usingwith = FALSE
."notjoin" queries, i.e. queries that start with
!
, are only supported if there are no&
connections, e.g.DT[!x==3]
is supported, butDT[!x==3 & y == 4]
is not.
If in doubt, whether your query benefits from optimization, call it with the verbose = TRUE
argument. You should see "Optimized subsetting...".
Auto indexing: In case a query is optimized, but no appropriate key or index is found, data.table
automatically creates an index on the first run. Any successive subsets on the same
column then reuse this index to binary search (instead of
vector scan) and is therefore fast.
Auto indexing can be switched off with the global option
options(datatable.auto.index = FALSE)
. To switch off using existing
indices set global option options(datatable.use.index = FALSE)
.
Examples
if (FALSE) {
old = options(datatable.optimize = Inf)
# Generate a big data.table with a relatively many columns
set.seed(1L)
DT = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
setDT(DT)[, id := sample(1e5, 5e6, TRUE)]
print(object.size(DT), units="Mb") # 400MB, not huge, but will do
# 'order' optimisation
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[order(id)])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- DT[order(id)])
identical(ans1, ans2)
# optimisation of 'lapply(.SD, fun)'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[, lapply(.SD, min), by=id])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- DT[, lapply(.SD, min), by=id])
identical(ans1, ans2)
# optimisation of 'mean'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[, lapply(.SD, mean), by=id])
system.time(ans2 <- DT[, lapply(.SD, base::mean), by=id])
identical(ans1, ans2)
# optimisation of 'c(.N, lapply(.SD, ))'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[, c(.N, lapply(.SD, min)), by=id])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- DT[, c(N=.N, lapply(.SD, min)), by=id])
identical(ans1, ans2)
# GForce
options(datatable.optimize = 2L) # optimisation 'on'
system.time(ans1 <- DT[, lapply(.SD, median), by=id])
system.time(ans2 <- DT[, lapply(.SD, function(x) as.numeric(stats::median(x))), by=id])
identical(ans1, ans2)
# optimized subsets
options(datatable.optimize = 2L)
system.time(ans1 <- DT[id == 100L]) # vector scan
system.time(ans2 <- DT[id == 100L]) # vector scan
system.time(DT[id %in% 100:500]) # vector scan
options(datatable.optimize = 3L)
system.time(ans1 <- DT[id == 100L]) # index + binary search subset
system.time(ans2 <- DT[id == 100L]) # only binary search subset
system.time(DT[id %in% 100:500]) # only binary search subset again
# sensitivity to collate order
old_lc_collate = Sys.getlocale("LC_COLLATE")
if (old_lc_collate == "C") {
Sys.setlocale("LC_COLLATE", "")
}
DT = data.table(
grp = rep(1:2, each = 4L),
var = c("A", "a", "0", "1", "B", "b", "0", "1")
)
options(datatable.optimize = Inf)
DT[, .(max(var), min(var)), by=grp]
# GForce is deactivated because of the ad-hoc column 'tolower(var)',
# through which the result for 'max(var)' may also change
DT[, .(max(var), min(tolower(var))), by=grp]
Sys.setlocale("LC_COLLATE", old_lc_collate)
options(old)
}