# Optimisations in data.table

`datatable-optimize.Rd`

`data.table`

internally optimises certain expressions in order to improve
performance. This section briefly summarises those optimisations.

Note that there's no additional input needed from the user to take advantage of these optimisations. They happen automatically.

Run the code under the *example* section to get a feel for the performance
benefits from these optimisations.

Note that for all optimizations involving efficient sorts, the caveat mentioned
in `setorder`

applies -- whenever data.table does the sorting,
it does so in "C-locale". This has some subtle implications; see Examples.

## Details

`data.table`

reads the global option `datatable.optimize`

to figure
out what level of optimisation is required. The default value `Inf`

activates *all* available optimisations.

At optimisation level `>= 1`

, i.e., `getOption("datatable.optimize")`

>= 1, these are the optimisations:

The base function

`order`

is internally replaced with`data.table`

's*fast ordering*. That is,`DT[order(...)]`

gets internally optimised to`DT[forder(...)]`

.The expression

`DT[, lapply(.SD, fun), by=.]`

gets optimised to`DT[, list(fun(a), fun(b), ...), by=.]`

where`a,b, ...`

are columns in`.SD`

. This improves performance tremendously.Similarly, the expression

`DT[, c(.N, lapply(.SD, fun)), by=.]`

gets optimised to`DT[, list(.N, fun(a), fun(b), ...)]`

.`.N`

is just for example here.`base::mean`

function is internally optimised to use`data.table`

's`fastmean`

function.`mean()`

from`base`

is an S3 generic and gets slow with many groups.

At optimisation level `>= 2`

, i.e., `getOption("datatable.optimize")`

>= 2, additional optimisations are implemented on top of the optimisations already shown above.

Expressions in

`j`

which contain only the functions`min, max, mean, median, var, sd, sum, prod, first, last, head, tail`

(for example,`DT[, list(mean(x), median(x), min(y), max(y)), by=z]`

), they are very effectively optimised using what we call*GForce*. These functions are automatically replaced with a corresponding GForce version with pattern`g*`

, e.g.,`prod`

becomes`gprod`

.Normally, once the rows belonging to each group are identified, the values corresponding to the group are gathered and the

`j`

-expression is evaluated. This can be improved by computing the result directly without having to gather the values or evaluating the expression for each group (which can get costly with large number of groups) by implementing it specifically for a particular function. As a result, it is extremely fast.In addition to all the functions above, `.N` is also optimised to use GForce, when used separately or when combined with the functions mentioned above. Note further that GForce-optimized functions must be used separately, i.e., code like

`DT[ , max(x) - min(x), by=z]`

will*not*currently be optimized to use`gmax, gmin`

.Expressions of the form

`DT[i, j, by]`

are also optimised when`i`

is a*subset*operation and`j`

is any/all of the functions discussed above.

At optimisation level `>= 3`

, i.e., `getOption("datatable.optimize")`

>= 3, additional optimisations for subsets in i are implemented on top of the optimisations already shown above. Subsetting operations are - if possible - translated into joins to make use of blazing fast binary search using indices and keys. The following queries are optimized:

Supported operators:

`==`

,`%in%`

. Non-equi operators(>, <, etc.) are not supported yet because non-equi joins are slower than vector based subsets.Queries on multiple columns are supported, if the connector is '

`&`

', e.g.`DT[x == 2 & y == 3]`

is supported, but`DT[x == 2 | y == 3]`

is not.Optimization will currently be turned off when doing subset when cross product of elements provided to filter on exceeds > 1e4. This most likely happens if multiple

`%in%`

, or`%chin%`

queries are combined, e.g.`DT[x %in% 1:100 & y %in% 1:200]`

will not be optimized since`100 * 200 = 2e4 > 1e4`

.Queries with multiple criteria on one column are

*not*supported, e.g.`DT[x == 2 & x %in% c(2,5)]`

is not supported.Queries with non-missing j are supported, e.g.

`DT[x == 3 & y == 5, .(new = x-y)]`

or`DT[x == 3 & y == 5, new := x-y]`

are supported. Also extends to queries using`with = FALSE`

."notjoin" queries, i.e. queries that start with

`!`

, are only supported if there are no`&`

connections, e.g.`DT[!x==3]`

is supported, but`DT[!x==3 & y == 4]`

is not.

If in doubt, whether your query benefits from optimization, call it with the `verbose = TRUE`

argument. You should see "Optimized subsetting...".

**Auto indexing:** In case a query is optimized, but no appropriate key or index is found, `data.table`

automatically creates an *index* on the first run. Any successive subsets on the same
column then reuse this index to *binary search* (instead of
*vector scan*) and is therefore fast.
Auto indexing can be switched off with the global option
`options(datatable.auto.index = FALSE)`

. To switch off using existing
indices set global option `options(datatable.use.index = FALSE)`

.

## Examples

```
if (FALSE) {
old = options(datatable.optimize = Inf)
# Generate a big data.table with a relatively many columns
set.seed(1L)
DT = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
setDT(DT)[, id := sample(1e5, 5e6, TRUE)]
print(object.size(DT), units="Mb") # 400MB, not huge, but will do
# 'order' optimisation
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[order(id)])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- DT[order(id)])
identical(ans1, ans2)
# optimisation of 'lapply(.SD, fun)'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[, lapply(.SD, min), by=id])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- DT[, lapply(.SD, min), by=id])
identical(ans1, ans2)
# optimisation of 'mean'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[, lapply(.SD, mean), by=id])
system.time(ans2 <- DT[, lapply(.SD, base::mean), by=id])
identical(ans1, ans2)
# optimisation of 'c(.N, lapply(.SD, ))'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[, c(.N, lapply(.SD, min)), by=id])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- DT[, c(N=.N, lapply(.SD, min)), by=id])
identical(ans1, ans2)
# GForce
options(datatable.optimize = 2L) # optimisation 'on'
system.time(ans1 <- DT[, lapply(.SD, median), by=id])
system.time(ans2 <- DT[, lapply(.SD, function(x) as.numeric(stats::median(x))), by=id])
identical(ans1, ans2)
# optimized subsets
options(datatable.optimize = 2L)
system.time(ans1 <- DT[id == 100L]) # vector scan
system.time(ans2 <- DT[id == 100L]) # vector scan
system.time(DT[id %in% 100:500]) # vector scan
options(datatable.optimize = 3L)
system.time(ans1 <- DT[id == 100L]) # index + binary search subset
system.time(ans2 <- DT[id == 100L]) # only binary search subset
system.time(DT[id %in% 100:500]) # only binary search subset again
# sensitivity to collate order
old_lc_collate = Sys.getlocale("LC_COLLATE")
if (old_lc_collate == "C") {
Sys.setlocale("LC_COLLATE", "")
}
DT = data.table(
grp = rep(1:2, each = 4L),
var = c("A", "a", "0", "1", "B", "b", "0", "1")
)
options(datatable.optimize = Inf)
DT[, .(max(var), min(var)), by=grp]
# GForce is deactivated because of the ad-hoc column 'tolower(var)',
# through which the result for 'max(var)' may also change
DT[, .(max(var), min(tolower(var))), by=grp]
Sys.setlocale("LC_COLLATE", old_lc_collate)
options(old)
}
```