datatable.optimize {data.table}R Documentation

Optimisations in data.table

Description

data.table internally optimises certain expressions in order to improve performance. This section briefly summarises those optimisations.

Note that there's no additional input needed from the user to take advantage of these optimisations. They happen automatically.

Run the code under the example section to get a feel for the performance benefits from these optimisations.

Note that for all optimizations involving efficient sorts, the caveat mentioned in setorder applies – whenever data.table does the sorting, it does so in "C-locale". This has some subtle implications; see Examples.

Details

data.table reads the global option datatable.optimize to figure out what level of optimisation is required. The default value Inf activates all available optimisations.

At optimisation level >= 1, i.e., getOption("datatable.optimize") >= 1, these are the optimisations:

At optimisation level >= 2, i.e., getOption("datatable.optimize") >= 2, additional optimisations are implemented on top of the optimisations already shown above.

At optimisation level >= 3, i.e., getOption("datatable.optimize") >= 3, additional optimisations for subsets in i are implemented on top of the optimisations already shown above. Subsetting operations are - if possible - translated into joins to make use of blazing fast binary search using indices and keys. The following queries are optimized:

If in doubt, whether your query benefits from optimization, call it with the verbose = TRUE argument. You should see "Optimized subsetting...".

Auto indexing: In case a query is optimized, but no appropriate key or index is found, data.table automatically creates an index on the first run. Any successive subsets on the same column then reuse this index to binary search (instead of vector scan) and is therefore fast. Auto indexing can be switched off with the global option options(datatable.auto.index = FALSE). To switch off using existing indices set global option options(datatable.use.index = FALSE).

See Also

setNumericRounding, getNumericRounding

Examples

## Not run: 
old = options(datatable.optimize = Inf)

# Generate a big data.table with a relatively many columns
set.seed(1L)
DT = lapply(1:20, function(x) sample(c(-100:100), 5e6L, TRUE))
setDT(DT)[, id := sample(1e5, 5e6, TRUE)]
print(object.size(DT), units="Mb") # 400MB, not huge, but will do

# 'order' optimisation
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[order(id)])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- DT[order(id)])
identical(ans1, ans2)

# optimisation of 'lapply(.SD, fun)'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[, lapply(.SD, min), by=id])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- DT[, lapply(.SD, min), by=id])
identical(ans1, ans2)

# optimisation of 'mean'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[, lapply(.SD, mean), by=id])
system.time(ans2 <- DT[, lapply(.SD, base::mean), by=id])
identical(ans1, ans2)

# optimisation of 'c(.N, lapply(.SD, ))'
options(datatable.optimize = 1L) # optimisation 'on'
system.time(ans1 <- DT[, c(.N, lapply(.SD, min)), by=id])
options(datatable.optimize = 0L) # optimisation 'off'
system.time(ans2 <- DT[, c(N=.N, lapply(.SD, min)), by=id])
identical(ans1, ans2)

# GForce
options(datatable.optimize = 2L) # optimisation 'on'
system.time(ans1 <- DT[, lapply(.SD, median), by=id])
system.time(ans2 <- DT[, lapply(.SD, function(x) as.numeric(stats::median(x))), by=id])
identical(ans1, ans2)

# optimized subsets
options(datatable.optimize = 2L)
system.time(ans1 <- DT[id == 100L]) # vector scan
system.time(ans2 <- DT[id == 100L]) # vector scan
system.time(DT[id %in% 100:500])    # vector scan

options(datatable.optimize = 3L)
system.time(ans1 <- DT[id == 100L]) # index + binary search subset
system.time(ans2 <- DT[id == 100L]) # only binary search subset
system.time(DT[id %in% 100:500])    # only binary search subset again

# sensitivity to collate order
old_lc_collate = Sys.getlocale("LC_COLLATE")

if (old_lc_collate == "C") {
  Sys.setlocale("LC_COLLATE", "")
}
DT = data.table(
  grp = rep(1:2, each = 4L),
  var = c("A", "a", "0", "1", "B", "b", "0", "1")
)
options(datatable.optimize = Inf)
DT[, .(max(var), min(var)), by=grp]
# GForce is deactivated because of the ad-hoc column 'tolower(var)',
#   through which the result for 'max(var)' may also change
DT[, .(max(var), min(tolower(var))), by=grp]

Sys.setlocale("LC_COLLATE", old_lc_collate)
options(old)

## End(Not run)

[Package data.table version 1.15.99 Index]