Keys and fast binary search based subset
2025-02-13
Source:vignettes/datatable-keys-fast-subset.Rmd
datatable-keys-fast-subset.Rmd
Translations of this document are available in
This vignette is aimed at those who are already familiar with
data.table syntax, its general form, how to subset rows in
i
, select and compute on columns, add/modify/delete columns
by reference in j
and group by using
by
. If you’re not familiar with these concepts, please read
the vignettes vignette("datatable-intro", package="data.table")
and vignette("datatable-reference-semantics", package="data.table")
first.
Data
We will use the same flights
data as in the vignette("datatable-intro", package="data.table")
vignette.
flights <- fread("flights14.csv")
head(flights)
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# <int> <int> <int> <int> <int> <char> <char> <char> <int> <int> <int>
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
# 4: 2014 1 1 -8 -26 AA LGA PBI 157 1035 7
# 5: 2014 1 1 2 1 AA JFK LAX 350 2475 13
# 6: 2014 1 1 4 0 AA EWR LAX 339 2454 18
dim(flights)
# [1] 253316 11
Introduction
In this vignette, we will
first introduce the concept of
key
in data.table, and set and use keys to perform fast binary search based subsets ini
,see that we can combine key based subsets along with
j
andby
in the exact same way as before,look at other additional useful arguments -
mult
andnomatch
,and finally conclude by looking at the advantage of setting keys - perform fast binary search based subsets and compare with the traditional vector scan approach.
1. Keys
a) What is a key?
In the vignette("datatable-intro", package="data.table")
vignette, we saw how to subset rows in i
using logical
expressions, row numbers and using order()
. In this
section, we will look at another way of subsetting incredibly fast -
using keys.
But first, let’s start by looking at data.frames. All
data.frames have a row names attribute. Consider the
data.frame DF
below.
set.seed(1L)
DF = data.frame(ID1 = sample(letters[1:2], 10, TRUE),
ID2 = sample(1:3, 10, TRUE),
val = sample(10),
stringsAsFactors = FALSE,
row.names = sample(LETTERS[1:10]))
DF
# ID1 ID2 val
# I a 1 10
# D a 3 9
# G a 1 4
# A a 1 7
# B a 1 1
# E b 1 8
# C b 2 3
# J b 1 2
# F b 1 5
# H a 2 6
rownames(DF)
# [1] "I" "D" "G" "A" "B" "E" "C" "J" "F" "H"
We can subset a particular row using its row name as shown below:
DF["C", ]
# ID1 ID2 val
# C b 2 3
i.e., row names are more or less an index to rows of a data.frame. However,
-
Each row is limited to exactly one row name.
But, a person (for example) has at least two names - a first and a second name. It is useful to organise a telephone directory by surname then first name.
-
And row names should be unique.
Now let’s convert it to a data.table.
DT = as.data.table(DF)
DT
# ID1 ID2 val
# <char> <int> <int>
# 1: a 1 10
# 2: a 3 9
# 3: a 1 4
# 4: a 1 7
# 5: a 1 1
# 6: b 1 8
# 7: b 2 3
# 8: b 1 2
# 9: b 1 5
# 10: a 2 6
rownames(DT)
# [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
Note that row names have been reset.
-
data.tables never uses row names. Since data.tables inherit from data.frames, it still has the row names attribute. But it never uses them. We’ll see in a moment as to why.
If you would like to preserve the row names, use
keep.rownames = TRUE
inas.data.table()
- this will create a new column calledrn
and assign row names to this column.
Instead, in data.tables we set and use keys
.
Think of a key
as supercharged
rownames.
Keys and their properties
We can set keys on multiple columns and the column can be of different types – integer, numeric, character, factor, integer64 etc. list and complex types are not supported yet.
Uniqueness is not enforced, i.e., duplicate key values are allowed. Since rows are sorted by key, any duplicates in the key columns will appear consecutively.
-
Setting a
key
does two things:physically reorders the rows of the data.table by the column(s) provided by reference, always in increasing order.
marks those columns as key columns by setting an attribute called
sorted
to the data.table.
Since the rows are reordered, a data.table can have at most one key because it can not be sorted in more than one way.
For the rest of the vignette, we will work with flights
data set.
b) Set, get and use keys on a data.table
– How can we set the column origin
as key in the
data.table flights
?
setkey(flights, origin)
head(flights)
# Key: <origin>
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# <int> <int> <int> <int> <int> <char> <char> <char> <int> <int> <int>
# 1: 2014 1 1 4 0 AA EWR LAX 339 2454 18
# 2: 2014 1 1 -5 -17 AA EWR MIA 161 1085 16
# 3: 2014 1 1 191 185 AA EWR DFW 214 1372 16
# 4: 2014 1 1 -1 -2 AA EWR DFW 214 1372 14
# 5: 2014 1 1 -3 -10 AA EWR MIA 154 1085 6
# 6: 2014 1 1 4 -17 AA EWR DFW 215 1372 9
## alternatively we can provide character vectors to the function 'setkeyv()'
# setkeyv(flights, "origin") # useful to program with
You can use the function
setkey()
and provide the column names (without quoting them). This is helpful during interactive use.Alternatively you can pass a character vector of column names to the function
setkeyv()
. This is particularly useful while designing functions to pass columns to set key on as function arguments.Note that we did not have to assign the result back to a variable. This is because like the
:=
function we saw in thevignette("datatable-reference-semantics", package="data.table")
vignette,setkey()
andsetkeyv()
modify the input data.table by reference. They return the result invisibly.The data.table is now reordered (or sorted) by the column we provided -
origin
. Since we reorder by reference, we only require additional memory of one column of length equal to the number of rows in the data.table, and is therefore very memory efficient.You can also set keys directly when creating data.tables using the
data.table()
function usingkey
argument. It takes a character vector of column names.
set* and :=
:
In data.table, the :=
operator and all the
set*
(e.g., setkey
, setorder
,
setnames
etc.) functions are the only ones which modify the
input object by reference.
Once you key a data.table by certain columns, you
can subset by querying those key columns using the .()
notation in i
. Recall that .()
is an alias
to list()
.
– Use the key column origin
to subset all rows where
the origin airport matches “JFK”
flights[.("JFK")]
# Key: <origin>
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# <int> <int> <int> <int> <int> <char> <char> <char> <int> <int> <int>
# 1: 2014 1 1 14 13 AA JFK LAX 359 2475 9
# 2: 2014 1 1 -3 13 AA JFK LAX 363 2475 11
# 3: 2014 1 1 2 9 AA JFK LAX 351 2475 19
# 4: 2014 1 1 2 1 AA JFK LAX 350 2475 13
# 5: 2014 1 1 -2 -18 AA JFK LAX 338 2475 21
# ---
# 81479: 2014 10 31 -4 -21 UA JFK SFO 337 2586 17
# 81480: 2014 10 31 -2 -37 UA JFK SFO 344 2586 18
# 81481: 2014 10 31 0 -33 UA JFK LAX 320 2475 17
# 81482: 2014 10 31 -6 -38 UA JFK SFO 343 2586 9
# 81483: 2014 10 31 -6 -38 UA JFK LAX 323 2475 11
## alternatively
# flights[J("JFK")] (or)
# flights[list("JFK")]
The key column has already been set to
origin
. So it is sufficient to provide the value, here “JFK”, directly. The.()
syntax helps identify that the task requires looking up the value “JFK” in the key column of data.table (here columnorigin
offlights
data.table).The row indices corresponding to the value “JFK” in
origin
is obtained first. And since there is no expression inj
, all columns corresponding to those row indices are returned.-
On single column key of character type, you can drop the
.()
notation and use the values directly when subsetting, like subset using row names on data.frames.flights["JFK"] ## same as flights[.("JFK")]
-
We can subset any amount of values as required
flights[c("JFK", "LGA")] ## same as flights[.(c("JFK", "LGA"))]
This returns all columns corresponding to those rows where
origin
column matches either “JFK” or “LGA”.
c) Keys and multiple columns
To refresh, keys are like supercharged row names. We can set key on multiple columns and they can be of multiple types.
– How can I set keys on both origin
and
dest
columns?
setkey(flights, origin, dest)
head(flights)
# Key: <origin, dest>
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# <int> <int> <int> <int> <int> <char> <char> <char> <int> <int> <int>
# 1: 2014 1 2 -2 -25 EV EWR ALB 30 143 7
# 2: 2014 1 3 88 79 EV EWR ALB 29 143 23
# 3: 2014 1 4 220 211 EV EWR ALB 32 143 15
# 4: 2014 1 4 35 19 EV EWR ALB 32 143 7
# 5: 2014 1 5 47 42 EV EWR ALB 26 143 8
# 6: 2014 1 5 66 62 EV EWR ALB 31 143 23
## or alternatively
# setkeyv(flights, c("origin", "dest")) # provide a character vector of column names
key(flights)
# [1] "origin" "dest"
- It sorts the data.table first by the column
origin
and then bydest
by reference.
– Subset all rows using key columns where first key column
origin
matches “JFK” and second key column
dest
matches “MIA”
flights[.("JFK", "MIA")]
# Key: <origin, dest>
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# <int> <int> <int> <int> <int> <char> <char> <char> <int> <int> <int>
# 1: 2014 1 1 -1 -17 AA JFK MIA 161 1089 15
# 2: 2014 1 1 7 -8 AA JFK MIA 166 1089 9
# 3: 2014 1 1 2 -1 AA JFK MIA 164 1089 12
# 4: 2014 1 1 6 3 AA JFK MIA 157 1089 5
# 5: 2014 1 1 6 -12 AA JFK MIA 154 1089 17
# ---
# 2746: 2014 10 31 -1 -22 AA JFK MIA 148 1089 16
# 2747: 2014 10 31 -3 -20 AA JFK MIA 146 1089 8
# 2748: 2014 10 31 2 -17 AA JFK MIA 150 1089 6
# 2749: 2014 10 31 -3 -12 AA JFK MIA 150 1089 5
# 2750: 2014 10 31 29 4 AA JFK MIA 146 1089 19
How does the subset work here?
It is important to understand how this works internally. “JFK” is first matched against the first key column
origin
. And within those matching rows, “MIA” is matched against the second key columndest
to obtain row indices where bothorigin
anddest
match the given values.Since no
j
is provided, we simply return all columns corresponding to those row indices.
– Subset all rows where just the first key column
origin
matches “JFK”
key(flights)
# [1] "origin" "dest"
flights[.("JFK")] ## or in this case simply flights["JFK"], for convenience
# Key: <origin, dest>
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# <int> <int> <int> <int> <int> <char> <char> <char> <int> <int> <int>
# 1: 2014 1 1 10 4 B6 JFK ABQ 280 1826 20
# 2: 2014 1 2 134 161 B6 JFK ABQ 252 1826 22
# 3: 2014 1 7 6 6 B6 JFK ABQ 269 1826 20
# 4: 2014 1 8 15 -15 B6 JFK ABQ 259 1826 20
# 5: 2014 1 9 45 32 B6 JFK ABQ 267 1826 20
# ---
# 81479: 2014 10 31 0 -18 DL JFK TPA 142 1005 8
# 81480: 2014 10 31 1 -8 B6 JFK TPA 149 1005 19
# 81481: 2014 10 31 -2 -22 B6 JFK TPA 145 1005 14
# 81482: 2014 10 31 -8 -5 B6 JFK TPA 149 1005 9
# 81483: 2014 10 31 -4 -18 B6 JFK TPA 145 1005 8
- Since we did not provide any values for the second key column
dest
, it just matches “JFK” against the first key columnorigin
and returns all the matched rows.
– Subset all rows where just the second key column dest
matches “MIA”
flights[.(unique(origin), "MIA")]
# Key: <origin, dest>
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# <int> <int> <int> <int> <int> <char> <char> <char> <int> <int> <int>
# 1: 2014 1 1 -5 -17 AA EWR MIA 161 1085 16
# 2: 2014 1 1 -3 -10 AA EWR MIA 154 1085 6
# 3: 2014 1 1 -5 -8 AA EWR MIA 157 1085 11
# 4: 2014 1 1 43 42 UA EWR MIA 155 1085 15
# 5: 2014 1 1 60 49 UA EWR MIA 162 1085 21
# ---
# 9924: 2014 10 31 -11 -8 AA LGA MIA 157 1096 13
# 9925: 2014 10 31 -5 -11 AA LGA MIA 150 1096 9
# 9926: 2014 10 31 -2 10 AA LGA MIA 156 1096 6
# 9927: 2014 10 31 -2 -16 AA LGA MIA 156 1096 19
# 9928: 2014 10 31 1 -11 US LGA MIA 164 1096 15
What’s happening here?
Read this again. The value provided for the second key column “MIA” has to find the matching values in
dest
key column on the matching rows provided by the first key columnorigin
. We can not skip the values of key columns before. Therefore, we provide all unique values from key columnorigin
.“MIA” is automatically recycled to fit the length of
unique(origin)
which is 3.
2. Combining keys with j
and by
All we have seen so far is the same concept – obtaining row
indices in i
, but just using a different method –
using keys
. It shouldn’t be surprising that we can do
exactly the same things in j
and by
as seen
from the previous vignettes. We will highlight this with a few
examples.
a) Select in j
– Return arr_delay
column as a data.table
corresponding to origin = "LGA"
and
dest = "TPA"
.
key(flights)
# [1] "origin" "dest"
flights[.("LGA", "TPA"), .(arr_delay)]
# arr_delay
# <int>
# 1: 1
# 2: 14
# 3: -17
# 4: -4
# 5: -12
# ---
# 1848: 39
# 1849: -24
# 1850: -12
# 1851: 21
# 1852: -11
The row indices corresponding to
origin == "LGA"
anddest == "TPA"
are obtained using key based subset.Once we have the row indices, we look at
j
which requires only thearr_delay
column. So we simply select the columnarr_delay
for those row indices in the exact same way as we have seen in thevignette("datatable-intro", package="data.table")
vignette.-
We could have returned the result by using
with = FALSE
as well.flights[.("LGA", "TPA"), "arr_delay", with = FALSE]
d) sub-assign by reference using :=
in
j
We have seen this example already in the vignette("datatable-reference-semantics", package="data.table")
vignette. Let’s take a look at all the hours
available in
the flights
data.table:
# get all 'hours' in flights
flights[, sort(unique(hour))]
# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
We see that there are totally 25
unique values in the
data. Both 0 and 24 hours seem to be present. Let’s go
ahead and replace 24 with 0, but this time using
key.
We first set
key
tohour
. This reordersflights
by the columnhour
and marks that column as thekey
column.Now we can subset on
hour
by using the.()
notation. We subset for the value 24 and obtain the corresponding row indices.And on those row indices, we replace the
key
column with the value0
.Since we have replaced values on the key column, the data.table
flights
isn’t sorted byhour
anymore. Therefore, the key has been automatically removed by setting to NULL.
Now, there shouldn’t be any 24 in the hour
column.
e) Aggregation using by
Let’s set the key back to origin, dest
first.
– Get the maximum departure delay for each month
corresponding to origin = "JFK"
. Order the result by
month
ans <- flights["JFK", max(dep_delay), keyby = month]
head(ans)
# Key: <month>
# month V1
# <int> <int>
# 1: 1 881
# 2: 2 1014
# 3: 3 920
# 4: 4 1241
# 5: 5 853
# 6: 6 798
key(ans)
# [1] "month"
We subset on the
key
column origin to obtain the row indices corresponding to “JFK”.Once we obtain the row indices, we only need two columns -
month
to group by anddep_delay
to obtainmax()
for each group. data.table’s query optimisation therefore subsets just those two columns corresponding to the row indices obtained ini
, for speed and memory efficiency.And on that subset, we group by month and compute
max(dep_delay)
.We use
keyby
to automatically key that result by month. Now we understand what that means. In addition to ordering, it also sets month as thekey
column.
3. Additional arguments - mult
and
nomatch
a) The mult argument
We can choose, for each query, if “all” the matching rows
should be returned, or just the “first” or “last”
using the mult
argument. The default value is
“all” - what we’ve seen so far.
– Subset only the first matching row from all rows where
origin
matches “JFK” and dest
matches
“MIA”
flights[.("JFK", "MIA"), mult = "first"]
# Key: <origin, dest>
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# <int> <int> <int> <int> <int> <char> <char> <char> <int> <int> <int>
# 1: 2014 1 1 6 3 AA JFK MIA 157 1089 5
– Subset only the last matching row of all the rows where
origin
matches “LGA”, “JFK”, “EWR” and
dest
matches “XNA”
flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last"]
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# <int> <int> <int> <int> <int> <char> <char> <char> <int> <int> <int>
# 1: 2014 5 23 163 148 MQ LGA XNA 158 1147 18
# 2: NA NA NA NA NA <NA> JFK XNA NA NA NA
# 3: 2014 2 3 231 268 EV EWR XNA 184 1131 12
The query “JFK”, “XNA” doesn’t match any rows in
flights
and therefore returnsNA
.Once again, the query for second key column
dest
, “XNA”, is recycled to fit the length of the query for first key columnorigin
, which is of length 3.
b) The nomatch argument
We can choose if queries that do not match should return
NA
or be skipped altogether using the nomatch
argument.
– From the previous example, Subset all rows only if there’s a match
flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last", nomatch = NULL]
# year month day dep_delay arr_delay carrier origin dest air_time distance hour
# <int> <int> <int> <int> <int> <char> <char> <char> <int> <int> <int>
# 1: 2014 5 23 163 148 MQ LGA XNA 158 1147 18
# 2: 2014 2 3 231 268 EV EWR XNA 184 1131 12
Default value for
nomatch
isNA
. Settingnomatch = NULL
skips queries with no matches.The query “JFK”, “XNA” doesn’t match any rows in flights and therefore is skipped.
4. binary search vs vector scans
We have seen so far how we can set and use keys to subset. But what’s the advantage? For example, instead of doing:
# key by origin,dest columns
flights[.("JFK", "MIA")]
we could have done:
flights[origin == "JFK" & dest == "MIA"]
One advantage very likely is shorter syntax. But even more than that, binary search based subsets are incredibly fast.
As the time goes data.table
gets new optimization and
currently the latter call is automatically optimized to use binary
search.
To use slow vector scan key needs to be removed.
setkey(flights, NULL)
flights[origin == "JFK" & dest == "MIA"]
a) Performance of binary search approach
To illustrate, let’s create a sample data.table with 20
million rows and three columns and key it by columns x
and
y
.
set.seed(2L)
N = 2e7L
DT = data.table(x = sample(letters, N, TRUE),
y = sample(1000L, N, TRUE),
val = runif(N))
print(object.size(DT), units = "Mb")
# 381.5 Mb
DT
is ~380MB. It is not really huge, but this will do to
illustrate the point.
From what we have seen in the Introduction to data.table section, we
can subset those rows where columns x = "g"
and
y = 877
as follows:
key(DT)
# NULL
## (1) Usual way of subsetting - vector scan approach
t1 <- system.time(ans1 <- DT[x == "g" & y == 877L])
t1
# user system elapsed
# 0.388 0.134 0.522
head(ans1)
# x y val
# <char> <int> <num>
# 1: g 877 0.57059767
# 2: g 877 0.74859806
# 3: g 877 0.03616756
# 4: g 877 0.28087868
# 5: g 877 0.83727299
# 6: g 877 0.43867189
dim(ans1)
# [1] 762 3
Now let’s try to subset by using keys.
## (2) Subsetting using keys
t2 <- system.time(ans2 <- DT[.("g", 877L)])
t2
# user system elapsed
# 0.001 0.000 0.001
head(ans2)
# Key: <x, y>
# x y val
# <char> <int> <num>
# 1: g 877 0.57059767
# 2: g 877 0.74859806
# 3: g 877 0.03616756
# 4: g 877 0.28087868
# 5: g 877 0.83727299
# 6: g 877 0.43867189
dim(ans2)
# [1] 762 3
identical(ans1$val, ans2$val)
# [1] TRUE
- The speed-up is ~522x!
b) Why does keying a data.table result in blazing fast subsets?
To understand that, let’s first look at what vector scan approach (method 1) does.
Vector scan approach
The column
x
is searched for the value “g” row by row, on all 20 million of them. This results in a logical vector of size 20 million, with valuesTRUE, FALSE or NA
corresponding tox
’s value.Similarly, the column
y
is searched for877
on all 20 million rows one by one, and stored in another logical vector.Element wise
&
operations are performed on the intermediate logical vectors and all the rows where the expression evaluates toTRUE
are returned.
This is what we call a vector scan approach. And this is quite inefficient, especially on larger tables and when one needs repeated subsetting, because it has to scan through all the rows each time.
Now let us look at binary search approach (method 2). Recall from Properties of key - setting keys reorders
the data.table by key columns. Since the data is sorted, we don’t
have to scan through the entire length of the column! We can
instead use binary search to search a value in
O(log n)
as opposed to O(n)
in case of
vector scan approach, where n
is the number of
rows in the data.table.
Binary search approach
Here’s a very simple illustration. Let’s consider the (sorted) numbers shown below:
1, 5, 10, 19, 22, 23, 30
Suppose we’d like to find the matching position of the value 1, using binary search, this is how we would proceed - because we know that the data is sorted.
Start with the middle value = 19. Is 1 == 19? No. 1 < 19.
Since the value we’re looking for is smaller than 19, it should be somewhere before 19. So we can discard the rest of the half that are >= 19.
Our set is now reduced to 1, 5, 10. Grab the middle value once again = 5. Is 1 == 5? No. 1 < 5.
Our set is reduced to 1. Is 1 == 1? Yes. The corresponding index is also 1. And that’s the only match.
A vector scan approach on the other hand would have to scan through all the values (here, 7).
It can be seen that with every search we reduce the number of searches by half. This is why binary search based subsets are incredibly fast. Since rows of each column of data.tables have contiguous locations in memory, the operations are performed in a very cache efficient manner (also contributes to speed).
In addition, since we obtain the matching row indices directly without having to create those huge logical vectors (equal to the number of rows in a data.table), it is quite memory efficient as well.
Summary
In this vignette, we have learnt another method to subset rows in
i
by keying a data.table. Setting keys allows us
to perform blazing fast subsets by using binary search. In
particular, we have seen how to
set key and subset using the key on a data.table.
subset using keys which fetches row indices in
i
, but much faster.combine key based subsets with
j
andby
. Note that thej
andby
operations are exactly the same as before.
Key based subsets are incredibly fast and are
particularly useful when the task involves repeated subsetting.
But it may not be always desirable to set key and physically reorder the
data.table. In the next next vignette
(vignette("datatable-secondary-indices-and-auto-indexing", package="data.table")
),
we will address this using a new feature – secondary
indexes.