Create key on a data.table
setkey.Rd
setkey
sorts a data.table
and marks it as sorted with an
attribute "sorted"
. The sorted columns are the key. The key can be any
number of columns. The data is always sorted in ascending order with NA
s
(if any) always first. The table is changed by reference and there is
no memory used for the key (other than marking which columns the data is sorted by).
There are three reasons setkey
is desirable:
binary search and joins are faster when they detect they can use an existing key
grouping by a leading subset of the key columns is faster because the groups are already gathered contiguously in RAM
simpler shorter syntax; e.g.
DT["id",]
finds the group "id" in the first column ofDT
's key using binary search. It may be helpful to think of a key as super-charged rownames: multi-column and multi-type.
NA
s are always first because:
NA
is internallyINT_MIN
(a large negative number) in R. Keys and indexes are always in increasing order so ifNA
s are first, no special treatment or branch is needed in manydata.table
internals involving binary search. It is not optional to placeNA
s last for speed, simplicity and robustness of internals at C level.if any
NA
s are present then we believe it is better to display them up front (rather than hiding them at the end) to reduce the risk of not realizingNA
s are present.
In data.table
parlance, all set*
functions change their input
by reference. That is, no copy is made at all other than for temporary
working memory, which is as large as one column. The only other data.table
operator that modifies input by reference is :=
. Check out the
See Also
section below for other set*
functions data.table
provides.
setindex
creates an index for the provided columns. This index is simply an
ordering vector of the dataset's rows according to the provided columns. This order vector
is stored as an attribute of the data.table
and the dataset retains the original order
of rows in memory. See the vignette("datatable-secondary-indices-and-auto-indexing")
for more details.
key
returns the data.table
's key if it exists; NULL
if none exists.
haskey
returns TRUE
/FALSE
if the data.table
has a key.
Arguments
- x
A
data.table
.- ...
The columns to sort by. Do not quote the column names. If
...
is missing (i.e.setkey(DT)
), all the columns are used.NULL
removes the key.- cols
A character vector of column names. For
setindexv
, this can be alist
of character vectors, in which case each element will be applied as an index in turn.- verbose
Output status and information.
- physical
TRUE
changes the order of the data in RAM.FALSE
adds an index.- vectors
logical
scalar, defaultFALSE
; when set toTRUE
, alist
of character vectors is returned, each referring to one index.
Details
setkey
reorders (i.e. sorts) the rows of a data.table
by the columns
provided. The sort method used has developed over the years and we have contributed
to base R too; see sort
. Generally speaking we avoid any type
of comparison sort (other than insert sort for very small input) preferring instead
counting sort and forwards radix. We also avoid hash tables.
Note that setkey
always uses "C-locale"; see the Details in the help for setorder
for more on why.
The sort is stable; i.e., the order of ties (if any) is preserved.
For character vectors, data.table
takes advantage of R's internal global string cache, also exported as chorder
.
Good practice
In general, it's good practice to use column names rather than numbers. This is
why setkey
and setkeyv
only accept column names.
If you use column numbers then bugs (possibly silent) can more easily creep into
your code as time progresses if changes are made elsewhere in your code; e.g., if
you add, remove or reorder columns in a few months time, a setkey
by column
number will then refer to a different column, possibly returning incorrect results
with no warning. (A similar concept exists in SQL, where "select * from ..."
is considered poor programming style when a robust, maintainable system is
required.)
If you really wish to use column numbers, it is possible but
deliberately a little harder; e.g., setkeyv(DT,names(DT)[1:2])
.
If you want to subset rows based on values of an integer key column, it should be done with the dot (.
) syntax, because integers are otherwise interpreted as row numbers (see example).
If you wanted to use grep
to select key columns according to
a pattern, note that you can just set value = TRUE
to return a character vector instead of the default integer indices.
Value
The input is modified by reference and returned (invisibly) so it can be used
in compound statements; e.g., setkey(DT,a)[.("foo")]
. If you require a
copy, take a copy first (using DT2=copy(DT)
). copy
may also
sometimes be useful before :=
is used to subassign to a column by
reference.
References
https://en.wikipedia.org/wiki/Radix_sort
https://en.wikipedia.org/wiki/Counting_sort
http://stereopsis.com/radix.html
https://codercorner.com/RadixSortRevisited.htm
https://cran.r-project.org/package=bit64
https://github.com/Rdatatable/data.table/wiki/Presentations
See also
data.table
, tables
, J
,
sort.list
, copy
, setDT
,
setDF
, set
:=
, setorder
,
setcolorder
, setattr
, setnames
,
chorder
, setNumericRounding
Examples
# Type 'example(setkey)' to run these at the prompt and browse output
DT = data.table(A=5:1,B=letters[5:1])
DT # before
#> A B
#> <int> <char>
#> 1: 5 e
#> 2: 4 d
#> 3: 3 c
#> 4: 2 b
#> 5: 1 a
setkey(DT,B) # re-orders table and marks it sorted.
DT # after
#> Key: <B>
#> A B
#> <int> <char>
#> 1: 1 a
#> 2: 2 b
#> 3: 3 c
#> 4: 4 d
#> 5: 5 e
tables() # KEY column reports the key'd columns
#> NAME NROW NCOL MB COLS KEY
#> 1: DT 5 2 0 A,B B
#> Total: 0MB using type_size
key(DT)
#> [1] "B"
keycols = c("A","B")
setkeyv(DT,keycols)
DT = data.table(A=5:1,B=letters[5:1])
DT2 = DT # does not copy
setkey(DT2,B) # does not copy-on-write to DT2
identical(DT,DT2) # TRUE. DT and DT2 are two names for the same keyed table
#> [1] TRUE
DT = data.table(A=5:1,B=letters[5:1])
DT2 = copy(DT) # explicit copy() needed to copy a data.table
setkey(DT2,B) # now just changes DT2
identical(DT,DT2) # FALSE. DT and DT2 are now different tables
#> [1] FALSE
DT = data.table(A=5:1,B=letters[5:1])
setindex(DT) # set indices
setindex(DT, A)
setindex(DT, B)
indices(DT) # get indices single vector
#> [1] "A__B" "A" "B"
indices(DT, vectors = TRUE) # get indices list
#> [[1]]
#> [1] "A" "B"
#>
#> [[2]]
#> [1] "A"
#>
#> [[3]]
#> [1] "B"
#>
# Use the dot .(subset_value) syntax with integer keys:
DT = data.table(id = 2:1)
setkey(DT, id)
subset_value <- 1
DT[subset_value] # treats subset_value as an row number
#> Key: <id>
#> id
#> <int>
#> 1: 1
DT[.(subset_value)] # matches subset_value against key column (id)
#> Key: <id>
#> id
#> <int>
#> 1: 1