Fast row reordering of a data.table by reference
setorder.Rd
In data.table
parlance, all set*
functions change their input
by reference. That is, no copy is made at all, other than temporary
working memory, which is as large as one column. The only other
data.table
operator that modifies input by reference is :=
.
Check out the See Also
section below for other set*
function
data.table
provides.
setorder
(and setorderv
) reorders the rows of a data.table
based on the columns (and column order) provided. It reorders the table
by reference and is therefore very memory efficient.
Note that queries like x[order(.)]
are optimised internally to use data.table
's fast order.
Also note that data.table
always reorders in "C-locale" (see Details). To sort by session locale, use x[base::order(.)]
.
bit64::integer64
type is also supported for reordering rows of a data.table
.
Usage
setorder(x, ..., na.last=FALSE)
setorderv(x, cols = colnames(x), order=1L, na.last=FALSE)
# optimised to use data.table's internal fast order
# x[order(., na.last=TRUE)]
# x[order(., decreasing=TRUE)]
Arguments
- x
A
data.table
.- ...
The columns to sort by. Do not quote column names. If
...
is missing (ex:setorder(x)
),x
is rearranged based on all columns in ascending order by default. To sort by a column in descending order prefix the symbol"-"
which means "descending" (not "negative", in this context), i.e.,setorder(x, a, -b, c)
. The-b
works whenb
is of typecharacter
as well.- cols
A character vector of column names of
x
by which to order. By default, sorts over all columns;cols = NULL
will returnx
untouched. Do not add"-"
here. Useorder
argument instead.- order
An integer vector with only possible values of
1
and-1
, corresponding to ascending and descending order. The length oforder
must be either1
or equal to that ofcols
. Iflength(order) == 1
, it is recycled tolength(cols)
.- na.last
logical
. IfTRUE
, missing values in the data are placed last; ifFALSE
, they are placed first; ifNA
they are removed.na.last=NA
is valid only forx[order(., na.last)]
and its default isTRUE
.setorder
andsetorderv
only acceptTRUE
/FALSE
with defaultFALSE
.
Details
data.table
implements its own fast radix-based ordering. See the references for some exposition on the concept of radix sort.
setorder
accepts unquoted column names (with names preceded with a
-
sign for descending order) and reorders data.table
rows
by reference, for e.g., setorder(x, a, -b, c)
. We emphasize that
this means "descending" and not "negative" because the implementation simply
reverses the sort order, as opposed to sorting the opposite of the input
(which would be inefficient).
Note that -b
also works with columns of type character
unlike
order
, which requires -xtfrm(y)
instead (which is slow).
setorderv
in turn accepts a character vector of column names and an
integer vector of column order separately.
Note that setkey
still requires and will always sort only in
ascending order, and is different from setorder
in that it additionally
sets the sorted
attribute.
na.last
argument, by default, is FALSE
for setorder
and
setorderv
to be consistent with data.table
's setkey
and
is TRUE
for x[order(.)]
to be consistent with base::order
.
Only x[order(.)]
can have na.last = NA
as it is a subset operation
as opposed to setorder
or setorderv
which reorders the data.table
by reference.
data.table
always reorders in "C-locale".
As a consequence, the ordering may be different to that obtained by base::order
.
In English locales, for example, sorting is case-sensitive in C-locale.
Thus, sorting c("c", "a", "B")
returns c("B", "a", "c")
in data.table
but c("a", "B", "c")
in base::order
. Note this makes no difference in most cases
of data; both return identical results on ids where only upper-case or lower-case letters are present ("AB123" < "AC234"
is true in both), or on country names and other proper nouns which are consistently capitalized.
For example, neither "America" < "Brazil"
nor
"america" < "brazil"
are affected since the first letter is consistently
capitalized.
Using C-locale makes the behaviour of sorting in data.table
more consistent across sessions and locales.
The behaviour of base::order
depends on assumptions about the locale of the R session.
In English locales, "america" < "BRAZIL"
is true by default
but false if you either type Sys.setlocale(locale="C")
or the R session has been started in a C locale
for you -- which can happen on servers/services since the locale comes from the environment the R session
was started in. By contrast, "america" < "BRAZIL"
is always FALSE
in data.table
regardless of the way your R session was started.
If setorder
results in reordering of the rows of a keyed data.table
,
then its key will be set to NULL
.
Value
The input is modified by reference, and returned (invisibly) so it can be used
in compound statements; e.g., setorder(DT,a,-b)[, cumsum(c), by=list(a,b)]
.
If you require a copy, take a copy first (using DT2 = copy(DT)
). See
copy
.
References
https://en.wikipedia.org/wiki/Radix_sort
https://en.wikipedia.org/wiki/Counting_sort
http://stereopsis.com/radix.html
https://codercorner.com/RadixSortRevisited.htm
https://medium.com/basecs/getting-to-the-root-of-sorting-with-radix-sort-f8e9240d4224
See also
setkey
, setcolorder
, setattr
,
setnames
, set
, :=
, setDT
,
setDF
, copy
, setNumericRounding