Split data.table into chunks in a list
split.Rd
Split method for data.table. Faster and more flexible. Be aware that processing list of data.tables will be generally much slower than manipulation in single data.table by group using by
argument, read more on data.table
.
Usage
# S3 method for data.table
split(x, f, drop = FALSE,
by, sorted = FALSE, keep.by = TRUE, flatten = TRUE,
..., verbose = getOption("datatable.verbose"))
Arguments
- x
data.table
- f
Same as
split.data.frame
. Useby
argument instead, this is just for consistency with data.frame method.- drop
logical. Default
FALSE
will not drop empty list elements caused by factor levels not referred by that factors. Works also with new arguments of split data.table method.- by
character vector. Column names on which split should be made. For
length(by) > 1L
andflatten
FALSE it will result nested lists with data.tables on leafs.- sorted
When default
FALSE
it will retain the order of groups we are splitting on. WhenTRUE
then sorted list(s) are returned. Does not have effect forf
argument.- keep.by
logical default
TRUE
. Keep column provided toby
argument.- flatten
logical default
TRUE
will unlist nested lists of data.tables. When usingf
results are always flattened to list of data.tables.- ...
When using
f
, passed tosplit.data.frame
. When usingby
,sep
is recognized as with the default method.- verbose
logical default
FALSE
. WhenTRUE
it will print to console data.table split query used to split data.
Details
Argument f
is just for consistency in usage to data.frame method. Recommended is to use by
argument instead, it will be faster, more flexible, and by default will preserve order according to order in data.
Value
List of data.table
s. If using flatten
FALSE and length(by) > 1L
then recursively nested lists having data.table
s as leafs of grouping according to by
argument.
Examples
set.seed(123)
DT = data.table(x1 = rep(letters[1:2], 6),
x2 = rep(letters[3:5], 4),
x3 = rep(letters[5:8], 3),
y = rnorm(12))
DT = DT[sample(.N)]
DF = as.data.frame(DT)
# split consistency with data.frame: `x, f, drop`
all.equal(
split(DT, list(DT$x1, DT$x2)),
lapply(split(DF, list(DF$x1, DF$x2)), setDT)
)
#> [1] TRUE
# nested list using `flatten` arguments
split(DT, by=c("x1", "x2"))
#> $a.e
#> x1 x2 x3 y
#> <char> <char> <char> <num>
#> 1: a e g 1.5587083
#> 2: a e e -0.6868529
#>
#> $b.d
#> x1 x2 x3 y
#> <char> <char> <char> <num>
#> 1: b d h -1.2650612
#> 2: b d f -0.2301775
#>
#> $b.c
#> x1 x2 x3 y
#> <char> <char> <char> <num>
#> 1: b c f -0.44566197
#> 2: b c h 0.07050839
#>
#> $a.c
#> x1 x2 x3 y
#> <char> <char> <char> <num>
#> 1: a c g 0.4609162
#> 2: a c e -0.5604756
#>
#> $b.e
#> x1 x2 x3 y
#> <char> <char> <char> <num>
#> 1: b e f 1.7150650
#> 2: b e h 0.3598138
#>
#> $a.d
#> x1 x2 x3 y
#> <char> <char> <char> <num>
#> 1: a d g 1.2240818
#> 2: a d e 0.1292877
#>
split(DT, by=c("x1", "x2"), flatten=FALSE)
#> $a
#> $a$e
#> x1 x2 x3 y
#> <char> <char> <char> <num>
#> 1: a e g 1.5587083
#> 2: a e e -0.6868529
#>
#> $a$c
#> x1 x2 x3 y
#> <char> <char> <char> <num>
#> 1: a c g 0.4609162
#> 2: a c e -0.5604756
#>
#> $a$d
#> x1 x2 x3 y
#> <char> <char> <char> <num>
#> 1: a d g 1.2240818
#> 2: a d e 0.1292877
#>
#>
#> $b
#> $b$d
#> x1 x2 x3 y
#> <char> <char> <char> <num>
#> 1: b d h -1.2650612
#> 2: b d f -0.2301775
#>
#> $b$c
#> x1 x2 x3 y
#> <char> <char> <char> <num>
#> 1: b c f -0.44566197
#> 2: b c h 0.07050839
#>
#> $b$e
#> x1 x2 x3 y
#> <char> <char> <char> <num>
#> 1: b e f 1.7150650
#> 2: b e h 0.3598138
#>
#>
# dealing with factors
fdt = DT[, c(lapply(.SD, as.factor), list(y=y)), .SDcols=x1:x3]
fdf = as.data.frame(fdt)
sdf = split(fdf, list(fdf$x1, fdf$x2))
all.equal(
split(fdt, by=c("x1", "x2"), sorted=TRUE),
lapply(sdf[sort(names(sdf))], setDT)
)
#> [1] TRUE
# factors having unused levels, drop FALSE, TRUE
fdt = DT[, .(x1 = as.factor(c(as.character(x1), "c"))[-13L],
x2 = as.factor(c("a", as.character(x2)))[-1L],
x3 = as.factor(c("a", as.character(x3), "z"))[c(-1L,-14L)],
y = y)]
fdf = as.data.frame(fdt)
sdf = split(fdf, list(fdf$x1, fdf$x2))
all.equal(
split(fdt, by=c("x1", "x2"), sorted=TRUE),
lapply(sdf[sort(names(sdf))], setDT)
)
#> [1] TRUE
sdf = split(fdf, list(fdf$x1, fdf$x2), drop=TRUE)
all.equal(
split(fdt, by=c("x1", "x2"), sorted=TRUE, drop=TRUE),
lapply(sdf[sort(names(sdf))], setDT)
)
#> [1] TRUE