Skip to contents

dcast.data.table is data.table's long-to-wide reshaping tool. In the spirit of data.table, it is very fast and memory efficient, making it well-suited to handling large data sets in RAM. More importantly, it is capable of handling very large data quite efficiently in terms of memory usage. dcast.data.table can also cast multiple value.var columns and accepts multiple functions to fun.aggregate. See Examples for more.

Usage

# S3 method for data.table
dcast(data, formula, fun.aggregate = NULL, sep = "_",
    ..., margins = NULL, subset = NULL, fill = NULL,
    drop = TRUE, value.var = guess(data),
    verbose = getOption("datatable.verbose"),
    value.var.in.dots = FALSE, value.var.in.LHSdots = value.var.in.dots, 
    value.var.in.RHSdots = value.var.in.dots)

Arguments

data

A data.table.

formula

A formula of the form LHS ~ RHS to cast, see Details.

fun.aggregate

Should the data be aggregated before casting? If the formula doesn't identify a single observation for each cell, then aggregation defaults to length with a warning of class 'dt_missing_fun_aggregate_warning'.

To use multiple aggregation functions, pass a list; see Examples.

sep

Character vector of length 1, indicating the separating character in variable names generated during casting. Default is _ for backwards compatibility.

...

Any other arguments that may be passed to the aggregating function.

margins

Not implemented yet. Should take variable names to compute margins on. A value of TRUE would compute all margins.

subset

Specified if casting should be done on a subset of the data. Ex: subset = .(col1 <= 5) or subset = .(variable != "January").

fill

Value with which to fill missing cells. If fill=NULL and missing cells are present, then fun.aggregate is used on a 0-length vector to obtain a fill value.

drop

FALSE will cast by including all missing combinations.

c(FALSE, TRUE) will only include all missing combinations of formula LHS; c(TRUE, FALSE) will only include all missing combinations of formula RHS. See Examples.

value.var

Name of the column whose values will be filled to cast. Function guess() tries to, well, guess this column automatically, if none is provided.

Cast multiple value.var columns simultaneously by passing their names as a character vector. See Examples.

verbose

Not used yet. May be dropped in the future or used to provide informative messages through the console.

value.var.in.dots

logical; value.var.in.dots = TRUE is shorthand to save setting both value.var.in.LHSdots = TRUE and value.var.in.RHSdots = TRUE.

value.var.in.LHSdots

logical; if TRUE, ... in LHS of the formula includes value.var variables. The default is FALSE, so that ... represents all variables not otherwise mentioned in formula or value.var (including default/guessed value.var).

value.var.in.RHSdots

logical; analogous to value.var.in.LHSdots above, but with respect to RHS of the formula.

Details

The cast formula takes the form LHS ~ RHS, ex: var1 + var2 ~ var3. The order of entries in the formula is essential. There are two special variables: . represents no variable, while ... represents all variables not otherwise mentioned in formula, and value.var depending on value.var.in.LHSdots and value.var.in.RHSdots arguments; see Examples.

When not all combinations of LHS & RHS values are present in the data, some or all (in accordance with drop) missing combinations will replaced with the value specified by fill. Note that fill will be converted to the class of value.var; see Examples.

dcast also allows value.var columns of type list.

When variable combinations in formula don't identify a unique value, fun.aggregate will have to be specified, which defaults to length. For the formula var1 ~ var2, this means there are some (var1, var2) combinations in the data corresponding to multiple rows (i.e. x is not unique by (var1, var2).

The aggregating function should take a vector as input and return a single value (or a list of length one) as output. In cases where value.var is a list, the function should be able to handle a list input and provide a single value or list of length one as output.

If the formula's LHS contains the same column more than once, ex: dcast(DT, x+x~ y), then the answer will have duplicate names. In those cases, the duplicate names are renamed using make.unique so that key can be set without issues.

Names for columns that are being cast are generated in the same order (separated by an underscore, _) from the (unique) values in each column mentioned in the formula RHS.

From v1.9.4, dcast tries to preserve attributes wherever possible.

From v1.9.6, it is possible to cast multiple value.var columns and also cast by providing multiple fun.aggregate functions. Multiple fun.aggregate functions should be provided as a list, for e.g., list(mean, sum, function(x) paste(x, collapse=""). value.var can be either a character vector or list of length one, or a list of length equal to length(fun.aggregate). When value.var is a character vector or a list of length one, each function mentioned under fun.aggregate is applied to every column specified under value.var column. When value.var is a list of length equal to length(fun.aggregate) each element of fun.aggregate is applied to each element of value.var column.

Historical note: dcast.data.table was originally designed as an enhancement to reshape2::dcast in terms of computing and memory efficiency. reshape2 has since been superseded in favour of tidyr, and dcast has had a generic defined within data.table since v1.9.6 in 2015, at which point the dependency between the packages became more etymological than programmatic. We thank the reshape2 authors for the inspiration.

Value

A keyed data.table that has been cast. The key columns are equal to the variables in the formula LHS in the same order.

Examples

ChickWeight = as.data.table(ChickWeight)
setnames(ChickWeight, tolower(names(ChickWeight)))
DT <- melt(as.data.table(ChickWeight), id.vars=2:4) # calls melt.data.table

# dcast is an S3 method in data.table from v1.9.6
dcast(DT, time ~ variable, fun.aggregate=mean)
#> Key: <time>
#>      time    weight
#>     <num>     <num>
#>  1:     0  41.06000
#>  2:     2  49.22000
#>  3:     4  59.95918
#>  4:     6  74.30612
#>  5:     8  91.24490
#>  6:    10 107.83673
#>  7:    12 129.24490
#>  8:    14 143.81250
#>  9:    16 168.08511
#> 10:    18 190.19149
#> 11:    20 209.71739
#> 12:    21 218.68889
dcast(DT, diet ~ variable, fun.aggregate=mean)
#> Key: <diet>
#>      diet   weight
#>    <fctr>    <num>
#> 1:      1 102.6455
#> 2:      2 122.6167
#> 3:      3 142.9500
#> 4:      4 135.2627
dcast(DT, diet+chick ~ time, drop=FALSE)
#> Key: <diet, chick>
#>        diet chick     0     2     4     6     8    10    12    14    16    18
#>      <fctr> <ord> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#>   1:      1    18    39    35    NA    NA    NA    NA    NA    NA    NA    NA
#>   2:      1    16    41    45    49    51    57    51    54    NA    NA    NA
#>   3:      1    15    41    49    56    64    68    68    67    68    NA    NA
#>   4:      1    13    41    48    53    60    65    67    71    70    71    81
#>   5:      1     9    42    51    59    68    85    96    90    92    93   100
#>  ---                                                                         
#> 196:      4    49    40    53    64    85   108   128   152   166   184   203
#> 197:      4    46    40    52    62    82   101   120   144   156   173   210
#> 198:      4    50    41    54    67    84   105   122   155   175   205   234
#> 199:      4    42    42    49    63    84   103   126   160   174   204   234
#> 200:      4    48    39    50    62    80   104   125   154   170   222   261
#>         20    21
#>      <num> <num>
#>   1:    NA    NA
#>   2:    NA    NA
#>   3:    NA    NA
#>   4:    91    96
#>   5:   100    98
#>  ---            
#> 196:   233   237
#> 197:   231   238
#> 198:   264   264
#> 199:   269   281
#> 200:   303   322
dcast(DT, diet+chick ~ time, drop=FALSE, fill=0)
#> Key: <diet, chick>
#>        diet chick     0     2     4     6     8    10    12    14    16    18
#>      <fctr> <ord> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#>   1:      1    18    39    35     0     0     0     0     0     0     0     0
#>   2:      1    16    41    45    49    51    57    51    54     0     0     0
#>   3:      1    15    41    49    56    64    68    68    67    68     0     0
#>   4:      1    13    41    48    53    60    65    67    71    70    71    81
#>   5:      1     9    42    51    59    68    85    96    90    92    93   100
#>  ---                                                                         
#> 196:      4    49    40    53    64    85   108   128   152   166   184   203
#> 197:      4    46    40    52    62    82   101   120   144   156   173   210
#> 198:      4    50    41    54    67    84   105   122   155   175   205   234
#> 199:      4    42    42    49    63    84   103   126   160   174   204   234
#> 200:      4    48    39    50    62    80   104   125   154   170   222   261
#>         20    21
#>      <num> <num>
#>   1:     0     0
#>   2:     0     0
#>   3:     0     0
#>   4:    91    96
#>   5:   100    98
#>  ---            
#> 196:   233   237
#> 197:   231   238
#> 198:   264   264
#> 199:   269   281
#> 200:   303   322

# using subset
dcast(DT, chick ~ time, fun.aggregate=mean, subset=.(time < 10 & chick < 20))
#> Key: <chick>
#>    chick     0     2     4     6     8
#>    <ord> <num> <num> <num> <num> <num>
#> 1:    18    39    35   NaN   NaN   NaN
#> 2:    16    41    45    49    51    57
#> 3:    15    41    49    56    64    68
#> 4:    13    41    48    53    60    65
#> 5:     9    42    51    59    68    85

# drop argument, #1512
DT <- data.table(v1 = c(1.1, 1.1, 1.1, 2.2, 2.2, 2.2),
                 v2 = factor(c(1L, 1L, 1L, 3L, 3L, 3L), levels=1:3),
                 v3 = factor(c(2L, 3L, 5L, 1L, 2L, 6L), levels=1:6),
                 v4 = c(3L, 2L, 2L, 5L, 4L, 3L))
# drop=TRUE
dcast(DT, v1+v2~v3, value.var='v4')                      # default is drop=TRUE
#> Key: <v1, v2>
#>       v1     v2     1     2     3     5     6
#>    <num> <fctr> <int> <int> <int> <int> <int>
#> 1:   1.1      1    NA     3     2     2    NA
#> 2:   2.2      3     5     4    NA    NA     3
dcast(DT, v1+v2~v3, value.var='v4', drop=FALSE)          # all missing combinations of LHS and RHS
#> Key: <v1, v2>
#>       v1     v2     1     2     3     4     5     6
#>    <num> <fctr> <int> <int> <int> <int> <int> <int>
#> 1:   1.1      1    NA     3     2    NA     2    NA
#> 2:   1.1      2    NA    NA    NA    NA    NA    NA
#> 3:   1.1      3    NA    NA    NA    NA    NA    NA
#> 4:   2.2      1    NA    NA    NA    NA    NA    NA
#> 5:   2.2      2    NA    NA    NA    NA    NA    NA
#> 6:   2.2      3     5     4    NA    NA    NA     3
dcast(DT, v1+v2~v3, value.var='v4', drop=c(FALSE, TRUE)) # all missing combinations of LHS only
#> Key: <v1, v2>
#>       v1     v2     1     2     3     5     6
#>    <num> <fctr> <int> <int> <int> <int> <int>
#> 1:   1.1      1    NA     3     2     2    NA
#> 2:   1.1      2    NA    NA    NA    NA    NA
#> 3:   1.1      3    NA    NA    NA    NA    NA
#> 4:   2.2      1    NA    NA    NA    NA    NA
#> 5:   2.2      2    NA    NA    NA    NA    NA
#> 6:   2.2      3     5     4    NA    NA     3
dcast(DT, v1+v2~v3, value.var='v4', drop=c(TRUE, FALSE)) # all missing combinations of RHS only
#> Key: <v1, v2>
#>       v1     v2     1     2     3     4     5     6
#>    <num> <fctr> <int> <int> <int> <int> <int> <int>
#> 1:   1.1      1    NA     3     2    NA     2    NA
#> 2:   2.2      3     5     4    NA    NA    NA     3

# using . and ...
DT <- data.table(v1 = rep(1:2, each = 6),
                 v2 = rep(rep(1:3, 2), each = 2),
                 v3 = rep(1:2, 6),
                 v4 = rnorm(6))
dcast(DT, ... ~ v3, value.var="v4") # same as v1+v2 ~ v3, value.var="v4"
#> Key: <v1, v2>
#>       v1    v2          1          2
#>    <int> <int>      <num>      <num>
#> 1:     1     1  0.4722138 -1.7874243
#> 2:     1     2 -0.5141861  2.0179949
#> 3:     1     3 -1.3474280  0.8697871
#> 4:     2     1  0.4722138 -1.7874243
#> 5:     2     2 -0.5141861  2.0179949
#> 6:     2     3 -1.3474280  0.8697871
dcast(DT, ... ~ v3, value.var="v4", value.var.in.dots=TRUE) # same as v1+v2+v4~v3, value.var="v4"
#> Key: <v1, v2, v4>
#>        v1    v2         v4          1          2
#>     <int> <int>      <num>      <num>      <num>
#>  1:     1     1 -1.7874243         NA -1.7874243
#>  2:     1     1  0.4722138  0.4722138         NA
#>  3:     1     2 -0.5141861 -0.5141861         NA
#>  4:     1     2  2.0179949         NA  2.0179949
#>  5:     1     3 -1.3474280 -1.3474280         NA
#>  6:     1     3  0.8697871         NA  0.8697871
#>  7:     2     1 -1.7874243         NA -1.7874243
#>  8:     2     1  0.4722138  0.4722138         NA
#>  9:     2     2 -0.5141861 -0.5141861         NA
#> 10:     2     2  2.0179949         NA  2.0179949
#> 11:     2     3 -1.3474280 -1.3474280         NA
#> 12:     2     3  0.8697871         NA  0.8697871
dcast(DT, v1+v2+v3 ~ ., value.var="v4")
#> Key: <v1, v2, v3>
#>        v1    v2    v3          .
#>     <int> <int> <int>      <num>
#>  1:     1     1     1  0.4722138
#>  2:     1     1     2 -1.7874243
#>  3:     1     2     1 -0.5141861
#>  4:     1     2     2  2.0179949
#>  5:     1     3     1 -1.3474280
#>  6:     1     3     2  0.8697871
#>  7:     2     1     1  0.4722138
#>  8:     2     1     2 -1.7874243
#>  9:     2     2     1 -0.5141861
#> 10:     2     2     2  2.0179949
#> 11:     2     3     1 -1.3474280
#> 12:     2     3     2  0.8697871

## for each combination of (v1, v2), add up all values of v4
dcast(DT, v1+v2 ~ ., value.var="v4", fun.aggregate=sum)
#> Key: <v1, v2>
#>       v1    v2          .
#>    <int> <int>      <num>
#> 1:     1     1 -1.3152104
#> 2:     1     2  1.5038087
#> 3:     1     3 -0.4776409
#> 4:     2     1 -1.3152104
#> 5:     2     2  1.5038087
#> 6:     2     3 -0.4776409

# fill and types
dcast(DT, v2~v3, value.var='v1', fun.aggregate=length, fill=0L)  #  0L --> 0
#> Key: <v2>
#>       v2     1     2
#>    <int> <int> <int>
#> 1:     1     2     2
#> 2:     2     2     2
#> 3:     3     2     2
dcast(DT, v2~v3, value.var='v4', fun.aggregate=length, fill=1.1) # 1.1 --> 1L
#> Key: <v2>
#>       v2     1     2
#>    <int> <int> <int>
#> 1:     1     2     2
#> 2:     2     2     2
#> 3:     3     2     2

# multiple value.var and multiple fun.aggregate
DT = data.table(x=sample(5,20,TRUE), y=sample(2,20,TRUE),
                z=sample(letters[1:2], 20,TRUE), d1=runif(20), d2=1L)
# multiple value.var
dcast(DT, x+y ~ z, fun.aggregate=sum, value.var=c("d1","d2"))
#> Key: <x, y>
#>        x     y      d1_a      d1_b  d2_a  d2_b
#>    <int> <int>     <num>     <num> <int> <int>
#> 1:     1     2 0.1979177 0.8200896     1     1
#> 2:     2     1 1.1856316 0.9392879     2     2
#> 3:     2     2 0.9029800 0.7041539     2     1
#> 4:     3     1 0.6117537 0.2140010     1     2
#> 5:     3     2 0.0000000 1.5993223     0     3
#> 6:     4     1 0.5987232 0.0000000     1     0
#> 7:     5     1 0.2473545 0.0000000     1     0
#> 8:     5     2 0.2225236 0.1775296     2     1
# multiple fun.aggregate
dcast(DT, x+y ~ z, fun.aggregate=list(sum, mean), value.var="d1")
#> Key: <x, y>
#>        x     y  d1_sum_a  d1_sum_b d1_mean_a d1_mean_b
#>    <int> <int>     <num>     <num>     <num>     <num>
#> 1:     1     2 0.1979177 0.8200896 0.1979177 0.8200896
#> 2:     2     1 1.1856316 0.9392879 0.5928158 0.4696440
#> 3:     2     2 0.9029800 0.7041539 0.4514900 0.7041539
#> 4:     3     1 0.6117537 0.2140010 0.6117537 0.1070005
#> 5:     3     2 0.0000000 1.5993223       NaN 0.5331074
#> 6:     4     1 0.5987232 0.0000000 0.5987232       NaN
#> 7:     5     1 0.2473545 0.0000000 0.2473545       NaN
#> 8:     5     2 0.2225236 0.1775296 0.1112618 0.1775296
# multiple fun.agg and value.var (all combinations)
dcast(DT, x+y ~ z, fun.aggregate=list(sum, mean), value.var=c("d1", "d2"))
#> Key: <x, y>
#>        x     y  d1_sum_a  d1_sum_b d2_sum_a d2_sum_b d1_mean_a d1_mean_b
#>    <int> <int>     <num>     <num>    <int>    <int>     <num>     <num>
#> 1:     1     2 0.1979177 0.8200896        1        1 0.1979177 0.8200896
#> 2:     2     1 1.1856316 0.9392879        2        2 0.5928158 0.4696440
#> 3:     2     2 0.9029800 0.7041539        2        1 0.4514900 0.7041539
#> 4:     3     1 0.6117537 0.2140010        1        2 0.6117537 0.1070005
#> 5:     3     2 0.0000000 1.5993223        0        3       NaN 0.5331074
#> 6:     4     1 0.5987232 0.0000000        1        0 0.5987232       NaN
#> 7:     5     1 0.2473545 0.0000000        1        0 0.2473545       NaN
#> 8:     5     2 0.2225236 0.1775296        2        1 0.1112618 0.1775296
#>    d2_mean_a d2_mean_b
#>        <num>     <num>
#> 1:         1         1
#> 2:         1         1
#> 3:         1         1
#> 4:         1         1
#> 5:       NaN         1
#> 6:         1       NaN
#> 7:         1       NaN
#> 8:         1         1
# multiple fun.agg and value.var (one-to-one)
dcast(DT, x+y ~ z, fun.aggregate=list(sum, mean), value.var=list("d1", "d2"))
#> Key: <x, y>
#>        x     y  d1_sum_a  d1_sum_b d2_mean_a d2_mean_b
#>    <int> <int>     <num>     <num>     <num>     <num>
#> 1:     1     2 0.1979177 0.8200896         1         1
#> 2:     2     1 1.1856316 0.9392879         1         1
#> 3:     2     2 0.9029800 0.7041539         1         1
#> 4:     3     1 0.6117537 0.2140010         1         1
#> 5:     3     2 0.0000000 1.5993223       NaN         1
#> 6:     4     1 0.5987232 0.0000000         1       NaN
#> 7:     5     1 0.2473545 0.0000000         1       NaN
#> 8:     5     2 0.2225236 0.1775296         1         1