This vignette discusses the default usage of reshaping functions
melt (wide to long) and
dcast (long to wide) for data.tables as well as the new extended functionalities of melting and casting on multiple columns available from
We will load the data sets directly within sections.
dcast functions for
data.tables are for reshaping wide-to-long and long-to-wide, respectively; the implementations are specifically designed with large in-memory data (e.g. 10Gb) in mind.
In this vignette, we will
First briefly look at the default
data.tables to convert them from wide to long format and vice versa
Look at scenarios where the current functionalities become cumbersome and inefficient
Finally look at the new improvements to both
dcast methods for
data.tables to handle multiple columns simultaneously.
The extended functionalities are in line with
data.table’s philosophy of performing operations efficiently and in a straightforward manner.
data.tables (wide to long)
Suppose we have a
data.table (artificial data) as shown below:
s1 <- "family_id age_mother dob_child1 dob_child2 dob_child3 1 30 1998-11-26 2000-01-29 NA 2 27 1996-06-22 NA NA 3 26 2002-07-11 2004-04-05 2007-09-02 4 32 2004-10-10 2009-08-27 2012-07-21 5 29 2000-12-05 2005-02-28 NA" DT <- fread(s1) DT # family_id age_mother dob_child1 dob_child2 dob_child3 # [ reached getOption("max.print") -- omitted 6 rows ] ## dob stands for date of birth. str(DT) # Classes 'data.table' and 'data.frame': 5 obs. of 5 variables: # $ family_id : int 1 2 3 4 5 # $ age_mother: int 30 27 26 32 29 # $ dob_child1: IDate, format: "1998-11-26" "1996-06-22" "2002-07-11" ... # $ dob_child2: IDate, format: "2000-01-29" NA "2004-04-05" ... # $ dob_child3: IDate, format: NA NA "2007-09-02" ... # - attr(*, ".internal.selfref")=<externalptr>
DTto long form where each
dobis a separate observation.
We could accomplish this using
melt() by specifying
measure.vars arguments as follows:
DT.m1 = melt(DT, id.vars = c("family_id", "age_mother"), measure.vars = c("dob_child1", "dob_child2", "dob_child3")) DT.m1 # family_id age_mother variable value # <int> <int> <fctr> <IDat> # [ reached getOption("max.print") -- omitted 15 rows ] str(DT.m1) # Classes 'data.table' and 'data.frame': 15 obs. of 4 variables: # $ family_id : int 1 2 3 4 5 1 2 3 4 5 ... # $ age_mother: int 30 27 26 32 29 30 27 26 32 29 ... # $ variable : Factor w/ 3 levels "dob_child1","dob_child2",..: 1 1 1 1 1 2 2 2 2 2 ... # $ value : IDate, format: "1998-11-26" "1996-06-22" "2002-07-11" ... # - attr(*, ".internal.selfref")=<externalptr>
measure.vars specify the set of columns we would like to collapse (or combine) together.
We can also specify column indices instead of names.
variable column is of type
variable.factor argument to
FALSE if you’d like to return a
character vector instead.
By default, the molten columns are automatically named
melt preserves column attributes in result.
By default, when one of
measure.vars is missing, the rest of the columns are automatically assigned to the missing argument.
measure.vars are specified, as mentioned under
?melt, all non-
logical columns will be assigned to
In addition, a warning message is issued highlighting the columns that are automatically considered to be
data.tables (long to wide)
In the previous section, we saw how to get from wide form to long form. Let’s see the reverse operation in this section.
That is, we’d like to collect all child observations corresponding to each
family_id, age_mother together under the same row. We can accomplish it using
dcast as follows:
dcast uses formula interface. The variables on the LHS of formula represents the id vars and RHS the measure vars.
value.var denotes the column to be filled in with while casting to wide format.
dcast also tries to preserve attributes in result wherever possible.
DT.m1, how can we get the number of children in each family?
You can also pass a function to aggregate by in
dcast with the argument
fun.aggregate. This is particularly essential when the formula provided does not identify single observation for each cell.
?dcast for other useful arguments and additional examples.
So far we’ve seen features of
dcast that are implemented efficiently for
data.tables, using internal
data.table machinery (fast radix ordering, binary search etc..).
However, there are situations we might run into where the desired operation is not expressed in a straightforward manner. For example, consider the
data.table shown below:
s2 <- "family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3 1 30 1998-11-26 2000-01-29 NA 1 2 NA 2 27 1996-06-22 NA NA 2 NA NA 3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1 4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1 5 29 2000-12-05 2005-02-28 NA 2 1 NA" DT <- fread(s2) DT # family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3 # [ reached getOption("max.print") -- omitted 6 rows ] ## 1 = female, 2 = male
And you’d like to combine (
melt) all the
dob columns together, and
gender columns together. Using the current functionality, we can do something like this:
DT.m1 = melt(DT, id = c("family_id", "age_mother")) DT.m1[, c("variable", "child") := tstrsplit(variable, "_", fixed = TRUE)] DT.c1 = dcast(DT.m1, family_id + age_mother + child ~ variable, value.var = "value") DT.c1 # Key: <family_id, age_mother, child> # family_id age_mother child dob gender # [ reached getOption("max.print") -- omitted 16 rows ] str(DT.c1) ## gender column is character type now! # Classes 'data.table' and 'data.frame': 15 obs. of 5 variables: # $ family_id : int 1 1 1 2 2 2 3 3 3 4 ... # $ age_mother: int 30 30 30 27 27 27 26 26 26 32 ... # $ child : chr "child1" "child2" "child3" "child1" ... # $ dob : IDate, format: "1998-11-26" "2000-01-29" NA ... # $ gender : IDate, format: "1970-01-02" "1970-01-03" NA ... # - attr(*, ".internal.selfref")=<externalptr> # - attr(*, "sorted")= chr [1:3] "family_id" "age_mother" "child"
What we wanted to do was to combine all the
gender type columns together respectively. Instead we are combining everything together, and then splitting them again. I think it’s easy to see that it’s quite roundabout (and inefficient).
As an analogy, imagine you’ve a closet with four shelves of clothes and you’d like to put together the clothes from shelves 1 and 2 together (in 1), and 3 and 4 together (in 3). What we are doing is more or less to combine all the clothes together, and then split them back on to shelves 1 and 3!
The columns to
melt may be of different types, as in this case (
integer types). By
melting them all together, the columns will be coerced in result, as explained by the warning message above and shown from output of
gender has been converted to
We are generating an additional column by splitting the
variable column into two columns, whose purpose is quite cryptic. We do it because we need it for casting in the next step.
Finally, we cast the data set. But the issue is it’s a much more computationally involved operation than melt. Specifically, it requires computing the order of the variables in formula, and that’s costly.
stats::reshape is capable of performing this operation in a very straightforward manner. It is an extremely useful and often underrated function. You should definitely give it a try!
Since we’d like for
data.tables to perform this operation straightforward and efficient using the same interface, we went ahead and implemented an additional functionality, where we can
melt to multiple columns simultaneously.
meltmultiple columns simultaneously
The idea is quite simple. We pass a list of columns to
measure.vars, where each element of the list contains the columns that should be combined together.
colA = paste0("dob_child", 1:3) colB = paste0("gender_child", 1:3) DT.m2 = melt(DT, measure = list(colA, colB), value.name = c("dob", "gender")) DT.m2 # family_id age_mother variable dob gender # [ reached getOption("max.print") -- omitted 16 rows ] str(DT.m2) ## col type is preserved # Classes 'data.table' and 'data.frame': 15 obs. of 5 variables: # $ family_id : int 1 2 3 4 5 1 2 3 4 5 ... # $ age_mother: int 30 27 26 32 29 30 27 26 32 29 ... # $ variable : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ... # $ dob : IDate, format: "1998-11-26" "1996-06-22" "2002-07-11" ... # $ gender : int 1 2 2 1 2 2 NA 2 1 1 ... # - attr(*, ".internal.selfref")=<externalptr>
We can remove the
variable column if necessary.
The functionality is implemented entirely in C, and is therefore both fast and memory efficient in addition to being straightforward.
Usually in these problems, the columns we’d like to melt can be distinguished by a common pattern. We can use the function
patterns(), implemented for convenience, to provide regular expressions for the columns to be combined together. The above operation can be rewritten as:
measure.varsvia separator or pattern
If, as in the data above, the input columns to melt have regular names, then we can use
measure, which allows specifying the columns to melt via a separator or a regex. For example consider the iris data,
The iris data has four numeric columns with a regular structure: first the flower part, then a period, then the measurement dimension. To specify that we want to melt those four columns, we can use
sep="." which means to use
strsplit on all column names; the columns which result in the maximum number of groups after splitting will be used as
The first two arguments to
measure in the code above (
dim) are used to name the output columns; the number of arguments must equal the max number of groups after splitting with
If we want two value columns, one for each part, we can use the special
value.name keyword, which means to output a value column for each unique name found in that group:
Using the code above we get one value column per flower part. If we instead want a value column for each measurement dimension, we can do
Going back to the example of the data with families and children, we can see a more complex usage of
measure, involving a function which is used to convert the
child string values to integers:
In the code above we used
sep="_child" which results in melting only the columns which contain that string (six column names split into two groups each). The
child=as.integer argument means the second group will result in an output column named
child with values defined by plugging the character strings from that group into the function
Finally we consider an example (borrowed from tidyr package) where we need to define the groups using a regular expression rather than a separator.
(who <- data.table(id=1, new_sp_m5564=2, newrel_f65=3)) # id new_sp_m5564 newrel_f65 # <num> <num> <num> # [ reached getOption("max.print") -- omitted 1 row ] melt(who, measure.vars = measure( diagnosis, gender, ages, pattern="new_?(.*)_(.)(.*)")) # id diagnosis gender ages value # [ reached getOption("max.print") -- omitted 3 rows ]
When using the
pattern argument, it must be a Perl-compatible regular expression containing the same number of capture groups (parenthesized sub-expressions) as the number other arguments (group names). The code below shows how to use a more complex regex with five groups, two numeric output columns, and an anonymous type conversion function,
Okay great! We can now melt into multiple columns simultaneously. Now given the data set
DT.m2 as shown above, how can we get back to the same format as the original data we started with?
If we use the current functionality of
dcast, then we’d have to cast twice and bind the results together. But that’s once again verbose, not straightforward and is also inefficient.
We can now provide multiple
value.var columns to
data.tables directly so that the operations are taken care of internally and efficiently.
Attributes are preserved in result wherever possible.
Everything is taken care of internally, and efficiently. In addition to being fast, it is also very memory efficient.
You can also provide multiple functions to
dcast for data.tables. Check the examples in
?dcast which illustrates this functionality.