This vignette will explain the most common ways to use the .SD
variable in your data.table
analyses. It is an adaptation of this answer given on StackOverflow.
.SD
?In the broadest sense, .SD
is just shorthand for capturing a variable that comes up frequently in the context of data analysis. It can be understood to stand for Subset, Selfsame, or Self-reference of the Data. That is, .SD
is in its most basic guise a reflexive reference to the data.table
itself – as we’ll see in examples below, this is particularly helpful for chaining together “queries” (extractions/subsets/etc using [
). In particular, this also means that .SD
is itself a data.table
(with the caveat that it does not allow assignment with :=
).
The simpler usage of .SD
is for column subsetting (i.e., when .SDcols
is specified); as this version is much more straightforward to understand, we’ll cover that first below. The interpretation of .SD
in its second usage, grouping scenarios (i.e., when by =
or keyby =
is specified), is slightly different, conceptually (though at core it’s the same, since, after all, a non-grouped operation is an edge case of grouping with just one group).
To give this a more real-world feel, rather than making up data, let’s load some data sets about baseball from the Lahman database. In typical R usage, we’d simply load these data sets from the Lahman
R package; in this vignette, we’ve pre-downloaded them directly from the package’s GitHub page instead.
load('Teams.RData')
setDT(Teams)
Teams
# yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
# WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER
# ERA CG SHO SV IPouts HA HRA BBA SOA E DP FP
# name park attendance BPF PPF teamIDBR
# teamIDlahman45 teamIDretro
# [ reached getOption("max.print") -- omitted 12 rows ]
load('Pitching.RData')
setDT(Pitching)
Pitching
# playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H
# ER HR BB SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF
# GIDP
# [ reached getOption("max.print") -- omitted 12 rows ]
Readers up on baseball lingo should find the tables’ contents familiar; Teams
records some statistics for a given team in a given year, while Pitching
records statistics for a given pitcher in a given year. Please do check out the documentation and explore the data yourself a bit before proceeding to familiarize yourself with their structure.
.SD
on Ungrouped DataTo illustrate what I mean about the reflexive nature of .SD
, consider its most banal usage:
Pitching[ , .SD]
# playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H
# ER HR BB SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF
# GIDP
# [ reached getOption("max.print") -- omitted 12 rows ]
That is, Pitching[ , .SD]
has simply returned the whole table, i.e., this was an overly verbose way of writing Pitching
or Pitching[]
:
In terms of subsetting, .SD
is still a subset of the data, it’s just a trivial one (the set itself).
.SDcols
The first way to impact what .SD
is is to limit the columns contained in .SD
using the .SDcols
argument to [
:
# W: Wins; L: Losses; G: Games
Pitching[ , .SD, .SDcols = c('W', 'L', 'G')]
# W L G
# <int> <int> <int>
# [ reached getOption("max.print") -- omitted 11 rows ]
This is just for illustration and was pretty boring. But even this simply usage lends itself to a wide variety of highly beneficial / ubiquitous data manipulation operations:
Column type conversion is a fact of life for data munging. Though fwrite
recently gained the ability to declare the class of each column up front, not all data sets come from fread
(e.g. in this vignette) and conversions back and forth among character
/factor
/numeric
types are common. We can use .SD
and .SDcols
to batch-convert groups of columns to a common type.
We notice that the following columns are stored as character
in the Teams
data set, but might more logically be stored as factor
s:
# teamIDBR: Team ID used by Baseball Reference website
# teamIDlahman45: Team ID used in Lahman database version 4.5
# teamIDretro: Team ID used by Retrosheet
fkt = c('teamIDBR', 'teamIDlahman45', 'teamIDretro')
# confirm that they're stored as `character`
Teams[ , sapply(.SD, is.character), .SDcols = fkt]
# teamIDBR teamIDlahman45 teamIDretro
# TRUE TRUE TRUE
If you’re confused by the use of sapply
here, note that it’s quite similar for base R data.frames
:
setDF(Teams) # convert to data.frame for illustration
sapply(Teams[ , fkt], is.character)
# teamIDBR teamIDlahman45 teamIDretro
# TRUE TRUE TRUE
setDT(Teams) # convert back to data.table
The key to understanding this syntax is to recall that a data.table
(as well as a data.frame
) can be considered as a list
where each element is a column – thus, sapply
/lapply
applies the FUN
argument (in this case, is.character
) to each column and returns the result as sapply
/lapply
usually would.
The syntax to now convert these columns to factor
is very similar – simply add the :=
assignment operator:
Teams[ , (fkt) := lapply(.SD, factor), .SDcols = fkt]
# print out the first column to demonstrate success
head(unique(Teams[[fkt[1L]]]))
# [1] BOS CHI CLE KEK
# [ reached getOption("max.print") -- omitted 2 entries ]
# 101 Levels: ALT ANA ARI ATH ATL BAL BLA BLN BLU BOS BRA BRG BRO BSN BTT BUF BWW CAL CEN CHC ... WSN
Note that we must wrap fkt
in parentheses ()
to force data.table
to interpret this as column names, instead of trying to assign a column named 'fkt'
.
Actually, the .SDcols
argument is quite flexible; above, we supplied a character
vector of column names. In other situations, it is more convenient to supply an integer
vector of column positions or a logical
vector dictating include/exclude for each column. .SDcols
even accepts regular expression-based pattern matching.
For example, we could do the following to convert all factor
columns to character
:
# while .SDcols accepts a logical vector,
# := does not, so we need to convert to column
# positions with which()
fkt_idx = which(sapply(Teams, is.factor))
Teams[ , (fkt_idx) := lapply(.SD, as.character), .SDcols = fkt_idx]
head(unique(Teams[[fkt_idx[1L]]]))
# [1] "NA" "NL" "AA" "UA"
# [ reached getOption("max.print") -- omitted 2 entries ]
Lastly, we can do pattern-based matching of columns in .SDcols
to select all columns which contain team
back to factor
:
Teams[ , .SD, .SDcols = patterns('team')]
# teamID teamIDBR teamIDlahman45 teamIDretro
# <char> <char> <char> <char>
# [ reached getOption("max.print") -- omitted 11 rows ]
# now convert these columns to factor;
# value = TRUE in grep() is for the LHS of := to
# get column names instead of positions
team_idx = grep('team', names(Teams), value = TRUE)
Teams[ , (team_idx) := lapply(.SD, factor), .SDcols = team_idx]
** A proviso to the above: explicitly using column numbers (like DT[ , (1) := rnorm(.N)]
) is bad practice and can lead to silently corrupted code over time if column positions change. Even implicitly using numbers can be dangerous if we don’t keep smart/strict control over the ordering of when we create the numbered index and when we use it.
Varying model specification is a core feature of robust statistical analysis. Let’s try and predict a pitcher’s ERA (Earned Runs Average, a measure of performance) using the small set of covariates available in the Pitching
table. How does the (linear) relationship between W
(wins) and ERA
vary depending on which other covariates are included in the specification?
Here’s a short script leveraging the power of .SD
which explores this question:
# this generates a list of the 2^k possible extra variables
# for models of the form ERA ~ G + (...)
extra_var = c('yearID', 'teamID', 'G', 'L')
models = unlist(
lapply(0L:length(extra_var), combn, x = extra_var, simplify = FALSE),
recursive = FALSE
)
# here are 16 visually distinct colors, taken from the list of 20 here:
# https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/
col16 = c('#e6194b', '#3cb44b', '#ffe119', '#0082c8',
'#f58231', '#911eb4', '#46f0f0', '#f032e6',
'#d2f53c', '#fabebe', '#008080', '#e6beff',
'#aa6e28', '#fffac8', '#800000', '#aaffc3')
par(oma = c(2, 0, 0, 0))
lm_coef = sapply(models, function(rhs) {
# using ERA ~ . and data = .SD, then varying which
# columns are included in .SD allows us to perform this
# iteration over 16 models succinctly.
# coef(.)['W'] extracts the W coefficient from each model fit
Pitching[ , coef(lm(ERA ~ ., data = .SD))['W'], .SDcols = c('W', rhs)]
})
barplot(lm_coef, names.arg = sapply(models, paste, collapse = '/'),
main = 'Wins Coefficient\nWith Various Covariates',
col = col16, las = 2L, cex.names = .8)
Fit OLS coefficient on W, various specifications, depicted as bars with distinct colors.
The coefficient always has the expected sign (better pitchers tend to have more wins and fewer runs allowed), but the magnitude can vary substantially depending on what else we control for.
data.table
syntax is beautiful for its simplicity and robustness. The syntax x[i]
flexibly handles three common approaches to subsetting – when i
is a logical
vector, x[i]
will return those rows of x
corresponding to where i
is TRUE
; when i
is another data.table
(or a list
), a (right) join
is performed (in the plain form, using the key
s of x
and i
, otherwise, when on =
is specified, using matches of those columns); and when i
is a character, it is interpreted as shorthand for x[list(i)]
, i.e., as a join.
This is great in general, but falls short when we wish to perform a conditional join, wherein the exact nature of the relationship among tables depends on some characteristics of the rows in one or more columns.
This example is admittedly a tad contrived, but illustrates the idea; see here (1, 2) for more.
The goal is to add a column team_performance
to the Pitching
table that records the team’s performance (rank) of the best pitcher on each team (as measured by the lowest ERA, among pitchers with at least 6 recorded games).
# to exclude pitchers with exceptional performance in a few games,
# subset first; then define rank of pitchers within their team each year
# (in general, we should put more care into the 'ties.method' of frank)
Pitching[G > 5, rank_in_team := frank(ERA), by = .(teamID, yearID)]
Pitching[rank_in_team == 1, team_performance :=
Teams[.SD, Rank, on = c('teamID', 'yearID')]]
Note that the x[y]
syntax returns nrow(y)
values (i.e., it’s a right join), which is why .SD
is on the right in Teams[.SD]
(since the RHS of :=
in this case requires nrow(Pitching[rank_in_team == 1])
values.
.SD
operationsOften, we’d like to perform some operation on our data at the group level. When we specify by =
(or keyby =
), the mental model for what happens when data.table
processes j
is to think of your data.table
as being split into many component sub-data.table
s, each of which corresponds to a single value of your by
variable(s):
[Grouping, Illustrated](plots/grouping_illustration.png ‘A visual depiction of how grouping works. On the left is a grid. The first column is titled “ID COLUMN” with values the capital letters A through G, and the rest of the data is unlabelled, but is in a darker color and simply has “Data” written to indicate that’s arbitrary. A right arrow shows how this data is split into groups. Each capital letter A through G has a grid on the right-hand side; the grid on the left has been subdivided to create that on the right.’)
In the case of grouping, .SD
is multiple in nature – it refers to each of these sub-data.table
s, one-at-a-time (slightly more accurately, the scope of .SD
is a single sub-data.table
). This allows us to concisely express an operation that we’d like to perform on each sub-data.table
before the re-assembled result is returned to us.
This is useful in a variety of settings, the most common of which are presented here:
Let’s get the most recent season of data for each team in the Lahman data. This can be done quite simply with:
# the data is already sorted by year; if it weren't
# we could do Teams[order(yearID), .SD[.N], by = teamID]
Teams[ , .SD[.N], by = teamID]
# teamID yearID lgID franchID divID Rank G Ghome W L DivWin WCWin LgWin WSWin
# R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA
# CG SHO SV IPouts HA HRA BBA SOA E DP FP name
# park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro
# [ reached getOption("max.print") -- omitted 12 rows ]
Recall that .SD
is itself a data.table
, and that .N
refers to the total number of rows in a group (it’s equal to nrow(.SD)
within each group), so .SD[.N]
returns the entirety of .SD
for the final row associated with each teamID
.
Another common version of this is to use .SD[1L]
instead to get the first observation for each group, or .SD[sample(.N, 1L)]
to return a random row for each group.
Suppose we wanted to return the best year for each team, as measured by their total number of runs scored (R
; we could easily adjust this to refer to other metrics, of course). Instead of taking a fixed element from each sub-data.table
, we now define the desired index dynamically as follows:
Teams[ , .SD[which.max(R)], by = teamID]
# teamID yearID lgID franchID divID Rank G Ghome W L DivWin WCWin LgWin WSWin
# R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA
# CG SHO SV IPouts HA HRA BBA SOA E DP FP name
# park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro
# [ reached getOption("max.print") -- omitted 12 rows ]
Note that this approach can of course be combined with .SDcols
to return only portions of the data.table
for each .SD
(with the caveat that .SDcols
should be fixed across the various subsets)
NB: .SD[1L]
is currently optimized by GForce
(see also), data.table
internals which massively speed up the most common grouped operations like sum
or mean
– see ?GForce
for more details and keep an eye on/voice support for feature improvement requests for updates on this front: 1, 2, 3, 4, 5, 6
Returning to the inquiry above regarding the relationship between ERA
and W
, suppose we expect this relationship to differ by team (i.e., there’s a different slope for each team). We can easily re-run this regression to explore the heterogeneity in this relationship as follows (noting that the standard errors from this approach are generally incorrect – the specification ERA ~ W*teamID
will be better – this approach is easier to read and the coefficients are OK):
# Overall coefficient for comparison
overall_coef = Pitching[ , coef(lm(ERA ~ W))['W']]
# use the .N > 20 filter to exclude teams with few observations
Pitching[ , if (.N > 20L) .(w_coef = coef(lm(ERA ~ W))['W']), by = teamID
][ , hist(w_coef, 20L, las = 1L,
xlab = 'Fitted Coefficient on W',
ylab = 'Number of Teams', col = 'darkgreen',
main = 'Team-Level Distribution\nWin Coefficients on ERA')]
abline(v = overall_coef, lty = 2L, col = 'red')
A histogram depicting the distribution of fitted coefficients. It is vaguely bell-shaped and concentrated around -.2
While there is indeed a fair amount of heterogeneity, there’s a distinct concentration around the observed overall value.
The above is just a short introduction of the power of .SD
in facilitating beautiful, efficient code in data.table
!