`assign.Rd`

Fast add, remove and update subsets of columns, by reference. `:=`

operator can be used in two ways: `LHS := RHS`

form, and `Functional form`

. See `Usage`

.

`set`

is a low-overhead loop-able version of `:=`

. It is particularly useful for repetitively updating rows of certain columns by reference (using a for-loop). See `Examples`

. It can not perform grouping operations.

# 1. LHS := RHS form # DT[i, LHS := RHS, by = ...] # DT[i, c("LHS1", "LHS2") := list(RHS1, RHS2), by = ...] # 2. Functional form # DT[i, `:=`(LHS1 = RHS1, # LHS2 = RHS2, # ...), by = ...] set(x, i = NULL, j, value)

LHS | A character vector of column names (or numeric positions) or a variable that evaluates as such. If the column doesn't exist, it is added, |
---|---|

RHS | A list of replacement values. It is recycled in the usual way to fill the number of rows satisfying |

x | A |

i | Optional. Indicates the rows on which the values must be updated with. If not provided, implies In |

j | Column name(s) (character) or number(s) (integer) to be assigned |

value | A list of replacement values to assign by reference to |

`:=`

is defined for use in `j`

only. It *adds* or *updates* or *removes* column(s) by reference. It makes no copies of any part of memory at all. Please read `vignette("datatable-reference-semantics")`

and follow with examples. Some typical usages are:

DT[, col := val] # update (or add at the end if doesn't exist) a column called "col" with value "val" (recycled if necessary). DT[i, col := val] # same as above, but only for those rows specified in i and (for new columns) NA elsewhere. DT[i, "col a" := val] # same. column is called "col a" DT[i, (3:6) := val] # update existing columns 3:6 with value. Aside: parens are not required here since : already makes LHS a call rather than a symbol. DT[i, colvector := val, with = FALSE] # OLD syntax. The contents of "colvector" in calling scope determine the column(s). DT[i, (colvector) := val] # same (NOW PREFERRED) shorthand syntax. The parens are enough to stop the LHS being a symbol; same as c(colvector). DT[i, colC := mean(colB), by = colA] # update (or add) column called "colC" by reference by group. A major feature of `:=`. DT[,`:=`(new1 = sum(colB), new2 = sum(colC))] # Functional form

The `.Last.updated`

variable contains the number of rows updated by the most recent `:=`

or `set`

calls, which may be useful, for example, in production settings for testing assumptions about the number of rows affected by a statement; see `.Last.updated`

for details.

Note that for efficiency no check is performed for duplicate assignments, i.e. if multiple values are passed for assignment to the same index, assignment to this index will occur repeatedly and sequentially; for a given use case, consider whether it makes sense to create your own test for duplicates, e.g. in production code.

All of the following result in a friendly error (by design) :

x := 1L DT[i, col] := val DT[i]$col := val DT[, {col1 := 1L; col2 := 2L}] # Use the functional form, `:=`(), instead (see above).

For additional resources, please read `vignette("datatable-faq")`

. Also have a look at StackOverflow's data.table tag.

`:=`

in `j`

can be combined with all types of `i`

(such as binary search), and all types of `by`

. This a one reason why `:=`

has been implemented in `j`

. Please see `vignette("datatable-reference-semantics")`

and also `FAQ 2.16`

for analogies to SQL.

When `LHS`

is a factor column and `RHS`

is a character vector with items missing from the factor levels, the new level(s) are automatically added (by reference, efficiently), unlike base methods.

Unlike `<-`

for `data.frame`

, the (potentially large) LHS is not coerced to match the type of the (often small) RHS. Instead the RHS is coerced to match the type of the LHS, if necessary. Where this involves double precision values being coerced to an integer column, a warning is given (whether or not fractional data is truncated). The motivation for this is efficiency. It is best to get the column types correct up front and stick to them. Changing a column type is possible but deliberately harder: provide a whole column as the RHS. This RHS is then *plonked* into that column slot and we call this *plonk syntax*, or *replace column syntax* if you prefer. By needing to construct a full length vector of a new type, you as the user are more aware of what is happening, and it is clearer to readers of your code that you really do intend to change the column type.

`data.table`

s are *not* copied-on-change by `:=`

, `setkey`

or any of the other `set*`

functions. See `copy`

.

It is easy to see how *sub-assigning* to existing columns is done internally. Removing columns by reference is also straightforward by modifying the vector of column pointers only (using memmove in C). However adding (new) columns is more tricky as to how the `data.table`

can be grown *by reference*: the list vector of column pointers is *over-allocated*, see `truelength`

. By defining `:=`

in `j`

we believe update syntax is natural, and scales, but it also bypasses `[<-`

dispatch and allows `:=`

to update by reference with no copies of any part of memory at all.

Since `[.data.table`

incurs overhead to check the existence and type of arguments (for example), `set()`

provides direct (but less flexible) assignment by reference with low overhead, appropriate for use inside a `for`

loop. See examples. `:=`

is more powerful and flexible than `set()`

because `:=`

is intended to be combined with `i`

and `by`

in single queries on large datasets.

`DT[a > 4, b := c]`

is different from `DT[a > 4][, b := c]`

. The first expression updates (or adds) column `b`

with the value `c`

on those rows where `a > 4`

evaluates to `TRUE`

. `X`

is updated *by reference*, therefore no assignment needed.

The second expression on the other hand updates a *new* `data.table`

that's returned by the subset operation. Since the subsetted data.table is ephemeral (it is not assigned to a symbol), the result would be lost; unless the result is assigned, for example, as follows: `ans <- DT[a > 4][, b := c]`

.

`DT`

is modified by reference and returned invisibly. If you require a copy, take a `copy`

first (using `DT2 = copy(DT)`

).

`data.table`

, `copy`

, `setalloccol`

, `truelength`

, `set`

, `.Last.updated`

DT = data.table(a = LETTERS[c(3L,1:3)], b = 4:7) DT[, c := 8] # add a numeric column, 8 for all rows#> a b c #> <char> <int> <num> #> 1: C 4 8 #> 2: A 5 8 #> 3: B 6 8 #> 4: C 7 8DT[, d := 9L] # add an integer column, 9L for all rows#> a b c d #> <char> <int> <num> <int> #> 1: C 4 8 9 #> 2: A 5 8 9 #> 3: B 6 8 9 #> 4: C 7 8 9DT[, c := NULL] # remove column c#> a b d #> <char> <int> <int> #> 1: C 4 9 #> 2: A 5 9 #> 3: B 6 9 #> 4: C 7 9DT[2, d := -8L] # subassign by reference to d; 2nd row is -8L now#> a b d #> <char> <int> <int> #> 1: C 4 9 #> 2: A 5 -8 #> 3: B 6 9 #> 4: C 7 9DT # DT changed by reference#> a b d #> <char> <int> <int> #> 1: C 4 9 #> 2: A 5 -8 #> 3: B 6 9 #> 4: C 7 9DT[2, d := 10L][] # shorthand for update and print#> a b d #> <char> <int> <int> #> 1: C 4 9 #> 2: A 5 10 #> 3: B 6 9 #> 4: C 7 9DT[b > 4, b := d * 2L] # subassign to b with d*2L on those rows where b > 4 is TRUE#> a b d #> <char> <int> <int> #> 1: C 4 9 #> 2: A 20 10 #> 3: B 18 9 #> 4: C 18 9DT[b > 4][, b := d * 2L] # different from above. [, := ] is performed on the subset#> a b d #> <char> <int> <int> #> 1: A 20 10 #> 2: B 18 9 #> 3: C 18 9# which is an new (ephemeral) data.table. Result needs to be # assigned to a variable (using `<-`). DT[, e := mean(d), by = a] # add new column by group by reference#> a b d e #> <char> <int> <int> <num> #> 1: C 4 9 9 #> 2: A 20 10 10 #> 3: B 18 9 9 #> 4: C 18 9 9DT["A", b := 0L, on = "a"] # ad-hoc update of column b for group "A" using#> a b d e #> <char> <int> <int> <num> #> 1: C 4 9 9 #> 2: A 0 10 10 #> 3: B 18 9 9 #> 4: C 18 9 9# joins-as-subsets with binary search and 'on=' # same as above but using keys setkey(DT, a) DT["A", b := 0L] # binary search for group "A" and set column b using keys#> a b d e #> <char> <int> <int> <num> #> 1: A 0 10 10 #> 2: B 18 9 9 #> 3: C 4 9 9 #> 4: C 18 9 9#> a b d e f #> <char> <int> <int> <num> <num> #> 1: A 0 10 10 NA #> 2: B 18 9 9 9 #> 3: C 4 9 9 NA #> 4: C 18 9 9 NA# Adding multiple columns ## by name DT[ , c('sin_d', 'log_e', 'cos_d') := .(sin(d), log(e), cos(d))]#> a b d e f sin_d log_e cos_d #> <char> <int> <int> <num> <num> <num> <num> <num> #> 1: A 0 10 10 NA -0.5440211 2.302585 -0.8390715 #> 2: B 18 9 9 9 0.4121185 2.197225 -0.9111303 #> 3: C 4 9 9 NA 0.4121185 2.197225 -0.9111303 #> 4: C 18 9 9 NA 0.4121185 2.197225 -0.9111303#> a b d e f sin_d log_e cos_d sin_b #> <char> <int> <int> <num> <num> <num> <num> <num> <num> #> 1: A 0 10 10 NA -0.5440211 2.302585 -0.8390715 0.0000000 #> 2: B 18 9 9 9 0.4121185 2.197225 -0.9111303 -0.7509872 #> 3: C 4 9 9 NA 0.4121185 2.197225 -0.9111303 -0.7568025 #> 4: C 18 9 9 NA 0.4121185 2.197225 -0.9111303 -0.7509872 #> cos_b #> <num> #> 1: 1.0000000 #> 2: 0.6603167 #> 3: -0.6536436 #> 4: 0.6603167## using lapply & .SD DT[ , paste0('tan_', c('b', 'd', 'e')) := lapply(.SD, tan), .SDcols = c('b', 'd', 'e')]#> a b d e f sin_d log_e cos_d sin_b #> <char> <int> <int> <num> <num> <num> <num> <num> <num> #> 1: A 0 10 10 NA -0.5440211 2.302585 -0.8390715 0.0000000 #> 2: B 18 9 9 9 0.4121185 2.197225 -0.9111303 -0.7509872 #> 3: C 4 9 9 NA 0.4121185 2.197225 -0.9111303 -0.7568025 #> 4: C 18 9 9 NA 0.4121185 2.197225 -0.9111303 -0.7509872 #> cos_b tan_b tan_d tan_e #> <num> <num> <num> <num> #> 1: 1.0000000 0.000000 0.6483608 0.6483608 #> 2: 0.6603167 -1.137314 -0.4523157 -0.4523157 #> 3: -0.6536436 1.157821 -0.4523157 -0.4523157 #> 4: 0.6603167 -1.137314 -0.4523157 -0.4523157## using forced evaluation to disambiguate a vector of names ## and overwrite existing columns with their squares sq_cols = c('b', 'd', 'e') DT[ , (sq_cols) := lapply(.SD, `^`, 2L), .SDcols = sq_cols]#> a b d e f sin_d log_e cos_d sin_b #> <char> <num> <num> <num> <num> <num> <num> <num> <num> #> 1: A 0 100 100 NA -0.5440211 2.302585 -0.8390715 0.0000000 #> 2: B 324 81 81 9 0.4121185 2.197225 -0.9111303 -0.7509872 #> 3: C 16 81 81 NA 0.4121185 2.197225 -0.9111303 -0.7568025 #> 4: C 324 81 81 NA 0.4121185 2.197225 -0.9111303 -0.7509872 #> cos_b tan_b tan_d tan_e #> <num> <num> <num> <num> #> 1: 1.0000000 0.000000 0.6483608 0.6483608 #> 2: 0.6603167 -1.137314 -0.4523157 -0.4523157 #> 3: -0.6536436 1.157821 -0.4523157 -0.4523157 #> 4: 0.6603167 -1.137314 -0.4523157 -0.4523157## by integer (NB: for robustness, it is not recommended ## to use explicit integers to update/define columns) DT[ , c(2L, 3L, 4L) := .(sqrt(b), sqrt(d), sqrt(e))]#> a b d e f sin_d log_e cos_d sin_b #> <char> <num> <num> <num> <num> <num> <num> <num> <num> #> 1: A 0 10 10 NA -0.5440211 2.302585 -0.8390715 0.0000000 #> 2: B 18 9 9 9 0.4121185 2.197225 -0.9111303 -0.7509872 #> 3: C 4 9 9 NA 0.4121185 2.197225 -0.9111303 -0.7568025 #> 4: C 18 9 9 NA 0.4121185 2.197225 -0.9111303 -0.7509872 #> cos_b tan_b tan_d tan_e #> <num> <num> <num> <num> #> 1: 1.0000000 0.000000 0.6483608 0.6483608 #> 2: 0.6603167 -1.137314 -0.4523157 -0.4523157 #> 3: -0.6536436 1.157821 -0.4523157 -0.4523157 #> 4: 0.6603167 -1.137314 -0.4523157 -0.4523157#> a b d e f sin_d log_e cos_d sin_b #> <char> <num> <num> <num> <num> <num> <num> <num> <num> #> 1: a 0 10 10 NA -0.5440211 2.302585 -0.8390715 0.0000000 #> 2: b 18 9 9 9 0.4121185 2.197225 -0.9111303 -0.7509872 #> 3: c 4 9 9 NA 0.4121185 2.197225 -0.9111303 -0.7568025 #> 4: c 18 9 9 NA 0.4121185 2.197225 -0.9111303 -0.7509872 #> cos_b tan_b tan_d tan_e #> <num> <num> <num> <num> #> 1: 1.0000000 0.000000 0.6483608 0.6483608 #> 2: 0.6603167 -1.137314 -0.4523157 -0.4523157 #> 3: -0.6536436 1.157821 -0.4523157 -0.4523157 #> 4: 0.6603167 -1.137314 -0.4523157 -0.4523157## by implicit integer, using forced evaluation sq_col_idx = grep('d$', names(DT)) DT[ , (sq_col_idx) := lapply(.SD, dnorm), .SDcols = sq_col_idx]#> a b d e f sin_d log_e cos_d #> <char> <num> <num> <num> <num> <num> <num> <num> #> 1: a 0 7.694599e-23 10 NA 0.3440673 2.302585 0.2805624 #> 2: b 18 1.027977e-18 9 9 0.3664624 2.197225 0.2634168 #> 3: c 4 1.027977e-18 9 NA 0.3664624 2.197225 0.2634168 #> 4: c 18 1.027977e-18 9 NA 0.3664624 2.197225 0.2634168 #> sin_b cos_b tan_b tan_d tan_e #> <num> <num> <num> <num> <num> #> 1: 0.0000000 1.0000000 0.000000 0.3233162 0.6483608 #> 2: -0.7509872 0.6603167 -1.137314 0.3601505 -0.4523157 #> 3: -0.7568025 -0.6536436 1.157821 0.3601505 -0.4523157 #> 4: -0.7509872 0.6603167 -1.137314 0.3601505 -0.4523157if (FALSE) { # Speed example: m = matrix(1, nrow = 2e6L, ncol = 100L) DF = as.data.frame(m) DT = as.data.table(m) system.time(for (i in 1:1000) DF[i, 1] = i) # 15.856 seconds system.time(for (i in 1:1000) DT[i, V1 := i]) # 0.279 seconds (57 times faster) system.time(for (i in 1:1000) set(DT, i, 1L, i)) # 0.002 seconds (7930 times faster, overhead of [.data.table is avoided) # However, normally, we call [.data.table *once* on *large* data, not many times on small data. # The above is to demonstrate overhead, not to recommend looping in this way. But the option # of set() is there if you need it. }