Over-allocation access
truelength.Rd
These functions are experimental and somewhat advanced. By experimental we mean their names might change and perhaps the syntax, argument names and types. So if you write a lot of code using them, you have been warned! They should work and be stable, though, so please report problems with them. alloc.col
is just an alias to setalloccol
. We recommend to use setalloccol
(though alloc.col
will continue to be supported) because the set*
prefix in setalloccol
makes it clear that its input argument is modified in-place.
Arguments
- x
Any type of vector, including
data.table
which is alist
vector of column pointers.- DT
A
data.table
.- n
The number of spare column pointer slots to ensure are available. If
DT
is a 1,000 columndata.table
with 24 spare slots remaining,n=1024L
means grow the 24 spare slots to be 1024.truelength(DT)
will then be 2024 in this example.- verbose
Output status and information.
Details
When adding columns by reference using :=
, we could simply create a new column list vector (one longer) and memcpy over the old vector, with no copy of the column vectors themselves. That requires negligible use of space and time, and is what v1.7.2 did. However, that copy of the list vector of column pointers only (but not the columns themselves), a shallow copy, resulted in inconsistent behaviour in some circumstances. So, as from v1.7.3 data.table over allocates the list vector of column pointers so that columns can be added fully by reference, consistently.
When the allocated column pointer slots are used up, to add a new column data.table
must reallocate that vector. If two or more variables are bound to the same data.table this shallow copy may or may not be desirable, but we don't think this will be a problem very often (more discussion may be required on data.table issue tracker). Setting options(datatable.verbose=TRUE)
includes messages if and when a shallow copy is taken. To avoid shallow copies there are several options: use copy
to make a deep copy first, use setalloccol
to reallocate in advance, or, change the default allocation rule (perhaps in your .Rprofile); e.g., options(datatable.alloccol=10000L)
.
Please note : over allocation of the column pointer vector is not for efficiency per se; it is so that :=
can add columns by reference without a shallow copy.
Value
truelength(x)
returns the length of the vector allocated in memory. length(x)
of those items are in use. Currently, it is just the list vector of column pointers that is over-allocated (i.e. truelength(DT)
), not the column vectors themselves, which would in future allow fast row insert()
. For tables loaded from disk however, truelength
is 0 in R 2.14.0+ (and random in R <= 2.13.2), which is perhaps unexpected. data.table
detects this state and over-allocates the loaded data.table
when the next column addition occurs. All other operations on data.table
(such as fast grouping and joins) do not need truelength
.
setalloccol
reallocates
DT
by reference. This may be useful for efficiency if you know you are about to going to add a lot of columns in a loop. It also returns the new DT
, for convenience in compound queries.
Examples
DT = data.table(a=1:3,b=4:6)
length(DT) # 2 column pointer slots used
#> [1] 2
truelength(DT) # 1026 column pointer slots allocated
#> [1] 1026
setalloccol(DT, 2048)
#> a b
#> <int> <int>
#> 1: 1 4
#> 2: 2 5
#> 3: 3 6
length(DT) # 2 used
#> [1] 2
truelength(DT) # 2050 allocated, 2048 free
#> [1] 2050
DT[,c:=7L] # add new column by assigning to spare slot
#> a b c
#> <int> <int> <int>
#> 1: 1 4 7
#> 2: 2 5 7
#> 3: 3 6 7
truelength(DT)-length(DT) # 2047 slots spare
#> [1] 2047