Determine Duplicate Rows
duplicated.Rd
duplicated
returns a logical vector indicating which rows of a
data.table
are duplicates of a row with smaller subscripts.
unique
returns a data.table
with duplicated rows removed, by
columns specified in by
argument. When no by
then duplicated
rows by all columns are removed.
anyDuplicated
returns the index i
of the first duplicated
entry if there is one, and 0 otherwise.
uniqueN
is equivalent to length(unique(x))
when x is an
atomic vector
, and nrow(unique(x))
when x is a data.frame
or data.table
. The number of unique rows are computed directly without
materialising the intermediate unique data.table and is therefore faster and
memory efficient.
Usage
# S3 method for data.table
duplicated(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...)
# S3 method for data.table
unique(x, incomparables=FALSE, fromLast=FALSE,
by=seq_along(x), cols=NULL, ...)
# S3 method for data.table
anyDuplicated(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...)
uniqueN(x, by=if (is.list(x)) seq_along(x) else NULL, na.rm=FALSE)
Arguments
- x
A data.table.
uniqueN
accepts atomic vectors and data.frames as well.- ...
Not used at this time.
- incomparables
Not used. Here for S3 method consistency.
- fromLast
Logical indicating if duplication should be considered from the reverse side. For
duplicated
, this means the last (or rightmost) of identical elements will correspond toduplicated = FALSE
. Forunique
, this means the last (or rightmost) of identical elements will be kept. See examples.- by
character
orinteger
vector indicating which combinations of columns fromx
to use for uniqueness checks. By default all columns are being used. That was changed recently for consistency to data.frame methods. In version< 1.9.8
default waskey(x)
.- cols
Columns (in addition to
by
) fromx
to include in the resultingdata.table
.- na.rm
Logical (default is
FALSE
). Should missing values (includingNaN
) be removed?
Details
Because data.tables are usually sorted by key, tests for duplication are
especially quick when only the keyed columns are considered. Unlike
unique.data.frame
, paste
is not used to ensure
equality of floating point data. It is instead accomplished directly and is
therefore quite fast. data.table provides setNumericRounding
to
handle cases where limitations in floating point representation is undesirable.
v1.9.4
introduces anyDuplicated
method for data.tables and is
similar to base in functionality. It also implements the logical argument
fromLast
for all three functions, with default value
FALSE
.
Note: When cols
is specified, the resulting table will have
columns c(by, cols)
, in that order.
Value
duplicated
returns a logical vector of length nrow(x)
indicating which rows are duplicates.
unique
returns a data table with duplicated rows removed.
anyDuplicated
returns a integer value with the index of first duplicate.
If none exists, 0L is returned.
uniqueN
returns the number of unique elements in the vector,
data.frame
or data.table
.
See also
setNumericRounding
, data.table
,
duplicated
, unique
, all.equal
,
fsetdiff
, funion
, fintersect
,
fsetequal
Examples
DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3),
C = rep(1:2, 6), key = c("A", "B"))
duplicated(DT)
#> [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
unique(DT)
#> Key: <A, B>
#> A B C
#> <int> <int> <int>
#> 1: 1 1 1
#> 2: 1 1 2
#> 3: 1 2 2
#> 4: 2 2 1
#> 5: 2 2 2
#> 6: 2 3 1
#> 7: 2 3 2
#> 8: 3 3 1
#> 9: 3 4 2
#> 10: 3 4 1
duplicated(DT, by="B")
#> [1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
unique(DT, by="B")
#> Key: <A, B>
#> A B C
#> <int> <int> <int>
#> 1: 1 1 1
#> 2: 1 2 2
#> 3: 2 3 1
#> 4: 3 4 2
duplicated(DT, by=c("A", "C"))
#> [1] FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
unique(DT, by=c("A", "C"))
#> Key: <A, B>
#> A B C
#> <int> <int> <int>
#> 1: 1 1 1
#> 2: 1 1 2
#> 3: 2 2 1
#> 4: 2 2 2
#> 5: 3 3 1
#> 6: 3 4 2
DT = data.table(a=c(2L,1L,2L), b=c(1L,2L,1L)) # no key
unique(DT) # rows 1 and 2 (row 3 is a duplicate of row 1)
#> a b
#> <int> <int>
#> 1: 2 1
#> 2: 1 2
DT = data.table(a=c(3.142, 4.2, 4.2, 3.142, 1.223, 1.223), b=rep(1,6))
unique(DT) # rows 1,2 and 5
#> a b
#> <num> <num>
#> 1: 3.142 1
#> 2: 4.200 1
#> 3: 1.223 1
DT = data.table(a=tan(pi*(1/4 + 1:10)), b=rep(1,10)) # example from ?all.equal
length(unique(DT$a)) # 10 strictly unique floating point values
#> [1] 10
all.equal(DT$a,rep(1,10)) # TRUE, all within tolerance of 1.0
#> [1] TRUE
DT[,which.min(a)] # row 10, the strictly smallest floating point value
#> [1] 10
identical(unique(DT),DT[1]) # TRUE, stable within tolerance
#> [1] FALSE
identical(unique(DT),DT[10]) # FALSE
#> [1] FALSE
# fromLast = TRUE vs. FALSE
DT <- data.table(A = c(1, 1, 2, 2, 3), B = c(1, 2, 1, 1, 2), C = c("a", "b", "a", "b", "a"))
duplicated(DT, by="B", fromLast=FALSE) # rows 3,4,5 are duplicates
#> [1] FALSE FALSE TRUE TRUE TRUE
unique(DT, by="B", fromLast=FALSE) # equivalent: DT[!duplicated(DT, by="B", fromLast=FALSE)]
#> A B C
#> <num> <num> <char>
#> 1: 1 1 a
#> 2: 1 2 b
duplicated(DT, by="B", fromLast=TRUE) # rows 1,2,3 are duplicates
#> [1] TRUE TRUE TRUE FALSE FALSE
unique(DT, by="B", fromLast=TRUE) # equivalent: DT[!duplicated(DT, by="B", fromLast=TRUE)]
#> A B C
#> <num> <num> <char>
#> 1: 2 1 b
#> 2: 3 2 a
# anyDuplicated
anyDuplicated(DT, by=c("A", "B")) # 3L
#> [1] 4
any(duplicated(DT, by=c("A", "B"))) # TRUE
#> [1] TRUE
# uniqueN, unique rows on key columns
uniqueN(DT, by = key(DT))
#> [1] 5
# uniqueN, unique rows on all columns
uniqueN(DT)
#> [1] 5
# uniqueN while grouped by "A"
DT[, .(uN=uniqueN(.SD)), by=A]
#> A uN
#> <num> <int>
#> 1: 1 2
#> 2: 2 2
#> 3: 3 1
# uniqueN's na.rm=TRUE
x = sample(c(NA, NaN, runif(3)), 10, TRUE)
uniqueN(x, na.rm = FALSE) # 5, default
#> [1] 5
uniqueN(x, na.rm=TRUE) # 3
#> [1] 3