Fri Apr 5 10:17:37 CEST 2024

Hello Mark,

> I found what looks to me like an odd edge case for duplicated(),
> unique() etc. on data frames with zero columns, due to duplicated()
> returning a zero-length vector for them, regardless of the number of
> rows:

> df <- data.frame(a = 1:5)
> df$a <- NULLnrow(df)
> # 5 (row count preserved by row.names)
> duplicated(df)
> # logical(0), should be c(FALSE, TRUE, TRUE, TRUE, TRUE)
> anyDuplicated(df)
> # 0, should be 2

> This behaviour isn't mentioned in the documentation; is there a
> reason for it to work like this?


> I admit this is a case we rarely care about.However, for an example
> of this being an issue, I've been running into it when treating data
> frames as database relations, where they have one or more candidate
> keys (irreducible subsets of the columns for which every row must
> have a unique value set).

Part of the problem is that it's not obvious what should be a
zero-column but non-zero-row data.frame mean.

On the one hand, your database relation use case is entirely valid. On
the other hand, if data.frames are considered to be tables of data with
row.names as their identifiers, then duplicated(d) should be returning
logical(nrow(d)) for zero-column data.frames, since row.names are
required to be unique. I'm sure that more interpretations can be
devised, requiring some other behaviour for duplicated() and friends.

Thankfully, duplicated() and anyDuplicated() are generic functions, and
you can subclass your data frames to change their behaviour:

duplicated.database_relation <- function(x, incomparables = FALSE, ...)
 if (length(x)) return(NextMethod()) else c(
  FALSE, rep(TRUE, nrow(x) - 1)
.S3method('duplicated', 'database_relation')

anyDuplicated.database_relation <- function(
 x, incomparables = FALSE, ...
) if (nrow(x) > 1) 2 else 0
.S3method('anyDuplicated', 'database_relation')

x <- data.frame(row.names = 1:5)
class(x) <- c('database_relation', class(x))

# [1] 2
# data frame with 0 columns and 1 row

Best regards,

