[R] dataframe subsetting behaviour

Douglas Grove dgrove at fhcrc.org
Thu Jan 23 00:29:03 CET 2003


> Douglas Grove <dgrove at fhcrc.org> writes:
> 
> > Hi,
> > 
> > I'm trying to understand a behaviour that I have encountered
> > and can't fathom.
> > 
> > 
> > Here's some code I will use to illustrate the behaviour:
> > 
> > # start with some data frame "a" having some named columns
> > a <- data.frame(a=rep(1,3),c=rep(2,3),d=rep(3,3),e=rep(4,3))
> > 
> > # create a subset of the original data frame, but include a
> > # name "b" that is not present in my original data frame
> > b <- a[,c("a","b","c")]
> > 
> > 
> > ## Up until now no errors are issued, but the following commands
> > ## will give the error shown:
> > 
> > b[1,]     ## "Error in x[[j]] : subscript out of bounds"
> > b[1,2]    ## "Error in "names<-.default"(*tmp*, value = cols) : 
> >           ##  names attribute must be the same length as the vector"
> > 
> > 
> > Can anyone explain to me the meaning of these error messages in terms
> > of R is actually doing?  These error messages had me baffled and 
> > it took me hours to track down that the source of the error was an 
> > incorrect column name in my data frame subsetting.
> 
> Looks like a (semi-)bug. Indexing outside of the data frame creates a
> "column" which is really the single value NULL, e.g. 
> 
> > dput(a[,4:5])
> structure(list(e = c(4, 4, 4), "NA" = NULL), .Names = c("e",
> NA), row.names = c("1", "2", "3"), class = "data.frame")
> 
> This will print because the format.data.frame called inside
> print.data.frame will recycle the NULL and give you
> 
> > a[,4:5]
>   e   NA
> 1 4 NULL
> 2 4 NULL
> 3 4 NULL
> 
> However, it confuses the h*ck out of "[.data.frame"
> 
> > (a[,4:5])[2]
> Error in "[.data.frame"((a[, 4:5]), 2) : undefined columns selected
> > (a[,4:5])[,2]
> NULL
> > (a[,4:5])[,1]
> [1] 4 4 4
> 
> and also the examples you found. However, the main issue is that you
> have managed to construct a corrupt data frame. So indexing outside
> the array should probably either give an error or return a column of
> NA.


Yes, it would be nice if trying to index outside the data frame generated
an error, that is what happens in Splus (at least the version I have
access to: 6.0 Release 1 for Linux 2.2.12)


> 
> -- 
>    O__  ---- Peter Dalgaard             Blegdamsvej 3  
>   c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
>  (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907
>




More information about the R-help mailing list