[R] Data Extraction

Berend Hasselman bhh at xs4all.nl
Thu Nov 22 15:49:56 CET 2012


On 22-11-2012, at 15:11, Muhuri, Pradip (SAMHSA/CBHSQ) wrote:

> Hello,
> 
> I would appreciate if someone could help me resolve the following:
> 
> 1. df1[!is.na( X1 | X2 | X3 | X4 | X5),][,1:5] # This does not work
> 
> 2. Is these message harmful?  The following object(s) are masked from 'df1 (position 3)':
>    X1, X2, X3, X4, X5
> 
> Thanks,
> 
> Pradip Muhuri
> 
> 
> #Reproducible Example
> set.seed(5)
> df1<-data.frame(matrix(sample(c(1:10,NA),100,replace=TRUE),ncol=5))
> attach (df1)
> #delete rows if any of them NA for X1
> df1[!is.na( X1),][,1:5] # This works
> 
> #delete rows if any of them NA for X1, X2, X3, X4 or X5
> df1[!is.na( X1 | X2 | X3 | X4 | X5),][,1:5] # This does not work

Yet another way of doing this is 

df1[!is.na(rowSums(df1)),][1:5]

But Petr's solution appears to be quickest.
See this:

> N <- 100000
> set.seed(13)
> df <- data.frame(matrix(sample(c(1:10,NA),N,replace=TRUE),ncol=50))  
> library(rbenchmark)
>
> f1 <- function(df) {df[apply(df, 1, function(x)all(!is.na(x))),][,1:ncol(df)]}
> f2 <- function(df) {df[!is.na(rowSums(df)),][1:ncol(df)]}
> f3 <- function(df) {df[complete.cases(df),][1:ncol(df)]}
>
> benchmark(d1 <- f1(df), d2 <- f2(df), d3 <- f3(df), columns=c("test","elapsed", "relative", "replications"))
          test elapsed relative replications
1 d1 <- f1(df)   3.675   13.172          100
2 d2 <- f2(df)   0.401    1.437          100
3 d3 <- f3(df)   0.279    1.000          100

> identical(d1,d2)
[1] TRUE
> identical(d1,d3)
[1] TRUE


Berend




More information about the R-help mailing list