[R] problem with "APPLY"

Wed May 20 21:19:09 CEST 2009

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Peter Dalgaard
> Sent: Wednesday, May 20, 2009 8:16 AM
> To: De France Henri
> Cc: r-help at r-project.org
> Subject: Re: [R] problem with "APPLY"
> 
> De France Henri wrote:
> > Hello,
> >  
> > The "apply" function seems to behave oddly with my code below
> >  
> > NB : H1 is a data frame. (data in the attached file.)
> > # the first lines are:
> > 1 02/01/2008 0.000000  0  0 0.000000   0
> > 2 03/01/2008 0.000000  0  0 0.000000   0
> > 3 04/01/2008 0.000000  0  0 0.000000   0
> > 4 07/01/2008 0.000000  0  0 0.000000   0
> > 5 08/01/2008 0.000000  0  0 0.000000   0
> > 6 09/01/2008 0.000000  0  0 0.000000   0
> > 7 10/01/2008 0.000000  0  0 0.000000   0
> > 8 11/01/2008 1.010391  0  0 1.102169   0
> > ...
> > The aim of the code is to extract those lines for which 
> there is a strictly positive value in the second column AND 
> in one of the others:
> >  
> > reper=function(x){as.numeric(x[2]>1 & any(x[3:length(x)]>1))}
> >  
> > TAB1= H1[which(apply(H1,1,reper)>0),]
> >  
> > Strangely, this is OK for all the lines, except for the 
> last one. In fact, in H1, the last 2 lines are:
> > 258 29/12/2008 1.476535 1.187615  0 0.000000   0
> > 259 30/12/2008 0.000000 1.147888  0 0.000000   0
> > Obviously, line 258 should be the last line of TAB1, but it 
> is not the case (it does not appear at all) and I really 
> don't understand why. This is all the more strange since 
> applying the function "reper" only to this line 258 gives a 
> "1" as expected...
> > Can someone help ?
> >  
> 
> Works for me...
> 
>         do...1.       V3       V5 V7      V13 V31
> 213 24/10/2008 2.038218 2.820196  0 0.000000   0
> 214 27/10/2008 3.356057 2.588509  0 2.101651   0
> 219 03/11/2008 2.122751 1.648410  0 2.180908   0
> 233 21/11/2008 1.439861 1.883605  0 1.359372   0
> 234 24/11/2008 1.216548 1.480797  0 1.049390   0
> 258 29/12/2008 1.476535 1.187615  0 0.000000   0
> 
> You are crossing the creek to fetch water, though:
> 
> reper <- function(x) x[2]>1 & any(x[3:length(x)]>1)
> TAB1 <-  H1[apply(H1,1,reper),]
> 
> or even
> 
> TAB1 <-  H1[ H1[2] > 1  & apply(H1[3:6] > 1, 1, any),]
> 
> 
> -- 
>     O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>    c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>   (*) \(*) -- University of Copenhagen   Denmark      Ph:  
> (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: 
> (+45) 35327907

I couldn't reproduce the bad result either.  However, it was
more or less by chance that the results were as good as
they were.  The call
    apply(myDataFrame, 1, FUN)
does essentially the equivalent of
    myMatrix <- as.matrix(myDataFrame)
    for(i in seq_len(nrow(myMatrix)))
          rowResult[i] <- FUN(myMatrix[i,,drop=TRUE])
If myDataFrame contains any factor, character, POSIXt, or
any other non-numeric columns then myMatrix will be a matrix
of character strings.  Each column of myDataFrame is passed
though format() to make those strings, so the precise formatting
of the strings depends on all the other elements of the column
(E.g., one big or small number might cause the whole column to
be formatted in "scientific" notation).

Your reper() function happened to work because
    "2.3" > 0
is interpreted as (I think)
    "2.3" > "0"
which is TRUE (at least in ASCII).  However, if your cutoff were 0.000002
then you might be surprised
    > "2.3">0.000002
    [1] FALSE
because as.character(0.000002) is "2e-06".

I think that using apply(MARGIN=1,...) to data.frames is generally
a bad idea and it only really works if all the columns are the same
simple type.  Avoiding it altgether makes for tedious coding like
     H1[ H1[2] > 1  & (H1[,3]>1 | H1[,4]>1 | H1[,5]>1 | H1[,6]>1) ,]
You can also use pmax (parallel max), as in,
     H1[H1[2]>1 & do.call("pmax", unname(as.list(H1[,3:6])))>1, ]
Peter's 2nd solution calls apply(MARGIN=1,...) only on the numeric
part of the data.frame so it works as expected.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com