[R] slow computation of mixed ANOVA using aov

Sat Mar 19 11:19:19 CET 2005

"Steven Lacey" <slacey at umich.edu> writes:

> Dear R-help list,
>  
> I am trying to do a mixed ANOVA on a 8960 x 5 dataframe. I have 3 factors
> for which I want to test all main effects and interactions : f1 (40 levels),
> f2 (7 levels), and f3 (4 levels). I also have a subject factor, subject, and
> a dependent variable, dv. 
>  
> Some more information about the factors:
> f2 is a between-subject factor. That is, for each level of f2 there are 8
> nested levels of the subject factor. For example, levels 1-8 of subject are
> nested in level 1 of f2. Levels 9-16 of subject are nested in level 2 of f2.
> In other words, the subjects that participated in any level of f2 are
> different from the subjects that participated in any other level of f2. 
>  
> In contrast, f1 and f3 are within-subject factors. That is, for any one of
> the 56 subjects, we have a 160 medians corresponding to each condition from
> a complete crossing of factors f1 and f2. While it is true that we do have
> replicate observations for any subject in each of these conditions, we take
> the median of those values and operate as if there is only a single
> observation for each subject in each of the 160 within-subject conditions. 
>  
> Below is code that will generate dataframe d, which is just like the one I
> am working with:
>  
> f1<-gl(40,1,8960,ordered=T)
> f2<-gl(7,1280,8960,ordered=T)
> f3<-gl(4,40,8960,ordered=T)
> subject<-gl(56,160,8960,ordered=T)
> dv<-rnorm(8960,mean=500,sd=50)
> d <- data.frame(f1,f2,f3,f4,dv)
>  
> To run the mixed ANOVA I use the following call (modeled after J. Baron):
> aov(dv~f1*f2*f3+Error(subject/(f1*f2)),data=d)

[You mean subject/(f1*f3), right? "f2" is a coarsening of "subject"]

> WARNING: Exert caution when running the aov command. I have run the exact
> same command on Windows and Unix machines (each with 1GB of RAM; allocated
> up to 3 or 4GB of memory for the analysis ) and it has taken many, many
> hours to finish. That said, this is not a new problem posted on the R-help
> list. There are several posts where analysts have run into similar problems.
> My general impression of these posts, and correct me if I am wrong, is that
> because aov is a wrapper around lm, the extra time is required to build and
> manipulate a design matrix (via qr decomposition) that is 8960 x several
> thousand columns large! Is that so? It seems fitting because if I call aov
> with only a single factor, then it returns in a few seconds. 

Yes, this is basically correct. The main issue is the calculation of
the projection onto the terms in the Error model, which is done using
the generic lm algorithm even though the design is typicaly orthogonal
so that there are much faster ways to get to the result. 

To paraphrase: if you have double determinations and an error term at
the level of each pair, the algorithm fits a model with N/2 parameters
in order to calculate the mean and difference within each pair. For
large designs, this becomes an issue...

This is in some sense a tradeoff of generality for speed, but the
results for non-orthogonal designs are typically very hard to
interpret.

The topic does come up off and on, and we have been considering the
option of adding an algorithm where it is assumed that the Error model
consists of mutually orthogonal and balanced terms (in the sense that
all factor levels appear equally frequently). Just a "simple matter of
programming"... 

For the near term, there are a couple of things that you can do:

- avoid having an error term that is equivalent to the full data set.
  In your case, subject:f1:f3 is such a term, and subject/(f1+f3) is
  actually equivalent (the second order interaction term becomes the
  residual stratum). This at least saves you from inverting an NxN
  matrix. 

- use a version of R compiled with a fast BLAS, on a fast computer
  with a lot of RAM... (A ~2K square matrix inversion will take a
  while, but "hours" sounds a bit excessive). 

- (not too sure of this, but R 2.1.0 will have some new code for
  multivariate lm, with intra-individual designs, and
  tests under the sphericity assumptions; it is possible that
  your models can be reformulated in that framework. You'd have to
  set it up as a model with 160 responses on 56 subjects, though, and
  the code wasn't really designed with that sort of intrasubject
  dimensionality in mind.)

> In order to test if the computational slowness was something unique to R, I
> ran the same analysis, including all 3 factors, in SPSS. To my surprise SPSS
> returned almost instantaneously. I do not know much about the algorithm in
> SPSS, but I suspect it may be calculating condition means and sums of
> squares rather than generating a design matrix. Does that sound plausible?
>  
> At this point I am a dedicated R user. However, I do the kind of analysis
> described above quite often. It is important that my statistical package be
> able to handle it more efficiently than what I have been able to get R to do
> at this point. Am I doing anything obviously wrong? Is there a method in R
> that more closely resembles the algorithm used in SPSS? If not, are there
> any other methods R has to do these kind of analyses? Could I split up the
> analysis in some principled way to ease the processing demand on R?
>  
> Thanks in advanvce for any further insight into this issue, 
> Steve Lacey
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> 

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907