[R] Is there a fast way to do several hundred thousand ANOVA tests?
Charles C. Berry
cberry at tajo.ucsd.edu
Mon Aug 24 05:03:25 CEST 2009
On Mon, 24 Aug 2009, big permie wrote:
> Dear R users,
>
> I have a matrix a and a classification vector b such that
>
>> str(a)
> num [1:50, 1:800000]
> and
>> str(b)
> Factor w/ 3 levels "cond1","cond2","cond3"
>
> I'd like to do an anova on all 800000 columns and record the F statistic for
> each test; I currently do this using
>
> f.stat.vec <- numeric(length(a[1,])
>
> for (i in 1:length(a[1,]) {
> f.test.frame <- data.frame(nums = a[,i], cond = b)
> aov.vox <- aov(nums ~ cond, data = f.test.frame)
> f.stat <- summary(aov.vox)[[1]][1,4]
> f.stat.vec[i] <- f.stat
> }
>
> The problem is that this code takes about 70 minutes to run.
Using lsfit(), my five year old windows XP PC does 100k columns in about
40 seconds, so I reckon that in 5 minutes it could do 800000:
> x <- factor(sample(1:3,50,repl=T))
> x.mat <- model.matrix( ~x )[, 2:3 ] # drop intercept here
> y <- matrix(rnorm(50*100000),nc=100000)
> system.time({fit <- lsfit( x.mat, y );ls.pr <- ls.print( fit, print.it=FALSE )})
user system elapsed
39.16 0.58 39.79
>
> # check F-statistic:
> summary(as.numeric( ls.pr[[ 'summary' ]][, "F-value" ]))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.2932 0.7095 1.0460 1.4270 17.1100
>
> # theoretical 1st Qu., Median, 3rd Qu match
>
> qf(c(.25,0.5, 0.75 ), 2, 47 )
[1] 0.2894502 0.7034708 1.4280000
>
If it needed to be faster, I would take the part out of ls.print() that
does just the F-statistic and work on it.
Of course, a newer computer would be a lot faster, too.
HTH,
Chuck
>
> Is there a faster way to do an anova & record the F stat for each column?
>
> Any help would be appreciated.
>
> Thanks
> Heath
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the R-help
mailing list