[R] Large number of dummy variables
Douglas Bates
bates at stat.wisc.edu
Tue Jul 22 01:07:26 CEST 2008
On Mon, Jul 21, 2008 at 5:45 PM, Bert Gunter <gunter.berton at gene.com> wrote:
> Unless I'm way off base, dummy variable are never needed (nor are desirable)
> in R; they should be modelled as factors instead. AN INTRO TO R might, and
> certainly V&R's MASS and others will, explain this in more detail.
But Alan wants to have those factors in a linear regression model. If
you use lm then it will create a dense model matrix from those factors
and that's when you run out of memory.
Alan: I haven't read the whole discussion yet but if you really,
really want to use a fixed-effects model with factors that have that
many levels then you can form (the transpose of) the sparse model
matrix for just those factors using code like
library(Matrix)
MMt <- rBind(as(fac1, "sparseMatrix"), as(fac2, "sparseMatrix")[-1,])
At that point you may be able to use
solve(tcrossprod(MMt), MMt %*% y)
to solve for coefficients. Notice that I have dropped the indicator
row for the first level of the second factor but kept all the
indicators columns for the first factor. Thus the coefficients
correspond to an lm specification of
lm(y ~ 0 + fac1 + fac2, ...)
under the default contrasts.
I'm not sure that is the best way of solving for coefficients. I
would need to look at the code for that solve method to see what form
of factorization that it uses. Also, I agree with Harold that you
really should consider using random effects for those factors. It is
almost never a good idea to try to estimate fixed effects for
thousands of levels of a factor.
> -- Bert Gunter
> Genentech, Inc.
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of Doran, Harold
> Sent: Monday, July 21, 2008 3:16 PM
> To: aspearot at ucsc.edu; r-help at r-project.org
> Cc: Douglas Bates
> Subject: Re: [R] Large number of dummy variables
>
> Well, at the risk of entering a debate I really don't have time for (I'm
> doing it anyway) why not consider a random coefficient model? If your
> response has anything like, "well, random effects and fixed effects are
> correlated and so the estimates are biased but OLS is consistent and
> unbiased via an appeal to Gauss-Markov" then I will probably make time
> for this discussion :)
>
> I have experienced this problem, though. In what you're doing, you are
> first creating the model matrix and then doing the demeaning, correct? I
> do recall Doug Bates was, at one point, doing some work where the model
> matrix for the fixed effects was immediately created as a sparse matrix
> for OLS models. I think doing the work on the sparse matrix is a better
> analytical method than time-demeaning. I don't remember where that work
> is, though.
>
> There is a package called sparseM which had functions for doing OLS with
> sparse matrices. I don't know its status, but vaguely recall the author
> of sparseM at one point noting that the work of Bates and Maechler would
> be the go to package for work with large, sparse model matrices.
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Alan Spearot
>> Sent: Monday, July 21, 2008 5:59 PM
>> To: r-help at r-project.org
>> Subject: [R] Large number of dummy variables
>>
>> Hello,
>>
>> I'm trying to run a regression predicting trade flows between
>> importers and exporters. I wish to include both
>> year-importer dummies and year-exporter dummies. The former
>> includes 1378 levels, and the latter includes 1390 levels. I
>> have roughly 100,000 total observations.
>>
>> When I'm using lm() to run a simple regression, it give me a
>> "cannot allocate ___" error. I've been able to get around
>> time-demeaning over one large group, but since I have two, it
>> doesn't work in the correct way. Is there a more efficient
>> way to handling a model matrix this large in R?
>>
>> Thanks for your help.
>>
>> Alan Spearot
>>
>> --
>> Alan Spearot
>> Assistant Professor - International Economics University of
>> California - Santa Cruz
>> 1156 High Street
>> 453 Engineering 2
>> Santa Cruz, CA 95064
>> Office: (831) 459-1530
>> acspearot at gmail.com
>> http://people.ucsc.edu/~aspearot
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list