[R] lean and mean lm/glm?
Thomas Lumley
tlumley at u.washington.edu
Wed Aug 23 17:25:54 CEST 2006
On Wed, 23 Aug 2006, Damien Moore wrote:
>
> Thomas Lumley wrote:
>
>> No, it is quite straightforward if you are willing to make multiple passes
>> through the data. It is hard with a single pass and may not be possible
>> unless the data are in random order.
>>
>> Fisher scoring for glms is just an iterative weighted least squares
>> calculation using a set of 'working' weights and 'working' response. These
>> can be defined chunk by chunk and fed to biglm. Three iterations should
>> be sufficient.
>
> (NB: Although not stated clearly I was referring to single pass when I
> wrote "impossible"). Doing as you suggest with multiple passes would
> entail either sticking the database input calls into the main iterative
> loop of a lookalike glm.fit or lumping the user with a very unattractive
> sequence of calls:
I have written most of a bigglm() function where the data= argument is a
function with a single argument 'reset'. When called with reset=FALSE the
function should return another chunk of data, or NULL if no data are
available, and when called with reset=TRUE it should go back to the
beginning of the data. I don't think this is too inelegant.
In general I don't think a one-pass algorithm is possible. If the data are
in random order then you could read one chunk, fit a glm, and set up a
grid of coefficient values around the estimate. You then read the rest of
the data, computing the loglikelihood and score function at each point in
the grid. After reading all the data you can then fit a suitable smooth
surface to the loglikelihood. I don't know whether this will give
sufficient accuracy, though.
For really big data sets you are probably better off with the approach
that Brian Ripley and Fei Chen used -- they have shown that it works and
there unlikely to be anything much simpler that also works that they
missed.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list