[R] building a formula for glm() with 30,000 independent variables

Mon Nov 11 02:29:28 CET 2002

Murray Jorgensen wrote:

> You have not really given enough background to enable much help to be 
> given.

In a way, that was intentional.  I was hoping that my problem was merely 
a matter of proper R usage.  But several of you have politely pointed 
out that my underlying thinking about the statistics itself is flawed. 
With 30,000 predictors and an order of magnitude *fewer* observations, I 
should expect to find a bogus but perfectly predictive model even if 
everything were random noise.

> Knowledge of any structure on the predictors may suggest strategies 
> for choosing representative predictors.

Understood.  Since my statistics background is so weak, perhaps it would 
be wise at this point to explain what exactly I am trying to accomplish, 
and thereby better leverage this list's expertise.

The "predictors" here are randomly sampled observations of the behavior 
of a running program.  We decide in advance what things to observe.  For 
example, we might decide to check whether a particular pointer on a 
particular line is null.  So that would give us two counters: one 
telling us how many times it was seen to be null, and one telling us how 
many times it was seen to be not null.

A similar sort of instrumentation would be to guess that a pair of 
program variables might related in some way.  At random intervals we 
check their values and increment one of three counters depending on 
whether the first is less then, equal to, or greater than the second. 
So that would give us a trio of related predictors.

We don't update these counters every time a given line of code executes, 
though:  we randomly sample perhaps 1/100 or 1/1000.  The samples are 
fair in the sense that each sampling opportunity is taken or skipped 
randomly and independently from each other opportunity.

The "dependent outcome" is whether the program ultimately crashes or 
exits successfully.  The goal is to identify those program behaviors 
which are strongly predictive of an eventual crash.  For example, if the 
program has a single buffer overrun bug, we might discover that the 
"(index > limit) on line 196" counter is nonzero every time we crash, 
but is zero for most runs that do not crash.

(Most, but not all.  Sometimes you can overrun a buffer but not crash. 
"Getting lucky" is part of what we're trying to express in our model.)

In my current experiment, I have about 10,000 pairs of program variables 
being compared with a "less", "equal", and "greater" counter for each. 
Thus, 30,000 predictors.  Almost all of these should be irrelevant.  And 
they certainly are not independent of each other.  Looking along the 
other axis, I've got about 3300 distinct program runs, of which roughly 
one fifth crash.  I have complete and perfect counter information for 
all of these runs, which I can easily postprocess to simulate sampled 
counters with any desired sampling density.

I'm getting the distinct impression that a standard logistic regression 
with 30,000 predictors is *not* a practical approach.  What should I be 
using instead?  I'm frustrated by the fact that while the problem seems 
conceptually simple enough, I just don't have the statistics background 
required to know how to solve it correctly.  If any of you have any 
suggestions, I certainly welcome them.

Thank you, one and all.  You've been quite generous with your advice 
already, and I certainly do appreciate it.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._