[R] R's Data Dredging Philosophy for Distribution Fitting

Frank E Harrell Jr f.harrell at Vanderbilt.Edu
Thu Jul 15 03:31:41 CEST 2010


On 07/14/2010 06:22 PM, emorway wrote:
>
> Forum,
>
> I'm a grad student in Civil Eng, took some Stats classes that required
> students learn R, and I have since taken to R and use it for as much as I
> can.  Back in my lab/office, many of my fellow grad students still use
> proprietary software at the behest of advisers who are familiar with the
> recommended software (Statistica, @Risk (Excel Add-on), etc).  I have spent
> a lot of time learning R and am confident it can generally out-process,
> out-graph, or more simply stated, out-perform most of these other software
> packages.  However, one area my view has been humbled in is distribution
> fitting.
>
> I started by reading through
> http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf  After that
> I started digging around on this forum and found posts like this one
> http://r.789695.n4.nabble.com/Fitting-usual-distributions-td800000.html#a800000
> that are close to what I'm after.  That is, given an observation dataset, I
> would like to call a function that cycles through numerous distributions
> (common or not) and then ranks them for me based on Chi-Square,
> Kolmogorov-Smirnov and/or Anderson-Darling, for example.
>
> This question was asked back in 2004:
> http://finzi.psych.upenn.edu/R/Rhelp02a/archive/37053.html but the response
> was that this kind of thing wasn't in R nor in proprietary software to the
> best of the responding author's memory.  In 2010, however, this is no longer
> true as @Risk's
> (http://www.palisade.com/risk/?gclid=CKvblPSM7KICFZQz5wodDRI2fg)
> "Distribution Fitting" function does this very thing.  And it is here that
> my R pride has taken a hit.  Based on the first response to the question
> posed here
> http://r.789695.n4.nabble.com/Which-distribution-best-fits-the-data-td859448.html#a859448
> is it fair to say that the R community (I realize this is only 1 view) would
> take exception to this kind of "data mining"?
>
> Unless I've missed a discussion of a package that does this very thing, it
> seems as though I would need to code something up using fitdistr() and do
> all the ranking myself.  Undoubtedly that would be a good exercise for me,
> but its hard for me to believe R would be a runner-up to something like
> distribution fitting in @Risk.
>
> Eric

Eric,

I didn't read the links you provided but the approach you have advocated 
(and you are not alone) is futile.  If you entertain more than about 2 
distributions, the variance of the final fits is no better than the 
variance of the empirical cumulative distribution function (once you 
properly adjust variances for model uncertainty).  So just go empirical. 
  In general if your touchstone is the observed data (as in checking 
goodness of fit of various parametric distributions), your final 
estimators will have the variance of empirical estimators.

Frank
-- 
Frank E Harrell Jr   Professor and Chairman        School of Medicine
                      Department of Biostatistics   Vanderbilt University



More information about the R-help mailing list