[R] sciplot question
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Mon May 25 21:06:18 CEST 2009
spencerg wrote:
> Frank E Harrell Jr wrote:
>> spencerg wrote:
>>> Dear Frank, et al.:
>>>
>>> Frank E Harrell Jr wrote:
>>>> <snip>
>>>> Yes; I do see a normal distribution about once every 10 years.
>>>
>>> To what do you attribute the nonnormality you see in most cases?
>>> (1) Unmodeled components of variance that can generate
>>> errors in interpretation if ignored, even with bootstrapping?
>>>
>>> (2) Honest outliers that do not relate to the phenomena of
>>> interest and would better be removed through improved checks on data
>>> quality, but where bootstrapping is appropriate (provided the data
>>> are not also contaminated with (1))?
>>>
>>> (3) Situations where the physical application dictates a
>>> different distribution such as binomial, lognormal, gamma, etc.,
>>> possibly also contaminated with (1) and (2)?
>>>
>>> I've fit mixtures of normals to data before, but one needs to be
>>> careful about not carrying that to extremes, as the mixture may be a
>>> result of (1) and therefore not replicable.
>>>
>>> George Box once remarked that he thought most designed
>>> experiments included split plotting that had been ignored in the
>>> analysis. That is only a special case of (1).
>>>
>>> Thanks,
>>> Spencer Graves
>>
>> Spencer,
>>
>> Those are all important reasons for non-normality of margin
>> distributions. But the biggest reason of all is that the underlying
>> process did not know about the normal distribution. Normality in raw
>> data is usually an accident.
>
> Frank:
>
> Might there be a difference between the physical and social
> sciences on this issue?
Hi Spencer,
I doubt that the difference is large, but biological measurements seem
to be more of a problem.
>
> The central limit effect works pretty well with many kinds of
> manufacturing data, except that it is often masked by between-lot
> components of variance. The first differences in log(prices) are often
> long-tailed and negatively skewed. Standard GARCH and similar models
> handle the long tails well but miss the skewness, at least in what I've
> seen. I think that can be fixed, but I have not yet seen it done.
The central limit theorem in and of itself doesn't help because it
doesn't tell you how large N must be before normality holds well enough.
>
> Social science data, however, often involve discrete scales where
> the raters' interpretations of the scales rarely match any standard
> distribution. Transforming to latent variables, e.g., via factor
> analysis, may help but do not eliminate the problem.
Good example. Many of the scales I've seen are non-normal or even
multi-modal.
>
> Thanks for your comments.
Thanks for yours
Frank
> Spencer
>>
>> Frank
>>
>
>
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help
mailing list