[R] Displaying a distribution -- was: Combining two histograms

Wed Feb 2 19:03:58 CET 2005

      There are PP plots and QQ plots for any distribution, and I've 
experimented a little (though not much) with PP plots and with QQ plots 
for uniform, Student's t, chi-square and F distributions.  I've found 
qqnorm plots very useful, but I don't recall learning much from any 
other probability plots. 

      In data mining situations, I've computed hundreds of p-values, 
possibly associated with Student's t or F or log(likelihood ratio) 
approximate chi-squares.  I've found it useful to convert them to normal 
scores via qnorm(p), then make a normal plot of the p-values.  Points 
off the line on the lower tail are statistically significant.  Points on 
the line are there by chance alone.  If the slope of the line is 
different from one or not centered at zero, there may be hidden 
components of variance or serial dependence of various kinds that I'm 
not modeling properly in computing the p-values.  This "p-value plot" 
seems to provide a subtle check and first order correction for a variety 
of different violations of assumptions like this. 

      Comments?
      Best Wishes,
      spencer
p.s.  Regarding normal plots with millions of points:  I find them still 
useful.  However, we need some kind of heuristic to decimate excess 
points so we still get the same visual image without a plot object that 
consumes gigabytes on the hard drive, hours to plot, and can't be 
exported to PowerPoint, for example. 

Berton Gunter wrote:

>May I take this off topic a little to seek collective wisdom (and so feel
>free to reply privately).
>
>The catalyst is Deepayan's remark:
>
>  
>
>>Histograms were appropriate for drawing density estimates by 
>>hand in the  good old days, but I can imagine very few situations where I 
>>would not prefer to use smoother density estimates when I have the 
>>computational power to do so.
>>
>>Deepayan
>>    
>>
>
>Generally, I agree; but the appearance and thus one's perception and
>interpretation of both histograms and density plots depend upon the
>parameters chosen for the display (bin boundaries for histograms; bandwidth
>and kernel for density plots). Important data peculiarities like arbitrary
>rounding, favoring of certain values, resolution limitations, and so forth
>are therefore often lost. I would instead advocate that simple quantile
>plots -- plot(ppoints(x),sort(x)) -- or perhaps normal qqplots always be the
>first plot used to explore univariate data distributions. I believe this
>conforms to Bill Cleveland's recommendations, who says in the first sentence
>on p. 17 of VISUALIZING DATA on visualizing univariate data: "Quantiles are
>essential to visualizing distributions."
>
>While it is true that many people may be unfamiliar with quantile plots, I
>think we need to improve modern statistical practice not only by abandoning
>histograms in favor of density plots, but also by always first using
>quantile plots and explaining why this is necessary.
>
>Difficult issue: What should one do when when there are, say, a million
>values?
>
>Alternative views?
>
> 
>-- Bert Gunter
>Genentech Non-Clinical Statistics
>South San Francisco, CA
> 
>"The business of the statistician is to catalyze the scientific learning
>process."  - George E. P. Box
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>  
>