[R] Re-binning histogram data

Thu Jun 8 19:25:00 CEST 2006

Hi, Bert, Ted, et al.:

	  Do you use normal probability plots?  They are the best tool I know 
for identifying all kinds of nonnormality, including normal mixtures 
with either outliers or multimodality as well as skewness.  I've 
experimented with PP and nonnormal QQ plots, and not found them that 
useful.  I prefer to transform the data to apparent normality, because 
that seems to produce about the right amount of visual separation in the 
tails:  A PP plot provides very poor resolution of tail behavior, and 
the image in a QQ plot with longer than normal tails becomes for me so 
overwhelmed by random tail behavior that I'm unable to make sense of it.

	  Also, could someone explain the rationale behind the "datax=FALSE" 
default?  I presume this default was established before research showed 
that humans have better judgment about vertical and horizontal lines 
than lines at other angles, and that 45 degree lines are more easily 
judged than lines at other angles.  This research led to "the 45 degree 
banking rule (see _Visualizing Data_ by William S. Cleveland for 
details)", mentioned on the help page for xyplot{lattice}.

	  In my experience, most normal probability plots come closer to 
meeting this "45 degree banking rule" when datax=TRUE than FALSE.  With 
a typical aspect ratio, normally distributed data will appear with an 
angle less than 45 degrees.  An outlier with the default datax=FALSE 
will reduce that 45 degrees, making it harder to process visually.  By 
contrast, with datax=TRUE, an outlier increases the banking, moving it 
closer to (or even beyond) the 45 degree line that seems to facilitate 
the best human visual processing.

	  Beyond this, what do you think about combining a normal plot with 
either a histogram or a density estimate on the bottom?  With multiple 
lines on the normal probability plot, I've seen stacked-bar histograms 
on the bottom that seemed intelligible.  Would you suggest replacing 
stacked-bar histograms with overlapping plots of density estimates?  And 
how many observations would you need in each group for that to make sense?

	  What do you think?
	  Best Wishes,
	  Spencer Graves

(Ted Harding) wrote:
> On 08-Jun-06 Berton Gunter wrote:
>> I would argue that histograms are outdated relics and that density
>> plots (whatever your favorite flavor is) should **always** be used
>> instead these days.
>>
>> In this vein, I would appreciate critical rejoinders (public or
>> private) to the following proposition: Given modern computer power
>> and software like R on multi ghz machines, statistical and graphical
>> relics of the pre-computer era (like histograms, low resolution
>> printer-type plots, and perhaps even method of moments EMS
>> calculations) should be abandoned in favor of superior but perhaps
>> computation-intensive alternatives (like density plots, high
>> resolution plots, and likelihood or resampling or Bayes based methods).
> 
> While your head is above the parapet, Bert ...
> 
> Your general question could go in many directions, but there's a
> lot to be said for that point of view (as well as some against).
> 
> However, my short answer is that it's a matter of horses for courses.
> 
> In particular, where the histogram is concerned, it has a straightforward
> property that it exactly represents the information about the counts
> within the bin-ranges. While usually the bars are not labelled with
> count values, you can (and I quite often have, when it was the only
> way) recover the counts using a ruler graduated in millimetres. And
> the same time it usually (if judiciously constructed) presents a
> good blockwise representation of the implied underlying continuous
> distribution.
> 
> A continuous density estimation may be a better and smoother (or
> at least more appealing) representation of the distribution (though
> you would need to be careful about local humps), but to recover the
> data from it would take a combination of optical scanning, image
> analysis software, and (if you don't know what smoothing method
> was used) heuristic algorithm-inference software. Well within
> your technological utopia, of course, but ...
> 
>> NB: Please -- no pleadings that new methods would be mystifying
>> to the non-cogniscenti. Following that to its logical conclusion
>> would mean that we'd all have to give up our TV remotes and cell
>> phones, and what kind of world would that be?! :-)
> 
> One day, let me show you how to use my wooden plough-share.
> 
> Best wishes,
> Ted.
> 
> PS Please bring your own horse.
> 
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 08-Jun-06                                       Time: 17:16:53
> ------------------------------ XFMail ------------------------------
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html