[R] Drawing a histogram from a massive dataset

Tue Jul 19 15:43:04 CEST 2011

On Tue, Jul 19, 2011 at 12:30 AM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
>>>> [snip] I guess that I must have a data frame to plot a histogram.
>>>
>>> Not at all!
>>>
>>> ## a *vector* of 100 million observation
>>> x <- rnorm(10^8)
>>> ## a histogram for it (see attached for the result from my system)
>>> hist(x)
>>>
>>> No data frame required.  I would not try this straight in anything but
>>> traditional graphics for a 100 million observation vector, but if you
>>> wanted it made in ggplot2 or something, you could prebin the data and
>>> THEN plot bars corresponding to the bins.
>>
>> Thanks, Joshua, for your answer.
>>
>> True: A vector is enough to supply data for hist(). But my point is:
>> Can a histogram be drawn without having all data on the computer
>> memory? You partially answer this question by suggesting to prebind
>> the data. Can this prebinning process be done transparently but chunk
>> by chunk of data underneath?
>
> Sure, as long as you can figure out some basic details about the full
> dataset.  Just define your breaks, and then for chunks of the data at
> a time, count how many fall into any particular bin.  Once you are
> done, add up all the counts for each bin, and voila.
>
> ## Get these values from the full data (using SQL)
> x <- rnorm(1000)
> n <- length(x)
> minx <- min(x)
> maxx <- max(x)
>
> ## Sturges style breaks
> breaks <- pretty(c(minx, maxx), n = ceiling(log2(n) + 1))
> nB <- length(breaks)
>
> fuzz <- rep(1e-07 * median(diff(breaks)), nB)
> fuzz[1] <- fuzz[1] * -1
> fuzzybreaks <- breaks + fuzz
>
> chunks <- 10
>
> counts <- matrix(NA, nrow = chunks, ncol = nB - 1,
>  dimnames = list(paste("Sec", 1:chunks, sep = ''),
>    as.character(fuzzybreaks[-1])))
>
> for(i in 1:chunks) {
>  index <- seq(1, n/chunks) + (n/chunks * (i - 1))
>  counts[i, ] <- hist(x[index], breaks = fuzzybreaks)$counts
> }
>
> ## The heights of your bars
> colSums(counts)
> ## results using hist() on x all at once
> hist(x)$counts
>
> You would not even need to know the number of chunks you were going to
> split your data into before hand, I just did it for convenience and to
> instatiate a full sized matrix to hold the results.  If you are
> selecting subsets of your data using SQL rather than R, it becomes
> even simpler.  Once you have your fuzzybreaks, you just keep calling
> hist on your new data with using the predefined breaks and saving the
> results.  Still, I do not break about 4.5 GB of memory used to just
> plot a histogram on a 100 million observation vector, and it is
> difficult to imagine the shape of the distribution changing
> appreciably using a random sample of 100 million observations.  It
> also takes less than 10 seconds to calculate and draw the histogram on
> my computer.  The point being, I suspect you will spend more time
> getting everything setup and working than seems worth it because you
> can easily and quickly create a histogram on so large of vectors
> already, the distribution is unlikely to vary anyway.  Whatever floats
> your boat, though.

Thanks again, Joshua. Your approach is quite interesting.

Paul