[R] Approximating discrete distribution by continuous distribution

Tue Jan 22 15:20:38 CET 2013

On Jan 22, 2013, at 13:45 , Prof Brian Ripley wrote:

> On 22/01/2013 11:49, Michael Haenlein wrote:
>> Dear all,
>> 
>> I have a discrete distribution showing how age is distributed across a
>> population using a certain set of bands:
>> 
>> Age <- matrix(c(74045062, 71978405, 122718362, 40489415), ncol=1,
>> dimnames=list(c("<18", "18-34", "35-64", "65+"),c()))
>> Age_dist <- Age/sum(Age)
>> 
>> For example I know that 23.94% of all people are between 0-18 years, 23.28%
>> between 18-34 years and so forth.
>> 
>> I would like to find a continuous approximation of this discrete
>> distribution in order to estimate the probability that a person is for
>> example 16 years old.
>> 
>> Is there some automatic way in R through which this can be done? I tried a
>> Kernel density estimation of the histogram but this does not seem to
>> provide what I'm looking for.
> 
> This is not really an R question, but a statistics one.  It is almost guesswork: if for example these were drivers in the UK, the answer is 0.  So you need to supply some information about the shape of the distribution of <18 year olds.
> 
> You have estimates of the cumulative distribution function at c(0, 18, 35, 65, Inf) (or some better upper limit).  You want to interpolate it.  You could use linear interpolation (approx[fun]) or a monotone spline interpolation (spline[fun]) or any other interpolation method which meets your needs.  But whatever you use, you will supplying a lot of information not actually in your data.

Agreed. The linear interpolation method is sometimes described as the "sum polygon", and sort of assumes that there is a uniform distribution of ages in each range. I.e., the number of 16 year olds would be 1/18 of the 0-17 y.o. However, I'd feel somewhat uneasy about doing this with such wide age-bands.

There is also the option of fitting a standard distribution like the Weibull to the data and using that. The mle() function should do this if you write out the log-likelihood using something like 

dmultinom(Age, prob=diff(pweibull(c(0,18,15,65,Inf), shape, scale), log=TRUE)

With a quarter of a billion observations, the fit might be less than perfect, but on the other hand, extracting more than two parameters from four data points sound a bit ominous.

-- 
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com