[R] Creating a vector of variable bin widths
d.s.robinson at dur.ac.uk
d.s.robinson at dur.ac.uk
Thu Mar 1 17:16:58 CET 2007
Dear R users,
I am having a little trouble with grouping data.
-----------Detailed explanation (summary below)------------
A small sample of my data is below (which has already been rounded and
grouped a little from the raw data for clarity).
I am sampling data from an unknown game which, according to my null
hypothesis, follows a binomial distribution. The game can be supposedly
be played with a range of probabilities (the independent variable) of
success, 0.0-0.3 are shown below, although my full data set goes all the
way up 0.99. The number of observations for each probability of success,
and the actual proportion of wins in the sample (the dependant variable)
are also shown.
By CLT, the sample winning proportions (the dependant variable) should
be a unbiased estimator of the population proportion (the independent
variable). I want to perform a significance test at each probability
level to see if the null hypothesis can be rejected.
But, the problem is in defining those probability levels. At the moment,
some probabilities of success have a very low number of observations,
whilst others have very many. Leaving the data as it is results in
statistically meaningless results at the low and high levels of success.
Further grouping the data using fixed group widths results very few data
points at high and low probabilities, and a few data points in the
middle with a very high number of observations.
The way around this (I think) is to use variable bin widths. The width
of each bin should be wide enough so that (again, I think this is a
reasonable idea) the variance of the sample estimate (using the normal
approximation to the binomial), [p(1-p)]/n, is less than a certain
value, say 2% squared. I presume I also need to make sure that for each
group np<5 and n(1-p)<5, or can this simply replace the variance test?
IndependantVar Observations DependantVar
--------------------------------------------
0.01 1 0.000
0.03 5 0.000
0.04 11 0.000
0.05 9 0.000
0.06 19 0.000
0.07 12 0.000
0.08 18 0.056
0.09 10 0.200
0.10 13 0.077
0.11 17 0.118
0.12 17 0.059
0.13 18 0.056
0.14 21 0.000
0.15 25 0.160
0.16 23 0.000
0.17 35 0.314
0.18 26 0.231
0.19 31 0.226
0.20 27 0.148
0.21 26 0.462
0.22 21 0.286
0.23 29 0.207
0.24 38 0.289
0.25 38 0.132
0.26 27 0.259
0.27 52 0.308
0.28 62 0.194
0.29 82 0.232
0.30 97 0.278
------------------Summary---------------------------
So, I how can I write a function that creates a vector of variable break
values for, say, cut(). It should iteratively make bin widths wider
until an condition based on the value to be binned (the probability of
success), and a second value, the number of observations, is met
(assuming you agree with my method of restricting the variance, the
rational of which is outlined above).
I would appreciate any comments on either the reasoning (I am fairly new
to this sort of statistics) or how I can write the R code to achieve the
proposed goal. I hope I have explained this clearly enough to merit a
response.
Regards,
DR
More information about the R-help
mailing list