[R] Binning numbers into integer-valued intervals (or: a version of cut or cut2 that makes sense)
Karl Ove Hufthammer
karl at huftis.org
Mon Jul 25 16:00:46 CEST 2011
Dear list members,
I’m looking for a way to divide numbers into simple (i.e., integer-valued)
intervals, and thought the ‘cut’ function in ‘base’ or the ‘cut2’ function
in ‘Hmisc’ would, er, cut it. However, they seem to give rather surprising
results.
Since I want the endpoints of the intervals to be integers, I used the
‘dig.lab’ and ‘digits’ arguments. One assumption I made: If the number x
gets the label (a, b], then x lies in the interval (a, b]. It turns out that
this assumption was incorrect. Example:
$ cut(c(20.8, 21.3, 21.7, 23, 25), 2, dig.lab=1)
[1] (21,23] (21,23] (21,23] (23,25] (23,25]
Levels: (21,23] (23,25]
So the first number, 20.8, get put in the interval (21,23], which seem
strange. I can see why this could happen, though, as perhaps the 20.8 is
rounded to 21 before binning. But it’s even stranger that the *integer* 23
is put in in the interval (23,25] instead of in the interval (21,23]. Can
anyone explain why?
I then turned to ‘cut2’ in ‘Hmisc’. But again I was surprised by the result:
$ cut2(c(20.8, 21.3, 21.7, 23), g=2, digits=1)
[1] [21,22) [21,22) [22,23] [22,23]
Levels: [21,22) [22,23]
Again 20.8 is placed in an interval that doesn’t mathematically contain it.
And 21.3 and 21.7 are placed in *different* intervals, instead of both being
placed in the interval [21,22). This may perhaps strictly not be a bug, but
it’s certainly surprising behaviour!
Since obviously none of the two functions do what I require them to do, is
there a different function that does, hidden deep inside some R package?
This function should take as input a vector of numbers, and output a vector
of non-overlapping (but ‘touching’) intervals with integer end-points so
that each number is in exactly one interval. It should of course also
include information on which interval each number belongs to.
Version information (though I also observe this on R 2.13.1 on Windows):
$ sessionInfo()
R version 2.13.1 Patched (2011-07-25 r56494)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=nn_NO.UTF-8 LC_NUMERIC=C
[3] LC_TIME=nn_NO.UTF-8 LC_COLLATE=nn_NO.UTF-8
[5] LC_MONETARY=C LC_MESSAGES=nn_NO.UTF-8
[7] LC_PAPER=nn_NO.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=nn_NO.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] splines stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Hmisc_3.8-3 survival_2.36-9
loaded via a namespace (and not attached):
[1] cluster_1.14.0 grid_2.13.1 lattice_0.19-30
--
Karl Ove Hufthammer
More information about the R-help
mailing list