[R] Binning numbers into integer-valued intervals (or: a version of cut or cut2 that makes sense)

William Dunlap wdunlap at tibco.com
Mon Jul 25 17:29:42 CEST 2011


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Karl Ove
> Hufthammer
> Sent: Monday, July 25, 2011 7:01 AM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Binning numbers into integer-valued intervals (or: a version of cut or cut2 that makes
> sense)
> 
> Dear list members,
> 
> I’m looking for a way to divide numbers into simple (i.e., integer-valued)
> intervals, and thought the ‘cut’ function in ‘base’ or the ‘cut2’ function
> in ‘Hmisc’ would, er, cut it. However, they seem to give rather surprising
> results.
> 
> Since I want the endpoints of the intervals to be integers, I used the
> ‘dig.lab’ and ‘digits’ arguments. One assumption I made: If the number x
> gets the label (a, b], then x lies in the interval (a, b]. It turns out that
> this assumption was incorrect. Example:
> 
> $ cut(c(20.8, 21.3, 21.7, 23, 25), 2, dig.lab=1)
> [1] (21,23] (21,23] (21,23] (23,25] (23,25]
> Levels: (21,23] (23,25]
> 
> So the first number, 20.8, get put in the interval (21,23], which seem
> strange. I can see why this could happen, though, as perhaps the 20.8 is
> rounded to 21 before binning. But it’s even stranger that the *integer* 23
> is put in in the interval (23,25] instead of in the interval (21,23]. Can
> anyone explain why?

dig.lab does not affect the choice of break points, it only
affects how they are converted to character form for the labels.
Unfortunately, cut() does not return the actual breakpoints but
if you make them yourself you know what they are.
 
You need to find or make a function akin to pretty() that returns
a "nice" set of breakpoints.  pretty() itself may do:
  > x <- c(20.8, 21.3, 21.7, 23, 25)
  > pretty(x, n=2)
  [1] 20 22 24 26
  > cut(x, breaks=pretty(x, n=2))
  [1] (20,22] (20,22] (20,22] (22,24] (24,26]
  Levels: (20,22] (22,24] (24,26]

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> 
> I then turned to ‘cut2’ in ‘Hmisc’. But again I was surprised by the result:
> 
> $ cut2(c(20.8, 21.3, 21.7, 23), g=2, digits=1)
> [1] [21,22) [21,22) [22,23] [22,23]
> Levels: [21,22) [22,23]
> 
> Again 20.8 is placed in an interval that doesn’t mathematically contain it.
> And 21.3 and 21.7 are placed in *different* intervals, instead of both being
> placed in the interval [21,22). This may perhaps strictly not be a bug, but
> it’s certainly surprising behaviour!
> 
> Since obviously none of the two functions do what I require them to do, is
> there a different function that does, hidden deep inside some R package?
> This function should take as input a vector of numbers, and output a vector
> of non-overlapping (but ‘touching’) intervals with integer end-points so
> that each number is in exactly one interval. It should of course also
> include information on which interval each number belongs to.
> 
> Version information (though I also observe this on R 2.13.1 on Windows):
> 
> $ sessionInfo()
> R version 2.13.1 Patched (2011-07-25 r56494)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
>  [1] LC_CTYPE=nn_NO.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=nn_NO.UTF-8        LC_COLLATE=nn_NO.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=nn_NO.UTF-8
>  [7] LC_PAPER=nn_NO.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=nn_NO.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> [8] base
> 
> other attached packages:
> [1] Hmisc_3.8-3     survival_2.36-9
> 
> loaded via a namespace (and not attached):
> [1] cluster_1.14.0  grid_2.13.1     lattice_0.19-30
> 
> --
> Karl Ove Hufthammer
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


More information about the R-help mailing list