[R] Binning question (binning rows of a data.frame according to a variable)

Adaikalavan Ramasamy ramasamy at cancer.org.uk
Mon Mar 20 19:01:03 CET 2006


Lets say there are 10 students in the first group and denote x1 as (say)
the number of red balls for student 1 and s1 the total balls. Then I was
calculating the average the proportion ( x1/s1 + x2/s2 + ... + x10/s10 )
and you were calculating the average number of events (x1+x2
+...+x10)/(s1+s2+...+s10).

On second thoughts I think it is much better to calculate the a weighted
average of the proportions. The weights should reflect the variance of
the estimate of the proportions.



( w1*x1/s1 + w2*x2/s2 + ... + w10*x10/s10 )



On Mon, 2006-03-20 at 15:11 +0000, Dan Bolser wrote:
> Adaikalavan Ramasamy wrote:
> > Are you saying that your data might look like this ?
> > 
> >  set.seed(1)  # For reproducibility only - remove this
> >  mydf <- data.frame( age=round(runif(100, min=5, max=65), digits=1),
> >                      nred=rpois(100, lambda=10), 
> >                      nblue=rpois(100, lambda=5), 
> >                      ngreen=rpois(100, lambda=15) )
> >  mydf$total <- rowSums( mydf[ , c("nred", "nblue", "ngreen")] )
> > 
> >  head(mydf)
> >     age nred nblue ngreen total
> >  1 20.9   11     7     15    33
> >  2 27.3    8     2     18    28
> >  3 39.4   11     4      8    23
> >  4 59.5    6     5      8    19
> >  5 17.1   10     3     16    29
> >  6 58.9   11     5     14    30
> > 
> > 
> > If so, then try this :
> > 
> >  mydf          <- mydf[order(mydf$age), ]  ## re-order by age
> >  mydf$cumtotal <- cumsum(mydf$total)       ## cummulative total
> > 
> >  brk.pts       <- seq(from=0, to=sum(mydf$total), len=9)
> >  mydf$grp      <- cut( mydf$cumtotal , brk.pts, labels=F )
> > 
> >      age nred nblue ngreen total cumtotal grp
> >  27  5.8    9     5      8    22       22   1
> >  47  6.4    6     5     13    24       46   1
> >  92  8.5    8     4     18    30       76   1
> >  10  8.7   12     5      8    25      101   1
> >  55  9.2   10     7     13    30      131   1
> >  69 10.1    9     3     18    30      161   1
> > 
> > 
> > So here your 'grp' column is what you really want. Just to check 
> > 
> >  tapply( mydf$total, mydf$grp, sum )
> >    1   2   3   4   5   6   7   8 
> >  352 363 372 387 358 377 377 370 
> > 
> >  sapply( tapply( mydf$age, mydf$grp, range ), c )
> >          1    2    3    4    5    6    7    8
> >  [1,]  5.8 17.1 24.5 29.0 34.6 44.6 51.2 56.7
> >  [2,] 16.2 24.0 28.4 33.9 44.1 51.0 55.4 64.5
> > 
> > The last command says that your youngest student in group 1 is aged 5.8
> > and oldest is aged 16.2.
> > 
> > 
> > Taking this one step further, you can calculate the proportion of the
> > red, green and blue for each of the 8 groups.
> > 
> >  props <- mydf[ , c("nred", "nblue", "ngreen")]/mydf$total # proportions
> >  apply( props, 2, function(v) tapply( v, mydf$grp, mean ) )
> >         nred     nblue    ngreen
> >  1 0.3459898 0.1776441 0.4763661
> >  2 0.3280712 0.1730796 0.4988492
> >  3 0.3061429 0.1748149 0.5190422
> >  4 0.3759380 0.2084694 0.4155926
> >  5 0.3548805 0.1587353 0.4863842
> >  6 0.3106835 0.1829349 0.5063816
> >  7 0.3525933 0.1599737 0.4874330
> >  8 0.3133796 0.1795567 0.5070637
> > 
> > Hope this of some use.
> 
> Yes, this is very useful! I have just one remaining question, above you 
> take the mean of the group proportion...
> 
> apply( props, 2, function(v) tapply( v, mydf$grp, mean ) )
> 
> 
> instead of explicitly recalculating the proportion for the group (what I 
> couldn't script real good) ...
> 
> rbind(
>        colSums(mydf[ mydf$grp==1, c("nred", "nblue", "ngreen")])/
>           sum (mydf[ mydf$grp==1, c("nred", "nblue", "ngreen")]),
>        ...
>        colSums(mydf[ mydf$grp==8, c("nred", "nblue", "ngreen")])/
>           sum (mydf[ mydf$grp==8, c("nred", "nblue", "ngreen")])
>       )
> 
> 
> Giving (from the same seed)...
> 
>            nred     nblue    ngreen
> [1,] 0.3465909 0.1704545 0.4829545
> [2,] 0.3250689 0.1735537 0.5013774
> [3,] 0.3064516 0.1774194 0.5161290
> [4,] 0.3746770 0.2067183 0.4186047
> [5,] 0.3519553 0.1564246 0.4916201
> [6,] 0.3103448 0.1830239 0.5066313
> [7,] 0.3501326 0.1644562 0.4854111
> [8,] 0.3081081 0.1837838 0.5081081
> 
> 
> Which is *slightly* different from the 'mean' approach.
> 
>  > round(former-latter,4)
>       nred   nblue  ngreen
> 1 -0.0006  0.0072 -0.0066
> 2  0.0030 -0.0005 -0.0025
> 3 -0.0003 -0.0026  0.0029
> 4  0.0013  0.0018 -0.0030
> 5  0.0029  0.0023 -0.0052
> 6  0.0003 -0.0001 -0.0002
> 7  0.0025 -0.0045  0.0020
> 8  0.0053 -0.0042 -0.0010
> 
> 
> I know this less a question about R, and more a question about general 
> stats, but why did you choose the former and not the latter method? Is 
> one wrong and one right? Or did the former better fit the situation as 
> described?
> 
> Thanks for any insight into your decision, as this is something that has 
> always puzzled me.
> 
> Thanks for the beautifully clear examples!
> 
> 
> Dan.
> 
> > 
> > Regards, Adai
> > 
> > 
> > 
> > On Sun, 2006-03-19 at 18:58 +0000, Dan Bolser wrote:
> > 
> >>Adaikalavan Ramasamy wrote:
> >>
> >>>Do you by any chance want to sample from each group equally to get an
> >>>equal representation matrix ? 
> >>
> >>No.
> >>
> >>I want to make groups of equal sizes, where size isn't simply number of 
> >>rows (allowing a simple 'gl'), but a sum of the variable.
> >>
> >>Thanks for the code though, it looks useful.
> >>
> >>
> >>
> >>Here is an analogy for what I want to do (in case it helps).
> >>
> >>A group of students have some bags of marbles - The marbles have 
> >>different colours. Each student has one bag, but can have between 5 and 
> >>50 marbles per bag with any given strange distribution you like. I line 
> >>the students up by age, and want to see if there is any systematic 
> >>difference between the number of each color of marble by age (older 
> >>students may find primary colours less 'cool').
> >>
> >>Because the statistics of each individual student are bad (like the 
> >>proportion of each color per student -- has a high variance) I first put 
> >>all the students into 8 groups (for example).
> >>
> >>Thing is, for one reason or another, the number of marbles per bag may 
> >>systematically vary with age too. However, I am not interested in the 
> >>number of marbles per bag, so I would like to group the students into 8 
> >>groups such that each group has the same total number of marbles. (Each 
> >>group having a different sized age range, none the less ordered by age).
> >>
> >>Then I can look at the proportion (or count) of colours in each group, 
> >>and I can compare the groups or any trend accross the groups.
> >>
> >>Does that make sense?
> >>
> >>Cheers,
> >>Dan.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>>Here is an example of the input :
> >>>
> >>> mydf <- data.frame( value=1:100, value2=rnorm(100),
> >>>                     grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) )
> >>>
> >>>which has 35 observations from A, 15 from B, 30 from C and 20 from D.
> >>>
> >>>
> >>>And here is a function that I wrote:
> >>>
> >>> sample.by.group <- function(df, grp, k, replace=FALSE){
> >>>
> >>>   if(length(k)==1){ k <- rep(k, length(unique(grp))) }
> >>>    
> >>>   if(!replace && any(k > table(grp)))
> >>>     stop( paste("Cannot take a sample larger than the population when
> >>>     'replace = FALSE'.\n", "Please specify a value greater than",
> >>>     min(table(grp)), "or use 'replace = TRUE'.\n") )
> >>>
> >>>  
> >>>   ind   <- model.matrix( ~ -1 + grp )
> >>>   w.mat <- list(NULL)
> >>>   
> >>>   for(i in 1:ncol(ind)){
> >>>     w.mat[[i]] <- sample( which( ind[,i]==1 ), k[i], replace=replace )
> >>>   }
> >>>  
> >>>   out <- df[ unlist(w.mat), ]
> >>>   return(out)
> >>> }
> >>>
> >>>
> >>>And here are some examples of how to use it :
> >>> 
> >>>mydf <- mydf[ sample(1:nrow(mydf)), ]   # scramble it for fun
> >>>
> >>>
> >>>out1 <- sample.by.group(mydf, mydf$grp, k=10 )
> >>>table( out1$grp )
> >>>
> >>> out2 <- sample.by.group(mydf, mydf$grp, k=50, replace=T) # ie bootstrap
> >>> table( out2$grp )
> >>>
> >>>and you can even do bootstrapping or sampling with weights via:
> >>>
> >>> out3 <- sample.by.group(mydf, mydf$grp, k=c(20, 20, 30, 30), replace=T)
> >>> table( out3$grp )
> >>>
> >>>
> >>>Regards, Adai
> >>>
> >>>
> >>>
> >>>On Fri, 2006-03-17 at 16:01 +0000, Dan Bolser wrote:
> >>>
> >>>
> >>>>Hi,
> >>>>
> >>>>I have tuples of data in rows of a data.frame, each column is a variable 
> >>>>for the 'items' (one per row).
> >>>>
> >>>>One of the variables is the 'size' of the item (row).
> >>>>
> >>>>I would like to cut my data.frame into groups such that each group has 
> >>>>the same *total size*. So, assuming that we order by size, some groups 
> >>>>should have several small items while other groups have a few large 
> >>>>items. All the groups should have approximately the same total size.
> >>>>
> >>>>I have tried various combinations of cut, quantile, and ecdf, and I just 
> >>>>can't work out how to do this!
> >>>>
> >>>>Any help is greatly appreciated!
> >>>>
> >>>>All the best,
> >>>>Dan.
> >>>>
> >>>>______________________________________________
> >>>>R-help at stat.math.ethz.ch mailing list
> >>>>https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> >>>>
> >>>
> >>>
> >>
> > 
> 
>




More information about the R-help mailing list