[R] Binning question (binning rows of a data.frame according to a variable)
Adaikalavan Ramasamy
ramasamy at cancer.org.uk
Mon Mar 20 19:01:03 CET 2006
Lets say there are 10 students in the first group and denote x1 as (say)
the number of red balls for student 1 and s1 the total balls. Then I was
calculating the average the proportion ( x1/s1 + x2/s2 + ... + x10/s10 )
and you were calculating the average number of events (x1+x2
+...+x10)/(s1+s2+...+s10).
On second thoughts I think it is much better to calculate the a weighted
average of the proportions. The weights should reflect the variance of
the estimate of the proportions.
( w1*x1/s1 + w2*x2/s2 + ... + w10*x10/s10 )
On Mon, 2006-03-20 at 15:11 +0000, Dan Bolser wrote:
> Adaikalavan Ramasamy wrote:
> > Are you saying that your data might look like this ?
> >
> > set.seed(1) # For reproducibility only - remove this
> > mydf <- data.frame( age=round(runif(100, min=5, max=65), digits=1),
> > nred=rpois(100, lambda=10),
> > nblue=rpois(100, lambda=5),
> > ngreen=rpois(100, lambda=15) )
> > mydf$total <- rowSums( mydf[ , c("nred", "nblue", "ngreen")] )
> >
> > head(mydf)
> > age nred nblue ngreen total
> > 1 20.9 11 7 15 33
> > 2 27.3 8 2 18 28
> > 3 39.4 11 4 8 23
> > 4 59.5 6 5 8 19
> > 5 17.1 10 3 16 29
> > 6 58.9 11 5 14 30
> >
> >
> > If so, then try this :
> >
> > mydf <- mydf[order(mydf$age), ] ## re-order by age
> > mydf$cumtotal <- cumsum(mydf$total) ## cummulative total
> >
> > brk.pts <- seq(from=0, to=sum(mydf$total), len=9)
> > mydf$grp <- cut( mydf$cumtotal , brk.pts, labels=F )
> >
> > age nred nblue ngreen total cumtotal grp
> > 27 5.8 9 5 8 22 22 1
> > 47 6.4 6 5 13 24 46 1
> > 92 8.5 8 4 18 30 76 1
> > 10 8.7 12 5 8 25 101 1
> > 55 9.2 10 7 13 30 131 1
> > 69 10.1 9 3 18 30 161 1
> >
> >
> > So here your 'grp' column is what you really want. Just to check
> >
> > tapply( mydf$total, mydf$grp, sum )
> > 1 2 3 4 5 6 7 8
> > 352 363 372 387 358 377 377 370
> >
> > sapply( tapply( mydf$age, mydf$grp, range ), c )
> > 1 2 3 4 5 6 7 8
> > [1,] 5.8 17.1 24.5 29.0 34.6 44.6 51.2 56.7
> > [2,] 16.2 24.0 28.4 33.9 44.1 51.0 55.4 64.5
> >
> > The last command says that your youngest student in group 1 is aged 5.8
> > and oldest is aged 16.2.
> >
> >
> > Taking this one step further, you can calculate the proportion of the
> > red, green and blue for each of the 8 groups.
> >
> > props <- mydf[ , c("nred", "nblue", "ngreen")]/mydf$total # proportions
> > apply( props, 2, function(v) tapply( v, mydf$grp, mean ) )
> > nred nblue ngreen
> > 1 0.3459898 0.1776441 0.4763661
> > 2 0.3280712 0.1730796 0.4988492
> > 3 0.3061429 0.1748149 0.5190422
> > 4 0.3759380 0.2084694 0.4155926
> > 5 0.3548805 0.1587353 0.4863842
> > 6 0.3106835 0.1829349 0.5063816
> > 7 0.3525933 0.1599737 0.4874330
> > 8 0.3133796 0.1795567 0.5070637
> >
> > Hope this of some use.
>
> Yes, this is very useful! I have just one remaining question, above you
> take the mean of the group proportion...
>
> apply( props, 2, function(v) tapply( v, mydf$grp, mean ) )
>
>
> instead of explicitly recalculating the proportion for the group (what I
> couldn't script real good) ...
>
> rbind(
> colSums(mydf[ mydf$grp==1, c("nred", "nblue", "ngreen")])/
> sum (mydf[ mydf$grp==1, c("nred", "nblue", "ngreen")]),
> ...
> colSums(mydf[ mydf$grp==8, c("nred", "nblue", "ngreen")])/
> sum (mydf[ mydf$grp==8, c("nred", "nblue", "ngreen")])
> )
>
>
> Giving (from the same seed)...
>
> nred nblue ngreen
> [1,] 0.3465909 0.1704545 0.4829545
> [2,] 0.3250689 0.1735537 0.5013774
> [3,] 0.3064516 0.1774194 0.5161290
> [4,] 0.3746770 0.2067183 0.4186047
> [5,] 0.3519553 0.1564246 0.4916201
> [6,] 0.3103448 0.1830239 0.5066313
> [7,] 0.3501326 0.1644562 0.4854111
> [8,] 0.3081081 0.1837838 0.5081081
>
>
> Which is *slightly* different from the 'mean' approach.
>
> > round(former-latter,4)
> nred nblue ngreen
> 1 -0.0006 0.0072 -0.0066
> 2 0.0030 -0.0005 -0.0025
> 3 -0.0003 -0.0026 0.0029
> 4 0.0013 0.0018 -0.0030
> 5 0.0029 0.0023 -0.0052
> 6 0.0003 -0.0001 -0.0002
> 7 0.0025 -0.0045 0.0020
> 8 0.0053 -0.0042 -0.0010
>
>
> I know this less a question about R, and more a question about general
> stats, but why did you choose the former and not the latter method? Is
> one wrong and one right? Or did the former better fit the situation as
> described?
>
> Thanks for any insight into your decision, as this is something that has
> always puzzled me.
>
> Thanks for the beautifully clear examples!
>
>
> Dan.
>
> >
> > Regards, Adai
> >
> >
> >
> > On Sun, 2006-03-19 at 18:58 +0000, Dan Bolser wrote:
> >
> >>Adaikalavan Ramasamy wrote:
> >>
> >>>Do you by any chance want to sample from each group equally to get an
> >>>equal representation matrix ?
> >>
> >>No.
> >>
> >>I want to make groups of equal sizes, where size isn't simply number of
> >>rows (allowing a simple 'gl'), but a sum of the variable.
> >>
> >>Thanks for the code though, it looks useful.
> >>
> >>
> >>
> >>Here is an analogy for what I want to do (in case it helps).
> >>
> >>A group of students have some bags of marbles - The marbles have
> >>different colours. Each student has one bag, but can have between 5 and
> >>50 marbles per bag with any given strange distribution you like. I line
> >>the students up by age, and want to see if there is any systematic
> >>difference between the number of each color of marble by age (older
> >>students may find primary colours less 'cool').
> >>
> >>Because the statistics of each individual student are bad (like the
> >>proportion of each color per student -- has a high variance) I first put
> >>all the students into 8 groups (for example).
> >>
> >>Thing is, for one reason or another, the number of marbles per bag may
> >>systematically vary with age too. However, I am not interested in the
> >>number of marbles per bag, so I would like to group the students into 8
> >>groups such that each group has the same total number of marbles. (Each
> >>group having a different sized age range, none the less ordered by age).
> >>
> >>Then I can look at the proportion (or count) of colours in each group,
> >>and I can compare the groups or any trend accross the groups.
> >>
> >>Does that make sense?
> >>
> >>Cheers,
> >>Dan.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>>Here is an example of the input :
> >>>
> >>> mydf <- data.frame( value=1:100, value2=rnorm(100),
> >>> grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) )
> >>>
> >>>which has 35 observations from A, 15 from B, 30 from C and 20 from D.
> >>>
> >>>
> >>>And here is a function that I wrote:
> >>>
> >>> sample.by.group <- function(df, grp, k, replace=FALSE){
> >>>
> >>> if(length(k)==1){ k <- rep(k, length(unique(grp))) }
> >>>
> >>> if(!replace && any(k > table(grp)))
> >>> stop( paste("Cannot take a sample larger than the population when
> >>> 'replace = FALSE'.\n", "Please specify a value greater than",
> >>> min(table(grp)), "or use 'replace = TRUE'.\n") )
> >>>
> >>>
> >>> ind <- model.matrix( ~ -1 + grp )
> >>> w.mat <- list(NULL)
> >>>
> >>> for(i in 1:ncol(ind)){
> >>> w.mat[[i]] <- sample( which( ind[,i]==1 ), k[i], replace=replace )
> >>> }
> >>>
> >>> out <- df[ unlist(w.mat), ]
> >>> return(out)
> >>> }
> >>>
> >>>
> >>>And here are some examples of how to use it :
> >>>
> >>>mydf <- mydf[ sample(1:nrow(mydf)), ] # scramble it for fun
> >>>
> >>>
> >>>out1 <- sample.by.group(mydf, mydf$grp, k=10 )
> >>>table( out1$grp )
> >>>
> >>> out2 <- sample.by.group(mydf, mydf$grp, k=50, replace=T) # ie bootstrap
> >>> table( out2$grp )
> >>>
> >>>and you can even do bootstrapping or sampling with weights via:
> >>>
> >>> out3 <- sample.by.group(mydf, mydf$grp, k=c(20, 20, 30, 30), replace=T)
> >>> table( out3$grp )
> >>>
> >>>
> >>>Regards, Adai
> >>>
> >>>
> >>>
> >>>On Fri, 2006-03-17 at 16:01 +0000, Dan Bolser wrote:
> >>>
> >>>
> >>>>Hi,
> >>>>
> >>>>I have tuples of data in rows of a data.frame, each column is a variable
> >>>>for the 'items' (one per row).
> >>>>
> >>>>One of the variables is the 'size' of the item (row).
> >>>>
> >>>>I would like to cut my data.frame into groups such that each group has
> >>>>the same *total size*. So, assuming that we order by size, some groups
> >>>>should have several small items while other groups have a few large
> >>>>items. All the groups should have approximately the same total size.
> >>>>
> >>>>I have tried various combinations of cut, quantile, and ecdf, and I just
> >>>>can't work out how to do this!
> >>>>
> >>>>Any help is greatly appreciated!
> >>>>
> >>>>All the best,
> >>>>Dan.
> >>>>
> >>>>______________________________________________
> >>>>R-help at stat.math.ethz.ch mailing list
> >>>>https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
> >>>>
> >>>
> >>>
> >>
> >
>
>
More information about the R-help
mailing list