[R] Sampling a dataframe based on the length of a subset of observations within

Thu Jul 9 03:19:23 CEST 2009

Thank you in advance for your consideration.

I have a dataframe of 2000+ observations with repeated measures across 
approximately 300 unique individuals  An event either does or does not 
happen
(1,0) and there is a suit of independent variables associated with the 
event.  A simplified representation follows:

my.df<-data.frame("id"=c("A","A","A","B","B","C","C","C", "C", "C"), 
event=c(0,0,1,0,1,0,0,1,1, 0))

_id_  _event_
A     0
A     0
A     1
B     0
B     1
C     0
C     0
C     1
C     1
C     0

I need to sample my.df to select the same number of observations with 
event = 0 as event = 1 for each unique id.
I can reshape or tapply my.df to group id and determine what sample size 
I need.  my.df.cast=

library(reshape)
my.df.melt<-melt(my.df, id="id")
my.df.cast<-cast(my.df.melt, id~value, length, fill=0)
my.df.cast

       Event
_id_      _0_   _1_
A     2     *1*
B     1     *1*
C     3     *2*

Given the above dataframe I need to randomly select (sample) from my.df 
*one* observation from my.df[my.df$id==A & my.df$event==0],  *one* from 
my.df[my.df$id==B & my.df$event==0], and* two* from my.df[my.df$id==C & 
my.df$event==0] and then rbind them to my.df[my.df$event == 1].  
However, it is impractical to individually code each case.

Alternatively if A in my.df matches A in my.df.cast  then 
sample(my.df[my.df$id == A & my.df$event == 0], size=my.df.cast[1,3], 
replace=FALSE).  I think I am close to a solution but I'm not sure how 
to code it to run through the entire dataframe.

This is how my.new.df would look:

_id event_
A     0
A     1
B     0
B     1
C     0
C     0
C     1
C     1

Thank you kindly for your help,

Eric

-- 
Eric Vander Wal
Ph.D. Candidate
University of Saskatchewan, 
Department of Biology,
112 Science Place, 
Saskatoon, SK., S7N 5E2

"Pluralitas non est ponenda sine neccesitate"