[R] Sampling a dataframe based on the length of a subset of observations within
Eric Vander Wal
eric.vanderwal at usask.ca
Thu Jul 9 03:19:23 CEST 2009
Thank you in advance for your consideration.
I have a dataframe of 2000+ observations with repeated measures across
approximately 300 unique individuals An event either does or does not
happen
(1,0) and there is a suit of independent variables associated with the
event. A simplified representation follows:
my.df<-data.frame("id"=c("A","A","A","B","B","C","C","C", "C", "C"),
event=c(0,0,1,0,1,0,0,1,1, 0))
_id_ _event_
A 0
A 0
A 1
B 0
B 1
C 0
C 0
C 1
C 1
C 0
I need to sample my.df to select the same number of observations with
event = 0 as event = 1 for each unique id.
I can reshape or tapply my.df to group id and determine what sample size
I need. my.df.cast=
library(reshape)
my.df.melt<-melt(my.df, id="id")
my.df.cast<-cast(my.df.melt, id~value, length, fill=0)
my.df.cast
Event
_id_ _0_ _1_
A 2 *1*
B 1 *1*
C 3 *2*
Given the above dataframe I need to randomly select (sample) from my.df
*one* observation from my.df[my.df$id==A & my.df$event==0], *one* from
my.df[my.df$id==B & my.df$event==0], and* two* from my.df[my.df$id==C &
my.df$event==0] and then rbind them to my.df[my.df$event == 1].
However, it is impractical to individually code each case.
Alternatively if A in my.df matches A in my.df.cast then
sample(my.df[my.df$id == A & my.df$event == 0], size=my.df.cast[1,3],
replace=FALSE). I think I am close to a solution but I'm not sure how
to code it to run through the entire dataframe.
This is how my.new.df would look:
_id event_
A 0
A 1
B 0
B 1
C 0
C 0
C 1
C 1
Thank you kindly for your help,
Eric
--
Eric Vander Wal
Ph.D. Candidate
University of Saskatchewan,
Department of Biology,
112 Science Place,
Saskatoon, SK., S7N 5E2
"Pluralitas non est ponenda sine neccesitate"
More information about the R-help
mailing list