[R] Duplicate rows when I combine two data.frames with merge!
Petr Savicky
savicky at cs.cas.cz
Mon Feb 6 22:26:47 CET 2012
On Mon, Feb 06, 2012 at 12:29:53PM -0800, RKinzer wrote:
> Hello all,
>
> First I have done extensive searches on this forum and others and nothing
> seems to work. So I decided to post thinking someone could point me to the
> write post or give me some help.
>
> I have drawn a 100 samples from a fictitious population (N=1000), and then
> randomly selected 25% of the 100 samples. I would like to now merge the
> data.frame from the 100 samples with the data.frame for the 25 individuals
> from the sample. When I do this with the following code I get duplicate
> rows, when I should have at most is 100.
>
> x<-mapply(rnorm,1000,c(54,78,89),c(3.5,5.5,5.9)) #sets up 1000 random
> numbers for age 3,4,5
> x.3<-sample(x[,1],60) #randomly selects 60 lengths from age 3
> x.4<-sample(x[,2],740)
> x.5<-sample(x[,3],200)
> length<-c(x.3,x.4,x.5)
> length<-round(length,digits=0) #rounds lengths to whole number
> age3<-rep(3,60)
> age4<-rep(4,740)
> age5<-rep(5,200)
> age<-c(age3,age4,age5) #combines ages into one vector
> unique<-1:1000 #gives each fish a unique id
> pop<-data.frame(unique,length,age)
> pop<-pop[sample(1:1000,size=1000,replace=FALSE),] #randomized the order of
> pop
> c.one<-pop[sample(1:1000,size=100,replace=TRUE),]
> a.one.qtr<-c.one[sample(1:100,size=25,replace=TRUE),]
> merge<-merge(c.one,a.one.qtr,by="unique",all=TRUE)
>
> What I would ultimately like to have is one row for all 100 in the sample
> and three columns (unique, length, age). And then some way to identify the
> 25 individual selected rows.
The function merge() here includes additional columns, which
contain in the rows from a.one.qtr copies of the columns length
and age. So, the same values appear twice in the row. I am
not sure, whether this is intended.
Another representation of the subsample a.one.qtr may be done
by adding a column to c.one, which specifies, how many times
was the row selected to a.one.qtr. For example as follows.
a.one.qtr2 <- sample(1:100,size=25,replace=TRUE)
c.one2 <- cbind(c.one, selected=tabulate(a.one.qtr2, nbins=100))
# a random result may look like
c.one2
unique length age selected
657 657 81 4 0
488 488 78 4 1
886 886 85 5 0
448 448 82 4 0
292 292 80 4 0
431 431 78 4 0
683 683 82 4 0
32 32 56 3 2
740 740 80 4 0
519 519 81 4 1
986 986 88 5 0
437 437 84 4 0
247 247 88 4 0
122 122 73 4 0
...
The sum of the column "selected" is 25.
Hope this helps.
Petr Savicky.
More information about the R-help
mailing list