[R] Duplicate rows when I combine two data.frames with merge!

Mon Feb 6 22:26:47 CET 2012

On Mon, Feb 06, 2012 at 12:29:53PM -0800, RKinzer wrote:
> Hello all,
> 
> First I have done extensive searches on this forum and others and nothing
> seems to work.  So I decided to post thinking someone could point me to the
> write post or give me some help.
> 
> I have drawn a 100 samples from a fictitious population (N=1000), and then
> randomly selected 25% of the 100 samples.  I would like to now merge the
> data.frame from the 100 samples with the data.frame for the 25 individuals
> from the sample.  When I do this with the following code I get duplicate
> rows, when I should have at most is 100.
> 
> x<-mapply(rnorm,1000,c(54,78,89),c(3.5,5.5,5.9))  #sets up 1000 random
> numbers for age 3,4,5
> x.3<-sample(x[,1],60)  #randomly selects 60 lengths from age 3
> x.4<-sample(x[,2],740)
> x.5<-sample(x[,3],200)
> length<-c(x.3,x.4,x.5)  
> length<-round(length,digits=0)  #rounds lengths to whole number
> age3<-rep(3,60) 
> age4<-rep(4,740)
> age5<-rep(5,200)
> age<-c(age3,age4,age5)  #combines ages into one vector
> unique<-1:1000  #gives each fish a unique id
> pop<-data.frame(unique,length,age) 
> pop<-pop[sample(1:1000,size=1000,replace=FALSE),]  #randomized the order of
> pop
> c.one<-pop[sample(1:1000,size=100,replace=TRUE),] 
> a.one.qtr<-c.one[sample(1:100,size=25,replace=TRUE),] 
> merge<-merge(c.one,a.one.qtr,by="unique",all=TRUE)
> 
> What I would ultimately like to have is one row for all 100 in the sample
> and three columns (unique, length, age).  And then some way to identify the
> 25 individual selected rows.

The function merge() here includes additional columns, which
contain in the rows from a.one.qtr copies of the columns length
and age. So, the same values appear twice in the row. I am
not sure, whether this is intended.

Another representation of the subsample a.one.qtr may be done
by adding a column to c.one, which specifies, how many times
was the row selected to a.one.qtr. For example as follows.

  a.one.qtr2 <- sample(1:100,size=25,replace=TRUE)
  c.one2 <- cbind(c.one, selected=tabulate(a.one.qtr2, nbins=100))

  # a random result may look like

  c.one2

        unique length age selected
  657      657     81   4        0
  488      488     78   4        1
  886      886     85   5        0
  448      448     82   4        0
  292      292     80   4        0
  431      431     78   4        0
  683      683     82   4        0
  32        32     56   3        2
  740      740     80   4        0
  519      519     81   4        1
  986      986     88   5        0
  437      437     84   4        0
  247      247     88   4        0
  122      122     73   4        0
  ...

The sum of the column "selected" is 25.

Hope this helps.

Petr Savicky.