[R] merging data frames with matrix objects when missing cases

Sat Sep 19 07:52:31 CEST 2009

Yes, that was the original question: when a variable in a data frame is
a matrix instead of an ordinary variable merge() handles the missing
cases so that only the first column of the matrix gets NA and the rest
are recycled. If the matrix is broken to several variables everything
works fine.

Why then have a matrix in a data frame as a variable? In chemometrics,
for example, it is usual to have e.g. NIR spectra stored in the data
frame in this way. This eases the use of such spectra as a predictor in
the model formula (may contain hundreds of "variables" depending on the
wavelength binning used). It is also helpful in grouping variables in a
data frame to different predictor sets. See examples in the "pls"
package. 

There is a workout by searching the NA for the first column and setting
all other columns on that row NA as well. But my question was more like
a caution about the unexpected behaviour that someone could consider as
an unwished feature.

Kari

On Fri, 2009-09-18 at 20:41 +0300, johannes rara wrote:
> This has something to do with your data.frame structure
> 
> see
> 
> > str(df1)
> 'data.frame':	3 obs. of  2 variables:
>  $ a : int  1 2 3
>  $ X1: 'AsIs' int [1:3, 1:2] 1 2 3 4 5 6
> > str(df2)
> 'data.frame':	2 obs. of  2 variables:
>  $ a : int  1 2
>  $ X2: 'AsIs' int [1:2, 1:2] 11 12 13 14
> 
> This seems to work
> 
> > df1<-data.frame(a=1:3, b = 1:3, c = 4:6)
> > str(df1)
> 'data.frame':	3 obs. of  3 variables:
>  $ a: int  1 2 3
>  $ b: int  1 2 3
>  $ c: int  4 5 6
> > df2<-data.frame(a=1:2, d = 11:12, e = 13:14)
> > str(df2)
> 'data.frame':	2 obs. of  3 variables:
>  $ a: int  1 2
>  $ d: int  11 12
>  $ e: int  13 14
> > merge(df1,df2)
>   a b c  d  e
> 1 1 1 4 11 13
> 2 2 2 5 12 14
> > merge(df1, df2, all=T)
>   a b c  d  e
> 1 1 1 4 11 13
> 2 2 2 5 12 14
> 3 3 3 6 NA NA
> >
> 
> 2009/9/18 Kari Ruohonen <kari.ruohonen at utu.fi>:
> > Hi,
> > I have faced a problem with the merge() function when trying to merge
> > two data frames that have a common index but the second one does not
> > have cases for all indexes in the first one. With usual variables R
> > fills in the missing cases with NA if all=T is requested. But if the
> > variable is a matrix R seems to insert NA only to the first column of
> > the matrix and fill in the rest of the columns by recycling the values.
> > Here is a toy example:
> >
> >> df1<-data.frame(a=1:3,X1=I(matrix(1:6,ncol=2)))
> >> df2<-data.frame(a=1:2,X2=I(matrix(11:14,ncol=2)))
> >> merge(df1,df2)
> >  a X1.1 X1.2 X2.1 X2.2
> > 1 1    1    4   11   13
> > 2 2    2    5   12   14
> > # no all=T, missing cases are dropped
> >
> >> merge(df1,df2,all=T)
> >  a X1.1 X1.2 X2.1 X2.2
> > 1 1    1    4   11   13
> > 2 2    2    5   12   14
> > 3 3    3    6   NA   13
> > # X2.1 set to NA correctly but X2.2 set to 13 by recycling.
> >
> > Can I somehow get the behaviour that the third row of the second matrix
> > X2 in the above example would be filled with NA for all columns? None of
> > the merge() options does not seem to provide a solution.
> >
> > regards, Kari
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.