[R] negative vector length when merging data frames
Ana Marija
@okov|c@@n@m@r|j@ @end|ng |rom gm@||@com
Tue Oct 29 03:28:57 CET 2019
HI Rui,
thank you so much for this. I tried with the sqldf but it didn't help.
Next I tried your 2nd method and I was following your steps until:
> res2 <- asign[i2, setdiff(names(asign), names(l4))]
> m=merge(res, res2, by.x = c("chr", "pos"), by.y = c("chr", "pos"))
Error in merge.data.table(res, res2, by.x = c("chr", "pos"), by.y = c("chr", :
Elements listed in `by.y` must be valid column names in y.
> head(res)
chr pos a1 a2 a3 variant_id pval_nominal gene_id
1: chr1 54490 G A b38 chr1_54490_G_A_b38 0.608495 ENSG00000227232.5
2: chr1 58814 G A b38 chr1_58814_G_A_b38 0.295211 ENSG00000227232.5
3: chr1 60351 A G b38 chr1_60351_A_G_b38 0.439788 ENSG00000227232.5
4: chr1 61920 G A b38 chr1_61920_G_A_b38 0.319528 ENSG00000227232.5
5: chr1 63671 G A b38 chr1_63671_G_A_b38 0.237739 ENSG00000227232.5
6: chr1 64931 G A b38 chr1_64931_G_A_b38 0.276679 ENSG00000227232.5
> head(res2)
[1] "gene" "chr_pos" "p.val.Retina"
> dim(res)
[1] 111478253 8
> head(l4)
chr pos a1 a2 a3 variant_id pval_nominal gene_id
1: chr1 13550 G A b38 chr1_13550_G_A_b38 0.375614 ENSG00000227232.5
2: chr1 14671 G C b38 chr1_14671_G_C_b38 0.474708 ENSG00000227232.5
3: chr1 14677 G A b38 chr1_14677_G_A_b38 0.699887 ENSG00000227232.5
4: chr1 16841 G T b38 chr1_16841_G_T_b38 0.127895 ENSG00000227232.5
5: chr1 16856 A G b38 chr1_16856_A_G_b38 0.627822 ENSG00000227232.5
6: chr1 17005 A G b38 chr1_17005_A_G_b38 0.802803 ENSG00000227232.5
> head(asign)
gene chr chr_pos pos p.val.Retina
1: ENSG00000227232 chr1 1:10177:A:AC 10177 0.381708
2: ENSG00000227232 chr1 rs145072688:10352:T:TA 10352 0.959523
3: ENSG00000227232 chr1 1:11008:C:G 11008 0.218132
4: ENSG00000227232 chr1 1:11012:C:G 11012 0.218132
5: ENSG00000227232 chr1 1:13110:G:A 13110 0.998262
6: ENSG00000227232 chr1 rs201725126:13116:T:G 13116 0.438572
> length(i2)
[1] 107371528
Everything is the same as I stated initially in the problem, except
that as you can see in l4 I renamed columns so now
instead of X1 and X2 I have "chr", "pos"
Do you know why this command didn't return anything?
res2 <- asign[i2, setdiff(names(asign), names(l4))]
On Thu, Oct 24, 2019 at 2:17 PM Rui Barradas <ruipbarradas using sapo.pt> wrote:
>
> Hello,
>
> Sometimes sqldf::sqldf tends to save memory. Maybe if you try
>
> library(sqldf)
>
> sqldf('select l4.*, asign.gene, asign.chr_pos, asign.`p.val.Retina`
> from l4
> inner join asign
> on X1 = asign.chr and X2 = asign.pos')
>
> Or you can filter the rows that match first, then merge the results.
> Something along the lines of
>
> # read in only the columns needed with fread, it's fast
> l4join <- data.table::fread(l4_file, select = c("X1", "X2"))
> ajoin <- data.table::fread(asign_file, select = c("chr", "pos"))
>
> # create indices with the matches on both sides
> i1 <- (l4join$X1 %in% ajoin$chr) & (l4join$X2 %in% ajoin$pos)
> i2 <- (ajoin$chr %in% l4join$X1) & (ajoin$pos %in% l4join$X2)
>
> rm(l4join, ajoin) # don't need this any more, remove them
>
> # now the real fread's
> l4 <- data.table::fread(l4_file)
> asign <- data.table::fread(asign_file)
>
> # extract the relevant rows and merge
> res <- l4[i1, ]
> res2 <- asign[i2, setdiff(names(asign), names(l4))]
> merge(res, res2, by.x = c("X1", "X2"), by.y = c("chr", "pos"))
>
>
> Hope this helps,
>
> Rui Barradas
>
>
>
>
>
>
> Às 00:08 de 24/10/19, Ana Marija escreveu:
> > Hi Jim,
> >
> > I think one of the issue is that data frames are so big,
> >> dim(l4)
> > [1] 166941635 8
> >> dim(asign)
> > [1] 107371528 5
> >
> > so my example would not reproduce the error
> >
> > On Wed, Oct 23, 2019 at 6:05 PM Jim Lemon <drjimlemon using gmail.com> wrote:
> >>
> >> Hi Ana,
> >> When I run this example taken from your email:
> >>
> >> l4<-read.table(text="X1 X2 X3 X4 X5 variant_id pval_nominal gene_id.LCL
> >> chr1 13550 G A b38 1:13550:G:A 0.375614 ENSG00000227232
> >> chr1 14671 G C b38 1:14671:G:C 0.474708 ENSG00000227232
> >> chr1 14677 G A b38 1:14677:G:A 0.699887 ENSG00000227232
> >> chr1 16841 G T b38 1:16841:G:T 0.127895 ENSG00000227232
> >> chr1 16856 A G b38 1:16856:A:G 0.627822 ENSG00000227232
> >> chr1 17005 A G b38 1:17005:A:G 0.802803 ENSG00000227232",
> >> header=TRUE,stringsAsFactors=FALSE)
> >> asign<-read.table(text="gene chr chr_pos pos p.val.Retina
> >> ENSG00000227232 chr1 1:10177:A:AC 10177 0.381708
> >> ENSG00000227232 chr1 rs145072688:10352:T:TA 10352 0.959523
> >> ENSG00000227232 chr1 1:11008:C:G 11008 0.218132
> >> ENSG00000227232 chr1 1:11012:C:G 11012 0.218132
> >> ENSG00000227232 chr1 1:13110:G:A 13110 0.998262
> >> ENSG00000227232 chr1 rs201725126:13116:T:G 13116 0.438572",
> >> header=TRUE,stringsAsFactors=FALSE)
> >> merge(l4, asign, by.x=c("X1", "X2"), by.y=c("chr", "pos"))
> >> [1] X1 X2 X3 X4 X5
> >> [6] variant_id pval_nominal gene_id.LCL gene chr_pos
> >> [11] p.val.Retina
> >> <0 rows> (or 0-length row.names)
> >>
> >> It works okay, but there are no matches in the join. So I can't even
> >> guess what the problem is.
> >>
> >> Jim
> >>
> >> On Thu, Oct 24, 2019 at 9:33 AM Ana Marija <sokovic.anamarija using gmail.com> wrote:
> >>>
> >>> Hello,
> >>>
> >>> I have two data frames like this:
> >>>
> >>>> head(l4)
> >>> X1 X2 X3 X4 X5 variant_id pval_nominal gene_id.LCL
> >>> 1 chr1 13550 G A b38 1:13550:G:A 0.375614 ENSG00000227232
> >>> 2 chr1 14671 G C b38 1:14671:G:C 0.474708 ENSG00000227232
> >>> 3 chr1 14677 G A b38 1:14677:G:A 0.699887 ENSG00000227232
> >>> 4 chr1 16841 G T b38 1:16841:G:T 0.127895 ENSG00000227232
> >>> 5 chr1 16856 A G b38 1:16856:A:G 0.627822 ENSG00000227232
> >>> 6 chr1 17005 A G b38 1:17005:A:G 0.802803 ENSG00000227232
> >>>> head(asign)
> >>> gene chr chr_pos pos p.val.Retina
> >>> 1: ENSG00000227232 chr1 1:10177:A:AC 10177 0.381708
> >>> 2: ENSG00000227232 chr1 rs145072688:10352:T:TA 10352 0.959523
> >>> 3: ENSG00000227232 chr1 1:11008:C:G 11008 0.218132
> >>> 4: ENSG00000227232 chr1 1:11012:C:G 11012 0.218132
> >>> 5: ENSG00000227232 chr1 1:13110:G:A 13110 0.998262
> >>> 6: ENSG00000227232 chr1 rs201725126:13116:T:G 13116 0.438572
> >>>> m = merge(l4, asign, by.x=c("X1", "X2"), by.y=c("chr", "pos"))
> >>> Error in merge.data.frame(l4, asign, by.x = c("X1", "X2"), by.y = c("chr", :
> >>> negative length vectors are not allowed
> >>>> sapply(l4,class)
> >>> X1 X2 X3 X4 X5 variant_id
> >>> "character" "character" "character" "character" "character" "character"
> >>> pval_nominal gene_id.LCL
> >>> "numeric" "character"
> >>>> sapply(asign,class)
> >>> gene chr chr_pos pos p.val.Retina
> >>> "character" "character" "character" "character" "character"
> >>>
> >>> Please advise as to why I am getting this error when merging?
> >>>
> >>> Thanks
> >>> Ana
> >>>
> >>> ______________________________________________
> >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
More information about the R-help
mailing list