[R] Efficient way to create new column based on comparison with another dataframe
Hervé Pagès
hpages at fredhutch.org
Mon Feb 1 23:06:48 CET 2016
Hi Gaius,
On 01/29/2016 10:52 AM, Gaius Augustus wrote:
> I have two dataframes. One has chromosome arm information, and the other
> has SNP position information. I am trying to assign each SNP an arm
> identity. I'd like to create this new column based on comparing it to the
> reference file.
>
> *1) Mapfile (has millions of rows)*
>
> Name Chr Position
> S1 1 3000
> S2 1 6000
> S3 1 1000
>
> *2) Chr.Arms file (has 39 rows)*
>
> Chr Arm Start End
> 1 p 0 5000
> 1 q 5001 10000
>
>
> *R Script that works, but slow:*
> Arms <- c()
> for (line in 1:nrow(Mapfile)){
> Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr &
> Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] <
> Chr.Arms$End]}
> }
> Mapfile$Arm <- Arms
>
>
> *Output Table:*
>
> Name Chr Position Arm
> S1 1 3000 p
> S2 1 6000 q
> S3 1 1000 p
>
>
> In words: I want each line to look up the location ( 1) find the right Chr,
> 2) find the line where the START < POSITION < END), then get the ARM
> information and place it in a new column.
>
> This R script works, but surely there is a more time/processing efficient
> way to do it.
You could use the GenomicRanges package for this:
1) Turn 'Mapfile' and 'Chr.Arms' into GRanges objects:
library(GenomicRanges)
query <- makeGRangesFromDataFrame(Mapfile, start.field="Position",
end.field="Position")
subject <- makeGRangesFromDataFrame(Chr.Arms)
2) Call findOverlaps() on them:
Mapfile2Chr.Arms <- findOverlaps(query, subject, select="arbitrary")
3) Use the result of findOverlaps() to create the column to add to
'Mapfile':
Mapfile$Arm <- Chr.Arms$Arm[Mapfile2Chr.Arms]
Mapfile
# Name Chr Position Arm
# 1 S1 1 3000 p
# 2 S2 1 6000 q
# 3 S3 1 1000 p
Should be very fast.
Note that GenomicRanges is a Bioconductor package:
http://bioconductor.org/packages/GenomicRanges
Make sure you follow the Installation instructions on that page.
Cheers,
H.
>
> Thanks in advance for any help,
> Gaius
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-help
mailing list