[R] Efficient way to create new column based on comparison with another dataframe

Gaius Augustus gaiusjaugustus at gmail.com
Sat Jan 30 18:50:05 CET 2016


I'll look into the Intervals idea.  The data.table code posted might not
work (because I don't believe it would put the rows in the correct order if
the chromosomes are interspersed), however, it did make me think about
possibly assigning based on values...

Something like:
mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End
= c(5000, 10000), key = "Chr")

for(i in 1:nrow(Chr.Arms)){
  cur.row <- Chr.Arms[i, ]
  mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position <=
cur.row$End] <- Chr.Arms$Arm
}

This might take out the need for the intermediate table/vector.  Not sure
yet if it'll work, but we'll see.  I'm interested to know if anyone else
has any ideas, too.

Thanks,
Gaius

On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at gmail.com>
wrote:

> Hi Gaius,
>
> Could you use data.table and loop over the small Chr.arms?
>
> library(data.table)
> mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position =
> c(3000, 6000, 1000), key = "Chr")
> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End
> = c(5000, 10000), key = "Chr")
>
> Arms <- data.table()
> for(i in 1:nrow(Chr.Arms)){
>   cur.row <- Chr.Arms[i, ]
>   Arm <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End]
>   Arm <- Arm[ , Arm:=cur.row$Arm][]
>   Arms <- rbind(Arms, Arm)
> }
>
> # Or use plyr to loop over each possible arm
> library(plyr)
> Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row, mapfile){
>   mapfile <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End]
>   mapfile <- mapfile[ , Arm:=cur.row$Arm][]
>   return(mapfile)
> }, mapfile = mapfile)
>
> I have just started to use the data.table and I have the feeling the code
> above can be greatly improved - maybe the loop can be dropped entirely?
>
> Hope this helps
> Ulrik
>
> On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com>
> wrote:
>
>> I have two dataframes. One has chromosome arm information, and the other
>> has SNP position information. I am trying to assign each SNP an arm
>> identity.  I'd like to create this new column based on comparing it to the
>> reference file.
>>
>> *1) Mapfile (has millions of rows)*
>>
>> Name    Chr   Position
>> S1      1      3000
>> S2      1      6000
>> S3      1      1000
>>
>> *2) Chr.Arms   file (has 39 rows)*
>>
>> Chr    Arm    Start   End
>> 1      p      0       5000
>> 1      q      5001    10000
>>
>>
>> *R Script that works, but slow:*
>> Arms  <- c()
>> for (line in 1:nrow(Mapfile)){
>>       Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr &
>>  Mapfile$Position[line] > Chr.Arms$Start &  Mapfile$Position[line] <
>> Chr.Arms$End]}
>> }
>> Mapfile$Arm <- Arms
>>
>>
>> *Output Table:*
>>
>> Name   Chr   Position   Arm
>> S1      1     3000      p
>> S2      1     6000      q
>> S3      1     1000      p
>>
>>
>> In words: I want each line to look up the location ( 1) find the right
>> Chr,
>> 2) find the line where the START < POSITION < END), then get the ARM
>> information and place it in a new column.
>>
>> This R script works, but surely there is a more time/processing efficient
>> way to do it.
>>
>> Thanks in advance for any help,
>> Gaius
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list