[R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
Peter Cowan
cowan.pd at gmail.com
Wed Aug 13 04:31:33 CEST 2008
Emmanuel,
On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <emmanuel.levy at gmail.com> wrote:
> Dear All,
>
> I have a large data frame ( 2700000 lines and 14 columns), and I would like to
> extract the information in a particular way illustrated below:
>
>
> Given a data frame "df":
>
>> col1=sample(c(0,1),10, rep=T)
>> names = factor(c(rep("A",5),rep("B",5)))
>> df = data.frame(names,col1)
>> df
> names col1
> 1 A 1
> 2 A 0
> 3 A 1
> 4 A 0
> 5 A 1
> 6 B 0
> 7 B 0
> 8 B 1
> 9 B 0
> 10 B 0
>
> I would like to tranform it in the form:
>
>> index = c("A","B")
>> col1[[1]]=df$col1[which(df$name=="A")]
>> col1[[2]]=df$col1[which(df$name=="B")]
I'm not sure I fully understand your problem, you example would not run for me.
You could get a small speedup by omitting which(), you can subset by a
logical vector also which give a small speedup.
> n <- 2700000
> foo <- data.frame(
+ one = sample(c(0,1), n, rep = T),
+ two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
+ )
> system.time(out <- which(foo$two=="A"))
user system elapsed
0.566 0.146 0.761
> system.time(out <- foo$two=="A")
user system elapsed
0.429 0.075 0.588
You might also find use for unstack(), though I didn't see a speedup.
> system.time(out <- unstack(foo))
user system elapsed
1.068 0.697 2.004
HTH
Peter
> My problem is that the command: *** which(df$name=="A") ***
> takes about 1 second because df is so big.
>
> I was thinking that a "level" could maybe be accessed instantly but I am not
> sure about how to do it.
>
> I would be very grateful for any advice that would allow me to speed this up.
>
> Best wishes,
>
> Emmanuel
More information about the R-help
mailing list