[R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?
Henrik Bengtsson
hb at stat.berkeley.edu
Wed Aug 13 04:56:05 CEST 2008
To simplify:
n <- 2.7e6;
x <- factor(c(rep("A", n/2), rep("B", n/2)));
# Identify 'A':s
t1 <- system.time(res <- which(x == "A"));
# To compare a factor to a string, the factor is in practice
# coerced to a character vector.
t2 <- system.time(res <- which(as.character(x) == "A"));
# Interestingly enough, this seems to be faster (repeated many times)
# Don't know why.
print(t2/t1);
user system elapsed
0.632653 1.600000 0.754717
# Avoid coercing the factor, but instead coerce the level compared to
t3 <- system.time(res <- which(x == match("A", levels(x))));
# ...but gives no speed up
print(t3/t1);
user system elapsed
1.041667 1.000000 1.018182
# But coercing the factor to integers does
t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x))))
print(t4/t1);
user system elapsed
0.4166667 0.0000000 0.3636364
So, the latter seems to be the fastest way to identify those elements.
My $.02
/Henrik
On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <cowan.pd at gmail.com> wrote:
> Emmanuel,
>
> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <emmanuel.levy at gmail.com> wrote:
>> Dear All,
>>
>> I have a large data frame ( 2700000 lines and 14 columns), and I would like to
>> extract the information in a particular way illustrated below:
>>
>>
>> Given a data frame "df":
>>
>>> col1=sample(c(0,1),10, rep=T)
>>> names = factor(c(rep("A",5),rep("B",5)))
>>> df = data.frame(names,col1)
>>> df
>> names col1
>> 1 A 1
>> 2 A 0
>> 3 A 1
>> 4 A 0
>> 5 A 1
>> 6 B 0
>> 7 B 0
>> 8 B 1
>> 9 B 0
>> 10 B 0
>>
>> I would like to tranform it in the form:
>>
>>> index = c("A","B")
>>> col1[[1]]=df$col1[which(df$name=="A")]
>>> col1[[2]]=df$col1[which(df$name=="B")]
>
> I'm not sure I fully understand your problem, you example would not run for me.
>
> You could get a small speedup by omitting which(), you can subset by a
> logical vector also which give a small speedup.
>
>> n <- 2700000
>> foo <- data.frame(
> + one = sample(c(0,1), n, rep = T),
> + two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
> + )
>> system.time(out <- which(foo$two=="A"))
> user system elapsed
> 0.566 0.146 0.761
>> system.time(out <- foo$two=="A")
> user system elapsed
> 0.429 0.075 0.588
>
> You might also find use for unstack(), though I didn't see a speedup.
>> system.time(out <- unstack(foo))
> user system elapsed
> 1.068 0.697 2.004
>
> HTH
>
> Peter
>
>> My problem is that the command: *** which(df$name=="A") ***
>> takes about 1 second because df is so big.
>>
>> I was thinking that a "level" could maybe be accessed instantly but I am not
>> sure about how to do it.
>>
>> I would be very grateful for any advice that would allow me to speed this up.
>>
>> Best wishes,
>>
>> Emmanuel
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list