[R] A manipulation problem for a large data set in R
Charles C. Berry
cberry at tajo.ucsd.edu
Wed Aug 27 17:58:59 CEST 2008
On Wed, 27 Aug 2008, Giuseppe Paleologo wrote:
> I have two questions for the group. One is very concrete, and is dangerously
> close to a "please do my homework" posting. The second follows from the
> first one but is more general. I would welcome the advice of experienced R
> users.
>
> As for the first one: I have a data frame with two variables
>
> X Y
> A, chris
> D, chris
> B, chris
> B, chris
> C, andrew
> E, andrew
> C, andrew
> B, beth
> D, chris
> D, beth
> C, beth
> D, beth
> D, beth
> A, andrew
> A, andrew
> A, andrew
> C, chris
> B, beth
> D, chris
> E, andrew
> D, chris
> D, beth
> D, chris
> A, andrew
> A, chris
> C chris
> A chris
> B chris
> C beth
> A chris
>
> I would like to produce a table, with one row for every level of the factor
> X, and multiple columns, filled with the observed levels of the factor Y
> that are observed jointly with X. Hence:
>
> X Z1 Z2 Z3
> A, andrew, chris
> B, chris beth, chris
> C, andrew, beth, chris
> D, chris, beth
> E, andrew
>
> A solution would be to something like
>
> temp = tapply(Y, X, function(a) levels(a[,drop=TRUE])))
lapply( split(Y,X), unique )
or
lapply( split(Y,X), function(x) as.character(unique(x)))
HTH,
Chuck
>
> and then putting the output in an appropriately sized data frame. The issue
> I have with this is that it is inelegant and rather slow for my typical data
> set (~200K rows). So I was wondering if a more efficient, nicer solution
> exists.
>
> This leads me to a second question. Maybe out of laziness, maybe because R
> is good enough, I tend to do all my local data manipulations in R. This
> includes de-duping records, joining tables, and grouping observations. I do
> this also for larger data sets (say, dense tables with 100M+ elements). Is
> this current practice among R users? If so, is there a tutorial, or an R
> view on it? If not, what do you use?
>
> Thanks in advance,
>
> -gappy
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the R-help
mailing list