[R] Large Data Set Help
Charles C. Berry
cberry at tajo.ucsd.edu
Mon Aug 25 23:29:34 CEST 2008
On Mon, 25 Aug 2008, Roland Rau wrote:
> Hi,
>
> Jason Thibodeau wrote:
>> I am attempting to perform some simple data manipulation on a large data
>> set. I have a snippet of the whole data set, and my small snippet is 2GB
>> in
>> CSV.
>>
>> Is there a way I can read my csv, select a few columns, and write it to an
>> output file in real time? This is what I do right now to a small test
>> file:
>>
>> data <- read.csv('data.csv', header = FALSE)
>>
>> data_filter <- data[c(1,3,4)]
>>
>> write.table(data_filter, file = "filter_data.csv", sep = ",", row.names =
>> FALSE, col.names = FALSE)
>
> in this case, I think R is not the best tool for the job. I would rather
> suggest to use an implementation of the awk language (e.g. gawk).
> I just tried the following on WinXP (zipped file (87MB zipped, 1.2GB
> unzipped), piped into gawk)
> unzip -p myzipfile.zip | gawk '{print $1, $3, $4}' > myfiltereddata.txt
Or
unzip -p myzipfile.zip | cut -d, -f1,3,4 > myfiltereddata.txt
But beware that both this and Roland's solution will return
a,c,d
for an input line consisting of
a,"b,c",d,e,f
HTH,
Chuck
> and it took about 90 seconds.
>
> Please note that you might need to specify your delimiter (field separator
> (FS) and output field separator (OFS)) =>
> gawk '{FS=","; OFS=","} {print $1, $3, $4}' data.csv > filter_data.scv
>
> I hope this helps (despite not encouraging the usage of R),
> Roland
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
Charles C. Berry (858) 534-2098
Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
More information about the R-help
mailing list