[R] reading VERY large binary files
Duncan Murdoch
murdoch at stats.uwo.ca
Wed Nov 8 01:45:42 CET 2006
On 11/7/2006 4:54 PM, Matt Anthony wrote:
> Hello,
>
>
>
> I am trying to read in elements out of a very large binary file ... the
> total file is 4 gigs. I want to select rows out of the file, and the
> current procedure I run works but is prohibitively slow (takes more than
> a day to run and still won't complete). Is there any faster way to
> accomplish this?
You are doing several things that are likely to be slow.
>
>
>
> My current procedure looks like this:
>
>
>
> readHH <- function(file_name, hhid_list) {
>
> incon=file(file_name, open="rb")
>
> result=data.frame()
>
> tran=list()
>
> byte_mark=0
>
> last_1M_mod=0
>
> file_size=file.info(file_name)$size
>
> write.table(paste("Data pulled from", file_name, sep=" "),
> file="readHH_output.txt", sep=",", row.names=FALSE, col.names=FALSE,
> append=TRUE)
>
> while (TRUE) {
>
> tran$hh_id <- readBin(incon,integer(),1,size=4)
Why use a function call integer() here, rather than just the character
string "integer"?
>
> if(is.element(tran$hh_id, hhid_list)) {
You don't show us the is.element() function, but since it's going to be
called a lot, it might be a place for an optimization.
>
> tran$prov_id <- readBin(incon,integer(),1,size=2)
>
> tran$txn_dn <- readBin(incon,integer(),1,size=2)
>
> tran$total_dollars <- readBin(incon,integer(),1,size=4)
>
> tran$total_items <- readBin(incon,integer(),1,size=4)
>
> tran$order_id <- readBin(incon,integer(),1,size=4)
>
> tran$txn_type <- readChar(incon,1)
>
> tran$gender <- readChar(incon,1)
>
> tran$zip_code <- readChar(incon,5)
>
> tran$region_code <- readChar(incon,1)
>
> tran$county_code <- readChar(incon,1)
>
> tran$state_abbrev <- readChar(incon,2)
>
> tran$channel_code <- readChar(incon,1)
>
> tran$source_code <- readChar(incon,20)
>
> tran$payment_type <- readChar(incon,1)
>
> tran$credit_card <- readChar(incon,1)
>
> tran$promo_type <- readChar(incon,1)
>
> tran$flags <- readChar(incon,1)
You could probably make all of this a lot faster by combining it into
three calls:
readBin(ints2, "integer", 2, size=2)
readBin(ints4, "integer", 4, size=4)
readChar(chars, 36)
and then extracting the elements after reading. The extraction will
probably be pretty fast, especially if you put the results into matrices
rather than data.frames. data.frames are hugely slower than matrices.
>
> write.table(data.frame(tran), file="readHH_output", sep=",",
> row.names=FALSE, col.names=FALSE, append=TRUE)
This is going to reopen, seek, and close the file each time. Do you
really need to do that? Can't you open the output file once, then just
write the data to it?
>
> result <- rbind(result,data.frame(tran))
This is also very slow. It needs to grow a big list of vectors (which
is how result is stored) every time you read a record. It would be
faster if you could pre-allocate the result, and just assign values into
it, especially if you were assigning into a matrix, not a dataframe.
I don't know which of these suggestions will have the biggest effect.
I'd suggest trying them one by one until things are fast enough, and
then going on to something else.
Duncan Murdoch
>
> }
>
> else {
>
> byte_mark=byte_mark+42
>
> if (byte_mark>=file_size) {break}
>
> else {seek(incon, where=byte_mark)}
>
> }
>
> }
>
> return(result)
>
> }
>
>
>
> Thanks
>
>
>
> Matt
>
>
>
>
>
>
>
>
>
> Matt Anthony | Senior Statistician| 303.327.1761 |
> matt.anthony at NextAction.Net
> 10155 Westmoor Drive | Westminster, CO 80021 | FAX 303.327.1650
>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list