[R] Reading recurring data in a text file

Rui Barradas ru|pb@rr@d@@ @end|ng |rom @@po@pt
Wed Jul 24 21:33:04 CEST 2019


Hello,

Instead of read.table use

data.table::fread

It's an order of magnitude faster and all you have to do is to change 
the function, all arguments are the same (in this case).


Hope this helps,

Rui Barradas

Às 20:18 de 24/07/19, Rui Barradas escreveu:
> Hello,
> 
> This is far from a complete answer.
> 
> A quicky one: no loops.
> 
> mc_list2 <- grep(srchStr1, lines)
> tmp_list2 <- grep(srchStr2, lines)
> 
> identical(mc_list, mc_list2)    # [1] TRUE
> identical(tmp_list, tmp_list2)  # [1] TRUE
> 
> 
> Another one: don't extend lists or vectors inside loops, reserve memory 
> beforehand.
> 
> wc <- vector("list", length = length(mc_list))
> tmp <- vector("list", length = length(tmp_list))
> 
> 
> are much better than your
> 
> wc <- list()
> tmp <- list()
> 
> 
> Maybe I will find ways to save time with the really slow instructions.
> 
> Hope this helps,
> 
> Rui Barradas
> 
> 
> Às 19:54 de 24/07/19, Morway, Eric via R-help escreveu:
>> The small reproducible example below works, but is way too slow on the 
>> real
>> problem.  The real problem is attempting to extract ~2920 repeated arrays
>> from a 60 Mb file and takes ~80 minutes.  I'm wondering how I might
>> re-engineer the script to avoid opening and closing the file 2920 
>> times as
>> is the case now.  That is, is there a way to keep the file open and peel
>> out the arrays and stuff them into a list of data.tables, as is done 
>> in the
>> small reproducible example below, but in a significantly faster way?
>>
>> wha <- "     INITIAL PRESSURE HEAD
>>       INITIAL TEMPERATURE SET TO 4.000E+00 DEGREES C
>>       VS2DH - MedSand for TL test
>>
>>       TOTAL ELAPSED TIME =  0.000000E+00 sec
>>       TIME STEP         0
>>
>>       MOISTURE CONTENT
>>    Z, IN
>>    m                       X OR R DISTANCE, IN m
>>                  0.500
>>       0.075     0.1475
>>       0.225     0.1475
>>       0.375     0.1475
>>       0.525     0.1475
>>       0.675     0.1475
>> blah
>> blah
>> blah
>>       TEMPERATURE, IN DECREES C
>>    Z, IN
>>    m                       X OR R DISTANCE, IN m
>>                  0.500
>>       0.075     1.1475
>>       0.225     2.1475
>>       0.375     3.1475
>>       0.525     4.1475
>>       0.675     5.1475
>> blah
>> blah
>> blah
>>
>>       TOTAL ELAPSED TIME =  8.6400E+04 sec
>>       TIME STEP         0
>>
>>       MOISTURE CONTENT
>>    Z, IN
>>    m                       X OR R DISTANCE, IN m
>>                  0.500
>>       0.075     0.1875
>>       0.225     0.1775
>>       0.375     0.1575
>>       0.525     0.1675
>>       0.675     0.1475
>> blah
>> blah
>> blah     TEMPERATURE, IN DECREES C
>>    Z, IN
>>    m                       X OR R DISTANCE, IN m
>>                  0.500
>>       0.075     1.1475
>>       0.225     2.1475
>>       0.375     3.1475
>>       0.525     4.1475
>>       0.675     5.1475
>> blah
>> blah
>> blah"
>>
>> example_content <- textConnection(wha)
>>
>> srchStr1 <- '     MOISTURE CONTENT'
>> srchStr2 <- 'TEMPERATURE, IN DECREES C'
>>
>> lines   <- readLines(example_content)
>> mc_list <- NULL
>> for (i in 1:length(lines)){
>>    # Look for start of water content
>>    if(grepl(srchStr1, lines[i])){
>>      mc_list <- c(mc_list, i)
>>    }
>> }
>>
>> tmp_list <- NULL
>> for (i in 1:length(lines)){
>>    # Look for start of temperature data
>>    if(grepl(srchStr2, lines[i])){
>>      tmp_list <- c(tmp_list, i)
>>    }
>> }
>>
>> # Store the water content arrays
>> wc <- list()
>> # Read all the moisture content profiles
>> for(i in 1:length(mc_list)){
>>    lineNum <- mc_list[i] + 3
>>    mct <- read.table(text = wha, skip=lineNum, nrows=5,
>>                      col.names=c('depth','wc'))
>>    wc[[i]] <- mct
>> }
>>
>> # Store the water temperature arrays
>> tmp <- list()
>> # Read all the temperature profiles
>> for(i in 1:length(tmp_list)){
>>    lineNum <- tmp_list[i] + 3
>>    tmpt <- read.table(text = wha, skip=lineNum, nrows=5,
>>                      col.names=c('depth','tmp'))
>>    tmp[[i]] <- tmpt
>> }
>>
>> # quick inspection
>> length(wc)
>> wc[[1]]
>> # Looks like what I'm after, but too slow in real world problem
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list