[R] How to read in this data format?

Mon Mar 5 11:31:08 CET 2007

Hi,

Although the solution worked, I'v got some troubles with some data files.
These datafiles are very large (600-700 MB), so my computer starts swapping.

If I use the code, written below, I get:
Error in .Call("R_lazyLoadDBfetch", key, file, compressed, hook, PACKAGE = 
"base") :
        recursive default argument reference
After about 15 minutes of loading the data with the  Lines. <- 
readLines("myfile.dat") command.

When I look in the help for readLines, I saw that there is a n to setup a 
maximum number, but is there a way to set a starting row number? If I can 
split up my datafiles in 4-8 small datasets, it's ok for me. But I couldn't 
figure it out.

Thanks

Bart

>From: "Gabor Grothendieck" <ggrothendieck at gmail.com>
>To: "Bart Joosen" <Bartjoosen at hotmail.com>
>CC: r-help at stat.math.ethz.ch
>Subject: Re: [R] How to read in this data format?
>Date: Thu, 1 Mar 2007 16:46:21 -0500
>
>On 3/1/07, Bart Joosen <Bartjoosen at hotmail.com> wrote:
>>Dear All,
>>
>>thanks for the replies, Jim Holtman has given a solution which fits my
>>needs, but Gabor Grothendieck did the same thing,
>>but it looks like the coding will allow faster processing (should check 
>>this
>>out tomorrow on a big datafile).
>>
>>@gabor: I don't understand the use of the grep command:
>>        grep("^[1-9][0-9. ]*$|Time", Lines., value = TRUE)
>>What is this expression  ("^[1-9][0-9. ]*$|Time") actually doing?
>>I looked in the help page, but couldn't find a suitable answer.
>
>I briefly discussed it in the first paragraph of my response.  It
>matches and returns only those lines that start (^ matches start of line)
>with a digit, i.e. [1-9], and contains only digits, dots and spaces,
>i.e. [0-9. ]*, to end of line, i.e. $ matches end of line, or (| means
>or) contains the word Time.
>If you don't have lines like ... (which you did in your example) then
>the regexp
>could be simplified to "^[0-9. ]+$|Time".  You may need to match tabs too
>if your input contains those.
>
>>
>>
>>Thanks to All
>>
>>
>>Bart
>>
>>----- Original Message -----
>>From: "Gabor Grothendieck" <ggrothendieck at gmail.com>
>>To: "Bart Joosen" <bartjoosen at hotmail.com>
>>Cc: <r-help at stat.math.ethz.ch>
>>Sent: Thursday, March 01, 2007 6:35 PM
>>Subject: Re: [R] How to read in this data format?
>>
>>
>> > Read in the data using readLines, extract out
>> > all desired lines (namely those containing only
>> > numbers, dots and spaces or those with the
>> > word Time) and remove Retention from all
>> > lines so that all remaining lines have two
>> > fields.  Now that we have desired lines
>> > and all lines have two fields read them in
>> > using read.table.
>> >
>> > Finally, split them into groups and restructure
>> > them using "by" and in the last line we
>> > convert the "by" output to a data frame.
>> >
>> > At the end we display an alternate function f
>> > for use with by should we wish to generate long
>> > rather than wide output (using the terminology
>> > of the reshape command).
>> >
>> >
>> > Lines <- "$$ Experiment Number:
>> > $$ Associated Data:
>> >
>> > FUNCTION 1
>> >
>> > Scan            1
>> > Retention Time  0.017
>> >
>> > 399.8112        184
>> > 399.8742        0
>> > 399.9372        152
>> > ....
>> >
>> > Scan            2
>> > Retention Time  0.021
>> >
>> > 399.8112        181
>> > 399.8742        1
>> > 399.9372        153
>> > "
>> >
>> > # replace next line with: Lines. <- readLines("myfile.dat")
>> > Lines. <- readLines(textConnection(Lines))
>> > Lines. <- grep("^[1-9][0-9. ]*$|Time", Lines., value = TRUE)
>> > Lines. <- gsub("Retention", "", Lines.)
>> >
>> > DF <- read.table(textConnection(Lines.), as.is = TRUE)
>> > closeAllConnections()
>> >
>> > f <- function(x) c(id = x[1,2], structure(x[-1,2], .Names = x[-1,1]))
>> > out.by <- by(DF, cumsum(DF[,1] == "Time"), f)
>> > as.data.frame(do.call("rbind", out.by))
>> >
>> >
>> > We could alternately consider producing long
>> > format by replacing the function f with:
>> >
>> > f <- function(x) data.frame(x[-1,], id = x[1,2])
>> >
>> >
>> > On 3/1/07, Bart Joosen <bartjoosen at hotmail.com> wrote:
>> >> Hi,
>> >>
>> >> I recieved an ascii file, containing following information:
>> >>
>> >> $$ Experiment Number:
>> >> $$ Associated Data:
>> >>
>> >> FUNCTION 1
>> >>
>> >> Scan            1
>> >> Retention Time  0.017
>> >>
>> >> 399.8112        184
>> >> 399.8742        0
>> >> 399.9372        152
>> >> ....
>> >>
>> >> Scan            2
>> >> Retention Time  0.021
>> >>
>> >> 399.8112        181
>> >> 399.8742        1
>> >> 399.9372        153
>> >> .....
>> >>
>> >>
>> >> I would like to import this data in R into a dataframe, where there is 
>>a
>> >> column time, the first numbers as column names, and the second numbers 
>>as
>> >> data in the dataframe:
>> >>
>> >> Time    399.8112        399.8742        399.9372
>> >> 0.017   184     0       152
>> >> 0.021   181     1       153
>> >>
>> >> I did take a look at the read.table, read.delim, scan, ... But I 've 
>>no
>> >> idea
>> >> about how to solve this problem.
>> >>
>> >> Anyone?
>> >>
>> >>
>> >> Thanks
>> >>
>> >> Bart
>> >>
>> >> ______________________________________________
>> >> R-help at stat.math.ethz.ch mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>>
>>