[R] How to read in this data format?
Bart Joosen
bartjoosen at hotmail.com
Mon Mar 5 11:31:08 CET 2007
Hi,
Although the solution worked, I'v got some troubles with some data files.
These datafiles are very large (600-700 MB), so my computer starts swapping.
If I use the code, written below, I get:
Error in .Call("R_lazyLoadDBfetch", key, file, compressed, hook, PACKAGE =
"base") :
recursive default argument reference
After about 15 minutes of loading the data with the Lines. <-
readLines("myfile.dat") command.
When I look in the help for readLines, I saw that there is a n to setup a
maximum number, but is there a way to set a starting row number? If I can
split up my datafiles in 4-8 small datasets, it's ok for me. But I couldn't
figure it out.
Thanks
Bart
>From: "Gabor Grothendieck" <ggrothendieck at gmail.com>
>To: "Bart Joosen" <Bartjoosen at hotmail.com>
>CC: r-help at stat.math.ethz.ch
>Subject: Re: [R] How to read in this data format?
>Date: Thu, 1 Mar 2007 16:46:21 -0500
>
>On 3/1/07, Bart Joosen <Bartjoosen at hotmail.com> wrote:
>>Dear All,
>>
>>thanks for the replies, Jim Holtman has given a solution which fits my
>>needs, but Gabor Grothendieck did the same thing,
>>but it looks like the coding will allow faster processing (should check
>>this
>>out tomorrow on a big datafile).
>>
>>@gabor: I don't understand the use of the grep command:
>> grep("^[1-9][0-9. ]*$|Time", Lines., value = TRUE)
>>What is this expression ("^[1-9][0-9. ]*$|Time") actually doing?
>>I looked in the help page, but couldn't find a suitable answer.
>
>I briefly discussed it in the first paragraph of my response. It
>matches and returns only those lines that start (^ matches start of line)
>with a digit, i.e. [1-9], and contains only digits, dots and spaces,
>i.e. [0-9. ]*, to end of line, i.e. $ matches end of line, or (| means
>or) contains the word Time.
>If you don't have lines like ... (which you did in your example) then
>the regexp
>could be simplified to "^[0-9. ]+$|Time". You may need to match tabs too
>if your input contains those.
>
>>
>>
>>Thanks to All
>>
>>
>>Bart
>>
>>----- Original Message -----
>>From: "Gabor Grothendieck" <ggrothendieck at gmail.com>
>>To: "Bart Joosen" <bartjoosen at hotmail.com>
>>Cc: <r-help at stat.math.ethz.ch>
>>Sent: Thursday, March 01, 2007 6:35 PM
>>Subject: Re: [R] How to read in this data format?
>>
>>
>> > Read in the data using readLines, extract out
>> > all desired lines (namely those containing only
>> > numbers, dots and spaces or those with the
>> > word Time) and remove Retention from all
>> > lines so that all remaining lines have two
>> > fields. Now that we have desired lines
>> > and all lines have two fields read them in
>> > using read.table.
>> >
>> > Finally, split them into groups and restructure
>> > them using "by" and in the last line we
>> > convert the "by" output to a data frame.
>> >
>> > At the end we display an alternate function f
>> > for use with by should we wish to generate long
>> > rather than wide output (using the terminology
>> > of the reshape command).
>> >
>> >
>> > Lines <- "$$ Experiment Number:
>> > $$ Associated Data:
>> >
>> > FUNCTION 1
>> >
>> > Scan 1
>> > Retention Time 0.017
>> >
>> > 399.8112 184
>> > 399.8742 0
>> > 399.9372 152
>> > ....
>> >
>> > Scan 2
>> > Retention Time 0.021
>> >
>> > 399.8112 181
>> > 399.8742 1
>> > 399.9372 153
>> > "
>> >
>> > # replace next line with: Lines. <- readLines("myfile.dat")
>> > Lines. <- readLines(textConnection(Lines))
>> > Lines. <- grep("^[1-9][0-9. ]*$|Time", Lines., value = TRUE)
>> > Lines. <- gsub("Retention", "", Lines.)
>> >
>> > DF <- read.table(textConnection(Lines.), as.is = TRUE)
>> > closeAllConnections()
>> >
>> > f <- function(x) c(id = x[1,2], structure(x[-1,2], .Names = x[-1,1]))
>> > out.by <- by(DF, cumsum(DF[,1] == "Time"), f)
>> > as.data.frame(do.call("rbind", out.by))
>> >
>> >
>> > We could alternately consider producing long
>> > format by replacing the function f with:
>> >
>> > f <- function(x) data.frame(x[-1,], id = x[1,2])
>> >
>> >
>> > On 3/1/07, Bart Joosen <bartjoosen at hotmail.com> wrote:
>> >> Hi,
>> >>
>> >> I recieved an ascii file, containing following information:
>> >>
>> >> $$ Experiment Number:
>> >> $$ Associated Data:
>> >>
>> >> FUNCTION 1
>> >>
>> >> Scan 1
>> >> Retention Time 0.017
>> >>
>> >> 399.8112 184
>> >> 399.8742 0
>> >> 399.9372 152
>> >> ....
>> >>
>> >> Scan 2
>> >> Retention Time 0.021
>> >>
>> >> 399.8112 181
>> >> 399.8742 1
>> >> 399.9372 153
>> >> .....
>> >>
>> >>
>> >> I would like to import this data in R into a dataframe, where there is
>>a
>> >> column time, the first numbers as column names, and the second numbers
>>as
>> >> data in the dataframe:
>> >>
>> >> Time 399.8112 399.8742 399.9372
>> >> 0.017 184 0 152
>> >> 0.021 181 1 153
>> >>
>> >> I did take a look at the read.table, read.delim, scan, ... But I 've
>>no
>> >> idea
>> >> about how to solve this problem.
>> >>
>> >> Anyone?
>> >>
>> >>
>> >> Thanks
>> >>
>> >> Bart
>> >>
>> >> ______________________________________________
>> >> R-help at stat.math.ethz.ch mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>>
>>
More information about the R-help
mailing list