[R] reading data from web data sources
Tim Coote
tim+r-project.org at coote.org
Sat Feb 27 21:28:36 CET 2010
Thanks, Gabor. My take away from this and Phil's post is that I'm
going to have to construct some code to do the parsing, rather than
use a standard function. I'm afraid that neither approach works, yet:
Gabor's gets has an off-by-one error (days start on the 2nd, not the
first), and the years get messed up around the 29th day. I think that
na.omit (DF) line is throwing out the baby with the bathwater. It's
interesting that this approach is based on read.table, I'd assumed
that I'd need read.ftable, which I couldn't understand the
documentation for. What is it that's removing the -999 and -888
values in this code -they seem to be gone, but I cannot see why.
Phil's reads in the data, but interleaves rows with just a year and
all other values as NA.
Tim
On 27 Feb 2010, at 17:33, Gabor Grothendieck wrote:
> Mark Leeds pointed out to me that the code wrapped around in the post
> so it may not be obvious that the regular expression in the grep is
> (i.e. it contains a space):
> "[^ 0-9.]"
>
>
> On Sat, Feb 27, 2010 at 7:15 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>> Try this. First we read the raw lines into R using grep to remove
>> any
>> lines containing a character that is not a number or space. Then we
>> look for the year lines and repeat them down V1 using cumsum.
>> Finally
>> we omit the year lines.
>>
>> myURL <- "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat
>> "
>> raw.lines <- readLines(myURL)
>> DF <- read.table(textConnection(raw.lines[!grepl("[^
>> 0-9.]",raw.lines)]), fill = TRUE)
>> DF$V1 <- DF[cumsum(is.na(DF[[2]])), 1]
>> DF <- na.omit(DF)
>> head(DF)
>>
>>
>> On Sat, Feb 27, 2010 at 6:32 AM, Tim Coote <tim+r-project.org at coote.org
>> > wrote:
>>> Hullo
>>> I'm trying to read some time series data of meteorological records
>>> that are
>>> available on the web (eg
>>> http://climate.arm.ac.uk/calibrated/soil/
>>> dsoil100_cal_1910-1919.dat). I'd
>>> like to be able to read in the digital data directly into R.
>>> However, I
>>> cannot work out the right function and set of parameters to use.
>>> It could
>>> be that the only practical route is to write a parser, possibly in
>>> some
>>> other language, reformat the files and then read these into R. As
>>> far as I
>>> can tell, the informal grammar of the file is:
>>>
>>> <comments terminated by a blank line>
>>> [<year number on a line on its own>
>>> <daily readings lines> ]+
>>>
>>> and the <daily readings> are of the form:
>>> <whitespace> <day number> [<whitespace> <reading on day of month>]
>>> 12
>>>
>>> Readings for days in months where a day does not exist have
>>> special values.
>>> Missing values have a different special value.
>>>
>>> And then I've got the problem of iterating over all relevant files
>>> to get a
>>> whole timeseries.
>>>
>>> Is there a way to read in this type of file into R? I've read all
>>> of the
>>> examples that I can find, but cannot work out how to do it. I
>>> don't think
>>> that read.table can handle the separate sections of data
>>> representing each
>>> year. read.ftable maybe can be coerced to parse the data, but I
>>> cannot see
>>> how after reading the documentation and experimenting with the
>>> parameters.
>>>
>>> I'm using R 2.10.1 on osx 10.5.8 and 2.10.0 on Fedora 10.
>>>
>>> Any help/suggestions would be greatly appreciated. I can see that
>>> this type
>>> of issue is likely to grow in importance, and I'd also like to
>>> give the data
>>> owners suggestions on how to reformat their data so that it is
>>> easier to
>>> consume by machines, while being easy to read for humans.
>>>
>>> The early records are a serious machine parsing challenge as they
>>> are tiff
>>> images of old notebooks ;-)
>>>
>>> tia
>>>
>>> Tim
>>> Tim Coote
>>> tim at coote.org
>>> vincit veritas
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
Tim Coote
tim at coote.org
vincit veritas
More information about the R-help
mailing list