[R] reading data from web data sources

Phil Spector spector at stat.berkeley.edu
Sun Feb 28 00:17:55 CET 2010


Tim -
    I don't understand what you mean about interleaving rows.  I'm guessing
that you want a single large data frame with all the data, and not a 
list with each year separately.  If that's the case:

x = read.table('http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat',
                 header=FALSE,fill=TRUE,skip=13)
cts = apply(x,1,function(x)sum(is.na(x)))
wh = which(cts == 12)
start = wh+1
end = c(wh[-1] - 1,nrow(x))
ans = mapply(function(i,j)x[i:j,],start,end,SIMPLIFY=FALSE)
names(ans) = x[wh,1]
alldat = do.call(rbind,ans)
alldat$year = rep(names(ans),sapply(ans,nrow))
names(alldat) = c('day',month.name,'year')

On the other hand, if you want a long data frame with month, day, year 
and value:

longdat = reshape(alldat,idvar=c('day','year'),
                   varying=list(month.name),direction='long',times=month.name)
names(longdat)[c(3,4)] = c('Month','value')

Next , if you want to create a Date variable:

longdat = transform(longdat,date=as.Date(paste(Month,day,year),'%B %d %Y'))
longdat = na.omit(longdat)
longdat = longdat[order(longdat$date),]

and finally:

zoodat = zoo(longdat$value,longdat$date)

which should be suitable for time series analysis.

Hope this helps.
                                                     - Phil

On Sat, 27 Feb 2010, Tim Coote wrote:

> Thanks, Gabor. My take away from this and Phil's post is that I'm going to 
> have to construct some code to do the parsing, rather than use a standard 
> function. I'm afraid that neither approach works, yet:
>
> Gabor's gets has an off-by-one error (days start on the 2nd, not the first), 
> and the years get messed up around the 29th day.  I think that na.omit (DF) 
> line is throwing out the baby with the bathwater.  It's interesting that this 
> approach is based on read.table, I'd assumed that I'd need read.ftable, which 
> I couldn't understand the documentation for.  What is it that's removing the 
> -999 and -888 values in this code -they seem to be gone, but I cannot see 
> why.
>
> Phil's reads in the data, but interleaves rows with just a year and all other 
> values as NA.
>
> Tim
> On 27 Feb 2010, at 17:33, Gabor Grothendieck wrote:
>
>> Mark Leeds pointed out to me that the code wrapped around in the post
>> so it may not be obvious that the regular expression in the grep is
>> (i.e. it contains a space):
>> "[^ 0-9.]"
>> 
>> 
>> On Sat, Feb 27, 2010 at 7:15 AM, Gabor Grothendieck
>> <ggrothendieck at gmail.com> wrote:
>>> Try this.  First we read the raw lines into R using grep to remove any
>>> lines containing a character that is not a number or space.  Then we
>>> look for the year lines and repeat them down V1 using cumsum.  Finally
>>> we omit the year lines.
>>> 
>>> myURL <- 
>>> "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat"
>>> raw.lines <- readLines(myURL)
>>> DF <- read.table(textConnection(raw.lines[!grepl("[^
>>> 0-9.]",raw.lines)]), fill = TRUE)
>>> DF$V1 <- DF[cumsum(is.na(DF[[2]])), 1]
>>> DF <- na.omit(DF)
>>> head(DF)
>>> 
>>> 
>>> On Sat, Feb 27, 2010 at 6:32 AM, Tim Coote <tim+r-project.org at coote.org> 
>>> wrote:
>>>> Hullo
>>>> I'm trying to read some time series data of meteorological records that 
>>>> are
>>>> available on the web (eg
>>>> http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat). I'd
>>>> like to be able to read in the digital data directly into R. However, I
>>>> cannot work out the right function and set of parameters to use.  It 
>>>> could
>>>> be that the only practical route is to write a parser, possibly in some
>>>> other language,  reformat the files and then read these into R. As far as 
>>>> I
>>>> can tell, the informal grammar of the file is:
>>>> 
>>>> <comments terminated by a blank line>
>>>> [<year number on a line on its own>
>>>> <daily readings lines> ]+
>>>> 
>>>> and the <daily readings> are of the form:
>>>> <whitespace> <day number> [<whitespace> <reading on day of month>] 12
>>>> 
>>>> Readings for days in months where a day does not exist have special 
>>>> values.
>>>> Missing values have a different special value.
>>>> 
>>>> And then I've got the problem of iterating over all relevant files to get 
>>>> a
>>>> whole timeseries.
>>>> 
>>>> Is there a way to read in this type of file into R? I've read all of the
>>>> examples that I can find, but cannot work out how to do it. I don't think
>>>> that read.table can handle the separate sections of data representing 
>>>> each
>>>> year. read.ftable maybe can be coerced to parse the data, but I cannot 
>>>> see
>>>> how after reading the documentation and experimenting with the 
>>>> parameters.
>>>> 
>>>> I'm using R 2.10.1 on osx 10.5.8 and 2.10.0 on Fedora 10.
>>>> 
>>>> Any help/suggestions would be greatly appreciated. I can see that this 
>>>> type
>>>> of issue is likely to grow in importance, and I'd also like to give the 
>>>> data
>>>> owners suggestions on how to reformat their data so that it is easier to
>>>> consume by machines, while being easy to read for humans.
>>>> 
>>>> The early records are a serious machine parsing challenge as they are 
>>>> tiff
>>>> images of old notebooks ;-)
>>>> 
>>>> tia
>>>> 
>>>> Tim
>>>> Tim Coote
>>>> tim at coote.org
>>>> vincit veritas
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide 
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>> 
>>> 
>
> Tim Coote
> tim at coote.org
> vincit veritas
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list