[R] reading data from web data sources
Phil Spector
spector at stat.berkeley.edu
Sun Feb 28 00:17:55 CET 2010
Tim -
I don't understand what you mean about interleaving rows. I'm guessing
that you want a single large data frame with all the data, and not a
list with each year separately. If that's the case:
x = read.table('http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat',
header=FALSE,fill=TRUE,skip=13)
cts = apply(x,1,function(x)sum(is.na(x)))
wh = which(cts == 12)
start = wh+1
end = c(wh[-1] - 1,nrow(x))
ans = mapply(function(i,j)x[i:j,],start,end,SIMPLIFY=FALSE)
names(ans) = x[wh,1]
alldat = do.call(rbind,ans)
alldat$year = rep(names(ans),sapply(ans,nrow))
names(alldat) = c('day',month.name,'year')
On the other hand, if you want a long data frame with month, day, year
and value:
longdat = reshape(alldat,idvar=c('day','year'),
varying=list(month.name),direction='long',times=month.name)
names(longdat)[c(3,4)] = c('Month','value')
Next , if you want to create a Date variable:
longdat = transform(longdat,date=as.Date(paste(Month,day,year),'%B %d %Y'))
longdat = na.omit(longdat)
longdat = longdat[order(longdat$date),]
and finally:
zoodat = zoo(longdat$value,longdat$date)
which should be suitable for time series analysis.
Hope this helps.
- Phil
On Sat, 27 Feb 2010, Tim Coote wrote:
> Thanks, Gabor. My take away from this and Phil's post is that I'm going to
> have to construct some code to do the parsing, rather than use a standard
> function. I'm afraid that neither approach works, yet:
>
> Gabor's gets has an off-by-one error (days start on the 2nd, not the first),
> and the years get messed up around the 29th day. I think that na.omit (DF)
> line is throwing out the baby with the bathwater. It's interesting that this
> approach is based on read.table, I'd assumed that I'd need read.ftable, which
> I couldn't understand the documentation for. What is it that's removing the
> -999 and -888 values in this code -they seem to be gone, but I cannot see
> why.
>
> Phil's reads in the data, but interleaves rows with just a year and all other
> values as NA.
>
> Tim
> On 27 Feb 2010, at 17:33, Gabor Grothendieck wrote:
>
>> Mark Leeds pointed out to me that the code wrapped around in the post
>> so it may not be obvious that the regular expression in the grep is
>> (i.e. it contains a space):
>> "[^ 0-9.]"
>>
>>
>> On Sat, Feb 27, 2010 at 7:15 AM, Gabor Grothendieck
>> <ggrothendieck at gmail.com> wrote:
>>> Try this. First we read the raw lines into R using grep to remove any
>>> lines containing a character that is not a number or space. Then we
>>> look for the year lines and repeat them down V1 using cumsum. Finally
>>> we omit the year lines.
>>>
>>> myURL <-
>>> "http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat"
>>> raw.lines <- readLines(myURL)
>>> DF <- read.table(textConnection(raw.lines[!grepl("[^
>>> 0-9.]",raw.lines)]), fill = TRUE)
>>> DF$V1 <- DF[cumsum(is.na(DF[[2]])), 1]
>>> DF <- na.omit(DF)
>>> head(DF)
>>>
>>>
>>> On Sat, Feb 27, 2010 at 6:32 AM, Tim Coote <tim+r-project.org at coote.org>
>>> wrote:
>>>> Hullo
>>>> I'm trying to read some time series data of meteorological records that
>>>> are
>>>> available on the web (eg
>>>> http://climate.arm.ac.uk/calibrated/soil/dsoil100_cal_1910-1919.dat). I'd
>>>> like to be able to read in the digital data directly into R. However, I
>>>> cannot work out the right function and set of parameters to use. It
>>>> could
>>>> be that the only practical route is to write a parser, possibly in some
>>>> other language, reformat the files and then read these into R. As far as
>>>> I
>>>> can tell, the informal grammar of the file is:
>>>>
>>>> <comments terminated by a blank line>
>>>> [<year number on a line on its own>
>>>> <daily readings lines> ]+
>>>>
>>>> and the <daily readings> are of the form:
>>>> <whitespace> <day number> [<whitespace> <reading on day of month>] 12
>>>>
>>>> Readings for days in months where a day does not exist have special
>>>> values.
>>>> Missing values have a different special value.
>>>>
>>>> And then I've got the problem of iterating over all relevant files to get
>>>> a
>>>> whole timeseries.
>>>>
>>>> Is there a way to read in this type of file into R? I've read all of the
>>>> examples that I can find, but cannot work out how to do it. I don't think
>>>> that read.table can handle the separate sections of data representing
>>>> each
>>>> year. read.ftable maybe can be coerced to parse the data, but I cannot
>>>> see
>>>> how after reading the documentation and experimenting with the
>>>> parameters.
>>>>
>>>> I'm using R 2.10.1 on osx 10.5.8 and 2.10.0 on Fedora 10.
>>>>
>>>> Any help/suggestions would be greatly appreciated. I can see that this
>>>> type
>>>> of issue is likely to grow in importance, and I'd also like to give the
>>>> data
>>>> owners suggestions on how to reformat their data so that it is easier to
>>>> consume by machines, while being easy to read for humans.
>>>>
>>>> The early records are a serious machine parsing challenge as they are
>>>> tiff
>>>> images of old notebooks ;-)
>>>>
>>>> tia
>>>>
>>>> Tim
>>>> Tim Coote
>>>> tim at coote.org
>>>> vincit veritas
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>
> Tim Coote
> tim at coote.org
> vincit veritas
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list