[R] troubles reading a text file
David Winsemius
dwinsemius at comcast.net
Sun Dec 16 05:45:34 CET 2012
On Dec 15, 2012, at 2:23 PM, <Igor.Drobyshev2 at uqat.ca> wrote:
> Dear R experts,
>
> For quite some time I have been trying to solve a mistery of reading a seemingly trouble-free text file. The data is temperature reconstruction arranged as a huge grid, preceded by seven "header lines" (which you see better if file is opened in Firefox or Chrome).
>
> This is the data (gridded temperature reconstruction)
> ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt
>
> And this is original data description:
> ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/readme-casty2007.txt
> Basically, it is says "space-delimited ASCII format" there ...
>
> I tried this:
> Temperature<-read.table(FileName,skip = 7, header = TRUE, na.strings="NA",sep="")
>
> But ..
>
>
>> Temperature <- read.table(FileName, skip = 7, header = FALSE, sep="")
> Error in read.table(FileName, skip = 7, header = FALSE, sep = "") :
> empty beginning of file
>
After inspecting a small (8 MB fragment downloaded with an ftp client) with both Firefox and TextEdit.app and seeing that they reported this to be UTF-16 encoded, I saved it from TextEdit as UTF-8 and then could view it with R readLines. These are the first 7 lines and the beginning of the eighth:
> readLines("~/Downloads/temp-mon2.txt", n=10)
[1] "NAME \"Monthly European Temperatures 1766-2000 [T=2m, Celsius]\""
[2] "LONGITUDES\t180\t50.00W\t40.00E\t"
[3] "LATITUDES\t100\t80.00N\t30.00N\t"
[4] "NODATA_STRING\tNA"
[5] "NUMBER_OF_ROWS\t2820"
[6] "NUMBER_OF_COLUMNS\t18001\t"
[7] ""
[8] "YYYYMM\t79.75N/49.75W\t79.75N/49.25W\t79.75N/48.75W\t79.75N/48.25W\t79.75N/47.75W\t79.75N/47.25W\t79.75N/46.75W\t79.75N/46.25W\t79.75N/45.75W\t79.75N/45.25W\t79.75N/44.75W\t79.75N/44.25W\t79.7
As you can readily see it isa tab-separated file. I was able to get partial success ( reading the first three lines anyway) with:
> inp <- read.table("~/Downloads/temp-mon.txt", nrow=3, skip =7, header=TRUE, fill=TRUE, fileEncoding ="UTF-16")
> inp[1 , 1:10]
YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W X79.75N.47.75W X79.75N.47.25W X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W
1 176512 -32.61 -32.92 -33.34 -33.65 -34.09 -34.21 -34.65 -34.98 -35.43
> inp[ , 1:10]
YYYYMM X79.75N.49.75W X79.75N.49.25W X79.75N.48.75W X79.75N.48.25W X79.75N.47.75W X79.75N.47.25W X79.75N.46.75W X79.75N.46.25W X79.75N.45.75W
1 176512 -32.61 -32.92 -33.34 -33.65 -34.09 -34.21 -34.65 -34.98 -35.43
2 176601 -31.89 -31.96 -32.26 -32.48 -32.71 -33.03 -33.29 -33.41 -33.76
3 176602 -34.31 -34.40 -34.60 -34.79 -35.01 -35.13 -35.46 -35.57 -35.91
>
> Trying read.csv gives this:
>
>
> Error: cannot allocate vector of size 370.5 Mb
That on the other hand suggests you have inadequate machine resources for this job. Perhaps you should be thinking of using other tools than R for this project ... or buying more ram. You should probably have 32 GB for a job this size.
>
> I attempted to handle this by opening and resaving the file in another software, but even if I can still see the first lines of the file in the import dialog, the full reading of the file always ends up with an error, possibly because of the huge humber of columns ..
>
> I believe the problem is with some special encoding but I cannot figure out how to go around it.
Partially correct but perhaps your problems are multifactorial.
I was able to get this to "work" from that webste:
> inp <- read.table(file=url("ftp://ftp.ncdc.noaa.gov/pub/data/paleo/historical/europe/casty2007/temp-mon.txt", encoding="UTF-16"), nrow=3 , skip =7, header=TRUE, fill=TRUE, fileEncoding ="UTF-16")
> str(inp[ , 1:10])
'data.frame': 3 obs. of 10 variables:
$ YYYYMM : int 176512 176601 176602
$ X79.75N.49.75W: num -32.6 -31.9 -34.3
$ X79.75N.49.25W: num -32.9 -32 -34.4
$ X79.75N.48.75W: num -33.3 -32.3 -34.6
$ X79.75N.48.25W: num -33.6 -32.5 -34.8
$ X79.75N.47.75W: num -34.1 -32.7 -35
$ X79.75N.47.25W: num -34.2 -33 -35.1
$ X79.75N.46.75W: num -34.6 -33.3 -35.5
$ X79.75N.46.25W: num -35 -33.4 -35.6
$ X79.75N.45.75W: num -35.4 -33.8 -35.9
--
David Winsemius
Alameda, CA, USA
More information about the R-help
mailing list