[R] read.delim problem with trailing spaces
John Fox
jfox at mcmaster.ca
Wed Oct 6 15:59:19 CEST 2004
Dear Mike,
This is a trap, but it's not a bug, and to "correct" it wouldn't be
appropriate, I think. That is, the string ". " wasn't declared as NA. One
could do the following to avoid the problem:
> read.csv("c:/temp/test.txt", na.strings=".", strip.white=TRUE)
income imr region oilexprt imr80 gnp80 life
Afghanistan 75 400.0 4 0 185.0 NA 37.5
Algeria 400 86.3 2 1 20.5 1920 50.7
Argentina 1191 59.6 1 0 40.8 2390 67.1
Australia 3426 26.7 4 0 12.5 9820 71.0
Austria 3350 23.7 3 0 14.8 10230 70.4
Bangladesh 100 124.3 4 0 139.0 120 NA
Belgium 3346 17.0 3 0 11.2 12180 70.6
Benin 81 109.6 2 0 109.6 300 NA
Bolivia 200 60.4 1 0 77.3 570 49.7
Brazil 425 170.0 1 0 84.0 2020 60.7
Britain 2503 17.5 3 0 12.6 7920 72.0
Burma 73 200.0 4 0 195.0 180 42.3
Regards,
John
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of
> Michael Friendly
> Sent: Wednesday, October 06, 2004 8:18 AM
> To: R-help
> Subject: [R] read.delim problem with trailing spaces
>
> I'm trying to read a comma delimited dataset that uses '.'
> for NA. I found that if the last field on a line was a missing '.'
> it was not read as NA, but just a '.', and the life variable
> was made a factor. The data looks like this,
>
> income,imr,region,oilexprt,imr80,gnp80,life
> Afghanistan,75,400.0,4,0,185.0,.,37.5
> Algeria,400,86.3,2,1,20.5,1920,50.7
> Argentina,1191,59.6,1,0,40.8,2390,67.1
> Australia,3426,26.7,4,0,12.5,9820,71.0
> Austria,3350,23.7,3,0,14.8,10230,70.4
> Bangladesh,100,124.3,4,0,139.0,120,.
> Belgium,3346,17.0,3,0,11.2,12180,70.6
> Benin,81,109.6,2,0,109.6,300,.
> Bolivia,200,60.4,1,0,77.3,570,49.7
> Brazil,425,170.0,1,0,84.0,2020,60.7
> Britain,2503,17.5,3,0,12.6,7920,72.0
> Burma,73,200.0,4,0,195.0,180,42.3
> ...
>
> and I used
> > nations <-
> read.delim("~/sasuser/data/nations2.dat",na.strings=".",row.na
> me=1,sep=",",header=TRUE)
>
> > nations[1:10,]
> income imr region oilexprt imr80 gnp80 life
> Afghanistan 75 400.0 4 0 185.0 NA 37.5
> Algeria 400 86.3 2 1 20.5 1920 50.7
> Argentina 1191 59.6 1 0 40.8 2390 67.1
> Australia 3426 26.7 4 0 12.5 9820 71.0
> Austria 3350 23.7 3 0 14.8 10230 70.4
> Bangladesh 100 124.3 4 0 139.0 120 .
> Belgium 3346 17.0 3 0 11.2 12180 70.6
> Benin 81 109.6 2 0 109.6 300 .
> Bolivia 200 60.4 1 0 77.3 570 49.7
> Brazil 425 170.0 1 0 84.0 2020 60.7
> > summary(nations$life)
> . 27.0 31.6 32.0 32.6 34.5 35.0 36.0 36.7 36.9 37.1 37.2
> 37.5 38.5 38.8 40.5
> 2 1 1 1 1 1 2 1 1 1 1 1
> 1 3 1 1
> 40.6 41.0 41.2 42.3 43.5 43.7 44.9 45.1 46.8 47.5 47.6 49.0
> 49.7 49.9 50.0 50.5
> 1 6 1 4 1 1 1 1 1 3 1 3
> 1 1 2 1
>
>
> After much hair-pulling, I discovered that the data lines for
> Bangladesh and Benin contained a trailing space after the
> '.'. Removing those made the problem go away, but that
> shouldn't happen and I wonder if this is
> still a potential problem for others. I'm using R 1.8.1.
>
> -Michael
>
> --
> Michael Friendly Email: friendly at yorku.ca
> Professor, Psychology Dept.
> York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
> 4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html
> Toronto, ONT M3J 1P3 CANADA
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
More information about the R-help
mailing list