[R] Possible bug in gzcon() (6161:src/main/connections.c)
Ivan Krylov
|kry|ov @end|ng |rom d|@root@org
Sat Apr 26 21:55:00 CEST 2025
В Fri, 25 Apr 2025 13:41:35 +0000
André Wildberg <andre.wildberg using outlook.com> пишет:
> Reproducible example:
>
> addr <-
> "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_station/USW00014839.csv.gz"
>
>
> # online/stream
>
> nrow(read.csv(gzcon(url(addr), text=T), header=F))
>
> # [1] 1798
>
>
> # local
>
> download.file(addr, destfile=basename(addr))
>
> nrow(read.csv(gzcon(file(basename(addr), "r"), text=T), header=F))
>
> # [1] 429498
I can reproduce the problem (with slightly different numbers):
length(readLines(gzcon(file("USW00014839.csv.gz", "rb"), text = TRUE)))
# [1] 28002
length(readLines("USW00014839.csv.gz"))
# [1] 429535
The underlying reason for the problem you're having with gzcon() is
most likely that the gzip archive has been concatenated from multiple
separate archives:
perl -MIO::Uncompress::Gunzip -E'
my $z = IO::Uncompress::Gunzip::->new(shift);
say "end of stream" while $z->nextStream() == 1;
' -- USW00014839.csv.gz
# end of stream
# end of stream
# end of stream
# end of stream
# end of stream
readLines("USW00014839.csv.gz") calls file(), which can transparently
switch to a gzfile() connection, which supports concatenated archives,
but gzcon() currently doesn't. Feature request submitted at
<https://bugs.r-project.org/show_bug.cgi?id=18887>.
--
Best regards,
Ivan
More information about the R-help
mailing list