[R] reading tables from url
stubben
stubben at lanl.gov
Wed Nov 14 19:49:47 CET 2007
I'm trying to read some web tables directly into R. These are both
genome sequencing projects (eukaryotes and metagenomes) from NCBI and
look very similar; however, only the first one works.
http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi
http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi
I added ?dump=selected to the end of the url string to get a tab-
delimited file (which is what happens if you click the Save button on
either page).
> options(internet.info=0)
## this one works
> x1<-url("http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi?
dump=selected")
> read.delim(x1, skip=1, nrows=5)[,1:3]
X...Columns. ProjectID Organism.Name
1 20303 Acanthamoeba castellanii Neff Protists
2 13657 Acyrthosiphon pisum LSR1 Animals
3 12434 Aedes aegypti Liverpool Animals
4 12635 Ajellomyces capsulatus G186AR Fungi
5 12653 Ajellomyces capsulatus G217B Fungi
Warning messages:
1: connected to 'www.ncbi.nlm.nih.gov' on port 80. in: open.connection
(file, "r")
2: -> GET /genomes/leuks.cgi?dump=selected HTTP/1.0
Host: www.ncbi.nlm.nih.gov
Pragma: no-cache
in: open.connection(file, "r")
3: <- HTTP/1.1 200 OK in: open.connection(file, "r")
4: <- Date: Wed, 14 Nov 2007 18:03:29 GMT in: open.connection(file, "r")
5: <- Server: Apache in: open.connection(file, "r")
6: <- Content-Disposition: attachment; filename="untitle.txt" in:
open.connection(file, "r")
7: <- Content-Type: application/force-download in: open.connection
(file, "r")
8: <- Vary: Accept-Encoding in: open.connection(file, "r")
9: <- Connection: close in: open.connection(file, "r")
10: Code 200, content-type 'application/force-download' in:
open.connection(file, "r")
## this one fails to open a connection
> x2<-url("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi?
dump=selected")
> read.delim(x2, skip=1, nrows=5)[,1:3]
Error in open.connection(file, "r") : unable to open connection
In addition: Warning messages:
1: connected to 'www.ncbi.nlm.nih.gov' on port 80. in: open.connection
(file, "r")
2: -> GET /genomes/lenvs.cgi?dump=selected HTTP/1.0
Host: www.ncbi.nlm.nih.gov
Pragma: no-cache
in: open.connection(file, "r")
3: <- HTTP/1.1 500 Internal Server Error in: open.connection(file, "r")
4: <- Date: Wed, 14 Nov 2007 18:04:26 GMT in: open.connection(file, "r")
5: <- Server: Apache in: open.connection(file, "r")
6: <- Content-Type: text/html; charset=ISO-8859-1 in: open.connection
(file, "r")
7: <- Vary: Accept-Encoding in: open.connection(file, "r")
8: <- Connection: close in: open.connection(file, "r")
9: Code 500, content-type 'text/html; charset=ISO-8859-1' in:
open.connection(file, "r")
10: cannot open: HTTP status was '500 Internal Server Error' in:
open.connection(file, "r")
Also, I can't even read lines from the main page.
> readLines("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi", n=10)
Error in file(con, "r") : unable to open connection
...
## now I'm just guessing...
> readLines("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi", n=10,
encoding="ISO-8859-1")
Error in file(con, "r") : unable to open connection
...
Download.file works fine, but I would like to avoid this if possible.
> capabilities()[5]
http/ftp
TRUE
> download.file("http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi?
dump=selected", "lenvs.tab")
> read.delim("lenvs.tab", skip=1, nrows=5)[,1:3]
X...Columns.
Parent.ProjectID ProjectID
1 19733 13694 Global Ocean Sampling
Expedition Metagenome
2 20823 13696 5-Way (CG) Acid Mine Drainage
Biofilm Metagenome
3 - 13699 Waseca County Farm
Soil Metagenome
4 - 13702 Methane-Oxidizing Archaea from Deep-
Sea Sediments
5 - 13729 Pacific Beach
Sand Metagenome
Thanks for your help. Hopefully this is something simple that I
missed in the documentation/help.
Chris
--
-------------------
Chris Stubben
Los Alamos National Lab
BioScience Division
MS M888
Los Alamos, NM 87545
More information about the R-help
mailing list