[R] Scraping from different level URLs website
Ilio Fornasero
iliofornasero at hotmail.com
Tue Jan 23 18:31:01 CET 2018
I am doing a research on World Bank (WB) projects on developing countries. To do so, I am scraping their website in order to collect the data I am interested in.
The structure of the webpage I want to scrape is the following:
1. List of countries the list of all countries in which WB has developed projects<http://projects.worldbank.org/country?lang=en&page=>
1.1. By clicking on a single country on 1. , one gets the single countries project list (that includes many webpages) it includes all the projects in a single countries <http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=3A> . Of course, here I have included just one page of a single countries, but every country has a number of pages dedicated to this subject
1.1.1. By clicking on a a single project on 1.1. , one gets - among the others - the project's overview option<http://projects.worldbank.org/P155642/?lang=en&tab=overview> I am interested in.
In other words, my problem is to find out a way to create a dataframe including all the countries, a complete list of all projects for each country and an overview of any single project.
Yet, this is the code that I have (unsuccessfully) written:
WB_links <- "http://projects.worldbank.org/country?lang=en&page=projects"
WB_proj <- function(x) {
Sys.sleep(5)
url <- sprintf("http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s", x)
html <- read_html(url)
tibble(title = html_nodes(html, ".grid_20") %>% html_text(trim = TRUE),
project_url = html_nodes(html, ".grid_20") %>% html_attr("href"))
}
WB_scrape <- map_df(1:5, WB_proj) %>%
mutate(study_description =
map(project_url,
~read_html(sprintf
("http://projects.worldbank.org/search?lang=en&searchTerm=&countrycode_exact=%s", .x)) %>%
html_node() %>%
html_text()))
Any suggestion?
Note: I am sorry if this question seems trivial, but I am quite a newbie in R and I haven't found a help on this by looking around (though I could have missed something, of course).
[[alternative HTML version deleted]]
More information about the R-help
mailing list