[R] getURL not working in loop
Abhinaba Roy
abhinabaroy09 at gmail.com
Tue Jul 21 18:50:37 CEST 2015
Hi R helpers,
I am trying to extract customer feedback from an e-commerce site and
subsequently use it for creating a word cloud. Below is the code I have
#web-crawling
library(RCurl)
library(XML)
library(rvest)
#web-crawling
init="
http://www.flipkart.com/moto-g-2nd-generation/product-reviews/ITME7YBANGAWQZZX?pid=MOBDYGZ6SHNB7RFC&type=all
"
crawlcandidate="start="
base="http://www.flipkart.com"
num=10
doclist=list()
anchorlist=vector()
j=0
while(j<num){
print(j)
if(j==0){
doclist[j+1]=getURL(init)
}else{
doclist[j+1]=getURL(paste(base,anchorlist[j+1],sep=""))
}
doc=htmlParse(doclist[[j+1]])
anchor=getNodeSet(doc,"//a")
anchor=sapply(anchor,function(x)xmlGetAttr(x,"href"))
anchor=anchor[grep(crawlcandidate,anchor)]
anchorlist=c(anchorlist,anchor)
anchorlist=unique(anchorlist)
j=j+1
}
#html_text is for extracting only reviews and ratings
reviews=c()
ratings=c()
for(i in 1:10){
doc=htmlParse(doclist[[i]])
l=getNodeSet(doc,"//div/p/span[@class='review-text']")
l1=html_text(l)
rateNodes=getNodeSet(doc,"//div[@class='fk-stars']")
rates=sapply(rateNodes,function(x)xmlGetAttr(x,'title'))
ratings=c(ratings,rates)
reviews=c(reviews,l1)
}
View(reviews)
View(ratings)
#creating wordcloud
#tm,wordcloud
corpus=Corpus(VectorSource(reviews[1:100]))
corpus=tm_map(corpus,tolower)
corpus=tm_map(corpus,removePunctuation)
corpus=tm_map(corpus,removeNumbers)
corpus=tm_map(corpus,removeWords,stopwords("en"))
corpus=Corpus(VectorSource(corpus))
tdm=TermDocumentMatrix(corpus)
m=as.matrix(tdm)
v=sort(rowSums(m),decreasing=T)
d=data.frame(words=names(v),freq=v)
wordcloud(d$words,d$freq,max.words=300,colors=brewer.pal(10,"Dark2"),scale=c(3,0.5),random.order=F)
But I am getting the error
Error in which(value == defs) :
argument "code" is missing, with no default
In addition: Warning message:
XML content does not seem to be XML: ''
How can I resolve this error??
Help will be appreciated.
Regards,
Abhi
[[alternative HTML version deleted]]
More information about the R-help
mailing list