[R] Web Scraping

Ista Zahn istazahn at gmail.com
Sat Oct 5 03:58:33 CEST 2013


I have a short demo at https://gist.github.com/izahn/5785265 that
might get you started.


On Fri, Oct 4, 2013 at 12:51 PM, Mohamed Anany
<melsayed at students.kennesaw.edu> wrote:
> Hello everybody,
> I just started using R and I'm presenting a poster for R day at Kennesaw
> State University and I really need some help in terms of web scraping.
> I'm trying to extract used cars data from www.cars.com to include the
> mileage, year, model, make, price, CARFAX availability and Technology
> package availability. I've done some research, and everything points to the
> XML package and RCurl package. I also got my hands on a function that would
> capture all the text in the web page and store as a huge character vector.
> I've never done data mining before so when i read the help documents on the
> packages i mentioned earlier is like reading Chinese. I would appreciate it
> if you guide me through this process of data extraction.
> Here's an example of what the data would look like:
> Cost    Year    Mileage    Tech    CARFAX    Make      Model
> $32000 1999   57,987      1         FREE        Audi       A4
> Here's the link to the search:-
> http://www.cars.com/for-sale/searchresults.action?stkTyp=U&tracktype=usedcc&mkId=20049&AmbMkId=20049&AmbMkNm=Audi&make=Audi&AmbMdNm=A4&model=A4&mdId=20596&AmbMdId=20596&rd=100&zc=30062&searchSource=QUICK_FORM&enableSeo=1
> I'm not expecting you to write the whole code for me, but just some
> guidance and where to start and what functions would be useful in my
> situation.
> Thanks a lot anyway.
> Regards,
> M. Samir Anany
>         [[alternative HTML version deleted]]
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

More information about the R-help mailing list