[R] Getting htmlParse to work with Hebrew? (on windows)
Milan Bouchet-Valat
nalimilan at club.fr
Thu Feb 21 15:43:43 CET 2013
Le jeudi 21 février 2013 à 18:31 +0400, Lawr Eskin a écrit :
> Hi Milan,
>
> a <- getURL(con, .encoding = "UTF-8")
> Encoding(a)
> > [1] "UTF-8"
> a # Here - the UTF-8 codes looks like fine.
> htmlParse(a, encoding = "UTF-8") ###again same encoding issue
And what if you try this:
a2 <- htmlParse(sub("windows-1251", "UTF-8", a))
or this:
a2 <- htmlParse(iconv(a, "windows-1251", "UTF-8"))
Cheers
> >>why didn't getURL() detect and set a's encoding correctly?
> I think there are page issue because another sites works fine
>
> 2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr>
> Le jeudi 21 février 2013 à 16:04 +0400, Lawr Eskin a écrit :
> > Hi Milan!
> >
> >
> > > Encoding(a)
> > [1] "unknown"
>
> Hm, here I get "UTF-8", which is my locale encoding.
>
> I've tried a little more, and I discovered that using
> a <- getURL(u, .encoding="UTF-8")
> ensures that a is in the correct encoding here. I know this is
> not your
> problem, but it might help: check whether Encoding(a) is set
> to "UTF-8"
> or not in that case, and whether this fixes things.
>
> I'm not sure how htmlParse() detects the encoding when you
> pass it a
> character vector, but it probably uses Encoding(a), since
> that's the
> only reliable information; if it is missing, maybe it falls
> back to what
> the contents of the file say (maybe even before what the
> "encoding"
> argument says), which is windows-1251, and may not be the
> encoding in
> which getURL() saved the character vector. The question would
> then be:
> why didn't getURL() detect and set a's encoding correctly?
>
>
> My two cents
>
>
> > 2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr>
> > Le jeudi 21 février 2013 à 13:16 +0400, Lawr Eskin a
> écrit :
> > > Hello dear R-help mailing list.
> > >
> > >
> > > Looks like the same issue in Russian:
> > >
> > >
> > >
> > > library(RCurl)
> > >
> > > library(XML)
> > >
> > > u = "
> >
> http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1"
> > >
> > > a = getURL(u)
> > >
> > > a # Here - the Russian is fine.
> > >
> > > a2 <- htmlParse(a)
> > >
> > > a2 # Here it is a mess...
> > >
> > >
> > >
> > > None of these seem to fix it:
> > >
> > >
> > >
> > > htmlParse(a, encoding = "windows-1251")
> > >
> > > htmlParse(a, encoding = "CP1251")
> > >
> > > htmlParse(a, encoding = "cp1251")
> > >
> > > htmlParse(a, encoding = "iso8859-5")
> > >
> > >
> > >
> > > This is my locale:
> > >
> > >
> > >
> > > Sys.getlocale()
> > >
> > >
> >
> "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> > >
> > >
> > >
> > > Any suggestions?
> >
> > What does Encoding(a) say?
> >
> >
> > (FWIW, here on Linux even a is not in the correct
> encoding :
> > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0
> Transitional//EN"
> > "http://www.w3.org/TR/REC-html40/loose.dtd">
> > <html><head>
> > <title>ГЉГіГЇГЁГІГј îäГîêîìГГ ГІГГіГѕ ГЄГўГ
> ðòèð
> > Гі Гў ГЊГ®Г
> > ±ГЄГўГҐ В— 11430 îáúÿâëåГГЁГ© Г® ïðîäГ
> æå îäГ
> > îêîìГ
> > Г ГІГûõ êâà ðòèð</title>
> > [...])
> >
> >
> > Regards
> >
> >
> > > Thanks you very much in advance,
> > >
> > > Lavrentiy Eskin
> >
> > > <http://www.eng.nvg.ru>
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained,
> reproducible
> > code.
> >
> >
>
>
>
More information about the R-help
mailing list