[R] Getting htmlParse to work with Hebrew? (on windows)
Milan Bouchet-Valat
nalimilan at club.fr
Fri Feb 22 17:04:10 CET 2013
Le jeudi 21 février 2013 à 18:53 +0400, Lawr Eskin a écrit :
> iconv trued before in various try, same issue and result with encoding
> = unknown
> now try sub - same issue
This procedure works on Linux, but not on Windows:
library(RCurl)
library(XML)
u <- "http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1"
a <- getURL(u, .encoding="UTF-8")
a <- iconv(a, "windows-1251", "UTF-8")
a2 <- htmlParse(sub("windows-1251", "UTF-8", a))
a2
But maybe the problem is more general, and related to conversion between
encodings on Windows. What looks weird to me is that on Windows, I'm not
able to save a character string to a file in UTF-8, despite what ?file
says:
x <- "Все права защищены"
Encoding(x)
# UTF-8
cat(x, con <- file("foo", "w", encoding="UTF-8")); close(con)
x2 <- readLines(con <- file(foo, "r", encoding="UTF-8")); close(con)
Encoding(x2)
# unknown
x2
# [1] "<U+041A><U+0443>..."
I know the problem happens on write because the file cannot be read
correctly on Linux either.
This Windows machine uses Windows Server 2008 with French_France.1252
locale.
> 2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr>
> Le jeudi 21 février 2013 à 18:31 +0400, Lawr Eskin a écrit :
> > Hi Milan,
> >
> > a <- getURL(con, .encoding = "UTF-8")
> > Encoding(a)
> > > [1] "UTF-8"
> > a # Here - the UTF-8 codes looks like fine.
> > htmlParse(a, encoding = "UTF-8") ###again same encoding
> issue
>
> And what if you try this:
> a2 <- htmlParse(sub("windows-1251", "UTF-8", a))
>
> or this:
> a2 <- htmlParse(iconv(a, "windows-1251", "UTF-8"))
>
>
> Cheers
>
>
> > >>why didn't getURL() detect and set a's encoding correctly?
> > I think there are page issue because another sites works
> fine
> >
> > 2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr>
> > Le jeudi 21 février 2013 à 16:04 +0400, Lawr Eskin a
> écrit :
> > > Hi Milan!
> > >
> > >
> > > > Encoding(a)
> > > [1] "unknown"
> >
> > Hm, here I get "UTF-8", which is my locale encoding.
> >
> > I've tried a little more, and I discovered that
> using
> > a <- getURL(u, .encoding="UTF-8")
> > ensures that a is in the correct encoding here. I
> know this is
> > not your
> > problem, but it might help: check whether
> Encoding(a) is set
> > to "UTF-8"
> > or not in that case, and whether this fixes things.
> >
> > I'm not sure how htmlParse() detects the encoding
> when you
> > pass it a
> > character vector, but it probably uses Encoding(a),
> since
> > that's the
> > only reliable information; if it is missing, maybe
> it falls
> > back to what
> > the contents of the file say (maybe even before what
> the
> > "encoding"
> > argument says), which is windows-1251, and may not
> be the
> > encoding in
> > which getURL() saved the character vector. The
> question would
> > then be:
> > why didn't getURL() detect and set a's encoding
> correctly?
> >
> >
> > My two cents
> >
> >
> > > 2013/2/21 Milan Bouchet-Valat <nalimilan at club.fr>
> > > Le jeudi 21 février 2013 à 13:16 +0400,
> Lawr Eskin a
> > écrit :
> > > > Hello dear R-help mailing list.
> > > >
> > > >
> > > > Looks like the same issue in Russian:
> > > >
> > > >
> > > >
> > > > library(RCurl)
> > > >
> > > > library(XML)
> > > >
> > > > u = "
> > >
> >
> http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1"
> > > >
> > > > a = getURL(u)
> > > >
> > > > a # Here - the Russian is fine.
> > > >
> > > > a2 <- htmlParse(a)
> > > >
> > > > a2 # Here it is a mess...
> > > >
> > > >
> > > >
> > > > None of these seem to fix it:
> > > >
> > > >
> > > >
> > > > htmlParse(a, encoding = "windows-1251")
> > > >
> > > > htmlParse(a, encoding = "CP1251")
> > > >
> > > > htmlParse(a, encoding = "cp1251")
> > > >
> > > > htmlParse(a, encoding = "iso8859-5")
> > > >
> > > >
> > > >
> > > > This is my locale:
> > > >
> > > >
> > > >
> > > > Sys.getlocale()
> > > >
> > > >
> > >
> >
> "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> > > >
> > > >
> > > >
> > > > Any suggestions?
> > >
> > > What does Encoding(a) say?
> > >
> > >
> > > (FWIW, here on Linux even a is not in the
> correct
> > encoding :
> > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML
> 4.0
> > Transitional//EN"
> > >
> "http://www.w3.org/TR/REC-html40/loose.dtd">
> > > <html><head>
> > > <title>ГЉГіГЇГЁГІГј îäГîêîìГГ ГІГ
> ГіГѕ ГЄГўГ
> > ðòèð
> > > Гі Гў ГЊГ®Г
> > > ±ГЄГўГҐ В— 11430 îáúÿâëåГГЁГ© Г®
> ïðîäГ
> > æå îäГ
> > > îêîìГ
> > > Г ГІГûõ êâà ðòèð</title>
> > > [...])
> > >
> > >
> > > Regards
> > >
> > >
> > > > Thanks you very much in advance,
> > > >
> > > > Lavrentiy Eskin
> > >
> > > > <http://www.eng.nvg.ru>
> > > >
> > > > [[alternative HTML version
> deleted]]
> > > >
> > > >
> ______________________________________________
> > > > R-help at r-project.org mailing list
> > > >
> https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide
> > >
> http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal,
> self-contained,
> > reproducible
> > > code.
> > >
> > >
> >
> >
> >
>
>
>
More information about the R-help
mailing list