[R] read.spss and umlaut
Thomas Lumley
tlumley at u.washington.edu
Thu Aug 3 15:34:04 CEST 2006
On Thu, 3 Aug 2006, Thomas Kuster wrote:
> Hello
>
> Am Mittwoch, 2. August 2006 17.11 schrieb Thomas Lumley:
>> This sounds like a conflict between encodings -- eg if R is assuming UTF-8
>> and the file is encoding in Latin-1 then the sequence
>> U+00FC : LATIN SMALL LETTER U WITH DIAERESIS
>> U+0072 : LATIN SMALL LETTER R
>> is coded as FC72 in the file, which is an illegal byte sequence in UTF-8.
>
> Hex: 74 65 20 66 fc 72 20 61 6c 6c 65 53 45 2f 31 36
> Text: t e f ü r a l l e S E / 1 6
Ok, so that looks like Latin-1 encoding in the file
>> The underlying C code (being written in the US quite a long time ago)
>> doesn't know about encodings, and I don't know what the rules are in SPSS
>> for valid characters (I suspect that in these old portable file formats it
>> probably just reads and writes bytes, leaving it up to the OS to interpret
>> them.
>
> But why stopp the C code reading? Is "/" not the endmark of the string? What
> is the problem, if I chance that in the source?
You haven't shown anything that indicates that the C code stopped reading.
More likely R just stops displaying when it gets to an illegal byte
sequence. You could use nchar() to count the bytes in the string to find
out.
>> You could try running R in a non-UTF-8 locale to see if it helps.
>
> I think my local is non-UTF-8 (de_CH, isolatin). How can I check that, and set
> an other temporary?
You can use charToRaw() to see what R thinks the byte sequence is for a
word with a u-umlaut.
Sys.setlocale() will let you change the locale, but your locale does look
non-UTF-8.
This is all guesswork since we can't see the file.
-thomas
More information about the R-help
mailing list