[R] read.spss and umlaut

Thu Aug 3 15:34:04 CEST 2006

On Thu, 3 Aug 2006, Thomas Kuster wrote:

> Hello
>
> Am Mittwoch, 2. August 2006 17.11 schrieb Thomas Lumley:
>> This sounds like a conflict between encodings -- eg if R is assuming UTF-8
>> and the file is encoding in Latin-1 then the sequence
>> U+00FC : LATIN SMALL LETTER U WITH DIAERESIS
>> U+0072 : LATIN SMALL LETTER R
>> is coded as FC72 in the file, which is an illegal byte sequence in UTF-8.
>
> Hex:  74 65 20 66 fc 72 20 61 6c 6c 65 53 45 2f 31 36
> Text:  t  e     f  ü  r     a  l  l  e  S  E  /  1  6

Ok, so that looks like Latin-1 encoding in the file

>> The underlying C code (being written in the US quite a long time ago)
>> doesn't know about encodings, and I don't know what the rules are in SPSS
>> for valid characters (I suspect that in these old portable file formats it
>> probably just reads and writes bytes, leaving it up to the OS to interpret
>> them.
>
> But why stopp the C code reading? Is "/" not the endmark of the string? What
> is the problem, if I chance that in the source?

You haven't shown anything that indicates that the C code stopped reading. 
More likely R just stops displaying when it gets to an illegal byte 
sequence.  You could use nchar() to count the bytes in the string to find 
out.

>> You could try running R in a non-UTF-8 locale to see if it helps.
>
> I think my local is non-UTF-8 (de_CH, isolatin). How can I check that, and set
> an other temporary?

You can use charToRaw() to see what R thinks the byte sequence is for a 
word with a u-umlaut.

Sys.setlocale() will let you change the locale, but your locale does look 
non-UTF-8.

This is all guesswork since we can't see the file.

 	-thomas