[R] Potential R bug in identical
Ivan Krylov
kry|ov@r00t @end|ng |rom gm@||@com
Thu Jan 17 22:32:05 CET 2019
On Thu, 17 Jan 2019 21:05:07 +0000
Layik Hama <L.Hama using leeds.ac.uk> wrote:
> Why would `identical(str, "Accident_Index", ignore.case = TRUE)`
> behave differently on Linux/MacOS vs Windows?
Because str is different from "Accident_Index" on Windows: it was
decoded from bytes to characters according to different rules when file
was read.
Default encoding for files being read is specified by 'encoding'
options. On both Windows and Linux I get:
> options('encoding')
$encoding
[1] "native.enc"
For which ?file says (in section "Encoding"):
>> ‘""’ and ‘"native.enc"’ both mean the ‘native’ encoding, that is the
>> internal encoding of the current locale and hence no translation is
>> done.
Linux version of R has a UTF-8 locale (AFAIK, macOS does too) and
decodes the files as UTF-8 by default:
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
locale:
[1] LC_CTYPE=ru_RU.utf8 LC_NUMERIC=C
[3] LC_TIME=ru_RU.utf8 LC_COLLATE=ru_RU.utf8
[5] LC_MONETARY=ru_RU.utf8 LC_MESSAGES=ru_RU.utf8
[7] LC_PAPER=ru_RU.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=ru_RU.utf8 LC_IDENTIFICATION=C
While on Windows R uses a single-byte encoding dependent on the locale:
> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Russian_Russia.1251 LC_CTYPE=Russian_Russia.1251
[3] LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C
[5] LC_TIME=Russian_Russia.1251
> readLines('test.txt')[1]
[1] "п»їAccident_Index"
> nchar(readLines('test.txt')[1])
[1] 17
R on Windows can be explicitly told to decode the file as UTF-8:
> nchar(readLines(file('test.txt',encoding='UTF-8'))[1])
[1] 15
The first character of the string is the invisible byte order mark.
Thankfully, there is an easy fix for that, too. ?file additionally
says:
>> As from R 3.0.0 the encoding ‘"UTF-8-BOM"’ is accepted for
>> reading and will remove a Byte Order Mark if present (which it
>> often is for files and webpages generated by Microsoft applications).
So this is how we get the 14-character column name we'd wanted:
> nchar(readLines(file('test.txt',encoding='UTF-8-BOM'))[1])
[1] 14
For our original task, this means:
> names(read.csv('Acc.csv'))[1] # might produce incorrect results
[1] "п.їAccident_Index"
> names(read.csv('Acc.csv', fileEncoding='UTF-8-BOM'))[1] # correct
[1] "Accident_Index"
--
Best regards,
Ivan
More information about the R-help
mailing list