[R] Encoding issue

Sebastien Bihorel @eb@@t|en@b|hore| @end|ng |rom cogn|gencorp@com
Mon Nov 5 14:36:13 CET 2018


Hi,

I am having problems getting similar output when processing the same markdown files on 2 different Linux systems (one is a laptop with Linux Mint 18.3, the other is a production server running on CentOS 7). I think this boils down to an encoding issue but I am not sure if this is a system-wide issue or an R issue. So, this is what I have so far.

I have this very small dummy html file (with the same md5sum on both systems) which only contains 3 characters. A "od -cx" call provides the same output in both systems:
0000000   r 342 200 231   s  \n
           e272    9980    0a73

The middle character is some form of single quote produced by the conversion of a ' character from markdown to html. Reading the same file in both systems and applying a gsub replace provide widely different results.

####On my laptop
# environment variable: echo $LANG: en_US.UTF-8
> x <- scan('test.html', what='character', sep='\n')
Read 1 item
> x
[1] "r’s"
> gsub('\\s{2,}', ' ', x)
[1] "r’s"
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.4

####On the server
# environment variable: echo $LANG: en_US.UTF-8
> x <- scan('test.html', what='character', sep='\n')
Read 1 item
> x
[1] "râs"
> gsub('\\s{2,}', ' ', x)
[1] " "
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.3

(The overarching issue is that I have to use the production server for SOP reasons, so I cannot simply ignore the problem and use my laptop).

I would appreciate any suggestions on how to approach this issue.




More information about the R-help mailing list