[R] puzzle using gsub (and encodings maybe)
Adrian Dragulescu
adrian_d at eskimo.com
Wed Oct 14 20:29:07 CEST 2009
Thank you.
If I use
>gsub(" \xad", "-", x)
[1] "NEW YORK-NEW ENGLAND"
I get what I want.
Adrian
> sessionInfo()
R version 2.9.2 (2009-08-24)
i386-pc-mingw32
locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
On Wed, 14 Oct 2009, Prof Brian Ripley wrote:
> On Wed, 14 Oct 2009, Adrian Dragulescu wrote:
>
>>> charToRaw(x)
>> [1] 4e 45 57 20 59 4f 52 4b 20 ad 4e 45 57 20 45 4e 47 4c 41 4e 44
>>> charToRaw(y)
>> [1] 4e 45 57 20 59 4f 52 4b 20 2d 4e 45 57 20 45 4e 47 4c 41 4e 44
>>>
>>
>> So they are different.
>
> We really do need the 'at a minimum' information we asked you for in the
> posting guide. But in cp1252 (a guess as to what you might be using) \xad is
> a 'soft hyphen', and that is not the same thing as a hyphen -- you will get
> the same issues with 'non-breaking space'.
>
> BDR
>
>>
>> Adrian
>>
>> I use R 2.8.1 on WinXP
>>
>>
>> On Wed, 14 Oct 2009, Duncan Murdoch wrote:
>>
>>> On 10/14/2009 1:30 PM, Adrian Dragulescu wrote:
>>>> Hello,
>>>>
>>>> Below is some output that shows my issue.
>>>>
>>>> I have a variable x that I read from a file (more on this below)
>>>>
>>>>> x
>>>> [1] "NEW YORK NEW ENGLAND"
>>>>> gsub(" -", "-", x) # this does not work!
>>>> [1] "NEW YORK NEW ENGLAND"
>
> Well, I see no hyphen at all here, but then I am not on Windows.
>
>>> It looks as though it worked, presumably because something got lost in
>>> your email.
>>>
>>> Could you post charToRaw(x) so we can see what's in x?
>>>
>>> Duncan Murdoch
>>>
>>>>> Encoding(x) # is x in a special encoding? no
>>>> [1] "unknown"
>>>>> y = "NEW YORK -NEW ENGLAND" # I type in variable y
>>>>> gsub(" -", "-", y) # and gsub works as expected
>>>> [1] "NEW YORK-NEW ENGLAND"
>>>>>
>>>>
>>>> I'm sure the problem has to do with the way I read the variable x. But
>>>> even if I change the encoding for x to ASCII, I still cannot do the sub.
>>>> I get x by reading a pdf file with pdftotext so you will not be able to
>>>> replicate my issue.
>>>>
>>>> Thanks for any suggestions,
>>>> Adrian
>
> --
> Brian D. Ripley, ripley at stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
More information about the R-help
mailing list