[R] Failed to convert data to numeric
Richard O'Keefe
r@oknz @end|ng |rom gm@||@com
Mon Mar 3 23:42:30 CET 2025
This is not for the R inferno.
This is for the Microsoft interno, or perhaps the Unicode inferno.
The Byte Order Mark is supposed to appear at the beginning of UTF-32
or UTF-16 *external* data, like a file
or data coming over a socket.
In the Microsoft world, it also tends to appear at the beginning of
UTF-8 files, where strictly speaking, it shouldn't.
ONLY at the beginning does ZWNBSP have this function.
I use a lot of programming languages, and I don't know any that
routinely ignores ZWNBSP.
Hmm. I wonder if the strings in this example are fields of a data
file but were originally in a different
order, with the last string first?
What *would* make sense would be an option, when opening a connection,
to skip a leading BOM.
On Tue, 4 Mar 2025 at 10:45, Rolf Turner <rolfturner using posteo.net> wrote:
>
>
> This issue looks like grist for the R Inferno.
>
> cheers,
>
> Rolf
>
>
> On Mon, 3 Mar 2025 12:19:02 -0500
> <avi.e.gross using gmail.com> wrote:
>
> > The second solution Ivan offers looks good, and a bit more general
> > than his first that simply removes one non-visible character.
> >
> > It begs the question of why the data has that anomaly at all. Did the
> > data come from a text-processing environment where it was going to
> > wrap there and was protected?
> >
> > As Ivan points out, there is a question of what format you expect
> > numbers in and what "as.numeric" should do when it does not see an
> > integer or floating point number.
> >
> > If you test it, you can see that as.numeric ignores leading and/or
> > trailing blanks and tabs and even newlines sometimes and some other
> > irrelevant ASCII characters. In that spirit, the UNICODE character
> > being mentioned should be one that any UNICODE-aware version of
> > as.numeric should ignore.
> >
> > But UNICODE supports a much wider vision of numeric so that there are
> > numeric-equivalent symbols in other languages and groupings and even
> > something like the symbols for numerals in light or dark circles
> > count as numbers. Those can likely safely be excluded in this context
> > but perhaps not in a more general function.
> >
> > But I note as.numeric seems to handle scientific notation as in:
> >
> > as.numeric("1.23e8")
> > [1] 1.23e+08
> >
> > So a single instance of the letters "e" and "E" must be supported if
> > your numbers in string form may contain them. Further, the E cannot
> > be the first or last letter. It cannot have adjacent whitespace.
> > Still, if you are OK with getting an NA in such situations, it should
> > be OK.
> >
> > It gets worse. Hexadecimal is supported:
> >
> > > as.numeric("0X12")
> > [1] 18
> >
> > You now need to support the letters x and X. But only if preceded by
> > a zero!
> >
> > It gets still worse as any characters from [0-9A-F] are supported:
> >
> > > as.numeric("0xAE")
> > [1] 174
> >
> > There may be other scenarios it handles. The filter applied might
> > remove valid numbers so you may want to carefully document it if your
> > program only handles a restricted set.
> >
> > A possible idea might be to make two passes and only evaluate any
> > resulting NA from as.numeric() by doing a substitution like Ivan
> > suggests to try to fix any broken ones. But note it may fix too much
> > as "1.2 e 5" might become "1.2e5" as spaces are removed.
> >
> > -----Original Message-----
> > From: R-help <r-help-bounces using r-project.org> On Behalf Of Ivan Krylov
> > via R-help Sent: Monday, March 3, 2025 3:09 AM
> > To: Christofer Bogaso <bogaso.christofer using gmail.com>
> > Cc: r-help <r-help using r-project.org>
> > Subject: Re: [R] Failed to convert data to numeric
> >
> > В Mon, 3 Mar 2025 13:21:31 +0530
> > Christofer Bogaso <bogaso.christofer using gmail.com> пишет:
> >
> > > Is there any way to remove all possible "Unicode character" that may
> > > be present in the array at once?
> >
> > Define a range of characters you consider acceptable, and you'll be
> > able to use regular expressions to remove everything else. For
> > example, the following expression should remove everything except
> > ASCII digits, dots, and hyphen-minus:
> >
> > gsub('[^0-9.-]+', '', dat2)
> >
> > There is a brief introduction to regular expressions in ?regex and
> > various online resources such as <https://regex101.com/>.
> >
>
>
>
> --
> Honorary Research Fellow
> Department of Statistics
> University of Auckland
> Stats. Dep't. (secretaries) phone:
> +64-9-373-7599 ext. 89622
> Home phone: +64-9-480-4619
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list