[R] Failed to convert data to numeric
Rolf Turner
ro||turner @end|ng |rom po@teo@net
Mon Mar 3 22:45:14 CET 2025
This issue looks like grist for the R Inferno.
cheers,
Rolf
On Mon, 3 Mar 2025 12:19:02 -0500
<avi.e.gross using gmail.com> wrote:
> The second solution Ivan offers looks good, and a bit more general
> than his first that simply removes one non-visible character.
>
> It begs the question of why the data has that anomaly at all. Did the
> data come from a text-processing environment where it was going to
> wrap there and was protected?
>
> As Ivan points out, there is a question of what format you expect
> numbers in and what "as.numeric" should do when it does not see an
> integer or floating point number.
>
> If you test it, you can see that as.numeric ignores leading and/or
> trailing blanks and tabs and even newlines sometimes and some other
> irrelevant ASCII characters. In that spirit, the UNICODE character
> being mentioned should be one that any UNICODE-aware version of
> as.numeric should ignore.
>
> But UNICODE supports a much wider vision of numeric so that there are
> numeric-equivalent symbols in other languages and groupings and even
> something like the symbols for numerals in light or dark circles
> count as numbers. Those can likely safely be excluded in this context
> but perhaps not in a more general function.
>
> But I note as.numeric seems to handle scientific notation as in:
>
> as.numeric("1.23e8")
> [1] 1.23e+08
>
> So a single instance of the letters "e" and "E" must be supported if
> your numbers in string form may contain them. Further, the E cannot
> be the first or last letter. It cannot have adjacent whitespace.
> Still, if you are OK with getting an NA in such situations, it should
> be OK.
>
> It gets worse. Hexadecimal is supported:
>
> > as.numeric("0X12")
> [1] 18
>
> You now need to support the letters x and X. But only if preceded by
> a zero!
>
> It gets still worse as any characters from [0-9A-F] are supported:
>
> > as.numeric("0xAE")
> [1] 174
>
> There may be other scenarios it handles. The filter applied might
> remove valid numbers so you may want to carefully document it if your
> program only handles a restricted set.
>
> A possible idea might be to make two passes and only evaluate any
> resulting NA from as.numeric() by doing a substitution like Ivan
> suggests to try to fix any broken ones. But note it may fix too much
> as "1.2 e 5" might become "1.2e5" as spaces are removed.
>
> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org> On Behalf Of Ivan Krylov
> via R-help Sent: Monday, March 3, 2025 3:09 AM
> To: Christofer Bogaso <bogaso.christofer using gmail.com>
> Cc: r-help <r-help using r-project.org>
> Subject: Re: [R] Failed to convert data to numeric
>
> В Mon, 3 Mar 2025 13:21:31 +0530
> Christofer Bogaso <bogaso.christofer using gmail.com> пишет:
>
> > Is there any way to remove all possible "Unicode character" that may
> > be present in the array at once?
>
> Define a range of characters you consider acceptable, and you'll be
> able to use regular expressions to remove everything else. For
> example, the following expression should remove everything except
> ASCII digits, dots, and hyphen-minus:
>
> gsub('[^0-9.-]+', '', dat2)
>
> There is a brief introduction to regular expressions in ?regex and
> various online resources such as <https://regex101.com/>.
>
--
Honorary Research Fellow
Department of Statistics
University of Auckland
Stats. Dep't. (secretaries) phone:
+64-9-373-7599 ext. 89622
Home phone: +64-9-480-4619
More information about the R-help
mailing list