[R-pkg-devel] [External] Re: UTF-8 and raw strings in package code

Sat Dec 6 23:30:37 CET 2025

Thanks, that's helpful. I'll tidy up and submit the patch and que sera, sera.

FWIW my code does already trap parse-errors, and in fact it will check all 5 "symboloids" in your snippet      function(x = y) g(z = w) . The $parseData reports SYMBOL, SYMBOL_SUB, SYMBOL_FUNCTION_CALL, SYMBOL_FORMALS for the different cases, which are all caught by grepl().

I initially did use getParseData() but then realized it wasn't adding anything (except time). Anyway, I'll leave both options in there, de-piped, for people to consider.

cheers
Mark

On Thu, Dec 4, 2025, at 09:20, luke-tierney using uiowa.edu wrote:
> On Mon, 1 Dec 2025, Mark Bravington wrote:
> 
> > [You don't often get email from markinr using summerinsouth.net. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > On Sun, Nov 30, 2025, at 13:10, luke-tierney using uiowa.edu wrote:
> >> Keeping the ASCII-only restriction for code is important as it makes
> >> the code easier to understand by a wider audience.
> >>
> >> Allowing non-ASCII characters in literal strings, raw or regular, does
> >> seem reasonable to me in principle, but others may see issues I am not
> >> aware of.
> >>
> >> But checking for non-ASCII characters in code while allowing non-ASCII
> >> characters in string literals needs much more sophisticated check code
> >> than we currently have. If you or anyone else want to see this happen
> >> you can explore creating a patch and submit to bugzilla for
> >> consideration.
> >
> > Fair enough. It might be easier than you suspect, though, since the parser already does the heavy lifting--- code below.
> >
> > (i) If the file doesn't even parse, that's a more serious problem!
> 
> You still have to handle it in a way that is consistent with the rest
> of the checking process, which I believe means catching the error and
> returning FALSE. I would use tryCatch for that.
> 
> > (ii) If the file does parse OK, then AFAICS the only places that non-ASCII characters might be lurking are: (a) in comments, where they are somewhat grudgingly allowed IIRC; (b) in string literals, where we would like to allow them;  and of course (c) in symbols (variable names; see notes below), where we DON'T want them if it's a package. And this can all be checked easily from $parseData. My specimen function below does it in ~20 lines of "real" code.
> 
> It isn't quite right though: symbols can appear in a few other
> places. Look at
> 
>      function(x = y) g(z = w)
> 
> I believe you are only picking up two of the five symbols you want.
> 
> Also you can simplify your code by using getParseData. I would also
> avoid using the pipe operator since it isn't consistent with the
> coding style in the file you are proposing to change.
> 
> > A couple of notes:
> >
> > #1 I didn't realize that it is even possible to have a "normal" (ie non-backticked) variable name with non-ASCII letters (see ?Quotes, "Names and Identifiers"). And indeed I can run the following in my (Anglo) Windows RGUI:
> >
> > français <- 'bon'
> >
> > Crikey, that's actually scary... Anyway,  the intention is clearly to NOT allow that in package code, at least not yet.
> >
> > #2 Should packages nevertheless be allowed to use backticked identifiers containing non-ASCII characters? (IME backticks are often used for funny names with all-ASCII characters but in the wrong places.) Personally I'd vote no, but it's well above my pay grade--- and there's no voting in R. Anyhow, my code below has an option to check/not-check backticked symbols.
> >
> > Is this likely to be acceptable? If so I'll try to submit a formal patch.
> 
> It is worth putting together a clean and well-tested patch that can be
> easily reviewed and tested by others. There are folks who spend much
> more time than I do on the QC code and may see reasons why going down
> this road is a bad idea, or how to do this better, but we'll see.
> 
> Best,
> 
> luke
> 
> >
> > cheers
> > Mark
> >
> >
> > ## My function:
> >
> > check_ASCII_code_MVB <- function(
> >    file, pp= NULL, check_backticks= FALSE
> > ){
> >  # Checks that any non-ASCII UTF-8 characters are confined to
> >  # string-literals & comments
> >
> >  # Can directly supply results of previous parse(), for speed
> >  if( is.null( pp)){ # ... or, if not:
> >    pp <- try( parse( file=file, keep.source=TRUE, encoding='UTF-8'))
> >    if( inherits( pp, 'try-error')){
> >      warning( "Can't even parse, let alone check for non-ASCII")
> > return( FALSE)
> >    }
> >  }
> >
> >  # Get tokens of "leaf" (terminal) elements, and associated text
> >  # This mimicks utils::getParseData()
> >  ppd <- pp |> attr( 'wholeSrcref') |> attr( 'srcfile') |>
> >    _$parseData |> attributes() |> _[ c( 'tokens', 'text')]
> >
> >  symbols <- with( ppd,
> >      text[ grepl( 'SYMBOL', tokens, fixed=TRUE)])
> >
> >  if( !check_backticks){
> >    # Not obvious whether to allow UTF-8 in backticked names
> >
> >    # AFAICS backticks can only occur both at start and end of a parsable symbol
> >    backy <- startsWith( symbols, r"{`}") & endsWith( symbols, r"{`}")
> >    symbols <- symbols[ !backy]
> >  }
> >
> >  non_ASCII <- .Call( tools:::C_nonASCII, symbols)
> >
> >  OK <- !any( non_ASCII)
> >  if( !OK){
> >    attr( OK, 'offending_symbols') <- unique( symbols[ non_ASCII])
> >  }
> > return( OK)
> > }
> >
> > ## A snippet to save into a file, for testing. Note the raw string: irrelevant, but useful.
> >
> > nonASCII_R <- r"--{
> >  français <- 'bon'
> >  `français` <- 'bon'
> >  lingo <- "français"
> >  # Nothing wrong with a bit of français in comments
> > }--" |> strsplit( '\n') |> _[[1]]
> >
> > writeLines( nonASCII_R, <file of your choice>)
> >
> >
> > ## Possible patch of tools::.check_package_ASCII_code :
> >
> > .check_package_ASCII_code_patch <- function (
> >  dir, respect_quotes = FALSE
> > ){
> >    if (!dir.exists(dir))
> >        stop(gettextf("directory '%s' does not exist", dir),
> >            domain = NA)
> >    dir <- file_path_as_absolute(dir)
> >    wrong_things <- character()
> >    for (f in c(file.path(dir, "NAMESPACE"), list_files_with_type(file.path(dir,
> >        "R"), "code", OS_subdirs = c("unix", "windows")))) {
> > ## OLD
> >        #text <- readLines(f, warn = FALSE)
> >        # if (.Call(C_check_nonASCII, text, respect_quotes))
> > ## NEW
> >        if( !check_ASCII_code_MVB( f))
> >            wrong_things <- c(wrong_things, f)
> >    }
> >    if (length(wrong_things)) {
> >        wrong_things <- substring(wrong_things, nchar(dir) +
> >            2L)
> >        cat(wrong_things, sep = "\n")
> >    }
> >    invisible(wrong_things)
> > }
> >
> >
> 
> -- 
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>     Actuarial Science
> 241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

	[[alternative HTML version deleted]]