[R-pkg-devel] [External] Re: UTF-8 and raw strings in package code

Mon Dec 1 15:12:07 CET 2025

> On 1 Dec 2025, at 02.53, Mark Bravington <markinr using summerinsouth.net> wrote:
> 
> On Sun, Nov 30, 2025, at 13:10, luke-tierney using uiowa.edu wrote:
>> Keeping the ASCII-only restriction for code is important as it makes
>> the code easier to understand by a wider audience.
>> 
>> Allowing non-ASCII characters in literal strings, raw or regular, does
>> seem reasonable to me in principle, but others may see issues I am not
>> aware of.
>> 
>> But checking for non-ASCII characters in code while allowing non-ASCII
>> characters in string literals needs much more sophisticated check code
>> than we currently have. If you or anyone else want to see this happen
>> you can explore creating a patch and submit to bugzilla for
>> consideration.
> 
> Fair enough. It might be easier than you suspect, though, since the parser already does the heavy lifting--- code below.
> 
> (i) If the file doesn't even parse, that's a more serious problem! 
> 
> (ii) If the file does parse OK, then AFAICS the only places that non-ASCII characters might be lurking are: (a) in comments, where they are somewhat grudgingly allowed IIRC; (b) in string literals, where we would like to allow them;  and of course (c) in symbols (variable names; see notes below), where we DON'T want them if it's a package. And this can all be checked easily from $parseData. My specimen function below does it in ~20 lines of "real" code.
> 
> A couple of notes:
> 
> #1 I didn't realize that it is even possible to have a "normal" (ie non-backticked) variable name with non-ASCII letters (see ?Quotes, "Names and Identifiers"). And indeed I can run the following in my (Anglo) Windows RGUI:
> 
> français <- 'bon'
> 
> Crikey, that's actually scary... Anyway,  the intention is clearly to NOT allow that in package code, at least not yet.

Maybe scary, but part of the R idiom is that plots, etc get auto-labeled with the name of the variables. If I want to do a child-vs-parents' income chart in Danish, it becomes "børn" and "forældre". And such names can be column names in datasets, etc. You can work around it but why should you? 

So, for local usage, it is quite sensible to allow extended character sets. 

For packages (and other distributed materials) probably not so. It is probably the language and not actually the character set you want to restrict, though. 

-pd

> 
> #2 Should packages nevertheless be allowed to use backticked identifiers containing non-ASCII characters? (IME backticks are often used for funny names with all-ASCII characters but in the wrong places.) Personally I'd vote no, but it's well above my pay grade--- and there's no voting in R. Anyhow, my code below has an option to check/not-check backticked symbols.
> 
> Is this likely to be acceptable? If so I'll try to submit a formal patch.
> 
> cheers
> Mark
> 
> 
> ## My function:
> 
> check_ASCII_code_MVB <- function( 
>    file, pp= NULL, check_backticks= FALSE
> ){
>  # Checks that any non-ASCII UTF-8 characters are confined to 
>  # string-literals & comments
> 
>  # Can directly supply results of previous parse(), for speed
>  if( is.null( pp)){ # ... or, if not:
>    pp <- try( parse( file=file, keep.source=TRUE, encoding='UTF-8'))
>    if( inherits( pp, 'try-error')){
>      warning( "Can't even parse, let alone check for non-ASCII")
> return( FALSE)
>    }
>  }
> 
>  # Get tokens of "leaf" (terminal) elements, and associated text
>  # This mimicks utils::getParseData()
>  ppd <- pp |> attr( 'wholeSrcref') |> attr( 'srcfile') |>
>    _$parseData |> attributes() |> _[ c( 'tokens', 'text')]
> 
>  symbols <- with( ppd, 
>      text[ grepl( 'SYMBOL', tokens, fixed=TRUE)])
> 
>  if( !check_backticks){
>    # Not obvious whether to allow UTF-8 in backticked names
> 
>    # AFAICS backticks can only occur both at start and end of a parsable symbol
>    backy <- startsWith( symbols, r"{`}") & endsWith( symbols, r"{`}")
>    symbols <- symbols[ !backy]
>  }
> 
>  non_ASCII <- .Call( tools:::C_nonASCII, symbols)
> 
>  OK <- !any( non_ASCII)
>  if( !OK){
>    attr( OK, 'offending_symbols') <- unique( symbols[ non_ASCII])
>  }
> return( OK)
> }
> 
> ## A snippet to save into a file, for testing. Note the raw string: irrelevant, but useful.
> 
> nonASCII_R <- r"--{
>  français <- 'bon'
>  `français` <- 'bon'
>  lingo <- "français"
>  # Nothing wrong with a bit of français in comments
> }--" |> strsplit( '\n') |> _[[1]]
> 
> writeLines( nonASCII_R, <file of your choice>)
> 
> 
> ## Possible patch of tools::.check_package_ASCII_code :
> 
> .check_package_ASCII_code_patch <- function (
>  dir, respect_quotes = FALSE
> ){
>    if (!dir.exists(dir)) 
>        stop(gettextf("directory '%s' does not exist", dir), 
>            domain = NA)
>    dir <- file_path_as_absolute(dir)
>    wrong_things <- character()
>    for (f in c(file.path(dir, "NAMESPACE"), list_files_with_type(file.path(dir, 
>        "R"), "code", OS_subdirs = c("unix", "windows")))) {
> ## OLD        
>        #text <- readLines(f, warn = FALSE)
>        # if (.Call(C_check_nonASCII, text, respect_quotes)) 
> ## NEW        
>        if( !check_ASCII_code_MVB( f))
>            wrong_things <- c(wrong_things, f)
>    }
>    if (length(wrong_things)) {
>        wrong_things <- substring(wrong_things, nchar(dir) + 
>            2L)
>        cat(wrong_things, sep = "\n")
>    }
>    invisible(wrong_things)
> }
> 
> ______________________________________________
> R-package-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com