[R] Intended use-case for data.matrix
Duncan Murdoch
murdoch@dunc@n @end|ng |rom gm@||@com
Wed Nov 4 21:37:40 CET 2020
You can see the change to the help page here:
https://github.com/wch/r-source/commit/d1d3863d72613660727379dd5dffacad32ac9c35#diff-9143902e81e6ad39faace2d926725c4c72b078dd13fbb1223c4a35f833b58ee6
Before the change, it said the input should be
a data frame whose components are logical vectors, factors or numeric
vectors
which suggests your input was invalid. But later it says
Logical and factor columns are converted to integers. Any other
column which is not numeric (according to \code{\link{is.numeric}}) is
converted by \code{\link{as.numeric}} or, for S4 objects,
\code{\link{as}(, "numeric")}.
which suggests what you were doing was supported.
It's unfortunate that you didn't know about this change, but it was made in
August 2019, and appeared on the news feed here:
https://developer.r-project.org/blosxom.cgi/R-devel/NEWS/2019/08/08#n2019-08-08
so some of the blame for this goes to you for not paying attention and
testing unreleased R versions.
To protect yourself against this kind of unpleasant surprise in the future,
I'd suggest this:
- Follow the news feed.
- Put your code in a package, and test it against R-devel now and then. (If
your package is on CRAN the testing will happen automatically; if it's not
on CRAN and not in a package, you could still test against R-devel, but why
make your life more difficult by *not* putting it in a package?)
Duncan Murdoch
On 04/11/2020 6:48 a.m., Philip Charles wrote:
> Hi R gurus,
>
> We do a lot of work with biological -omics datasets (genomics, proteomics
etc). The text file inputs to R typically contain a mixture of (mostly)
character data and numeric data. The number of columns (both character and
numeric data) in the file vary with the number of samples measured (which
makes use of colClasses , so a typical approach might be
>
> 1) read in the whole file as character matrix
>
> #simulated result of read.table (with stringsAsFactors=FALSE)
> raw <-
data.frame(Accession=c('P04637','P01375','P00761'),Description=c('Cellular
tumor antigen p53','Tumor necrosis factor','Trypsin'),Species=c('Homo
sapiens','Homo sapiens','Sus
scrofa'),Intensity.SampleA=c('919948','1346170','15870'),Intensity.SampleB=c('1625540','710272','83624'),Intensity.SampleC=c('1232780','1481040','62548'))
>
> 2) use grep to identify numeric columns based on column names and split
the raw matrix
>
> QUANT_COLS <- grepl('^Intensity\\.',colnames(raw))
> META_COLS <- !QUANT_COLS
> quant.df.char <- raw[,QUANT_COLS]
> meta.df <- raw[, META_COLS]
>
> 3) convert the quantitation data frame to a numeric matrix
>
> Prior to R version 4, my standard method for step 3 was to use
data.matrix() for this last step. After recently updating from v3.6.3, I've
found that all my workflows using this function were giving wildly
incorrect results. I figured out that data.matrix now yields a matrix of
factor levels rather than the actual numeric values
>
>> quant.df.char
> Intensity.SampleA Intensity.SampleB Intensity.SampleC
> 1 919948 1625540 1232780
> 2 1346170 710272 1481040
> 3 15870 83624 62548
>
>> data.matrix(quant.df.char)
> Intensity.SampleA Intensity.SampleB Intensity.SampleC
> [1,] 3 1 1
> [2,] 1 2 2
> [3,] 2 3 3
>
> The change in behaviour of this function is documented in the R v4.0.0
changelog, so it is clearly intentional:
>
> "data.matrix() now converts character columns to factors and from this to
integers."
>
> Now, I know there are other ways to achieve the same conversion, e.g.
sapply(quant.df.char, as.numeric). They aren't quite as straightforward to
read in the code as data.matrix (sapply/lapply in particular I have to
think though whether there will a need to transpose the result!), but the
fact that this base function has been changed (without a way to replicate
the previous behaviour) leads me to suspect that I have probably not
previously been using data.matrix in the intended manner - and I may
therefore be making similar mistakes elsewhere! I've certainly
distributed/handed out R scripting examples in the past that will now give
incorrect results when run on v4+ R.
>
> What even more confusing to me (but possibly related as regards an
answer) is that R v4 broke with long-standing convention to change
default.stringsAsFactors() to FALSE. So on one hand the update took away
what was (at least, from our perspective, with our data - I am sure some
here may disagree!) a perennial source of confusion/bugs for R learners, by
not introducing string factorisation during data import, and then on the
other hand changed a base function to explicitly introduce string
factorisation.. I can't see when converting a character dataset, not to
factors but, straight to numeric factor levels might be that useful (but of
course that doesn't mean it isn't!).
>
> I've had a look through r-help and r-devel archives and couldn't spot any
discussion of this, so apologies if this has been asked before. I'm also
pretty sure my misunderstanding is with the intended use-case of
data.matrix and R ethos around strings/factors rather than the rationale
for the change, which is why I'm asking here.
>
> Best wishes,
>
> Phil
>
> Philip Charles
> Target Discovery Institute, Nuffield Department Of Medicine
> University of Oxford
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list