[R] watch out for quotes in data files

Douglas Bates bates at stat.wisc.edu
Tue Jul 10 05:53:07 CEST 2001

I have just spent a day trying to determine why I seemed to be unable
to read a file of microarray expression results into R properly.  The
file was produced by the Dchip software developed by Li and Wong at
Harvard's Department of Biostatistics.  It contains rows of
tab-delimited fields in the order

Probe set identifier
Probe set description
Array 1 expression
Array 1 call
Array 2 expression
Array 2 call

plus an extra tab (which I think is due to a programming glich).

There are 7130 rows, including the column headers, for results from
Affymetrix Hu6800 chips. 

When I read this file using read.table(filename, sep = "\t", head = TRUE)
I got only 3720 rows.  Furthermore count.fields(filename, sep = "\t")
gave a result of length 7130 but several of the rows were reported as
having only two fields instead of 15 like the other rows.

It seemed to me that the important characteristic of these rows was
their having a very long "Probe set description" and I wasted quite a
bit of time looking for possible buffer overflows that might be
triggered by this.

When I finally came to my senses and created a much smaller input file
that only contained a few rows, including one that was giving an
aberrant field count, I could directly examine the results of scan()
applied to it.  I noticed that the second field for the aberrant line
contained all the subsequent lines and then I saw that its description
included "5'" (as in the 5' end of the sequence versus the 3' end).
Other descriptions had this written as "5 prime" but this one used
"5'".  What was happening was that everything from there to the next
"'" character in the file was being included as part of that

I could read the file properly by adding the optional argument quote =
"" to the call to read.table.

The moral of the story is to watch out for molecular biologists who
use unpaired quote characters in their descriptions.
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list