[R] retaining characters in a csv file
Daniel Nordlund
djnordlund at frontier.com
Wed Sep 23 21:05:33 CEST 2015
On 9/23/2015 5:57 AM, Therneau, Terry M., Ph.D. wrote:
> Thanks for all for the comments, I hadn't intended to start a war.
>
> My summary:
> 1. Most important: I wasn't missing something obvious. This is
> always my first suspicion when I submit something to R-help, and it's
> true more often than not.
>
> 2. Obviously (at least it is now), the CSV standard does not specify
> that quotes should force a character result. R is not "wrong". Wrt
> to using what Excel does as litmus test, I consider that to be totally
> uninformative about standards: neither pro (like Duncan) or anti (like
> Rolf), but simply irrelevant. (Like many MS choices.)
>
> 3. I'll have to code in my own solution, either pre-scan the first
> few lines to create a colClasses, or use read_csv from the readr
> library (if there are leading zeros it keeps the string as character,
> which may suffice for my needs), or something else.
>
> 4. The source of the data is a "text/csv" field coming from an http
> POST request. This is an internal service on an internal Mayo server
> and coded by our own IT department; this will not be the first case
> where I have found that their definition of "csv" is not quite standard.
>
> Terry T.
>
>
>
>> On 23/09/15 10:00, Therneau, Terry M., Ph.D. wrote:
>>> I have a csv file from an automatic process (so this will happen
>>> thousands of times), for which the first row is a vector of variable
>>> names and the second row often starts something like this:
>>>
>>> 5724550,"000202075214",2005.02.17,2005.02.17,"F", .....
>>>
>>> Notice the second variable which is
>>> a character string (note the quotation marks)
>>> a sequence of numeric digits
>>> leading zeros are significant
>>>
>>> The read.csv function insists on turning this into a numeric. Is there
>>> any simple set of options that
>>> will turn this behavior off? I'm looking for a way to tell it to "obey
>>> the bloody quotes" -- I still want the first, third, etc columns to
>>> become numeric. There can be more than one variable like this, and not
>>> always in the second position.
>>>
>>> This happens deep inside the httr library; there is an easy way for me
>>> to add more options to the read.csv call but it is not so easy to
>>> replace it with something else.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
A fairly simple workaround is to add two lines of code to the process,
and then add the colClasses parameter as you suggested in item 2 above.
want <- read.csv('yourfile', quote='', stringsAsFactors= FALSE, nrows=1)
classes <- sapply(want, class)
want <- read.csv('yourfile', stringsAsFactors= FALSE, colClasses=classes)
I don't know if you want your final file to convert strings to factors,
so you can modify as needed. In addition, if your files aren't as
regular as I inferred, you can increase the number of rows to read in
the first line to ensure getting the classes right.
Hope this is helpful,
Dan
--
Daniel Nordlund
Bothell, WA USA
More information about the R-help
mailing list