[R] CSV value not being read as it appears

Peter Ehlers ehlers at ucalgary.ca
Fri Jan 14 16:09:18 CET 2011


On 2011-01-14 02:09, bgreen at dyson.brisnet.org.au wrote:
> Brian,
>
> Thanks. My response to David follows. I should add that this problem has
> never occurred previously as far as I know (I have now checked the
> previous report I was sent):
>

This problem occurs to me frequently. Like Philipp and David,
I too always check imported categorical variables. The worst
cases are trailing spaces (in quoted text).

It is hardly R's fault that Excel users routinely commit
crimes against data.

Peter Ehlers

> Hello David,
>
> Thanks for your e-mail. The data was a report derived from a statewide
> database, saved in EXCEL format, so the usual issue of the vagaries of
> human data entry variation wasn't the issue as the data was an automated
> report, which is run every three months. I would not have even noticed
> this problem if I hadn't been double checking the numbers of people by
> district. Visual inspection didn't reveal this problem - no white space
> was obvious and the spelling was identical. Tabulation via R wouldn't have
> detected this - I was obtaining the EXCEL totals via filter which I then
> compared with R output. I'm hoping I can skip this step, in future, with
> Jim's suggestion.
>
> regards
>
> Bob
>
>
>
>
>
>
>> On Fri, 14 Jan 2011, David Scott wrote:
>>
>>> As a further note, this is a reminder that whenever you get data via
>>> a spreadsheet the first thing to do is examine it and clean up any
>>> problems. A basic requirement is to tabulate any categorical
>>> variable. Spreadsheets allow any sort of data to be entered, with no
>>> controls. My experience is that those who enter data into
>>> spreadsheets enter all sorts of variations of what a human would
>>> wish to treat as the same ("Open", "Open ", "open", etc.), even when
>>> told not to.
>>
>> Another common problem is that they enter characters such as
>> non-breaking space or zero-width characters: we added support for
>> known encodings of NBSP to strip.white about five years ago.
>>
>>>
>>> David Scott
>>>
>>> On 14/01/2011 4:03 p.m., Jim Holtman wrote:
>>>> try strip.white=TRUE to strip out white space
>>>>
>>>> Sent from my iPad
>>>>
>>>> On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
>>>>
>>>>>
>>>>> I have a frustrating issue which I am hoping someone may have a
>>>>> suggestion
>>>>> about.
>>>>>
>>>>> I am running XP and R 2.12.0 and saved an EXCEL file that I was sent
>>>>> as a
>>>>> csv file.
>>>>>
>>>>> The initial code I ran follows.
>>>>>
>>>>> dec<- read.csv("g://FMH/FO30122010.csv",header=T)
>>>>> dec.open<- subset (dec, Status == "Open")
>>>>> table(dec.open$AMHS)
>>>>>
>>>>> I was checking the output and noticed a difference between my manual
>>>>> count
>>>>> and R output. Two subject's rows were not being detected by the subset
>>>>> command:
>>>>>
>>>>> For the AMHS where there was a discrepancy I then ran:
>>>>> wm<- subset (dec, AMHS == "WM")
>>>>>
>>>>> The problem appears to be that there is a space before the 'Open"
>>>>> value
>>>>> for two indivduals, as per the example below.
>>>>>
>>>>> 10/02/2010  Open
>>>>> 22/08/2007   Open
>>>>>
>>>>> Checking in EXCEL there does not appear to be a space and the format
>>>>> is
>>>>> the same (e.g 'general').  I resolved the problem by copying over the
>>>>> values for the two individuals where I identified  a problem.
>>>>>
>>>>> Given this problem was not detected by visual scanning I would
>>>>> appreciate
>>>>> advice on how this problem can be detected in future without my having
>>>>> to
>>>>> manually check raw data against R output.
>>>>>
>>>>> Any assistance is appreciated,
>>>>>
>>>>> Bob
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>> --
>>> _________________________________________________________________
>>> David Scott	Department of Statistics
>>> 		The University of Auckland, PB 92019
>>> 		Auckland 1142,    NEW ZEALAND
>>> Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055
>>> Email:	d.scott at auckland.ac.nz,  Fax: +64 9 373 7018
>>>
>>> Director of Consulting, Department of Statistics
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> --
>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford,             Tel:  +44 1865 272861 (self)
>> 1 South Parks Road,                     +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list