[R] strange behavior when reading csv - line wraps

Sun May 31 22:47:48 CEST 2009

Big thanks to Ted and Jim for all the help.
Martin

(Ted Harding) wrote:
> Ah!!! It was count.fields() which we had overlooked! We discoveered
> a work-round which involved using 
>
>   Data0 <- readLines(file)
>
> to create a vector of strings, one for each line of the input file,
> and then using
>
>   NF <- unlist(lapply(R0,function(x)
>         length(unlist(gregexpr(";",x,fixed=TRUE,useBytes=TRUE))))))
>
> to count the number of occurrences of ";" (the separator) in each line.
> (NF+1) produces the same result as count.fields(file,sep=";"). 
>
> Thanks for pointing out the existence of count.fields()!
> Ted.
>
> On 31-May-09 15:04:23, jim holtman wrote:
>   
>> You can do something like this: count the number of fields in each line
>> of
>> the file and use the max to determine the number of columns for
>> read.table:
>>
>> file <- '/tempxx.txt'
>> maxFields <- max(count.fields(file))  # max
>> # now setup read.table for max number
>> input <- read.table(file, colClasses=rep(NA, maxFields), fill=TRUE,
>>     col.names=paste("V", seq(maxFields), sep=''))
>>
>>
>> On Sun, May 31, 2009 at 6:06 AM, Martin Tomko
>> <martin.tomko at geo.uzh.ch>wrote:
>>
>>     
>>> Dear Jim,
>>> with the help of Ted, we diagnosed that the cause is in the extreme
>>> variability in line length during reading in. As the table column
>>> number is
>>> apparently determined fro mthe first five lines, what exceeds this
>>> length
>>> gets automatically on the next line.
>>> I am now trying to find a way to read in the data despite this. I have
>>> no
>>> control over the table extent, the only thing that would make sense
>>> according to my data would be to read in a fixed number of columns and
>>> merge
>>> all remaining columns as a long string in the last one. No idea how to
>>> do
>>> this, though.
>>>
>>> Thanks
>>> Martin
>>>
>>>
>>> jim holtman wrote:
>>>
>>>       
>>>> It is still not clear to me exactly how you want to read the lines
>>>> in.  If
>>>> the lines have a variable number of fields, and some of the lines
>>>> might be
>>>> wrapped, is there some way to determine where the start of each line
>>>> is.
>>>>  If you are reading them in with read.csv, then the system is
>>>>  assuming
>>>> that each line starts a new row.  If this is not the case, then you
>>>> will
>>>> have to state the rules that determine where the lines start.  You
>>>> can
>>>> always read the data in with 'scan' to separate each line and then do
>>>> whatever processing is required to put together the rows in a data
>>>> frame
>>>> that you want.
>>>>  In one of your examples, you indicated that the line was split
>>>>  starting
>>>> at the word "kempten"; if this is in the middle of the line, then you
>>>> would
>>>> have to create the break after reading the line in with 'scan' and
>>>> then
>>>> creating the rows in the dataframe.  All of this can be done in R if
>>>> you can
>>>> state what the criteria is.
>>>> On Sat, May 30, 2009 at 4:32 AM, Martin Tomko
>>>> <martin.tomko at geo.uzh.ch<mailto:
>>>> martin.tomko at geo.uzh.ch>> wrote:
>>>>
>>>>    Jim,
>>>>    the two lines I put in are the actual problematic input lines.
>>>>    In these examples, there are no quotes nor # signs, although I
>>>>    have no means to make sure they do not occur in the inputs (any
>>>>    hints how I could deal with that?).
>>>>    I am trying to avoid as much pre-processing outside R as possible,
>>>>    and I have to process about 500 files with up to 3000 records
>>>>    each, so I need a more or less automated/batch solution. - so any
>>>>    string substitution will have to occur in R. But for the moment, I
>>>>    do not see a reaason for substitution, and the wrapping still
>>>>    occurs.
>>>>
>>>>    Cheers
>>>>    Martin
>>>>
>>>>
>>>>
>>>>    jim holtman wrote:
>>>>
>>>>        You need to supply the actual input line so we can see what is
>>>>        happening.  Are you sure you do not have unbalanced quotes in
>>>>        your input (try quote='') or do you have comment characters
>>>>        ("#") in your input?
>>>>
>>>>        On Fri, May 29, 2009 at 3:15 PM, Martin Tomko
>>>>        <martin.tomko at geo.uzh.ch <mailto:martin.tomko at geo.uzh.ch>
>>>>        <mailto:martin.tomko at geo.uzh.ch
>>>>        <mailto:martin.tomko at geo.uzh.ch>>> wrote:
>>>>
>>>>           Dear All,
>>>>           I am observing a strange behavior and searching the
>>>>        archives and
>>>>           help pages didn't help much.
>>>>           I have a csv with a variable number of fields in each line.
>>>>
>>>>           I use
>>>>           dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill
>>>>        =TRUE);
>>>>
>>>>           to read it in, and it works. But - some lines are long and
>>>>        'wrap',
>>>>           or split and continue on the next line. So when I check the
>>>>        dim of
>>>>           the frame, they are not correct and I can see when I do a
>>>>        printout
>>>>           that the lines is split into two in the frame. I checked
>>>>        the input
>>>>           file and all is good.
>>>>
>>>>           an example of the input is:
>>>>                 37;2175168475;13;8.522729;47.19537;16366682 at N00
>>>> ;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switz
>>>> erland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;touris
>>>> mus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitri
>>>> otnet;
>>>>
>>>>           where the last values occurs on the next line in the data
>>>>        frame.
>>>>
>>>>           It does not have to be the last value, as in the follwong
>>>>        example,
>>>>           the word "kempten" starts the next line:
>>>>                 39;167757703;12;10.309295;47.724545;21903142 at N00
>>>> ;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavar
>>>> ia;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss
>>>> ;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kup
>>>> pel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europ
>>>>         
>> aeischeunion;germanio;
>>     
>>>>           What could be the reason?
>>>>
>>>>           I ws thinking about solving the issue by using a different
>>>>           separator, that I would use for the first 7 fields and
>>>>           concatenating all of the remaining values into a single
>>>>           stirng
>>>>           value, but could not figure out how to do such a
>>>>        substitution in
>>>>           R. Unfortunately, on my system I cannot specify a range for
>>>>        sed...
>>>>
>>>>           Thanks for any help/pointers
>>>>           Martin
>>>>
>>>>           ______________________________________________
>>>>           R-help at r-project.org <mailto:R-help at r-project.org>
>>>>        <mailto:R-help at r-project.org <mailto:R-help at r-project.org>>
>>>>        mailing list
>>>>
>>>>           https://stat.ethz.ch/mailman/listinfo/r-help
>>>>           PLEASE do read the posting guide
>>>>           http://www.R-project.org/posting-guide.html<http://www.r-pro
>>>>           ject.org/posting-guide.html>
>>>>        <http://www.r-project.org/posting-guide.html>
>>>>           <http://www.r-project.org/posting-guide.html>
>>>>
>>>>           and provide commented, minimal, self-contained,
>>>>        reproducible code.
>>>>
>>>>
>>>>
>>>>
>>>>        --        Jim Holtman
>>>>        Cincinnati, OH
>>>>        +1 513 646 9390
>>>>
>>>>        What is the problem that you are trying to solve?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jim Holtman
>>>> Cincinnati, OH
>>>> +1 513 646 9390
>>>>
>>>> What is the problem that you are trying to solve?
>>>>
>>>>         
>>> --
>>> Martin Tomko
>>> Postdoctoral Research Assistant   Geographic Information Systems
>>> Division
>>> Department of Geography
>>> University of Zurich - Irchel
>>> Winterthurerstr. 190
>>> CH-8057 Zurich, Switzerland
>>>
>>> email:  martin.tomko at geo.uzh.ch
>>> site:   http://www.geo.uzh.ch/~mtomko
>>> mob:    +41-788 629 558
>>> tel:    +41-44-6355256
>>> fax:    +41-44-6356848
>>>
>>>
>>>       
>> -- 
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>     
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 31-May-09                                       Time: 16:24:27
> ------------------------------ XFMail ------------------------------
>
>
>   

-- 
Martin Tomko
Postdoctoral Research Assistant 

Geographic Information Systems Division
Department of Geography
University of Zurich - Irchel
Winterthurerstr. 190
CH-8057 Zurich, Switzerland

email: 	martin.tomko at geo.uzh.ch
site:	http://www.geo.uzh.ch/~mtomko
mob: 	+41-788 629 558
tel: 	+41-44-6355256
fax: 	+41-44-6356848