[R] strange behavior when reading csv - line wraps
Martin Tomko
martin.tomko at geo.uzh.ch
Sun May 31 22:47:48 CEST 2009
Big thanks to Ted and Jim for all the help.
Martin
(Ted Harding) wrote:
> Ah!!! It was count.fields() which we had overlooked! We discoveered
> a work-round which involved using
>
> Data0 <- readLines(file)
>
> to create a vector of strings, one for each line of the input file,
> and then using
>
> NF <- unlist(lapply(R0,function(x)
> length(unlist(gregexpr(";",x,fixed=TRUE,useBytes=TRUE))))))
>
> to count the number of occurrences of ";" (the separator) in each line.
> (NF+1) produces the same result as count.fields(file,sep=";").
>
> Thanks for pointing out the existence of count.fields()!
> Ted.
>
> On 31-May-09 15:04:23, jim holtman wrote:
>
>> You can do something like this: count the number of fields in each line
>> of
>> the file and use the max to determine the number of columns for
>> read.table:
>>
>> file <- '/tempxx.txt'
>> maxFields <- max(count.fields(file)) # max
>> # now setup read.table for max number
>> input <- read.table(file, colClasses=rep(NA, maxFields), fill=TRUE,
>> col.names=paste("V", seq(maxFields), sep=''))
>>
>>
>> On Sun, May 31, 2009 at 6:06 AM, Martin Tomko
>> <martin.tomko at geo.uzh.ch>wrote:
>>
>>
>>> Dear Jim,
>>> with the help of Ted, we diagnosed that the cause is in the extreme
>>> variability in line length during reading in. As the table column
>>> number is
>>> apparently determined fro mthe first five lines, what exceeds this
>>> length
>>> gets automatically on the next line.
>>> I am now trying to find a way to read in the data despite this. I have
>>> no
>>> control over the table extent, the only thing that would make sense
>>> according to my data would be to read in a fixed number of columns and
>>> merge
>>> all remaining columns as a long string in the last one. No idea how to
>>> do
>>> this, though.
>>>
>>> Thanks
>>> Martin
>>>
>>>
>>> jim holtman wrote:
>>>
>>>
>>>> It is still not clear to me exactly how you want to read the lines
>>>> in. If
>>>> the lines have a variable number of fields, and some of the lines
>>>> might be
>>>> wrapped, is there some way to determine where the start of each line
>>>> is.
>>>> If you are reading them in with read.csv, then the system is
>>>> assuming
>>>> that each line starts a new row. If this is not the case, then you
>>>> will
>>>> have to state the rules that determine where the lines start. You
>>>> can
>>>> always read the data in with 'scan' to separate each line and then do
>>>> whatever processing is required to put together the rows in a data
>>>> frame
>>>> that you want.
>>>> In one of your examples, you indicated that the line was split
>>>> starting
>>>> at the word "kempten"; if this is in the middle of the line, then you
>>>> would
>>>> have to create the break after reading the line in with 'scan' and
>>>> then
>>>> creating the rows in the dataframe. All of this can be done in R if
>>>> you can
>>>> state what the criteria is.
>>>> On Sat, May 30, 2009 at 4:32 AM, Martin Tomko
>>>> <martin.tomko at geo.uzh.ch<mailto:
>>>> martin.tomko at geo.uzh.ch>> wrote:
>>>>
>>>> Jim,
>>>> the two lines I put in are the actual problematic input lines.
>>>> In these examples, there are no quotes nor # signs, although I
>>>> have no means to make sure they do not occur in the inputs (any
>>>> hints how I could deal with that?).
>>>> I am trying to avoid as much pre-processing outside R as possible,
>>>> and I have to process about 500 files with up to 3000 records
>>>> each, so I need a more or less automated/batch solution. - so any
>>>> string substitution will have to occur in R. But for the moment, I
>>>> do not see a reaason for substitution, and the wrapping still
>>>> occurs.
>>>>
>>>> Cheers
>>>> Martin
>>>>
>>>>
>>>>
>>>> jim holtman wrote:
>>>>
>>>> You need to supply the actual input line so we can see what is
>>>> happening. Are you sure you do not have unbalanced quotes in
>>>> your input (try quote='') or do you have comment characters
>>>> ("#") in your input?
>>>>
>>>> On Fri, May 29, 2009 at 3:15 PM, Martin Tomko
>>>> <martin.tomko at geo.uzh.ch <mailto:martin.tomko at geo.uzh.ch>
>>>> <mailto:martin.tomko at geo.uzh.ch
>>>> <mailto:martin.tomko at geo.uzh.ch>>> wrote:
>>>>
>>>> Dear All,
>>>> I am observing a strange behavior and searching the
>>>> archives and
>>>> help pages didn't help much.
>>>> I have a csv with a variable number of fields in each line.
>>>>
>>>> I use
>>>> dataPoints <- read.csv(inputFile, head=FALSE, sep=";",fill
>>>> =TRUE);
>>>>
>>>> to read it in, and it works. But - some lines are long and
>>>> 'wrap',
>>>> or split and continue on the next line. So when I check the
>>>> dim of
>>>> the frame, they are not correct and I can see when I do a
>>>> printout
>>>> that the lines is split into two in the frame. I checked
>>>> the input
>>>> file and all is good.
>>>>
>>>> an example of the input is:
>>>> 37;2175168475;13;8.522729;47.19537;16366682 at N00
>>>> ;30;sculpture;bird;tourism;animal;statue;canon;eos;rebel;schweiz;switz
>>>> erland;eagle;swiss;adler;skulptur;zug;1750;28;tamron;f28;canton;touris
>>>> mus;vogel;baar;kanton;xti;tamron1750;1750mm;tamron1750mm;400d;rabbitri
>>>> otnet;
>>>>
>>>> where the last values occurs on the next line in the data
>>>> frame.
>>>>
>>>> It does not have to be the last value, as in the follwong
>>>> example,
>>>> the word "kempten" starts the next line:
>>>> 39;167757703;12;10.309295;47.724545;21903142 at N00
>>>> ;36;white;building;tower;clock;clouds;germany;bayern;deutschland;bavar
>>>> ia;europa;europe;eagle;adler;eu;wolke;dome;townhall;rathaus;turm;weiss
>>>> ;allemagne;europeanunion;bundesrepublik;gebaeude;glocke;brd;allgau;kup
>>>> pel;europ;kempten;niemcy;europo;federalrepublic;europaischeunion;europ
>>>>
>> aeischeunion;germanio;
>>
>>>> What could be the reason?
>>>>
>>>> I ws thinking about solving the issue by using a different
>>>> separator, that I would use for the first 7 fields and
>>>> concatenating all of the remaining values into a single
>>>> stirng
>>>> value, but could not figure out how to do such a
>>>> substitution in
>>>> R. Unfortunately, on my system I cannot specify a range for
>>>> sed...
>>>>
>>>> Thanks for any help/pointers
>>>> Martin
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org <mailto:R-help at r-project.org>
>>>> <mailto:R-help at r-project.org <mailto:R-help at r-project.org>>
>>>> mailing list
>>>>
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html<http://www.r-pro
>>>> ject.org/posting-guide.html>
>>>> <http://www.r-project.org/posting-guide.html>
>>>> <http://www.r-project.org/posting-guide.html>
>>>>
>>>> and provide commented, minimal, self-contained,
>>>> reproducible code.
>>>>
>>>>
>>>>
>>>>
>>>> -- Jim Holtman
>>>> Cincinnati, OH
>>>> +1 513 646 9390
>>>>
>>>> What is the problem that you are trying to solve?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jim Holtman
>>>> Cincinnati, OH
>>>> +1 513 646 9390
>>>>
>>>> What is the problem that you are trying to solve?
>>>>
>>>>
>>> --
>>> Martin Tomko
>>> Postdoctoral Research Assistant Geographic Information Systems
>>> Division
>>> Department of Geography
>>> University of Zurich - Irchel
>>> Winterthurerstr. 190
>>> CH-8057 Zurich, Switzerland
>>>
>>> email: martin.tomko at geo.uzh.ch
>>> site: http://www.geo.uzh.ch/~mtomko
>>> mob: +41-788 629 558
>>> tel: +41-44-6355256
>>> fax: +41-44-6356848
>>>
>>>
>>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 31-May-09 Time: 16:24:27
> ------------------------------ XFMail ------------------------------
>
>
>
--
Martin Tomko
Postdoctoral Research Assistant
Geographic Information Systems Division
Department of Geography
University of Zurich - Irchel
Winterthurerstr. 190
CH-8057 Zurich, Switzerland
email: martin.tomko at geo.uzh.ch
site: http://www.geo.uzh.ch/~mtomko
mob: +41-788 629 558
tel: +41-44-6355256
fax: +41-44-6356848
More information about the R-help
mailing list