[R] Unable to read csv files with comma in values

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Sun Apr 7 19:35:06 CEST 2019


... and here's another perhaps simpler, perhaps more efficient (??) way of
doing it using strsplit().Note that it uses the fixed field position, 2, of
the bracketed roles. Adjust as needed.

A better solution would be a regex that avoids the loops (here, the sapply)
altogether, but I don't know how to do this. Maybe someone cleverer will
offer such a solution.

txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan, DBA, R101",
        "Mary, [Stats, Designer, R], t14")

wh <-  grep("\\[.+\\]", txt)
spl <-  strsplit(txt[wh], "\\[|\\]")
txt[wh] <-  sapply(spl, function(y)
   paste0(y[1], gsub(" *, *","-", y[2]), y[-(1:2)]))

> txt
[1] "Sam, HadoopAnalyst-DBA-Developer, R46443 "
[2] "Jan, DBA, R101"
[3] "Mary, Stats-Designer-R, t14"

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Apr 7, 2019 at 9:55 AM Bert Gunter <bgunter.4567 using gmail.com> wrote:

> (Note: This follows an earlier mistaken reply just to Duncan)
>
> Multiple "amens!" to Duncan's comments...
>
> However:
>
> Here is a start at my interpretation of how to do what you want. Note
> first that your "example" listed 4 fields in the line, but you showed only
> 3. I modified your example for 3 text fields, only one of which has
> brackets ([...]) in it I assume. Here is a little example of how to use
> regex's to replace the commas within the brackets by "-", which would
> presumably then allow you to easily convert the text into a data frame e.g.
> using textConnection() and read.csv. Obviously, if this is not what you
> meant, read no further.
>
> ##Example
> txt <-c("Sam, [HadoopAnalyst, DBA, Developer], R46443 ","Jan, DBA, R101",
>         "Mary, [Stats, Designer, R], t14")
>
> wh <- grep("\\[.+\\]",txt)  ## which records need to be modified?
> fixup <- gsub(" *, *","-",sub(".+(\\[.+\\]).+","\\1",txt[wh])) ##
> bracketed expressions, changing "," to "-"
>
> ## Unfortunately, the "replacement" argument in sub() is not vectorized,
> se we need a loop:
>
> for(i in wh) txt[wh[i]] <- sub("\\[.+\\]",fixup[i],txt[wh[i]]) ## replace
> original bracketed text with fixed up bracketed text
>
> > txt
> [1] "Sam, [HadoopAnalyst-DBA-Developer], R46443 "
> [2] "Jan, DBA, R101"
> [3] "Mary, [HadoopAnalyst-DBA-Developer], t14"
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Sun, Apr 7, 2019 at 9:00 AM Duncan Murdoch <murdoch.duncan using gmail.com>
> wrote:
>
>> On 06/04/2019 10:03 a.m., Amit Govil wrote:
>> > Hi,
>> >
>> > I have a bunch of csv files to read in R. I'm unable to read them
>> correctly
>> > because in some of the files, there is a column ("Role") which has
>> comma in
>> > the values.
>> >
>> > Sample data:
>> >
>> > User, Role, Rule, GAPId
>> > Sam, [HadoopAnalyst, DBA, Developer], R46443
>> >
>> > I'm trying to play with the below code but it doesnt work:
>>
>> Since you didn't give a reproducible example, you should at least say
>> what "doesn't work" means.
>>
>> But here's some general advice:  if you want to debug code, don't write
>> huge expressions like the chain of functions below, put things in
>> temporary variables and make sure you get what you were expecting at
>> each stage.
>>
>> Instead of
>> >
>> > files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>> >
>> > tbl <- sapply(files, function(f) {
>> >    gsub('\\[|\\]', '"', readLines(f)) %>%
>> >      read.csv(text = ., check.names = FALSE)
>> > }) %>%
>> >    bind_rows(.id = "id") %>%
>> >    select(id, User, Rule) %>%
>> >    distinct()
>>
>> try
>>
>>
>> files <- list.files(pattern='.*REDUNDANT(.*).csv$')
>>
>> tmp1 <- sapply(files, function(f) {
>>    gsub('\\[|\\]', '"', readLines(f)) %>%
>>      read.csv(text = ., check.names = FALSE)
>> })
>>
>> tmp2 <- tmp1 %>% bind_rows(.id = "id")
>>
>> tmp3 <- tmp2 %>% select(id, User, Rule)
>>
>> tbl <- tmp3 %>% distinct()
>>
>> (You don't need pipes here, but it will make it easier to put the giant
>> expression back together at the end.)
>>
>> Then look at tmp1, tmp2, tmp3 as well as tbl to see where things went
>> wrong.
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list