[R] Reformatting text inside a data frame
David Winsemius
dwinsemius at comcast.net
Mon Sep 7 23:25:59 CEST 2015
> On Sep 7, 2015, at 1:20 PM, Jon BR <jonsleepy at gmail.com> wrote:
>
> Hi John,
> Thanks for the reply; I'm pasting here the output from dput, with a
> 'df <-' added in front:
>
> df <- structure(list(rowNum = c(1, 2, 3), first = structure(c(NA, 1L,
> 2L), .Label = c("AD=2;BA=8", "AD=9;BA=1"), class = "factor"),
> second = structure(c(2L, 1L, NA), .Label = c("AD=1;BA=2",
> "AD=13;BA=49"), class = "factor")), .Names = c("rowNum",
> "first", "second"), row.names = c(NA, -3L), class = "data.frame")
>
>
>
>
> To add more specifics, about what I would like; each value to be adjusted
> has the following general format:
>
> "AD=X;BA=Y"
>
> I would like to extract the values of X and Y and format them as a string
> as such:
>
> "X_X-Y"
>
>
> Here's how I would handle a specific instance using awk in a shell script:
>
> echo "AD=X;BA=Y" | awk '{split($1,a,"AD="); split(a[2],b,";");
> split(b[2],c,"BA="); print b[1]"_"b[1]"-"c[2]}'
> X_X-Y
>
> I'd like this to apply for all the entries that aren't NA to the right of
> column 1.
df[2:3] <- lapply(df[2:3], sub, patt="(AD\\=)(.+)(;BA\\=)(.+)”,
repl="\\2_\\2-\\4” )
> df
rowNum first second
1 1 <NA> 13_13-49
2 2 2_2-8 1_1-2
3 3 9_9-1 <NA>
>
> Hoping this adds clarity for any others who also didn't follow my example.
>
> Thanks in advance for any tips-
>
> Best,
> Jonathan
>
> On Mon, Sep 7, 2015 at 3:48 PM, John Kane <jrkrideau at inbox.com> wrote:
>
>> I'm not making a lot of sense of the data, it looks like you want more
>> recodes than you have mentioned but in any case you might want to look at
>> the recode function in the car package. It "should" do what you want
>> thought there may be faster ways to do it.
>>
>> BTW, for supplying sample data have a look at ?dput . Using dput() means
>> that we see exactly the same data as you do.
>>
>> Sorry not to be of more help
>> John Kane
>> Kingston ON Canada
>>
>>
>>> -----Original Message-----
>>> From: jonsleepy at gmail.com
>>> Sent: Mon, 7 Sep 2015 15:27:05 -0400
>>> To: r-help at r-project.org
>>> Subject: [R] Reformatting text inside a data frame
>>>
>>> Hi all,
>>> I've read in a large data frame that has formatting similar to the
>>> one
>>> in the small example below:
>>>
>>> df <-
>>>
>> data.frame(c(1,2,3),c(NA,"AD=2;BA=8","AD=9;BA=1"),c("AD=13;BA=49","AD=1;BA=2",NA));
>>> names(df) <- c("rowNum","first","second")
>>>
>>>> df
>>> rowNum first second
>>> 1 1 <NA> AD=13;BA=49
>>> 2 2 AD=2;BA=8 AD=1;BA=2
>>> 3 3 AD=9;BA=1 <NA>
>>>
>>>
>>> I'd like to reformat all of the non-NA entries in df from "first" and
>>> "second" and so-on such that "AD=13;BA=49" will be replaced by the
>>> following string: "13_13-49".
>>>
>>> So applied to df, the output would be the following:
>>>
>>> rowNum first second
>>> 1 1 <NA> 13_13-49
>>> 2 2 2_2-8 1_1-2
>>> 3 3 9_9-1 <NA>
>>>
>>>
>>> I'm generally a big proponent of shell scripting with awk, but I'd prefer
>>> an all-R solution if one exists (and also to learn how to do this more
>>> generally).
>>>
>>> Could someone point out an appropriate paradigm or otherwise point me in
>>> the right direction?
>>>
>>> Best,
>>> Jonathan
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ____________________________________________________________
>> FREE 3D EARTH SCREENSAVER - Watch the Earth right on your desktop!
>> Check it out at http://www.inbox.com/earth
>>
>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list