[R] Extract

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Mon Jul 22 16:22:08 CEST 2024


Thanks.

I found this to be quite informative and a nice example of how useful
R-Help can be as a resource for R users.

Best,
Bert

On Mon, Jul 22, 2024 at 4:50 AM Gabor Grothendieck
<ggrothendieck using gmail.com> wrote:
>
> Base R. Regarding code improvements:
>
> 1. Personally I find (\(...) ...)() notation hard to read (although by
> placing (\(x), the body and )() on 3 separate lines it can be improved
> somewhat). Instead let us use a named function. The name of the
> function can also serve to self document the code.
>
> 2. The use of dat both at the start of the pipeline and then again
> within a later step of the pipeline goes against a strict left to
> right flow. In general if this occurs it is either a sign that we need
> to break the pipeline into two or that we need to find another
> approach which is what we do here.
>
> We can use the base R code below. Note that the column names produced
> by transform(S = read.table(...)) are S.V1, S.V2, etc. so to fix the
> column names remove .V from all column names as in the fix_colnames
> function shown. It does no harm to apply that to all column names
> since the remaining column names will not match.
>
>   fix_colnames <- function(x) {
>     setNames(x, sub("\\.V", "", names(x)))
>   }
>
>   dat |>
>      transform(S = read.table(text = string,
>        header = FALSE, fill = TRUE, na.strings = "")) |>
>        fix_colnames()
>
> Another way to write this which does not use a separate defined
> function nor the anonymous function notation is to box the output of
> transform:
>
>   dat |>
>      transform(S = read.table(text = string,
>        header = FALSE, fill = TRUE, na.strings = "")) |>
>        list(x = _) |>
>        with( setNames(x, sub("\\.V", "", names(x))) )
>
> dplyr. Alternately use dplyr in which case we can make use of
> rename_with . In this case read.table(...) creates column names V1,
> V2, etc. and mutate does not change them so simply replacing V with S
> at the start of each column name in the output of read.table will do.
> Also we can pipe the read.table output directly to rename_with using a
> nested pipeline, i.e. the second pipe is entirely within mutate rather
> than after it) since mutate won't change the column names. The win
> here is because, unlike transform, mutate does not require the S= that
> is needed with transform (although it allows it had we wanted it).
>
>   library(dplyr)
>
>   dat |>
>      mutate(read.table(text = string,
>        header = FALSE, fill = TRUE, na.strings = "")  |>
>       rename_with(~ sub("^V", "S", .x))
>     )
>
>
> On Sun, Jul 21, 2024 at 3:08 PM Bert Gunter <bgunter.4567 using gmail.com> wrote:
> >
> > As always, good point.
> > Here's a piped version of your code for those who are pipe
> > afficianados. As I'm not very skilled with pipes, it might certainly
> > be improved.
> > dat <-
> >       dat$string |>
> >          read.table( text = _, fill = TRUE, header = FALSE, na.strings = "")  |>
> >          (\(x)'names<-'(x,paste0("s", seq_along(x))))() |>
> >          (\(x)cbind(dat, x))()
> >
> > -- Bert
> >
> >
> > On Sun, Jul 21, 2024 at 11:30 AM Gabor Grothendieck
> > <ggrothendieck using gmail.com> wrote:
> > >
> > > Fixing col.names=paste0("S", 1:5) assumes that there will be 5 columns and
> > > we may not want to do that.  If there are only 3 fields in string, at the most,
> > > we may wish to generate only 3 columns.
> > >
> > > On Sun, Jul 21, 2024 at 2:20 PM Bert Gunter <bgunter.4567 using gmail.com> wrote:
> > > >
> > > > Nice! -- Let read.table do the work of handling the NA's.
> > > > However, even simpler is to use the 'colnames' argument of
> > > > read.table() for the column names no?
> > > >
> > > >       string <- read.table(text = dat$string, fill = TRUE, header =
> > > > FALSE, na.strings = "",
> > > > col.names = paste0("s", 1:5))
> > > >       dat <- cbind(dat, string)
> > > >
> > > > -- Bert
> > > >
> > > > On Sun, Jul 21, 2024 at 10:16 AM Gabor Grothendieck
> > > > <ggrothendieck using gmail.com> wrote:
> > > > >
> > > > > We can use read.table for a base R solution
> > > > >
> > > > > string <- read.table(text = dat$string, fill = TRUE, header = FALSE,
> > > > > na.strings = "")
> > > > > names(string) <- paste0("S", seq_along(string))
> > > > > cbind(dat[-3], string)
> > > > >
> > > > > On Fri, Jul 19, 2024 at 12:52 PM Val <valkremk using gmail.com> wrote:
> > > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > I want to extract new variables from a string and add it to the dataframe.
> > > > > > Sample data is csv file.
> > > > > >
> > > > > > dat<-read.csv(text="Year, Sex,string
> > > > > > 2002,F,15 xc Ab
> > > > > > 2003,F,14
> > > > > > 2004,M,18 xb 25 35 21
> > > > > > 2005,M,13 25
> > > > > > 2006,M,14 ac 256 AV 35
> > > > > > 2007,F,11",header=TRUE)
> > > > > >
> > > > > > The string column has  a maximum of five variables. Some rows have all
> > > > > > and others may not have all the five variables. If missing then  fill
> > > > > > it with NA,
> > > > > > Desired result is shown below,
> > > > > >
> > > > > >
> > > > > > Year,Sex,string, S1, S2, S3 S4,S5
> > > > > > 2002,F,15 xc Ab, 15,xc,Ab, NA, NA
> > > > > > 2003,F,14, 14,NA,NA,NA,NA
> > > > > > 2004,M,18 xb 25 35 21,18, xb, 25, 35, 21
> > > > > > 2005,M,13 25,13, 25,NA,NA,NA
> > > > > > 2006,M,14 ac 256 AV 35, 14, ac, 256, AV, 35
> > > > > > 2007,F,11, 11,NA,NA,NA,NA
> > > > > >
> > > > > > Any help?
> > > > > > Thank you in advance.
> > > > > >
> > > > > > ______________________________________________
> > > > > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > > > > and provide commented, minimal, self-contained, reproducible code.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Statistics & Software Consulting
> > > > > GKX Group, GKX Associates Inc.
> > > > > tel: 1-877-GKX-GROUP
> > > > > email: ggrothendieck at gmail.com
> > > > >
> > > > > ______________________________________________
> > > > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > > > > and provide commented, minimal, self-contained, reproducible code.
> > >
> > >
> > >
> > > --
> > > Statistics & Software Consulting
> > > GKX Group, GKX Associates Inc.
> > > tel: 1-877-GKX-GROUP
> > > email: ggrothendieck at gmail.com
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com



More information about the R-help mailing list