[R] Extract

Mon Jul 22 18:04:43 CEST 2024

But have we lured you to the dark side with the tidyverse yet ;-)

On Mon, 22 Jul 2024, 15:22 Bert Gunter, <bgunter.4567 using gmail.com> wrote:

> Thanks.
>
> I found this to be quite informative and a nice example of how useful
> R-Help can be as a resource for R users.
>
> Best,
> Bert
>
> On Mon, Jul 22, 2024 at 4:50 AM Gabor Grothendieck
> <ggrothendieck using gmail.com> wrote:
> >
> > Base R. Regarding code improvements:
> >
> > 1. Personally I find (\(...) ...)() notation hard to read (although by
> > placing (\(x), the body and )() on 3 separate lines it can be improved
> > somewhat). Instead let us use a named function. The name of the
> > function can also serve to self document the code.
> >
> > 2. The use of dat both at the start of the pipeline and then again
> > within a later step of the pipeline goes against a strict left to
> > right flow. In general if this occurs it is either a sign that we need
> > to break the pipeline into two or that we need to find another
> > approach which is what we do here.
> >
> > We can use the base R code below. Note that the column names produced
> > by transform(S = read.table(...)) are S.V1, S.V2, etc. so to fix the
> > column names remove .V from all column names as in the fix_colnames
> > function shown. It does no harm to apply that to all column names
> > since the remaining column names will not match.
> >
> >   fix_colnames <- function(x) {
> >     setNames(x, sub("\\.V", "", names(x)))
> >   }
> >
> >   dat |>
> >      transform(S = read.table(text = string,
> >        header = FALSE, fill = TRUE, na.strings = "")) |>
> >        fix_colnames()
> >
> > Another way to write this which does not use a separate defined
> > function nor the anonymous function notation is to box the output of
> > transform:
> >
> >   dat |>
> >      transform(S = read.table(text = string,
> >        header = FALSE, fill = TRUE, na.strings = "")) |>
> >        list(x = _) |>
> >        with( setNames(x, sub("\\.V", "", names(x))) )
> >
> > dplyr. Alternately use dplyr in which case we can make use of
> > rename_with . In this case read.table(...) creates column names V1,
> > V2, etc. and mutate does not change them so simply replacing V with S
> > at the start of each column name in the output of read.table will do.
> > Also we can pipe the read.table output directly to rename_with using a
> > nested pipeline, i.e. the second pipe is entirely within mutate rather
> > than after it) since mutate won't change the column names. The win
> > here is because, unlike transform, mutate does not require the S= that
> > is needed with transform (although it allows it had we wanted it).
> >
> >   library(dplyr)
> >
> >   dat |>
> >      mutate(read.table(text = string,
> >        header = FALSE, fill = TRUE, na.strings = "")  |>
> >       rename_with(~ sub("^V", "S", .x))
> >     )
> >
> >
> > On Sun, Jul 21, 2024 at 3:08 PM Bert Gunter <bgunter.4567 using gmail.com>
> wrote:
> > >
> > > As always, good point.
> > > Here's a piped version of your code for those who are pipe
> > > afficianados. As I'm not very skilled with pipes, it might certainly
> > > be improved.
> > > dat <-
> > >       dat$string |>
> > >          read.table( text = _, fill = TRUE, header = FALSE, na.strings
> = "")  |>
> > >          (\(x)'names<-'(x,paste0("s", seq_along(x))))() |>
> > >          (\(x)cbind(dat, x))()
> > >
> > > -- Bert
> > >
> > >
> > > On Sun, Jul 21, 2024 at 11:30 AM Gabor Grothendieck
> > > <ggrothendieck using gmail.com> wrote:
> > > >
> > > > Fixing col.names=paste0("S", 1:5) assumes that there will be 5
> columns and
> > > > we may not want to do that.  If there are only 3 fields in string,
> at the most,
> > > > we may wish to generate only 3 columns.
> > > >
> > > > On Sun, Jul 21, 2024 at 2:20 PM Bert Gunter <bgunter.4567 using gmail.com>
> wrote:
> > > > >
> > > > > Nice! -- Let read.table do the work of handling the NA's.
> > > > > However, even simpler is to use the 'colnames' argument of
> > > > > read.table() for the column names no?
> > > > >
> > > > >       string <- read.table(text = dat$string, fill = TRUE, header =
> > > > > FALSE, na.strings = "",
> > > > > col.names = paste0("s", 1:5))
> > > > >       dat <- cbind(dat, string)
> > > > >
> > > > > -- Bert
> > > > >
> > > > > On Sun, Jul 21, 2024 at 10:16 AM Gabor Grothendieck
> > > > > <ggrothendieck using gmail.com> wrote:
> > > > > >
> > > > > > We can use read.table for a base R solution
> > > > > >
> > > > > > string <- read.table(text = dat$string, fill = TRUE, header =
> FALSE,
> > > > > > na.strings = "")
> > > > > > names(string) <- paste0("S", seq_along(string))
> > > > > > cbind(dat[-3], string)
> > > > > >
> > > > > > On Fri, Jul 19, 2024 at 12:52 PM Val <valkremk using gmail.com> wrote:
> > > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > I want to extract new variables from a string and add it to
> the dataframe.
> > > > > > > Sample data is csv file.
> > > > > > >
> > > > > > > dat<-read.csv(text="Year, Sex,string
> > > > > > > 2002,F,15 xc Ab
> > > > > > > 2003,F,14
> > > > > > > 2004,M,18 xb 25 35 21
> > > > > > > 2005,M,13 25
> > > > > > > 2006,M,14 ac 256 AV 35
> > > > > > > 2007,F,11",header=TRUE)
> > > > > > >
> > > > > > > The string column has  a maximum of five variables. Some rows
> have all
> > > > > > > and others may not have all the five variables. If missing
> then  fill
> > > > > > > it with NA,
> > > > > > > Desired result is shown below,
> > > > > > >
> > > > > > >
> > > > > > > Year,Sex,string, S1, S2, S3 S4,S5
> > > > > > > 2002,F,15 xc Ab, 15,xc,Ab, NA, NA
> > > > > > > 2003,F,14, 14,NA,NA,NA,NA
> > > > > > > 2004,M,18 xb 25 35 21,18, xb, 25, 35, 21
> > > > > > > 2005,M,13 25,13, 25,NA,NA,NA
> > > > > > > 2006,M,14 ac 256 AV 35, 14, ac, 256, AV, 35
> > > > > > > 2007,F,11, 11,NA,NA,NA,NA
> > > > > > >
> > > > > > > Any help?
> > > > > > > Thank you in advance.
> > > > > > >
> > > > > > > ______________________________________________
> > > > > > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more,
> see
> > > > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > > > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > > > > > > and provide commented, minimal, self-contained, reproducible
> code.
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Statistics & Software Consulting
> > > > > > GKX Group, GKX Associates Inc.
> > > > > > tel: 1-877-GKX-GROUP
> > > > > > email: ggrothendieck at gmail.com
> > > > > >
> > > > > > ______________________________________________
> > > > > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more,
> see
> > > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > > > > > and provide commented, minimal, self-contained, reproducible
> code.
> > > >
> > > >
> > > >
> > > > --
> > > > Statistics & Software Consulting
> > > > GKX Group, GKX Associates Inc.
> > > > tel: 1-877-GKX-GROUP
> > > > email: ggrothendieck at gmail.com
> >
> >
> >
> > --
> > Statistics & Software Consulting
> > GKX Group, GKX Associates Inc.
> > tel: 1-877-GKX-GROUP
> > email: ggrothendieck at gmail.com
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]