[R] Extract

Fri Jul 19 20:40:48 CEST 2024

Here is another way... for data analysis, the idiomatic result is usually more useful, though for presentation in a final result the wide result might be desired.

library(dplyr)
library(tidyr)

dat<-read.csv(text=
"Year, Sex,string
2002,F,15 xc Ab
2003,F,14
2004,M,18 xb 25 35 21
2005,M,13 25
2006,M,14 ac 256 AV 35
2007,F,11"
, header=TRUE )

idiomatic <- (
    dat
    %>% mutate( string = strsplit( string, " " ) )
    %>% unnest( cols = string )
    %>% group_by( Year, Sex )
    %>% mutate( s_name = paste0( "S", seq_along( string ) ) )
    %>% ungroup()
)
idiomatic # each row has unique Year, Sex, and s_name

wide <- (
    idiomatic
    %>% spread( s_name, string )
)
wide

On July 19, 2024 11:23:48 AM PDT, Val <valkremk using gmail.com> wrote:
>Thank you and sorry for the confusion.
>The desired result should have 8 variables as a comma separated in
>each line.  The string variable  is  considered as one variable.
>The output of your script is wfine for me.  Thank you!
>
>On Fri, Jul 19, 2024 at 1:00 PM Ebert,Timothy Aaron <tebert using ufl.edu> wrote:
>>
>> The desired result is odd.
>> 1) It looks like the string is duplicated in the desired result. The first line of data has "15, xc, Ab",  and the desired result has "15, xc, Ab, 15, xc, Ab"
>> 2) The example has S1 through S5, but the desired result has data for eight variables in the first line (not five).
>> 3) The desired result has a different number of variables for each line.
>> 4) Are you assuming that all missing data is at the end of the string? If there are 5 variables (S1 .... S5), do you know that "15, xc, Ab" is S1 = 15, S2 = 'xc', and S3 = 'Ab' rather than S2=15, S4='xc' and S5='Ab' ?
>>
>> This isn't exactly what you asked for, but maybe I was confused somewhere. This approach puts string data into variables in order. In this approach one mixes string and numeric data. The string is not duplicated.
>>
>> library(tidyr)
>>
>> dat <- read.csv(text="Year,Sex,string
>> 2002,F,15 xc Ab
>> 2003,F,14
>> 2004,M,18 xb 25 35 21
>> 2005,M,13 25
>> 2006,M,14 ac 256 AV 35
>> 2007,F,11", header=TRUE, stringsAsFactors=FALSE)
>>
>> # split the 'string' column based on spaces
>> dat_separated <- dat |>
>>   separate(string, into = paste0("S", 1:5), sep = " ",
>>            fill = "right", extra = "merge")
>>
>> Tim
>>
>>
>> -----Original Message-----
>> From: R-help <r-help-bounces using r-project.org> On Behalf Of Val
>> Sent: Friday, July 19, 2024 12:52 PM
>> To: r-help using R-project.org (r-help using r-project.org) <r-help using r-project.org>
>> Subject: [R] Extract
>>
>> [External Email]
>>
>> Hi All,
>>
>> I want to extract new variables from a string and add it to the dataframe.
>> Sample data is csv file.
>>
>> dat<-read.csv(text="Year, Sex,string
>> 2002,F,15 xc Ab
>> 2003,F,14
>> 2004,M,18 xb 25 35 21
>> 2005,M,13 25
>> 2006,M,14 ac 256 AV 35
>> 2007,F,11",header=TRUE)
>>
>> The string column has  a maximum of five variables. Some rows have all and others may not have all the five variables. If missing then  fill it with NA, Desired result is shown below,
>>
>>
>> Year,Sex,string, S1, S2, S3 S4,S5
>> 2002,F,15 xc Ab, 15,xc,Ab, NA, NA
>> 2003,F,14, 14,NA,NA,NA,NA
>> 2004,M,18 xb 25 35 21,18, xb, 25, 35, 21
>> 2005,M,13 25,13, 25,NA,NA,NA
>> 2006,M,14 ac 256 AV 35, 14, ac, 256, AV, 35
>> 2007,F,11, 11,NA,NA,NA,NA
>>
>> Any help?
>> Thank you in advance.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.r-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>______________________________________________
>R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.