[R] List of Levels for all Factor variables

Lopez, Dan lopez235 at llnl.gov
Wed Oct 17 23:10:54 CEST 2012


Hi David,

This is perfect.

Thank you very much!

FYI - I tweaked the code you gave me to exclude factor variables with more than 32 levels (based on Random Forest limits). This would be fields like employee names or department names. This is what I used
PrintLvls2 <- function(x) {print(data.frame(Lvls=sapply(x[sapply(x,function(x)is.factor(x)&&length(levels(x))<=32)],nlevels), 
                                              Names=sapply(x[sapply(x, function(x)is.factor(x)&&length(levels(x))<=32)], 
                                            function(y) paste0(levels(y), collapse=", "))), right=FALSE)}

Thanks again.
Dan


-----Original Message-----
From: David L Carlson [mailto:dcarlson at tamu.edu] 
Sent: Wednesday, October 17, 2012 8:29 AM
To: 'arun'; Lopez, Dan
Cc: 'R help'
Subject: RE: [R] List of Levels for all Factor variables

Given dat1, does this work?

> PrintLvls <- function(x) {print(data.frame(Lvls=sapply(x[sapply(x,
is.factor)],
+      nlevels), Names=sapply(x[sapply(x, is.factor)], 
+      function(y) paste0(levels(y), collapse=", "))), right=FALSE) }
> PrintLvls(dat1)
     Lvls Names                          
col1 9    2, 6, 7, 10, 15, 16, 17, 23, 24
col2 7    b, c, d, e, g, h, j            
col3 5    1, 2, 3, 4, 5                  

It automatically extracts the columns that are factors so it should work on your original data.frame.

----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352



> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- 
> project.org] On Behalf Of arun
> Sent: Tuesday, October 16, 2012 12:09 PM
> To: Lopez, Dan
> Cc: R help
> Subject: Re: [R] List of Levels for all Factor variables
> 
> HI,
> You can also try this:
> set.seed(1)
> dat1<-
> data.frame(col1=factor(sample(1:25,10,replace=TRUE)),col2=sample(lette
> r
> s[1:10],10,replace=TRUE),col3=factor(rep(1:5,each=2)))
> 
> sapply(lapply(mapply(c,lapply(names(sapply(dat1,levels)),function(x)
> x),sapply(dat1,levels)),function(x) paste(x[1],":",paste(x[- 
> 1],collapse=" "))),print) #[1] "col1 : 2 6 7 10 15 16 17 23 24"
> #[1] "col2 : b c d e g h j"
> #[1] "col3 : 1 2 3 4 5"
> #[1] "col1 : 2 6 7 10 15 16 17 23 24" "col2 : b c d e g h j"
> #[3] "col3 : 1 2 3 4 5"
> 
> A.K.
> 
> 
> 
> 
> ----- Original Message -----
> From: "Lopez, Dan" <lopez235 at llnl.gov>
> To: "R help (r-help at r-project.org)" <r-help at r-project.org>
> Cc:
> Sent: Tuesday, October 16, 2012 11:19 AM
> Subject: [R] List of Levels for all Factor variables
> 
> Hi,
> 
> I want to get a clean succinct list of all levels for all my factor 
> variables.
> 
> I have a dataframe that's something like #1 below. This is just an 
> example subset of my data and my actual dataset has 70 variables. I 
> know how to narrow down my list of variables to just my factor 
> variables by using #2 below (thanks to Bert Gunter). I can also get 
> list of all levels for all my factor variables using #3 below. But I 
> what I want to find out is if there is a way to get this list in a 
> similar fashion to what the str function returns: without all the 
> extra spacing and carriage returns. That's what I mean by "clean 
> succinct list".
> 
> BTW I also tried playing around with several of the parameters for the 
> str function itself but could not find a way to accomplish what I want 
> to accomplish.
> 
> 
> 
> 1.       DATAFRAME
> 
> > str(mydata)
> 'data.frame':  11868 obs. of  26 variables:
> $ EMPLID          : int  431108 32709 19730 10850 48786 2004 237628 
> 558
> 3423 743175 ...
> $ NAME            : Factor w/ 6402 levels "Aaron Cathy E",..: 2777 242
> 161 104 336 4254 1595 1244 3669 4760 ...
> $ TRAIN           : int  1 1 1 1 1 1 1 1 1 1 ...
> $ TARGET          : int  0 0 0 0 0 0 0 0 0 0 ...
> $ APPT_TYP_CD_LL  : Factor w/ 3 levels "FX","IN","IP": 2 2 2 2 2 2 2 2
> 2 2 ...
> $ ORG_NAM_LL      : Factor w/ 18 levels "Business","Chief Financial
> Officer",..: 11 7 7 9 4 4 18 18 8 4 ...
> $ NEW_DISCIPLINE  : Factor w/ 15 levels "100s","300s",..: 14 6 4 1 11
> 11 14 2 1 1 ...
> $ SERIES          : Factor w/ 10 levels "100s","300s",..: 9 6 4 1 9 9 
> 9
> 2 1 1 ...
> $ AGE             : int  62 53 46 62 55 59 50 36 34 53 ...
> $ SERVICE         : int  13 29 16 26 18 9 19 11 8 26 ...
> $ AGE_SERVICE     : int  75 82 62 87 73 69 69 47 42 79 ...
> $ HIEDUCLV        : Factor w/ 6 levels "Associate","Bachelor",..: 5 6 
> 6
> 6 5 2 3 2 2 1 ...
> $ GENDER          : Factor w/ 2 levels "F","M": 2 2 2 1 2 2 2 2 2 1 ...
> $ RETCD           : Factor w/ 2 levels "TCP1","TCP2": 2 1 2 2 2 1 1 2 
> 1
> 2 ...
> $ FLSASTATUS      : Factor w/ 2 levels "E","N": 1 2 2 1 1 1 1 1 1 1 ...
> $ MONTHLY_RT      : int  17640 6932 5845 9809 11473 8719 19190 8986
> 7231 6758 ...
> $ RETSTATUSDERIVED: Factor w/ 4 levels "401K","DOUBLE DIPPERS",..: 2 4
> 3 2 3 4 4 3 4 3 ...
> $ ETHNIC_GRP_CD   : Factor w/ 8 levels "AMIND","ASIAN",..: 8 8 8 8 8 8
> 8 8 8 8 ...
> $ COMMUTE_BIN     : Factor w/ 7 levels "","<15","15 - 24",..: 5 7 2 2
> 4
> 3 3 6 3 2 ...
> $ EEO_CLASS       : Factor w/ 4 levels "M","S1","S2",..: 1 2 4 4 4 4 1
> 2 4 2 ...
> $ WRK_SCHED       : Factor w/ 6 levels "12HR","4/10s",..: 3 3 3 3 3 3 
> 3
> 3 4 4 ...
> $ FWT_MAR_STATUS  : Factor w/ 2 levels "M","S": 1 1 1 1 2 1 1 1 1 2 ...
> $ COVERED_DP      : int  2 2 4 0 1 3 1 2 0 0 ...
> $ YRS_IN_SERIES   : int  13 29 16 26 18 9 19 3 7 26 ...
> $ SAVINGS_PCT     : int  10 0 6 19 8 0 10 15 15 18 ...
> $ Generation      : Factor w/ 4 levels "Baby Boomers",..: 1 1 2 1 1 1 
> 1
> 2 2 1 ...
> 
> 2. Create mydataF to only include factor variables (and exclude NAME 
> which I am not interested in)
> 
> > mydataF<-mydata[,sapply(mydata,function(x)is.factor(x))][,-1]
> 
> 3. Get a list of all levels
> 
> > sapply(mydataF,function(x)levels(x))
> 
> $APPT_TYP_CD_LL
> 
> [1] "FX" "IN" "IP"
> 
> 
> 
> $ORG_NAM_LL
> 
> [1] "Business"                        "Chief Financial Officer"
> "Chief Information Office"        "Computation"
> "Engineering"                     "ESH and Quality"
> 
> [7] "Facilities and Infrastructure"   "Global Security"
> "NIF"          "NO"              "Office of the Director"
> "Operations and Business Office"
> 
> [13] "Physical and Life Sciences"      "Planning and Financial 
> Services" "ST"   "Security Organization"           "Strategic Human 
> Resources Mgmt"  "WCI"
> 
> 
> 
> $NEW_DISCIPLINE
> 
> [1] "100s"                       "300s"                       "400s"
>                    "500s"                       "600s"
>      "800s"                       "900s"
> 
> [8] "Chem  Science"              "Engineering"                "Life 
> Sciences"              "Math  Computer Science  IT" "Physics"
>           "pre100s"                    "PSTS Other"
> 
> [15] "Re"
> 
> 
> 
> $SERIES   ......
> 
> Daniel Lopez
> Workforce Analyst
> HRIM - Workforce Analytics & Metrics
> 
> 
>     [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html and provide commented, minimal, self-contained, 
> reproducible code.
> 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting- 
> guide.html and provide commented, minimal, self-contained, 
> reproducible code.




More information about the R-help mailing list