[R] Trim trailng space from data.frame factor variables

Marc Schwartz marc_schwartz at comcast.net
Thu Aug 16 19:06:00 CEST 2007


On Thu, 2007-08-16 at 17:52 +0100, Prof Brian Ripley wrote:
> On Thu, 16 Aug 2007, Marc Schwartz wrote:
> 
> > The easiest way might be to modify the lapply() call as follows:
> >
> > d[] <- lapply(d, function(x) if (is.factor(x)) factor(sub(" +$", "", x)) else x)
> >
> >> str(d)
> > 'data.frame':   60 obs. of  3 variables:
> > $ x: Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
> > $ y: num  7.01 8.33 5.48 6.51 5.61 ...
> > $ f: Factor w/ 3 levels "lev1","lev2",..: 1 1 1 1 1 1 1 1 1 1 ...
> >
> >
> > This way the coercion back to a factor takes place within the loop as
> > needed.
> >
> > Note that I also meant to type sub() and not grep() below. The default
> > behavior for both is to return a character vector (if 'value = TRUE' in
> > grep()). There is not an argument to override that behavior.
> 
> I would have thought the thing to do was to apply sub() to the levels:
> 
> chfactor <- function(x) { levels(x) <- sub(" +$", "", levels(x)); x }
> 
> d[] <- lapply(d, function(x) if (is.factor(x)) chfactor(x) else x)
> 
> This has the advantage of not losing the order of the levels.  It will 
> merge levels if they only differ in the number of trailing spaces, which 
> is probably what you want.

Quite true. 

As also noted in that prior thread as I recall, is the 'strip.white'
option in read.table() et al which would obviate the need for
post-import trimming if it makes sense in this application for Lauri.

Thanks,

Marc



More information about the R-help mailing list