[R] How to sum values across multiple variables using a wildcard?

Tue Feb 21 02:18:47 CET 2006

On Mon, 2006-02-20 at 18:41 -0600, mtb954 at gmail.com wrote:
> I have a dataframe called "data" with 5 records (in rows) each of
> which has been scored on each of many variables (in columns).
> 
> Five of the variables are named var1, var2, var3, var4, var5 using
> headers. The other variables are named using other conventions.
> 
> I can create a new variable called var6 with the value 15 for each
> record with this code:
> 
> > var6=var1+var2+var3+var4+var5
> 
> but this is tedious for my real dataset with dozens of variables. I
> would rather use a wildcard to add up all the variables that begin
> with "Var" like this pseudocode:
> 
> > Var6=sum(var*)
> 
> Any suggestions for implementing this in R? Thanks! Mark

Here is one approach using grep().

Given a data frame called MyDF with the following structure:

> str(MyDF)
`data.frame':	10 obs. of  20 variables:
 $ other4 : num  -0.869  0.376 -2.022  0.619 -0.129 ...
 $ var8   : num  -0.380  1.428 -1.075 -0.796 -0.588 ...
 $ var4   : num  -0.0850 -0.7335 -0.5019 -1.1633 -0.0197 ...
 $ other9 : num   0.0210 -0.6455  0.0289  1.2405 -1.3359 ...
 $ var10  : num   0.647 -0.798  0.180  1.135 -0.258 ...
 $ other2 : num   0.1332 -0.2227  0.0423  0.6881  2.0304 ...
 $ other10: num  0.811 2.166 0.569 0.302 0.669 ...
 $ var1   : num  -0.774 -1.812 -1.230 -0.969  0.245 ...
 $ var2   : num  -0.0538  0.3712  0.8222 -0.8025 -0.6914 ...
 $ other6 : num  0.871 0.291 2.079 1.098 1.025 ...
 $ other1 : num  -0.5130  0.1358  0.8744  0.0997  1.7458 ...
 $ var9   : num   0.664 -0.456  0.415  2.090 -0.283 ...
 $ other3 : num  -0.425 -0.283  0.706 -1.879 -0.828 ...
 $ other7 : num   0.100  0.177  0.570 -0.631 -1.009 ...
 $ var3   : num   1.446 -0.862  0.184  1.077  0.146 ...
 $ var5   : num   0.402 -0.498 -0.906  0.641  1.690 ...
 $ var6   : num   0.892 -0.242  0.561  0.530 -0.291 ...
 $ other5 : num  -1.210  0.815 -1.284 -0.152  0.329 ...
 $ other8 : num  -0.265 -1.278  1.152  0.232 -1.189 ...
 $ var7   : num  -0.616 -0.994 -0.263  1.626 -1.372 ...

Note that the column names are either var* or other*.

Using grep() we get the indices of the column names that contain "other"
plus one or more following characters, where the "other" begins the
word:

> grep("\\bother.", names(MyDF))
 [1]  1  4  6  7 10 11 13 14 18 19

See ?regexp for more information. Note that I use "\\b" to being the
search at the starting word boundary and then "." to require that there
be following characters. Thus, this would not match " other1" or
" other".

You can then use the following to subset the data frame MyDF and sum the
rows for the requested columns:

> rowSums(MyDF[, grep("\\bother.", names(MyDF))])
        1         2         3         4         5         6         7 
-1.344893  1.531417  2.715234  1.616971  1.307379  4.655568  4.638446 
        8         9        10 
-2.640485 -2.226270 -2.158248 

You could use grep("other", names(MyDF)), but this would also get
"other" if it appears anywhere in the name. For example:

> grep("other", "Thisotherone")
[1] 1

It just depends upon your naming schema and how strict you need to be in
the search.

HTH,

Marc Schwartz