[R] How to sum values across multiple variables using a wildcard?
Marc Schwartz
MSchwartz at mn.rr.com
Tue Feb 21 02:18:47 CET 2006
On Mon, 2006-02-20 at 18:41 -0600, mtb954 at gmail.com wrote:
> I have a dataframe called "data" with 5 records (in rows) each of
> which has been scored on each of many variables (in columns).
>
> Five of the variables are named var1, var2, var3, var4, var5 using
> headers. The other variables are named using other conventions.
>
> I can create a new variable called var6 with the value 15 for each
> record with this code:
>
> > var6=var1+var2+var3+var4+var5
>
> but this is tedious for my real dataset with dozens of variables. I
> would rather use a wildcard to add up all the variables that begin
> with "Var" like this pseudocode:
>
> > Var6=sum(var*)
>
> Any suggestions for implementing this in R? Thanks! Mark
Here is one approach using grep().
Given a data frame called MyDF with the following structure:
> str(MyDF)
`data.frame': 10 obs. of 20 variables:
$ other4 : num -0.869 0.376 -2.022 0.619 -0.129 ...
$ var8 : num -0.380 1.428 -1.075 -0.796 -0.588 ...
$ var4 : num -0.0850 -0.7335 -0.5019 -1.1633 -0.0197 ...
$ other9 : num 0.0210 -0.6455 0.0289 1.2405 -1.3359 ...
$ var10 : num 0.647 -0.798 0.180 1.135 -0.258 ...
$ other2 : num 0.1332 -0.2227 0.0423 0.6881 2.0304 ...
$ other10: num 0.811 2.166 0.569 0.302 0.669 ...
$ var1 : num -0.774 -1.812 -1.230 -0.969 0.245 ...
$ var2 : num -0.0538 0.3712 0.8222 -0.8025 -0.6914 ...
$ other6 : num 0.871 0.291 2.079 1.098 1.025 ...
$ other1 : num -0.5130 0.1358 0.8744 0.0997 1.7458 ...
$ var9 : num 0.664 -0.456 0.415 2.090 -0.283 ...
$ other3 : num -0.425 -0.283 0.706 -1.879 -0.828 ...
$ other7 : num 0.100 0.177 0.570 -0.631 -1.009 ...
$ var3 : num 1.446 -0.862 0.184 1.077 0.146 ...
$ var5 : num 0.402 -0.498 -0.906 0.641 1.690 ...
$ var6 : num 0.892 -0.242 0.561 0.530 -0.291 ...
$ other5 : num -1.210 0.815 -1.284 -0.152 0.329 ...
$ other8 : num -0.265 -1.278 1.152 0.232 -1.189 ...
$ var7 : num -0.616 -0.994 -0.263 1.626 -1.372 ...
Note that the column names are either var* or other*.
Using grep() we get the indices of the column names that contain "other"
plus one or more following characters, where the "other" begins the
word:
> grep("\\bother.", names(MyDF))
[1] 1 4 6 7 10 11 13 14 18 19
See ?regexp for more information. Note that I use "\\b" to being the
search at the starting word boundary and then "." to require that there
be following characters. Thus, this would not match " other1" or
" other".
You can then use the following to subset the data frame MyDF and sum the
rows for the requested columns:
> rowSums(MyDF[, grep("\\bother.", names(MyDF))])
1 2 3 4 5 6 7
-1.344893 1.531417 2.715234 1.616971 1.307379 4.655568 4.638446
8 9 10
-2.640485 -2.226270 -2.158248
You could use grep("other", names(MyDF)), but this would also get
"other" if it appears anywhere in the name. For example:
> grep("other", "Thisotherone")
[1] 1
It just depends upon your naming schema and how strict you need to be in
the search.
HTH,
Marc Schwartz
More information about the R-help
mailing list