[R] subsetting a data.frame based on a specific group of columns
Boris Steipe
boris.steipe at utoronto.ca
Fri Nov 6 16:45:18 CET 2015
Please learn to use dput() to post example data.
# This is your data:
data <- structure(c(1232, 0, 43, 357, 71, 919, 23, 9, 1111, 0, 811, 0,
9871, 795, 76, 72, 743, 14), .Dim = c(3L, 6L), .Dimnames = list(
NULL, c("X1", "X2", "X3", "Y1", "Y2", "Y3")))
data
# define groups and threshold explicitly
groupA <- c(1, 2, 3)
groupB <- c(4, 5, 6)
thrsh <- 100
# Here's how you evaluate your condition on the member elements of your group
rowSums(data[ , groupA]) > thrsh
# note that you can cast a logical TRUE/FALSE into an integer 0/1
as.numeric(rowSums(data[ , groupA]) >= thrsh)
# ... which you can multiply with your data (*)
data[ , groupA] * as.numeric(rowSums(data[ , groupA]) > thrsh)
# now you could write this into your matrix
data[ , groupA] <- data[ , groupA] * as.numeric(rowSums(data[ , groupA]) > thrsh)
# data[ , groupB] etc ...
data
# ... but you would be repeating code, therefore better to write this
# as a function:
clearReadsBelowThreshold <- function(m, g, t) {
m[ , g] <- m[ , g] * as.numeric(rowSums(m[ , g]) >= t)
return(m)
}
data <- clearReadsBelowThreshold(data, groupA, thrsh)
data <- clearReadsBelowThreshold(data, groupB, thrsh)
data
(*) Note that R would do this conversion implicitly but omitting
the conversion will cause confusion for those who read the code
later.
Cheers,
Boris
On Nov 6, 2015, at 8:53 AM, Assa Yeroslaviz <frymor at gmail.com> wrote:
> sorry, for the misunderstanding. here is a more elaborate description of
> what i would like to achieve.
>
> I have a data set of counts from a RNA-Seq experiment and would like to
> filter reads with low counts. I don't want to set everything to 0
> automatically.
>
> I would like to set each categorical group (e.g. condition) to 0, if and
> only if all replica in the group together have less than 100 reads.
> in my examples I used X and Y to represents the categories. Ususally they
> have a more distinct names like "control", "knockout1", "dKo" etc.
>
> So what I really like to do is to check if the sum of all the "control"
> samples is lower than 100. If so, set all control sample to 0. This I would
> like to check *for each category* of every row of the data set.
>
> I hope it is more clear now
>
> thanks
> Assa
>
>
> On Fri, Nov 6, 2015 at 2:29 PM, jim holtman <jholtman at gmail.com> wrote:
>
>> Is this what you want:
>>
>>> x <- read.table(text = "X1 X2 X3 Y1 Y2 Y3
>> + 1232 357 23 0 9871 72
>> + 0 71 9 811 795 743
>> + 43 919 1111 0 76 14", header = TRUE)
>>> x
>> X1 X2 X3 Y1 Y2 Y3
>> 1 1232 357 23 0 9871 72
>> 2 0 71 9 811 795 743
>> 3 43 919 1111 0 76 14
>>>
>>> # create indices of columns that start with the same character
>>> indx <- split(seq(ncol(x)), substring(colnames(x), 1, 1))
>>> names(indx) <- NULL # remove names so output not messed up
>>>
>>> result <- lapply(indx, function(a){
>> + row_sum <- rowSums(x[, a])
>> + x[row_sum < 100, a] <- 0
>> + x[, a]
>> + })
>>> # combine back together
>>> do.call(cbind, result)
>> X1 X2 X3 Y1 Y2 Y3
>> 1 1232 357 23 0 9871 72
>> 2 0 0 0 811 795 743
>> 3 43 919 1111 0 0 0
>>
>>
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>> Tell me what you want to do, not how you want to do it.
>>
>> On Fri, Nov 6, 2015 at 5:40 AM, Assa Yeroslaviz <frymor at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a data frame with multiple columns, which are belong to several
>>> groups
>>> like that:
>>> X1 X2 X3 Y1 Y2 Y3
>>> 1232 357 23 0 9871 72
>>> 0 71 9 811 795 743
>>> 43 919 1111 0 76 14
>>>
>>> I would like to filter such rows out, where the sums in one group is lower
>>> than a specifc value. For example, I would like to set all the values in a
>>> group of cloums to zero, if the sum in one group is less than 100
>>> In my example table I would like to set the values in the second row for
>>> the three X-columns to 0, so that the table looks like that:
>>>
>>> X1 X2 X3 Y1 Y2 Y3
>>> 1232 357 23 0 9871 72
>>> 0 0 0 811 795 743
>>> 43 919 1111 0 0 0
>>>
>>> the same apply also for the Y-values in the last column.
>>> Is there a more efficient way of doing it than going row by row and use
>>> the
>>> apply function on each of the subgroups I have in the columns?
>>>
>>> thanks
>>> Assa
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list