[R] For help in R coding

Dennis Murphy djmuser at gmail.com
Sat Jul 2 21:22:30 CEST 2011


Hi:

There seems to be a problem if the string ends in , or . , which makes
it difficult for strsplit() to pick up if it is splitting on those
characters. Here is an alternative, splitting on individual characters
and using charmatch() instead:

charsum <- function(s, char) {
    u <- strsplit(s, "")
    sum(sapply(u, function(x) charmatch(x, char)), na.rm = TRUE)
   }

unname(sapply(txtvec, function(x) charsum(x, ',')))
unname(sapply(txtvec, function(x) charsum(x, '.')))

Putting this into a data frame,

dfout <- data.frame(periods = unname(sapply(txtvec, function(x)
charsum(x, '.'))),
                                commas = unname(sapply(txtvec,
function(x) charsum(x, '.'))) )
txtvec

HTH,
Dennis

On Sat, Jul 2, 2011 at 10:19 AM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote:
>
>>
>>
>>>> Dear all,
>>>>
>>>> I am doing a project on variant calling using R.I am working on
>>>> pileup file.There are 10 columns in my data frame and I want to
>>>> count the number of A,C,G and T in each row for column 9.example of
>>>> column 9 is given below-
>>>>
>>>>         .a,g,,
>>>>         .t,t,,
>>>>         .,c,c,
>>>>         .,a,,,
>>>>         .,t,t,t
>>>>         .c,,g,^!.
>>>>         .g,ggg.^!,
>>>>         .$,,,,,.,
>>>>         a,g,,t,
>>>>         ,,,,,.,^!.
>>>>         ,$,,,,.,.
>>>>
>>>> This is a bit confusing for me as these characters are in one column
>>>> and how can we scan them for each row to print number of A,C,G and T
>>>> for each row.
>>>
>>> Seems a bit clunky but this does the job (first the data):
>>>>
>>>> txt <- " .a,g,,
>>>
>>> +            .t,t,,
>>> +            .,c,c,
>>> +            .,a,,,
>>> +            .,t,t,t
>>> +            .c,,g,^!.
>>> +            .g,ggg.^!,
>>> +            .$,,,,,.,
>>> +            a,g,,t,
>>> +            ,,,,,.,^!.
>>> +            ,$,,,,.,."
>>>
>>>> txtvec <- readLines(textConnection(txt))
>>>
>>> Now the clunky solution, Basically subtracts 1 from the counts of
>>> "fragments" that result from splitting on each letter in turn. Could
>>> be made prettier with a function that did the job.
>>>
>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>
>>> split="a"), length) , "-", 1)),
>>> + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"),
>>> length) , "-", 1)),
>>> + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"),
>>> length) , "-", 1)),
>>> + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"),
>>> length) , "-", 1)) )
>>>                     A C G T
>>> .a,g,,               1 0 1 0
>>>          .t,t,,     0 0 0 2
>>>          .,c,c,     0 2 0 0
>>>          .,a,,,     1 0 0 0
>>>          .,t,t,t    0 0 0 2
>>>          .c,,g,^!.  0 1 1 0
>>>          .g,ggg.^!, 0 0 4 0
>>>          .$,,,,,.,  0 0 0 0
>>>          a,g,,t,    1 0 1 1
>>>          ,,,,,.,^!. 0 0 0 0
>>>          ,$,,,,.,.  0 0 0 0
>>>
>>> Has the advantage that the input data ends up as rownames, which was a
>>> surprise.
>>>
>>> If you wanted to count "A" and "a" as equivalent, then the split
>>> argument should be "a|A"
>>>
>>>
>>
>>>> AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT LIKE
>>>> THIS.
>>
>> BUT CAN I COUNT . AND , ALSO USING-
>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>> split=".|,"), length) , "-", 1)),
>>
>> I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME PLACES
>> IT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT EVEN
>> CALCULATING AND JUST SHOWING 0.
>
> You need to use valid regex expressions for 'split'. Since "." and "," are
> special characters they need to be escaped when you wnat the literals to be
> recognized as such.
>
> I haven't figured out why but you need to drop the final operation of
> subtracting 1 from the values when counting commas:
>
> data.frame(periods = unlist(lapply( lapply( sapply(txtvec, strsplit,
>                             split="\\."), length) , "-", 1))
>  ,commas = unlist( lapply( sapply(txtvec, strsplit,
>                             split="\\,"), length) ) )
>                       periods commas
>  .a,g,,                      1      3
>            .t,t,,           1      3
>            .,c,c,           1      3
>            .,a,,,           1      4
>            .,t,t,t          1      4
>            .c,,g,^!.        1      4
>            .g,ggg.^!,       2      2
>            .$,,,,,.,        2      6
>            a,g,,t,          0      4
>            ,,,,,.,^!.       1      7
>            ,$,,,,.,.        1      7
>
> --
>
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list