[R] For help in R coding

Sat Jul 2 20:57:54 CEST 2011

Thanking you,
Warm Regards
Vikas Bansal
Msc Bioinformatics
Kings College London
________________________________________
From: David Winsemius [dwinsemius at comcast.net]
Sent: Saturday, July 02, 2011 6:19 PM
To: Bansal, Vikas
Cc: r-help at r-project.org
Subject: Re: [R] For help in R coding

On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote:

>
>
>>> Dear all,
>>>
>>> I am doing a project on variant calling using R.I am working on
>>> pileup file.There are 10 columns in my data frame and I want to
>>> count the number of A,C,G and T in each row for column 9.example of
>>> column 9 is given below-
>>>
>>>          .a,g,,
>>>          .t,t,,
>>>          .,c,c,
>>>          .,a,,,
>>>          .,t,t,t
>>>          .c,,g,^!.
>>>          .g,ggg.^!,
>>>          .$,,,,,.,
>>>          a,g,,t,
>>>          ,,,,,.,^!.
>>>          ,$,,,,.,.
>>>
>>> This is a bit confusing for me as these characters are in one column
>>> and how can we scan them for each row to print number of A,C,G and T
>>> for each row.
>>
>> Seems a bit clunky but this does the job (first the data):
>>> txt <- " .a,g,,
>> +            .t,t,,
>> +            .,c,c,
>> +            .,a,,,
>> +            .,t,t,t
>> +            .c,,g,^!.
>> +            .g,ggg.^!,
>> +            .$,,,,,.,
>> +            a,g,,t,
>> +            ,,,,,.,^!.
>> +            ,$,,,,.,."
>>
>>> txtvec <- readLines(textConnection(txt))
>>
>> Now the clunky solution, Basically subtracts 1 from the counts of
>> "fragments" that result from splitting on each letter in turn. Could
>> be made prettier with a function that did the job.
>>
>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>> split="a"), length) , "-", 1)),
>> + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"),
>> length) , "-", 1)),
>> + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"),
>> length) , "-", 1)),
>> + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"),
>> length) , "-", 1)) )
>>                      A C G T
>> .a,g,,               1 0 1 0
>>           .t,t,,     0 0 0 2
>>           .,c,c,     0 2 0 0
>>           .,a,,,     1 0 0 0
>>           .,t,t,t    0 0 0 2
>>           .c,,g,^!.  0 1 1 0
>>           .g,ggg.^!, 0 0 4 0
>>           .$,,,,,.,  0 0 0 0
>>           a,g,,t,    1 0 1 1
>>           ,,,,,.,^!. 0 0 0 0
>>           ,$,,,,.,.  0 0 0 0
>>
>> Has the advantage that the input data ends up as rownames, which
>> was a
>> surprise.
>>
>> If you wanted to count "A" and "a" as equivalent, then the split
>> argument should be "a|A"
>>
>>
>
>>> AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT
>>> LIKE THIS.
> BUT CAN I COUNT . AND , ALSO USING-
> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
> split=".|,"), length) , "-", 1)),
>
> I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME
> PLACES IT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT
> EVEN CALCULATING AND JUST SHOWING 0.

You need to use valid regex expressions for 'split'. Since "." and ","
are special characters they need to be escaped when you wnat the
literals to be recognized as such.

I haven't figured out why but you need to drop the final operation of
subtracting 1 from the values when counting commas:

data.frame(periods = unlist(lapply( lapply( sapply(txtvec, strsplit,
                              split="\\."), length) , "-", 1))
  ,commas = unlist( lapply( sapply(txtvec, strsplit,
                              split="\\,"), length) ) )
                        periods commas
  .a,g,,                      1      3
             .t,t,,           1      3
             .,c,c,           1      3
             .,a,,,           1      4
             .,t,t,t          1      4
             .c,,g,^!.        1      4
             .g,ggg.^!,       2      2
             .$,,,,,.,        2      6
             a,g,,t,          0      4
             ,,,,,.,^!.       1      7
             ,$,,,,.,.        1      7

--

David Winsemius, MD
West Hartford, CT

SOME OF THE VALUES ARE COMING INCORRECT.I DO NOT KNOW WHY BUT IF YOU WILL SEE YOUR OUTPUT SOME OF COMMAS ARE 7 BUT ACTUALLY THERE ARE 6.THIS SAME PROBLEM IS OCCURRING DURING ALPHABETS ALSO WHEN I USE THIS-

data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,  
split="a|A"), length) , "-", 1)),C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c|C"),  
length) , "-", 1)),G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g|G"),  
length) , "-", 1)),T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t|T"),  
length) , "-", 1)) )

I DONT KNOW WHY THIS CODE IS NOT CALCULATING THE EXACT NUMBER.CAN YOU PLEASE CHECK IT?