[R] For help in R coding

Sun Jul 3 19:07:21 CEST 2011

Yes you are right. unlist operation is unnecessary and I have tried it yesterday and it is working without that operation also.But I have one more problem on which I have worked whole day but did not get any solution.As I told you I am new to R,I want to ask that how I can use the (if condition) in the following code

df=read.table("Case2.pileup",fill=T,sep="\t",colClasses = "character") 
txtvec <- readLines(textConnection(df[,9]))
dad=data.frame(A = (sapply(gregexpr("A|a", (df[,9])), function(x) if ( x[[1]] != -1)  
length(x) else 0 )),
C = (sapply(gregexpr("C|c", (df[,9])), function(x) if ( x[[1]] != -1)  
length(x) else 0 )),
G = (sapply(gregexpr("G|g", (df[,9])), function(x) if ( x[[1]] != -1)  
length(x) else 0 )),
T = (sapply(gregexpr("T|t", (df[,9])), function(x) if ( x[[1]] != -1)  
length(x) else 0 )),
N = (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)  
length(x) else 0 )))

Now my problem is in my data frame I have alphabets A,C,G and T in 3rd column also.Now these commas (,)and dots(.) in column 9 are for these alphabets which are in column 3.I want to use if condition like this

if in my dataframe column 3 have  A then A = (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)  
length(x) else 0 ))) else (A = (sapply(gregexpr("A|a", (df[,9])), function(x) if ( x[[1]] != -1)  
length(x) else 0 )),if in my dataframe column 3 haveCA then C = (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)  
length(x) else 0 ))) else C = (sapply(gregexpr("C|c", (df[,9])), function(x) if ( x[[1]] != -1)  
length(x) else 0 )), if in my dataframe column 3 have  G then G = (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)  
length(x) else 0 ))) else G = (sapply(gregexpr("G|g", (df[,9])), function(x) if ( x[[1]] != -1)  
length(x) else 0 )) if in my dataframe column 3 have  T then T = (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)  
length(x) else 0 ))) else T = (sapply(gregexpr("T|t", (df[,9])), function(x) if ( x[[1]] != -1)  
length(x) else 0 )),

So I want to code so that it will give the output like this-

DATA FRAME (Input)

   col3                 col 9
    T                      .a,g,,
    A                    .t,t,,
    A                    .,c,c,
    C                     .,a,,,
    G                     .,t,t,t
    A                     .c,,g,^!.
    A                      .g,ggg.^!,
    A                      .$,,,,,.,
    C                      a,g,,t,
    T                      ,,,,,.,^!.
    T                       ,$,,,,.,."

output

A            C                 G                        T
1             0                  1                        4
4             0                  0                        2
4              2                 0                        0
1              5                 0                        0
0              0                 4                        3

This is the output for first five rows.

Can you please help me how to use this if condition in your coding or we can also do it by using some other condition rather than if condition?

________________________________________
From: David Winsemius [dwinsemius at comcast.net]
Sent: Sunday, July 03, 2011 3:57 AM
To: Bansal, Vikas
Cc: Dennis Murphy; r-help at r-project.org
Subject: Re: [R] For help in R coding

On Jul 2, 2011, at 4:46 PM, Bansal, Vikas wrote:

> DEAR ALL,
> I TRIED THIS CODE AND THIS IS RUNNING PERFECTLY...
>
> df=read.table("Case2.pileup",fill=T,sep="\t",colClasses = "character")
> txt=df[,9]
> txtvec <- readLines(textConnection(txt))
> dad=data.frame(A = unlist(sapply(gregexpr("A|a", txtvec),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
> C = unlist(sapply(gregexpr("C|c", txtvec), function(x) if ( x[[1]] !
> = -1)
> length(x) else 0 )),
> G = unlist(sapply(gregexpr("G|g", txtvec), function(x) if ( x[[1]] !
> = -1)
> length(x) else 0 )),
> T = unlist(sapply(gregexpr("T|t", txtvec), function(x) if ( x[[1]] !
> = -1)
> length(x) else 0 )),
> N = unlist(sapply(gregexpr("\\,|\\.", txtvec), function(x) if
> ( x[[1]] != -1)
> length(x) else 0 )))
>

The unlist operation is unnecessary since the sapply operation returns
a vector.  (It doesn't hurt, but it is unnecessary.)
>
>
>
>
> Thanking you,
> Warm Regards
> Vikas Bansal
> Msc Bioinformatics
> Kings College London
> ________________________________________
> From: David Winsemius [dwinsemius at comcast.net]
> Sent: Saturday, July 02, 2011 9:04 PM
> To: Dennis Murphy
> Cc: r-help at r-project.org; Bansal, Vikas
> Subject: Re: [R] For help in R coding
>
> On reflection and a bit of testing I think the best approach would be
> to use gregexpr. For counting the number of commas, this appears quite
> straightforward.
>
>> sapply(gregexpr("\\,", txtvec), function(x) if ( x[[1]] != -1)
> length(x) else 0 )
>  [1] 3 3 3 4 3 3 2 6 4 6 6
>
> It easily generalizes to period and the `|` (or) operation on letters.
> ( did need to add the check since the length of gregexpr is always at
> least one but ihas value -1 when there is no match
>
>> sapply(gregexpr("t|T", txtvec), function(x) if ( x[[1]] != -1)
> length(x) else 0 )
>  [1] 0 2 0 0 3 0 0 0 1 0 0
>
>
> On Jul 2, 2011, at 3:22 PM, Dennis Murphy wrote:
>
>> Hi:
>>
>> There seems to be a problem if the string ends in , or . , which
>> makes
>> it difficult for strsplit() to pick up if it is splitting on those
>> characters. Here is an alternative, splitting on individual
>> characters
>> and using charmatch() instead:
>>
>> charsum <- function(s, char) {
>>   u <- strsplit(s, "")
>>   sum(sapply(u, function(x) charmatch(x, char)), na.rm = TRUE)
>>  }
>>
>> unname(sapply(txtvec, function(x) charsum(x, ',')))
>> unname(sapply(txtvec, function(x) charsum(x, '.')))
>>
>> Putting this into a data frame,
>>
>> dfout <- data.frame(periods = unname(sapply(txtvec, function(x)
>> charsum(x, '.'))),
>>                               commas = unname(sapply(txtvec,
>> function(x) charsum(x, '.'))) )
>> txtvec
>>
>> HTH,
>> Dennis
>>
>> On Sat, Jul 2, 2011 at 10:19 AM, David Winsemius <dwinsemius at comcast.net
>>> wrote:
>>>
>>> On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote:
>>>
>>>>
>>>>
>>>>>> Dear all,
>>>>>>
>>>>>> I am doing a project on variant calling using R.I am working on
>>>>>> pileup file.There are 10 columns in my data frame and I want to
>>>>>> count the number of A,C,G and T in each row for column 9.example
>>>>>> of
>>>>>> column 9 is given below-
>>>>>>
>>>>>>        .a,g,,
>>>>>>        .t,t,,
>>>>>>        .,c,c,
>>>>>>        .,a,,,
>>>>>>        .,t,t,t
>>>>>>        .c,,g,^!.
>>>>>>        .g,ggg.^!,
>>>>>>        .$,,,,,.,
>>>>>>        a,g,,t,
>>>>>>        ,,,,,.,^!.
>>>>>>        ,$,,,,.,.
>>>>>>
>>>>>> This is a bit confusing for me as these characters are in one
>>>>>> column
>>>>>> and how can we scan them for each row to print number of A,C,G
>>>>>> and T
>>>>>> for each row.
>>>>>
>>>>> Seems a bit clunky but this does the job (first the data):
>>>>>>
>>>>>> txt <- " .a,g,,
>>>>>
>>>>> +            .t,t,,
>>>>> +            .,c,c,
>>>>> +            .,a,,,
>>>>> +            .,t,t,t
>>>>> +            .c,,g,^!.
>>>>> +            .g,ggg.^!,
>>>>> +            .$,,,,,.,
>>>>> +            a,g,,t,
>>>>> +            ,,,,,.,^!.
>>>>> +            ,$,,,,.,."
>>>>>
>>>>>> txtvec <- readLines(textConnection(txt))
>>>>>
>>>>> Now the clunky solution, Basically subtracts 1 from the counts of
>>>>> "fragments" that result from splitting on each letter in turn.
>>>>> Could
>>>>> be made prettier with a function that did the job.
>>>>>
>>>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>>>
>>>>> split="a"), length) , "-", 1)),
>>>>> + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"),
>>>>> length) , "-", 1)),
>>>>> + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"),
>>>>> length) , "-", 1)),
>>>>> + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"),
>>>>> length) , "-", 1)) )
>>>>>                    A C G T
>>>>> .a,g,,               1 0 1 0
>>>>>         .t,t,,     0 0 0 2
>>>>>         .,c,c,     0 2 0 0
>>>>>         .,a,,,     1 0 0 0
>>>>>         .,t,t,t    0 0 0 2
>>>>>         .c,,g,^!.  0 1 1 0
>>>>>         .g,ggg.^!, 0 0 4 0
>>>>>         .$,,,,,.,  0 0 0 0
>>>>>         a,g,,t,    1 0 1 1
>>>>>         ,,,,,.,^!. 0 0 0 0
>>>>>         ,$,,,,.,.  0 0 0 0
>>>>>
>>>>> Has the advantage that the input data ends up as rownames, which
>>>>> was a
>>>>> surprise.
>>>>>
>>>>> If you wanted to count "A" and "a" as equivalent, then the split
>>>>> argument should be "a|A"
>>>>>
>>>>>
>>>>
>>>>>> AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT
>>>>>> LIKE
>>>>>> THIS.
>>>>
>>>> BUT CAN I COUNT . AND , ALSO USING-
>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>> split=".|,"), length) , "-", 1)),
>>>>
>>>> I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME
>>>> PLACES
>>>> IT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT EVEN
>>>> CALCULATING AND JUST SHOWING 0.
>>>
>>> You need to use valid regex expressions for 'split'. Since "." and
>>> "," are
>>> special characters they need to be escaped when you wnat the
>>> literals to be
>>> recognized as such.
>>>
>>> I haven't figured out why but you need to drop the final operation
>>> of
>>> subtracting 1 from the values when counting commas:
>>>
>>> data.frame(periods = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>                            split="\\."), length) , "-", 1))
>>> ,commas = unlist( lapply( sapply(txtvec, strsplit,
>>>                            split="\\,"), length) ) )
>>>                      periods commas
>>> .a,g,,                      1      3
>>>           .t,t,,           1      3
>>>           .,c,c,           1      3
>>>           .,a,,,           1      4
>>>           .,t,t,t          1      4
>>>           .c,,g,^!.        1      4
>>>           .g,ggg.^!,       2      2
>>>           .$,,,,,.,        2      6
>>>           a,g,,t,          0      4
>>>           ,,,,,.,^!.       1      7
>>>           ,$,,,,.,.        1      7
>>>
>>> --
>>>
>>> David Winsemius, MD
>>> West Hartford, CT
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>
> David Winsemius, MD
> West Hartford, CT
>

David Winsemius, MD
West Hartford, CT