[R] For help in R coding
Bansal, Vikas
vikas.bansal at kcl.ac.uk
Sat Jul 2 22:21:31 CEST 2011
HI
THIS SEEMS LITTLE BIT CONFUSING.BUT I AM USING THIS CODING AS SUGGESTED BY YOU-
df=read.table("Case2.pileup",fill=T,sep="\t",colClasses = "character")
txt=df[,9]
txtvec <- readLines(textConnection(txt))
vik=data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
split="a|A"), length) , "-", 1)),C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c|C"),
length) , "-", 1)),G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g|G"),
length) , "-", 1)),T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t|T"),
length) , "-", 1)) )
THE THING IS,AT SOME PLACES IT IS CALCULATING PERFECTLY BUT AT SOME POSITIONS IT IS NOT.I AM TRYING TO FIND OUT THE SOLUTION IN BOOKS,ON THE NET BUT I DONT KNOW WHY THERE IS NOTHING RELATED TO THIS.I THINK THIS CODING SEEMS TO BE GOOD BUT I AM MISSING SOMETHING.
FOR YOUR CONVENIENCE I HAVE ATTACHED MY Case2.pileup file.
I AM VERY THANKFUL TO YOU AND APPRECIATE THAT YOU ARE HELPING AND TAKING YOUR PRECIOUS TIME.
Thanking you,
Warm Regards
Vikas Bansal
Msc Bioinformatics
Kings College London
________________________________________
From: Dennis Murphy [djmuser at gmail.com]
Sent: Saturday, July 02, 2011 8:22 PM
To: r-help at r-project.org
Cc: Bansal, Vikas; David Winsemius
Subject: Re: [R] For help in R coding
Hi:
There seems to be a problem if the string ends in , or . , which makes
it difficult for strsplit() to pick up if it is splitting on those
characters. Here is an alternative, splitting on individual characters
and using charmatch() instead:
charsum <- function(s, char) {
u <- strsplit(s, "")
sum(sapply(u, function(x) charmatch(x, char)), na.rm = TRUE)
}
unname(sapply(txtvec, function(x) charsum(x, ',')))
unname(sapply(txtvec, function(x) charsum(x, '.')))
Putting this into a data frame,
dfout <- data.frame(periods = unname(sapply(txtvec, function(x)
charsum(x, '.'))),
commas = unname(sapply(txtvec,
function(x) charsum(x, '.'))) )
txtvec
HTH,
Dennis
On Sat, Jul 2, 2011 at 10:19 AM, David Winsemius <dwinsemius at comcast.net> wrote:
>
> On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote:
>
>>
>>
>>>> Dear all,
>>>>
>>>> I am doing a project on variant calling using R.I am working on
>>>> pileup file.There are 10 columns in my data frame and I want to
>>>> count the number of A,C,G and T in each row for column 9.example of
>>>> column 9 is given below-
>>>>
>>>> .a,g,,
>>>> .t,t,,
>>>> .,c,c,
>>>> .,a,,,
>>>> .,t,t,t
>>>> .c,,g,^!.
>>>> .g,ggg.^!,
>>>> .$,,,,,.,
>>>> a,g,,t,
>>>> ,,,,,.,^!.
>>>> ,$,,,,.,.
>>>>
>>>> This is a bit confusing for me as these characters are in one column
>>>> and how can we scan them for each row to print number of A,C,G and T
>>>> for each row.
>>>
>>> Seems a bit clunky but this does the job (first the data):
>>>>
>>>> txt <- " .a,g,,
>>>
>>> + .t,t,,
>>> + .,c,c,
>>> + .,a,,,
>>> + .,t,t,t
>>> + .c,,g,^!.
>>> + .g,ggg.^!,
>>> + .$,,,,,.,
>>> + a,g,,t,
>>> + ,,,,,.,^!.
>>> + ,$,,,,.,."
>>>
>>>> txtvec <- readLines(textConnection(txt))
>>>
>>> Now the clunky solution, Basically subtracts 1 from the counts of
>>> "fragments" that result from splitting on each letter in turn. Could
>>> be made prettier with a function that did the job.
>>>
>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>
>>> split="a"), length) , "-", 1)),
>>> + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"),
>>> length) , "-", 1)),
>>> + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"),
>>> length) , "-", 1)),
>>> + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"),
>>> length) , "-", 1)) )
>>> A C G T
>>> .a,g,, 1 0 1 0
>>> .t,t,, 0 0 0 2
>>> .,c,c, 0 2 0 0
>>> .,a,,, 1 0 0 0
>>> .,t,t,t 0 0 0 2
>>> .c,,g,^!. 0 1 1 0
>>> .g,ggg.^!, 0 0 4 0
>>> .$,,,,,., 0 0 0 0
>>> a,g,,t, 1 0 1 1
>>> ,,,,,.,^!. 0 0 0 0
>>> ,$,,,,.,. 0 0 0 0
>>>
>>> Has the advantage that the input data ends up as rownames, which was a
>>> surprise.
>>>
>>> If you wanted to count "A" and "a" as equivalent, then the split
>>> argument should be "a|A"
>>>
>>>
>>
>>>> AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT LIKE
>>>> THIS.
>>
>> BUT CAN I COUNT . AND , ALSO USING-
>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>> split=".|,"), length) , "-", 1)),
>>
>> I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME PLACES
>> IT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT EVEN
>> CALCULATING AND JUST SHOWING 0.
>
> You need to use valid regex expressions for 'split'. Since "." and "," are
> special characters they need to be escaped when you wnat the literals to be
> recognized as such.
>
> I haven't figured out why but you need to drop the final operation of
> subtracting 1 from the values when counting commas:
>
> data.frame(periods = unlist(lapply( lapply( sapply(txtvec, strsplit,
> split="\\."), length) , "-", 1))
> ,commas = unlist( lapply( sapply(txtvec, strsplit,
> split="\\,"), length) ) )
> periods commas
> .a,g,, 1 3
> .t,t,, 1 3
> .,c,c, 1 3
> .,a,,, 1 4
> .,t,t,t 1 4
> .c,,g,^!. 1 4
> .g,ggg.^!, 2 2
> .$,,,,,., 2 6
> a,g,,t, 0 4
> ,,,,,.,^!. 1 7
> ,$,,,,.,. 1 7
>
> --
>
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list