[BioC] consensusString Function

Erik Wright eswright at wisc.edu
Wed Apr 7 18:06:59 CEST 2010


Hi Patrick,

Thanks for closing the loop on this issue.  I have a question about the outputs you show below:

If two strings are ACAG and ACAR where R can be A or G, then it makes sense to me that the consensus is ACAR.

Why then is an N treated differently than an R?  If two strings are NNNN and ACTG, where N can be A, C, T, or G, then based on the prior example the output should be NNNN, but the output you show is ACTG.

If there are three sequences, ACAG, ACAR, and ACAG, and the threshold is set at the default 25% (for a DNAStringSet) then why is the output ACAG?  If an R is a C or G, and the other two codons in the final position are G's then the sum is two G's and one C.  One in three (33.33...%) being C is greater than the default threshold of 25%, so I would expect the consensus character to be R.  Therefore the output should be ACAR.

Thanks for helping me understand the output.

Smiles,
Erik



On Apr 6, 2010, at 5:29 PM, Patrick Aboyoun wrote:

> Erik, Heidi, and Wolfgang,
> To bring this thread full circle, Biostrings::consensusString didn't support ambiguity letters in input strings for BioC <= 2.5 (R <= 2.10). This support has been added to BioC 2.6 (R 2.11), but as Wolfgang mentioned there was a bug in the code. This bug has now been fixed in BioC 2.6 and will be available for download from bioconductor.org within 36 hours. Here are some examples of consensusString operating on DNAStringSet objects containing ambiguity letters:
> 
> >   consensusString(DNAStringSet(c("NNNN","ACTG")))
> [1] "ACTG"
> >   consensusString(DNAStringSet(c("AANN","ACTG")))
> [1] "AMTG"
> >   consensusString(DNAStringSet(c("ACAG","ACAR")))
> [1] "ACAR"
> >   consensusString(DNAStringSet(c("ACAG","ACAR", "ACAG")))
> [1] "ACAG"
> > sessionInfo()
> R version 2.11.0 alpha (2010-04-04 r51591)
> i386-apple-darwin9.8.0
> 
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] Biostrings_2.15.27 IRanges_1.5.74
> 
> loaded via a namespace (and not attached):
> [1] Biobase_2.7.5
> 
> 
> 
> On 4/6/10 2:36 PM, Wolfgang Huber wrote:
>> Hi Erik, Herv'e
>> 
>> please always provide the output of sessionInfo(), and a complete reproducible example (you let Heidi and the others guess that you're talking about the Biostrings package).
>> 
>> With a more recent version of Biostrings, I get:
>> 
>> library("Biostrings")
>> consensusString( DNAStringSet(c("AAAA","ACTG")) )
>> # [1] "AMWR"
>> 
>> consensusString( DNAStringSet(c("AAAB","ACTG")) )
>> # Error in FUN(newX[, i], ...) :
>>  'ambiguityMap' is missing some combinations of row names
>> 
>> And going into the debugger where the error is caused, i.e. within the function "consensusLetter", the expression
>> 
>> i <- paste(all_letters[col >= threshold], collapse = "")
>> 
>> is evaluated where
>> Browse[1]> all_letters
>> [1] "A" "C" "G" "T" "B"    ## length is 5
>> Browse[1]> col
>> [1] 1 0 0 0                ## length is 4
>> Browse[1]> i
>> [1] "AB"                   ## recycling rule was applied
>> 
>> which seems unintended and with some more insight will probably lead to the fixing of a bug.
>> 
>>    Best wishes
>>    Wolfgang
>> 
>> 
>> > sessionInfo()
>> R version 2.12.0 Under development (unstable) (2010-04-06 r51617)
>> x86_64-unknown-linux-gnu
>> 
>> locale:
>> [1] LC_CTYPE=C              LC_NUMERIC=C            LC_TIME=C
>> [4] LC_COLLATE=C            LC_MONETARY=C LC_MESSAGES=it_IT.UTF-8
>> [7] LC_PAPER=C              LC_NAME=C               LC_ADDRESS=C
>> [10] LC_TELEPHONE=C          LC_MEASUREMENT=C        LC_IDENTIFICATION=C
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices datasets  utils     methods   base
>> 
>> other attached packages:
>> [1] Biostrings_2.15.26 IRanges_1.5.74     fortunes_1.3-7
>> 
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.7.5
>> 
>> Heidi Dvinge ha scritto:
>>> Hello Erik,
>>> 
>>> unfortunately I'm not familiar with the Biostrings package, so I can't
>>> tell you why this doesn't work, but until someone else can answer, this
>>> seems to be a work-around.
>>> 
>>> Apparently, consensusString doesn't handle Ns.
>>> 
>>>> test <- DNAStringSet(c("AANN","ACTG"))
>>>> consensusString(test)
>>> Error in .local(x, ...) :
>>>  'threshold' must be a numeric in (0, 1/sum(rowSums(x) > 0)]
>>> 
>>> If there are no Ns things are okay though.
>>> 
>>>> test2 <- DNAStringSet(c("AAAA","ACTG"))
>>>> consensusString(test2)
>>> [1] "AMWR"
>>> 
>>> However, Ns seem acceptable if the consensus matrix is calculated first,
>>> although they might result in ?s where no consensus could be found.
>>> 
>>>> test3 <- consensusMatrix(test)
>>>> consensusString(test3)
>>> [1] "A???"
>>> 
>>> HTH
>>> \Heidi
>>> 
>>> 
>>>> Hello,
>>>> 
>>>> I am trying to get a consensus string for a DNAStringSet, but I am getting
>>>> an error.  The documentation for consensusString says the argument "x" is
>>>> either a consensus matrix or an XStringSet.  So this should work, right?:
>>>>> myDNAStringSet <- DNAStringSet(c("NNNN","ACTG"))
>>>>> consensusString(myDNAStringSet)
>>>> Error in .local(x, ...) :
>>>>  'threshold' must be a numeric in (0, 1/sum(rowSums(x) > 0)]
>>>> 
>>>> Specifying a threshold in the arguments doesn't seem to make a difference.
>>>> 
>>>> Thanks!,
>>>> Erik
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> 
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list