[BioC] alignShortReads

Thu Aug 22 06:42:08 CEST 2013

On 08/21/2013 02:54 PM, Sonja Althammer wrote:
> Hi!
> I am having trouble aligning a DNAStringSet to a BSgenome. The example in
> the help works for me....
> Could anyone explain to me the error I am getting, please and how I can
> make it work? Actually I would like to align these reads to a specific
> gene... Is there maybe a better way?
> Thanks!
> Sonja
>
>
>> alignShortReads(reads, Hsapiens, seqNames="chr1")
> Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr,

alignShortReads is from R453Plus1Toolbox.

DNAStringSet can contain ambiguity letters such as 'M', to indicate that the 
nucleotide could be either A or C

 > IUPAC_CODE_MAP
      A      C      G      T      M      R      W      S      Y      K      V
    "A"    "C"    "G"    "T"   "AC"   "AG"   "AT"   "CG"   "CT"   "GT"  "ACG"
      H      D      B      N
  "ACT"  "AGT"  "CGT" "ACGT"

but the aligner in alignShortReads expects only

 > DNA_BASES
[1] "A" "C" "G" "T"

If your sequences do contain just bases, it would make sense to run 
Biostrings::matchPDict directly on the sequence of the gene of interest. You 
could obtained the sequence of your gene(s) of interest by, e.g., using the 
function ?select and the Homo.sapiens package to get coordinates of the gene, 
?getSeq to extract the sequence from the Hsapiens BSgenome object you already 
have, and matchPDict to do the alignment.

Probably in a way that requires more care, I did the following to arrive at the 
sequence of the transcripts of BRCA1

     library(BSgenome.Hsapiens.UCSC.hg19)
     library(Homo.sapiens)

Discover what I can do:

     ## 'keytypes' I can query with...
     keytypes(Homo.sapiens)
     ## discover the correct way to specify a gene symbol I'm interested in
     grep("BRCA", keys(Homo.sapiens, "SYMBOL"))
     ## discover what info I can extract
     cols(Homo.sapiens)

Now create ranges for transcripts

     ## select genomic coordinates for BRCA1
     xx = select(Homo.sapiens, "BRCA1",
            c("TXID", "TXCHROM", "TXSTRAND", "TXSTART", "TXEND"),
            "SYMBOL")
     ## turn coordinates into a GRanges
     yy = with(xx, GRanges(TXCHROM, IRanges(TXSTART, TXEND), TXSTRAND))

or alternatively map to the central 'Entrez' identifier and...

     ## figure out Entrez ID corresponding to symbol
     eid = select(Homo.sapiens, "BRCA1", "ENTREZID", "SYMBOL")$ENTREZID
     ## extract transcript coordinates from TxDb
     yy = transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")[eid]

What's a gene? Maybe it's just the range of transcripts:

     gn = range(yy)

Finally retrieve the relevant sequence(s)

     seq = getSeq(Hsapiens, yy)  ## or getSeq(Hsapiens, gn)

Judging by the package you're using, you might have longer 454 reads, where the 
ungapped high fidelity matches of matchPDict are not appropriate. If you're 
interested in more general alignments, or if the sequences you're trying to 
align do have ambiguity letters, there is Biostrings::pairwiseAlignment.  It 
might still be that this general purpose aligner is not appropriate for the task 
you're trying to accomplish, so some detail (and perhaps input from others on 
the list!) might be needed.

Hope that provides some guidance.

Martin

> :
>    non base DNA letter found in Trusted Band for pattern 248
>> class(reads)
> [1] "DNAStringSet"
> attr(,"package")
> [1] "Biostrings"
>
>> class(Hsapiens)
> [1] "BSgenome"
> attr(,"package")
> [1] "BSgenome"
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793