[BioC] Rsubread alingner
Wei Shi
shi at wehi.EDU.AU
Fri Apr 22 08:54:47 CEST 2011
Hi João:
You should use a lower consensus cutoff when using Rsubread to map your 36bp reads. The default setting of requiring at least 3 consensus subreads should only be used when reads are at least 45 bases long (so that at least 10 subreads can be extracted from each read). You can use a consensus threshold of 2 (TH1=2) to map your data. This should be able to get more reads mapped at a quite low mapping error rate.
Attached is the C program for retrieving substrings from each read. Use this command to compile it on a linux computer:
> gcc -g -o get_substrings get_substrings.c
Typing ./get_substrings will tell you help to use it. It only works with a FASTQ file. The output will be a text file in which each line is a extracted substring.
After you got the output file, for example a file called "output.txt", issue the following command to get the frequencies of distinct substrings:
> sort output.txt | uniq -c > frequency.txt
If your reads include adapter sequence, you should see a 14mer sequence in your "frequency.txt" file which has a very big number.
Hope these will work for you, but let me know they don't.
Cheers,
Wei
On Apr 22, 2011, at 4:11 AM, João Moura wrote:
> Dear Wei Shi,
>
> Is it possible to send me the C code you talked about?
>
> Just another note, using bowtie with the full reads I get 66% while using Rsubread I get only 55%. Maybe the reason is that bowtie is reporting the alignments saying:
>
> "reads with at least one reported alignment"
>
> While maybe Rsubread just gets one hit per read?
>
> Thanks a lot for your help,
>
> On Thu, Apr 21, 2011 at 6:37 AM, Wei Shi <shi at wehi.edu.au> wrote:
> Dear João:
>
> You got to be certain whether the first 14 bases of your reads are adapters or not before you can trim them. You can do this by visual inspection on a small set of reads. But a better way will be to extract all first 14 bases from each read and and then look at the frequency of these 14mer sequences. I wrote a C program before to retrieve barcode sequences for each read, which might be useful for your data as well. This program has not been included in Rsubread package yet, but I will be happy to send you a copy of it if you want to use it.
>
> Cheers,
> Wei
>
> On Apr 21, 2011, at 10:44 AM, João Moura wrote:
>
>> Dear Wei Shi,
>>
>> First of all, thanks a lot for you help. Maybe I was doing it wrong also with bowtie, but the truth is I'm a master student, thus quite armature in all of this.
>>
>> Well, the reason why I trimmed my reads in first place was that sequence content across all bases is biased (which i think is a common problem in RNAseq). As you can see in the picture I send you, the first 14/15 position are biased (which I think is due to sequencing adapters?) So, the reason I followed was to trim the reads in order to be able to map the 22 bases that otherwise would be unmapped because of the adapters or other artifacts in the first 14/15 of the reads of the reads. But, since you say that 16bp is the minimum requirement for Rsubread, it seems that this problem will be solved using your software, right?
>>
>> Do you know if bowtie also uses this approach of subseting the reads? Because if I trim the reads I get an improvement of ~15% on the alignments.
>>
>> Once again, thank you very much for your help.
>>
>> On Thu, Apr 21, 2011 at 2:21 AM, Wei Shi <shi at wehi.edu.au> wrote:
>> Dear João:
>>
>> The default setting of Rsubread alignment requires at least 3 subreads (16bp substrings extracted from the read) to be mapped to the same location to report a hit. Subreads are extracted at a gap of 2bp from the read sequence, i.e. neighboring subreads are 2bp apart. 3 passes of mapping will be performed for mapping three sets of subreads respectively, which are extracted from different start positions of the read (1st base, 2nd base and 3rd base).
>>
>> Therefore, for your trimmed reads which are 20bp long, there will be only up to 2 subreads extracted from each read and they were used for mapping. However, because the consensus cutoff (minimal number of consensus subreads needed to report a hit) is 3 by default (which I believe is the value used in your analysis), no hits can be reported because of insufficient number of extracted subreads.
>>
>> DNA sequences of 20 bases are highly repetitive in human or mouse genome. But I do not know if this is the case for yeast genome. However with shorter read, you are more likely to get ambiguous mapping locations.
>>
>> I would suggest you to use the full read sequences for mapping to take advantage of all the base information in the reads. When mapping 36bp reads, Rsubread can extract 7 subreads from each read. This gives a lot more power for finding mapping locations in a sensitive and accurate way. You can use a consensus threshold of 2 (TH1=2) to call mapping locations, which we found works quite well.
>>
>> Hope this helps.
>>
>> Cheers,
>> Wei
>>
>> On Apr 21, 2011, at 12:31 AM, João Moura wrote:
>>
>>> Dear Wei Shi,
>>>
>>> Indeed I can run the example of the vignette but not my own trimmed reads. With bowtie I get more than 80% but with RSubread i get 0%. I think I got why. the thing is I'm trimming the reads in R and trying to realign them afterwards. But when I'm triming, I trimmed the quality and DNA strings. So, a line like this:
>>>
>>> @SRR014338.24 GA-EAS46_1_209DH:8:1:752:779 length=36
>>>
>>>
>>> Will contain "length=36" all the time, even if my reads and length 20. Do you think this might be a problem? Because if I run with untrimmed reads I get some reasonable results.
>>>
>>> I sent you a untrimmed version and a trimmed version. Maybe you can try...
>>>
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Apr 20, 2011 at 7:02 AM, Wei Shi <shi at wehi.edu.au> wrote:
>>> Dear Joao:
>>>
>>> Can you run the example enclosed with Rsubread package?
>>>
>>> Cheers,
>>> Wei
>>>
>>> On Apr 20, 2011, at 3:23 AM, João Moura wrote:
>>>
>>>> Dear Wei Shi,
>>>>
>>>> I'm trying out the Rsubread and I'm having strange results. That is I firstly built the reference genome, use a fasta file like this:
>>>>
>>>> [joao at goedel genomes]$ grep ">" yeast.fna
>>>> >Scchr01 1 - 230208
>>>> >Scchr02 1 - 813178
>>>> >Scchr03 1 - 316617
>>>> >Scchr04 1 - 1531917
>>>> >Scchr05 1 - 576869
>>>> >Scchr06 1 - 270148
>>>> >Scchr07 1 - 1090946
>>>> >Scchr08 1 - 562643
>>>> >Scchr09 1 - 439885
>>>> >Scchr10 1 - 745667
>>>> >Scchr11 1 - 666454
>>>> >Scchr12 1 - 1078175
>>>> >Scchr13 1 - 924429
>>>> >Scchr14 1 - 784333
>>>> >Scchr15 1 - 1091289
>>>> >Scchr16 1 - 948062
>>>> >Scmito 1 - 85779
>>>>
>>>> In R I ran the following code:
>>>>
>>>> ref <- file.path("/home/joao/bowtie-0.12.7/genomes/yeast.fna")
>>>> path<-file.path("/home/joao/bowtie-0.12.7/Rsubread")
>>>> buildindex(basename = file.path(path, "reference_index"), reference = ref)
>>>> reads <- dir("/home/joao/bowtie-0.12.7/reads/trimmed4", patt="fastq$", full=TRUE)
>>>> align(index = file.path(path, "reference_index"), readfile1 = reads[[1]],output_file = file.path(path, "alignResults.SAM"))
>>>>
>>>> 11152683 reads were processed in 121.8 seconds.
>>>> Percentage of successfully mapped reads is 0.00%.
>>>>
>>>>
>>>>
>>>> Completed successfully.
>>>>
>>>>
>>>> If I do the same with bowtie I get more than 70%..What am I doing wrong here? Thanks a lot!
>>>>
>>>> --
>>>> João Moura
>>>
>>>
>>> ______________________________________________________________________
>>> The information in this email is confidential and intended solely for the addressee.
>>> You must not disclose, forward, print or use it without the permission of the sender.
>>> ______________________________________________________________________
>>>
>>>
>>>
>>> --
>>> João Moura
>>> <thousand.fastq><untri_thousand.fastq>
>>
>>
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the addressee.
>> You must not disclose, forward, print or use it without the permission of the sender.
>> ______________________________________________________________________
>>
>>
>>
>> --
>> João Moura
>
>
> ______________________________________________________________________
> The information in this email is confidential and intended solely for the addressee.
> You must not disclose, forward, print or use it without the permission of the sender.
> ______________________________________________________________________
>
>
>
> --
> João Moura
______________________________________________________________________
The information in this email is confidential and intended solely for the addressee.
You must not disclose, forward, print or use it without the permission of the sender.
______________________________________________________________________
More information about the Bioconductor
mailing list