[BioC] aggregate genes in DEXSeq

Fri Mar 7 10:57:54 CET 2014

Dear Xiayu,

To avoid genes from merging you have to use "-r no".
You could do the counding steps inside R using the function 
"buildExonCountSet"

Best regards,
Alejandro

> Alejandro Reyes <alejandro.reyes at ...> writes:
>
>> Dear Julien, Dear Mar and people interested in DEXSeq ,
>>
>> You recently reported some problems in DEXSeq that had to do with the
>> way the HTSeq python scripts deal with the exons that overlap with more
>> than one gene ID.
>>
>> The solution that we had taken so far was that the gene IDs sharing an
>> exon were merged into an "aggregate gene" ID.  From the input of some
>> users and our own experience, we know that it was not the most
>> appropriate solution: when the merged genes were differentially
>> expressed, DEXSeq falsely calls differential usage in other exons of the
>> aggregate genes. We have included a "-r" parameter in the script
>> "prepare_annotation_dexseq.py", for the user to decide what to do with
>> these exons: either to ignore the exons associated with more than one
>> gene IDs and treat each gene separately, or to merge the genes and take
>> these exons into account.
>>
>> Additionally, we have implemented the R/Bioconductor functions
>> equivalent to the python scripts. These functions were implemented using
>> code contributed by Mike Love.
>>
>> All these changes are available in the last svn version (1.5.9).
>>
>> Best regards,
>> Alejandro Reyes
>>
>> Hi Alejandro,
>> Just to let you know that adding the junctions to the test of
>> differential expression of DEXSeq worked fine! The "hack" was actually
>> straightforward, I just had to modify the counts files taken as input.
>>
>> On a different note, I noticed that many false positives were generated
>> because of "aggregate" gene models that were composed on different
>> overlapping genes. When these overlapping genes have different behavior
>> in different conditions, this is interpreted as differential expression
>> of some exons, while it is differential expression of genes... See the
>> attached picture, this might turn out to be easier to understand
>> Did you notice this behavior of DEXSeq, and do you have any comment on
>> this?
>>
>> Thanks again for your work on DEXSeq
>> Julien
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at ...
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> Hi, Alejandro
>
> We recently used the python script dexseq_prepare_annotation.py to
> generate exon.gtf from genes.gtf. Similar to the situations in your
> previous discussion, we found some overlapping exons which share
> patial/all regions and in the meanwhile belong to different genes, which
> you can see below, the 1st and the 4th line. Even after using "-r yes"
> parameter, we still see these happening. "-r yes" should be the default.
> We thought by merging the genes and taking all these exons into account
> would generate non-overlapping exon.gtf in the end, but now the gtf still
> have some overlapping exons. We are concerned if the further count files
> generated would be biased in these regions. Do you have any lastest update
> about the solution of this problem?
>
> In your last message, you mentioned about the R functions equivalent to
> the python scripts which were contributed by Mike Love. Could you provide
> more details about how we can find these functions, like the name or the
> website of it?
>
> Thank you very much in advance!
>
> Thanks,
> Xiayu
>
>
> 1       dexseq_prepare_annotation.py    exonic_part     13671
> 14409   .       +       .
> transcripts "ENST00000456328+ENST00000515242+ENST00000518655"; exonic_
> part_number "018"; gene_id "ENSG00000223972"
> 1       dexseq_prepare_annotation.py    exonic_part     14410
> 14412   .       +       .       transcripts "ENST00000515242";
> exonic_part_number "019"; gene_id "ENSG
> 00000223972"
> 1       dexseq_prepare_annotation.py    aggregate_gene  14363
> 29806   .       -       .       gene_id "ENSG00000227232"
> 1       dexseq_prepare_annotation.py    exonic_part     14363
> 14403   .       -       .
> transcripts "ENST00000541675+ENST00000423562+ENST00000438504"; exonic_
> part_number "001"; gene_id "ENSG00000227232"
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor