[BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno

Zhu, Lihua (Julie) Julie.Zhu at umassmed.edu
Mon Mar 18 21:09:46 CET 2013


Stefan,

FYI, Jianhong will modify the function with two new parameters, one is
NucleotideLevel (TRUE or FALSE) to allow both peak centric and nucleotide
centric view. Another parameter will be precedence, which is only applicable
to peak centric view (NucleotideLevel = FALSE). If no precedence specified,
double count will be enabled, which means that if a peak overlap with both
promoter and 5'UTR, then both promoter and 5'UTR will be incremented. If a
precedence order is specified, for example, if promoter is specified before
5'UTR, then only promoter will be incremented for the same example. In
addition, Jaccard index will be computed.

Please let us know your thoughts on this. Many thanks for your brilliant
ideas!

Best regards,

Julie


On 3/18/13 9:11 AM, "Lihua Julie Zhu" <julie.zhu at umassmed.edu> wrote:

> Stefan,
> 
> Many thanks for the detailed information on your testing procedure and
> results!
> 
> AssignChromosomeRegion is peak centric and it considers the the precedence
> of each region. Since peaks are usually in the promoter region, it got the
> highest precedence. The reason to have precedence is to ensure the peak
> number assigned to all different regions adds up to total peak number, which
> is different from nucleotide centric view implemented by bed tools.
> 
> Thanks for bringing up a very good point! We probably need to add a
> parameter to have an option to choose between peak centric view and
> nucleotide centric view.
> 
> Best regards,
> 
> Julie
> 
> 
> On 3/17/13 3:47 PM, "Stefan Pinter" <pinter at molbio.mgh.harvard.edu> wrote:
> 
>> Hi Julie,
>> 
>> maybe I'm misunderstanding something about this function. Below is what I've
>> done as a positive control (using refseq genes as "peaks" compared to ensembl
>> annotation), can you tell me whether these numbers make sense? I was
>> expecting
>> to see close to 100% overlap (according to bedtools closest on refseq genes
>> against ensembl genes, 958/967 unique refseq X-linked genes are overlapping
>> an
>> ensembl gene (distance = 0)).
>> 
>> But with prox. promoter cutoff = 0, I  get only get 42% prox. promoter and
>> 45%
>> "enhancer", which I believe could also be called intergenic, right? Even with
>> default of 1k cutoff, only 81% would be counted as "genic" (sum of everything
>> - "Enhancer").
>> Thanks,
>> - Stefan.
>> 
>> 
>> First, I pulled in the ensembl gene set from archive, because my data are in
>> mm9:
>> 
>> ensembl67=useMart(host='may2012.archive.ensembl.org',
>> biomart='ENSEMBL_MART_ENSEMBL', dataset='mmusculus_gene_ensembl')
>>> TSS <- getAnnotation(ensembl67, featureType ="TSS")
>>> Exon <- getAnnotation(ensembl67, featureType ="Exon")
>>> utr5 <- getAnnotation(ensembl67, featureType ="5utr")
>>> utr3 <- getAnnotation(ensembl67, featureType ="3utr")
>> 
>> Then, I used all X-linked genes from the conservative mm9 refseq gene set
>> (merged to contain unique names only) as my "peak" list [as a positive
>> control].
>> 
>>> bed.rGm<-read.table("genes/chrX.rGm4.bed", header = FALSE)
>>> rd.rGm<-BED2RangedData(bed.rGm, header=FALSE)
>>> rd.rGm
>> RangedData with 967 rows and 2 value columns across 1 space
>> 
>> Finally, I annotated these "peaks" relative to TSS and used
>> assignChromosomeRegion with 0k or 1k cutoff:
>> 
>>> rda.rGm = annotatePeakInBatch(rd.rGm, AnnotationData=TSS, PeakLocForDistance
>>> = "middle")
>>> l.feat0k.rGm <-ChIPpeakAnno:::assignChromosomeRegion(rda.rGm, TSS, Exon,
>>> utr5, utr3, proximal.promoter.cutoff=0, immediate.downstream.cutoff=0)
>>> l.feat0k.rGm
>> $Exon
>> [1] 3.205791
>> 
>> $Intron
>> [1] 0
>> 
>> $`5UTR`
>> [1] 3.516029
>> 
>> $`3UTR`
>> [1] 6.30817
>> 
>> $`Proximal Promoter%`
>> [1] 41.98552
>> 
>> $`Immediate Downstream`
>> [1] 0
>> 
>> $Enhancer
>> [1] 44.98449
>> 
>>> l.feat1k.rGm <-ChIPpeakAnno:::assignChromosomeRegion(rda.rGm, TSS, Exon,
>>> utr5, utr3)
>>> l.feat1k.rGm
>> $Exon
>> [1] 3.205791
>> 
>> $Intron
>> [1] 0
>> 
>> $`5UTR`
>> [1] 3.516029
>> 
>> $`3UTR`
>> [1] 6.30817
>> 
>> $`Proximal Promoter%`
>> [1] 80.97208
>> 
>> $`Immediate Downstream`
>> [1] 0.1034126
>> 
>> $Enhancer
>> [1] 5.894519
>> 
>> ----- Original Message -----
>> From: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
>> To: "Stefan Pinter" <pinter at molbio.mgh.harvard.edu>,
>> "<bioconductor at r-project.org>" <bioconductor at r-project.org>
>> Cc: "Jianhong Ou" <Jianhong.Ou at umassmed.edu>
>> Sent: Saturday, March 16, 2013 8:25:44 PM
>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>> 
>> Stefan,
>> 
>> Could you please send me a simple dataset to see what went wrong? Thanks for
>> letting know!
>> 
>> Thanks for hashing out the chromosome region enrichment score calculation!
>> Looks like you are looking for something very similar to GO enrichment
>> analysis (getEnrichedGO function in ChIPpeakAnno).
>> 
>> For option a, as you mentioned, we will need the genome level estimation of
>> the distribution for each category, which could be tricky. Once we have
>> these estimation, then Hypergeometric test can be applied to find whether a
>> category is enriched/depleted.
>> 
>> Jaccard index (option b) is an interesting alternative. We could implement
>> Jaccard index at the peak level first and add options for computing Jaccard
>> index at the nucleotide level later.
>> 
>> Best regards,
>> 
>> Julie
>> 
>> 
>> On 3/16/13 7:41 PM, "Stefan Pinter" <pinter at molbio.mgh.harvard.edu> wrote:
>> 
>>> Julie, I compared my results from ChipPeakAnno with a simple overlap in
>>> bedtools: the % of peaks inside genes (exon + intron + 3utr + 5utr) reported
>>> by ChIPpeakAnno is over an order of magnitude lower than what I learned from
>>> bedtools. Browsing the data indicates that bedtools is right. Something is
>>> off, do you have a test.data set you can check against?
>>> Thanks,
>>> -S.
>>> 
>>> ----- Original Message -----
>>> From: "Stefan Pinter" <pinter at molbio.mgh.harvard.edu>
>>> To: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
>>> Cc: "bioconductor" <bioconductor at r-project.org>
>>> Sent: Saturday, March 16, 2013 1:30:55 PM
>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>> 
>>> Hi Julie, 
>>> 
>>> well, I was more thinking of something along the lines of either:
>>> 
>>> a.) Over/Under-representation:
>>> Percentage of exons in genome = 2%
>>> Percentage of peaks overlapping exons = 10%
>>> Exons are 5-fold over-represented
>>> 
>>> But I realize now, that will only work with a full annotation, not a custom
>>> annotation.
>>> 
>>> b.) something like the jaccard index used in bedtools:
>>> 
>>> total length of intersection / total length of union
>>> 
>>> http://en.wikipedia.org/wiki/Jaccard_index
>>> 
>>> Just suggestions, there are probably many more possible ways to calculate a
>>> score.
>>> Thank you,
>>> - Stefan.
>>> 
>>> ----- Original Message -----
>>> From: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
>>> To: "Stefan Pinter" <pinter at molbio.mgh.harvard.edu>
>>> Cc: "<bioconductor at r-project.org>" <bioconductor at r-project.org>
>>> Sent: Friday, March 15, 2013 1:28:09 PM
>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>> 
>>> Stefan,
>>> 
>>> Regarding the enrichment/depletion score, Does the following toy example
>>> illustrate how to calculate such score? If not, could you please give a toy
>>> example? Thanks!
>>> 
>>> For example, if the total number of peaks = 100, number of peaks assigned to
>>> promoter = 90, number of peaks assigned to enhancer = 10, then the
>>> enrichment score = 90% and 10% for promoter region and enhancer region
>>> respectively.
>>> 
>>> We will add closest to annotatePeakInBatch in the dev version. Thanks for
>>> the great suggestion!
>>> 
>>> Please cc bioconductor at r-project.org. Thanks!
>>> 
>>> Best regards,
>>> 
>>> Julie
>>> 
>>> 
>>> On 3/15/13 12:26 PM, "Stefan Pinter" <pinter at molbio.mgh.harvard.edu> wrote:
>>> 
>>>> Sure, thank you Julie.
>>>> 
>>>> Yes, that is what I meant. Since all the necessary information (peak
>>>> locations, feature locations, total count, total length) are already
>>>> available
>>>> in RangedData & Annotation, it would be very convenient to calculate
>>>> enrichment/depletion and possibly even significance scores (by permutation)
>>>> on
>>>> the fly. The significance score may be overkill, but if the function at
>>>> least
>>>> reported enrichment/depletion scores, the user could always supply a number
>>>> of
>>>> shuffled ranges to build a random model of enrichment scores and calculate
>>>> significance after.
>>>> 
>>>> In addition, would it be possible to add another definition for
>>>> PeakLocForDistance in annotatePeakInBatch?
>>>> PeakLocForDistance = "start means using start of the peak to calculate the
>>>> distance to feature"
>>>> 
>>>> It would be helpful to have "closest", meaning distance to feature measured
>>>> from peak start or peak end, whichever is closer. That would help with
>>>> broad
>>>> peaks, which using "middle" for isn't very helpful.
>>>> 
>>>> Thank you and Best wishes,
>>>> - Stefan.
>>>> 
>>>> ----- Original Message -----
>>>> From: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
>>>> To: "Stefan Pinter" <pinter at molbio.mgh.harvard.edu>
>>>> Cc: "<bioconductor at r-project.org>" <bioconductor at r-project.org>
>>>> Sent: Friday, March 15, 2013 8:06:03 AM
>>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>>> 
>>>> Stefan,
>>>> 
>>>> You can download 2.6.1 to directly use assignChromosomeRegion without the
>>>> package prefix (I updated last night and it might take sometime to
>>>> propagate
>>>> to the installation site). To find out what parameters supported by the
>>>> function, in R, please type help( ChIPpeakAnno:::assignChromosomeRegion).
>>>> You will see that indeed PeakLocForDistance is not supported. If needed, I
>>>> can add such parameter.
>>>> 
>>>> Also regarding your previous suggestion of adding enrichment status of
>>>> feature length, do you mean enrichment of peaks in certain category of
>>>> chromosome region? For example, a significant enrichment score with 90%
>>>> peaks in promoter region would be interesting.
>>>> 
>>>> Could you please keep the thread in the Bioc for others to
>>>> contribute/benefit? Thanks!
>>>> 
>>>> Best regards,
>>>> 
>>>> Julie
>>>> 
>>>> 
>>>> On 3/14/13 10:10 PM, "Stefan Pinter" <pinter at molbio.mgh.harvard.edu> wrote:
>>>> 
>>>>> PPS. I think PeakLocForDistance is not working in that function:
>>>>> 
>>>>>> l.feat.dRF8.100k <-ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS,
>>>>>> Exon,
>>>>>> utr5, utr3, proximal.promoter.cutoff=100000,
>>>>>> immediate.downstream.cutoff=100000, PeakLocForDistance="middle")
>>>>> Error in ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS, Exon, utr5,
>>>>> :
>>>>>   unused argument(s) (PeakLocForDistance = "middle")
>>>>>> l.feat.dRF8.100k <-ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS,
>>>>>> Exon,
>>>>>> utr5, utr3, proximal.promoter.cutoff=100000,
>>>>>> immediate.downstream.cutoff=100000, PeakLocForDistance= middle)
>>>>> Error in ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS, Exon, utr5,
>>>>> :
>>>>>   unused argument(s) (PeakLocForDistance = middle)
>>>>>> l.feat.dRF8.100k <-ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS,
>>>>>> Exon,
>>>>>> utr5, utr3, proximal.promoter.cutoff=100000,
>>>>>> immediate.downstream.cutoff=100000, PeakLocForDistance = c("middle"))
>>>>> Error in ChIPpeakAnno:::assignChromosomeRegion(rda.dRF8, TSS, Exon, utr5,
>>>>> :
>>>>>   unused argument(s) (PeakLocForDistance = c("middle"))
>>>>> 
>>>>> ----- Original Message -----
>>>>> From: "Stefan Pinter" <pinter at molbio.mgh.harvard.edu>
>>>>> To: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
>>>>> Sent: Thursday, March 14, 2013 9:39:10 PM
>>>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>>>> 
>>>>> PS. if I might add a simple feature addition to that function - it would
>>>>> be
>>>>> a
>>>>> TRUE/FALSE trigger for whether to also report enrichment/depletion stats
>>>>> based
>>>>> on feature lengths, i.e. 50% peaks labeled as intergenic (or enhancers) is
>>>>> less interesting than 50% of peaks reported as exonic (much greater
>>>>> enrichment
>>>>> as total exonic feature length is smaller). Thanks, Best...S
>>>>> 
>>>>> ----- Original Message -----
>>>>> From: "Stefan Pinter" <pinter at molbio.mgh.harvard.edu>
>>>>> To: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
>>>>> Cc: "bioconductor" <bioconductor at r-project.org>
>>>>> Sent: Thursday, March 14, 2013 9:32:33 PM
>>>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>>>> 
>>>>> Hi Julie,
>>>>> 
>>>>> yes, that worked! Thank you for the quick help, I very much appreciate it!
>>>>> Thank you and Best wishes,
>>>>> - Stefan.
>>>>> 
>>>>> ----- Original Message -----
>>>>> From: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>
>>>>> To: "Lihua Zhu (Julie)" <Julie.Zhu at umassmed.edu>, "Stefan Pinter"
>>>>> <pinter at molbio.mgh.harvard.edu>
>>>>> Cc: "<bioconductor at r-project.org>" <bioconductor at r-project.org>
>>>>> Sent: Thursday, March 14, 2013 8:50:51 PM
>>>>> Subject: Re: [BioC] assignChromosomeRegion in 2.6.0 ChIPpeakAnno
>>>>> 
>>>>> Stefan,
>>>>> 
>>>>> Thanks for reporting this! Actually the function assignChromosomeRegion
>>>>> is
>>>>> not exported.
>>>>> 
>>>>> Could you please try the following to see if you can use it? Thanks!
>>>>> 
>>>>> ChIPpeakAnno:::assignChromosomeRegion
>>>>> 
>>>>> Best regards,
>>>>> 
>>>>> Julie
>>>>> 
>>>>> 
>>>>> On 3/14/13 7:36 PM, "Lihua Julie Zhu" <julie.zhu at umassmed.edu> wrote:
>>>>> 
>>>>>> Stefan,
>>>>>> 
>>>>>> Could you please send me the sessionInfo? I just want to make sure you
>>>>>> installed version 2.6.0. Thanks!
>>>>>> 
>>>>>> I noticed that when I try to install ChIPpeakAnno with the following
>>>>>> code,
>>>>>> I
>>>>>> get 2.4 version instead.
>>>>>> 
>>>>>> source("http://bioconductor.org/biocLite.R")
>>>>>> biocLite("ChIPpeakAnno")
>>>>>> 
>>>>>> Best regards,
>>>>>> 
>>>>>> Julie
> 



More information about the Bioconductor mailing list