[BioC] differential binding question

Wed Jan 4 23:11:42 CET 2012

Hi all,

Mali: Very interesting dataset.

Rory: As I mentioned offline, I don't think the offset style normalization will work here, since the "inputs" are observations, and have associated variability.  Offsets are meant to be fixed quantities, even though in practice they are estimated, but from a lot of data (e.g. TMM).

I'm not keen on the idea of direct subtraction -- anyone for a Skellam distribution?

What about just fitting a GLM ?  If I understand it correctly, this seems appropriate:

------
> g <- rep( 0:1, each=4 )
> p <- rep( c("input","IP"), 4 )
> d <- model.matrix( ~ g*p )
> rownames(d) <- paste(g,p,sep=".")
> d
        (Intercept) g pIP g:pIP
0.input           1 0   0     0
0.IP              1 0   1     0
0.input           1 0   0     0
0.IP              1 0   1     0
1.input           1 1   0     0
1.IP              1 1   1     1
1.input           1 1   0     0
1.IP              1 1   1     1
------

First three columns account for expression levels and differences between IP/input.  Column 4 specifies the difference b/w conditions and should be what you are interested in.
Once you have a table of counts, edgeR and DESeq can do this.

Of course, the challenge is how you do the counting.  I'm not sure what sort of features are expected/possible with this assay, maybe a peak/region finder will do the trick.  And, there might be some normalization challenges too.

Hope that helps,
Mark

On 04.01.2012, at 17:24, Rory Stark wrote:

> That's the big question Mali! The more I think about it, the less confident I am that it will work.
> 
> As I understand it you want to control for transcripts whose expression may change but whose affinity (rate at which the protein binds) stays the same. Without the control, higher expression = more transcripts = more RNA pulled down by the IP even at the same affinity. I'm not sure that subtracting the transcripts independent of binding (the RNA-Seq) will work. Besides the normalization issue relating to the RNA-Seq counts, the problem  is that there should always be more transcripts in the control (as they include both bound and unbound transcripts) than in the IP (only bound transcripts). So even if the normalization was perfect, the subtraction would always result in a negative number of counts (set to a minimum of 1 count per peak in DiffBind).
> 
> In this case, subtracting the control is probably too crude. Mark Robinson, an author of edgeR, is doing some interesting work on incorporating copy number information into differential ChIP-Seq analysis, and I am adding it to DiffBind. He is able to cast the problem as a normalization issue. I'm thinking that would be a better approach to your problem: the RNA-Seq gives "copy number" information (overall transcript abundance), and this is incorporated as a normalization term, leaving  the differential analysis to identify changes in affinity.  I'm working on this right now, so if you are interested you might be a beta tester — let me know.
> 
> I still think it is worth running your data in DiffBind to see how it looks as a start.
> 
> Cheers-
> Rory
> 
> From: mali salmon <shalmom1 at gmail.com<mailto:shalmom1 at gmail.com>>
> Date: Wed, 4 Jan 2012 15:09:11 +0000
> To: Cancer Research UK <rory.stark at cancer.org.uk<mailto:rory.stark at cancer.org.uk>>
> Cc: "bioconductor at r-project.org<mailto:bioconductor at r-project.org>" <bioconductor at r-project.org<mailto:bioconductor at r-project.org>>
> Subject: Re: [BioC] Fwd: differential binding question
> 
> Thanks Rory and Heidi for replying.
> Would read subtraction is enough in order to account for the difference in gene expression?
> Mali
> 
> On Wed, Jan 4, 2012 at 2:11 PM, Rory Stark <Rory.Stark at cancer.org.uk<mailto:Rory.Stark at cancer.org.uk>> wrote:
> Hi Mali-
> 
> You can try this pretty easily using DiffBind. I suggest calling peaks on each IP separately (each IP and its matching RNA-Seq control) and read these four peaksets into DiffBind (you could also use two peak callers and read in all eight peaksets to identify more potential sites). DiffBind lets you derive an overall set of peaks (either a superset of all the peaks, or any that overlap in at least two [or more] peaksets), does the read counting (by default subtracting reads in the matching RNA-seq controls), runs edgeR and/or DESeq to identify differentially bound regions, and offers several plots and reports to characterize the results.
> 
> A couple of caveats: With only two replicates of each condition, your power to reliably identify significant differences is limited. Also, while the IP reads will be normalized, the control reads will not be (unless you do some normalization separately prior to loading it into DiffBind). However this does seem to be a a good place to start!
> 
> Cheers-
> Rory
> 
> ----------------------------------------------------------------------------
> Dr. Rory Stark
> 
> Principal Bioinformatics  Analyst
> 
> Cambridge Research Institute - Cancer Research UK
> Robinson Way
> Cambridge CB2 0RE
> United Kingdom
> +44 1223 404 311<tel:%2B44%201223%20404%20311>
> 
> rory.stark at cancer.org.uk<mailto:rory.stark at cancer.org.uk>
> ----------------------------------------------------------------------------
> 
> On 04/01/2012 13:37, "mali salmon" <shalmom1 at gmail.com<mailto:shalmom1 at gmail.com><mailto:shalmom1 at gmail.com<mailto:shalmom1 at gmail.com>>> wrote:
> 
> Dear Users
> We have RNA-IP-seq for two conditions with two biological replicates each.
> So in total we have 8 samples:
> 2 for condition1 IP
> 2 for condition1 Input
> 2 for condition2 IP
> 2 for condition2 Input
> We would like to find differential binding between the two conditions which
> are not influenced from differences in gene expression (Input samples are
> actually regular RNA-seq).
> I thought of first finding peak regions (maybe by pooling all IP and all
> Input samples) and run ChIP-seq peak caller, count how many reads fall
> within these regions in each of the samples, and then run DESeq and edgeR
> in order to find differential binding.
> Is this can be done with edgeR and DESeq (again the Input is different for
> the two conditions, and we would like to cancel out differential gene
> expression)?
> Thanks
> Mali
> 
> [[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org<mailto:Bioconductor at r-project.org><mailto:Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> NOTICE AND DISCLAIMER
> This e-mail (including any attachments) is intended for ...{{dropped:18}}
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> NOTICE AND DISCLAIMER
> This e-mail (including any attachments) is intended for ...{{dropped:18}}
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor