[BioC] DEseq for chip-seq data normalisation
Giuseppe Gallone
giuseppe.gallone at dpag.ox.ac.uk
Tue Nov 5 13:57:11 CET 2013
Hello Lucia
this is all great info! Thank you very much for taking the time to share
your findings. I am indeed using diffbind, and found some interesting
results, however now I'd need to access directly my downsampled mapped
reads.
With diffbind, I was basically throwing in my mapped reads and my peak
intervals and getting a differential analysis out (after setting up a
contrast).
Now I'd like to try something slightly different and need to work with
normalised bams of my samples. The problem I have is that my 10 bam have
wildly varying numbers of mapping reads. I would like to downsample them
all to a minimum common before examining quantitative differences in the
peak signals across them.
I was hoping I could do this with DEseq: feed it some bams and obtain
normalised versions of them. But I understand this is not possible?
I guess I will try to downsample my bams by myself using for example
picard and then take it from there. Are there maybe some alternatives
you'd suggest? I know MACS also allows to downsample bam. Thanks!
Giuseppe
On 11/04/13 18:13, Lucia Peixoto wrote:
> Hi Giuseppe,
> Unfortunately there is not much available to do stats on ChIPseq data.
> It is my experience that the data shows exactly the same overdispersion
> problem that is see in RNAseq so using either EdgeR, DEseq or DEseq2 to
> analyze ChIPseq data is the way to go. There are a couple of challenges
> along the way that make this undertaking not quite straightforward.The
> only bioconductor package that I know tries to tackle this issues is
> DiffBind, so you can give it a try.
>
> One of the main differences is that unlike gene or exon coordinates,
> peaks in your individual replicates will not be exactly in the same
> place, if you are working with TF data this will not be too bad, but
> anything nucleosome associated will have considerable phase shift from
> replicate to replicate. So you first have to do some sort of merging of
> reproducible peaks into regions.
>
> I do not recommend doing the peak calling with the pooled data.After
> doing several ChIP-seq experiments with replicates I have observed that
> a lot of peaks, even ones with high z-scores/low p-values, do not show
> up in more than one replicate (but maybe this is particular to my type
> of experiments). Merging all the peaks leads to a high number of false
> positives. So you need to integrate the peak locations into a single
> file but make sure you have a minimum number of carriers for each peak,
> I usually do presence in at least 2 of the replicates.
> You can make a gff file that you can feed into HTSeq in which you define
> the reproducible peak regions on your samples as if it was the gff with
> the gene models, but making this file takes a little bit of work.
> We are currently preparing a package for CRAN submission to specifically
> integrate the analysis of ChIP-seq data with replicates to EdgeR and
> DESeq, addressing most of what I mentioned above and including a peak
> caller for ease of flow of the analysis.I cannot finish the submission
> until the accompanying biological paper is out, so it won't be available
> until next year.
>
> hope this was helpful
> best
>
> Lucia
>
>
>
> On Mon, Nov 4, 2013 at 8:47 AM, Giuseppe Gallone
> <giuseppe.gallone at dpag.ox.ac.uk <mailto:giuseppe.gallone at dpag.ox.ac.uk>>
> wrote:
>
> Hi
>
> I would like to use DEseq or DEseq2 to normalise the peak signal for
> some Chip-seq data across 10 biological replicates.
>
> I started looking at the DEseq documentation - it seems the program
> requires a matrix arrangement of raw count data, where each row is a
> peak and each column is a replicate.
>
> What is the best way to obtain this? I have bam files for the reads,
> obtained with BWA, and bed files (or alternatively narrowPeak files)
> for the peak intervals, obtained using MACS.
>
> I gather it is possible to use a program called HTseq to compute
> these counts, however this program seems unable to deal with bed
> files, only with gff files, and I'd prefer working directly with my
> beds if at all possible. Thank you.
>
> Best regards
> Giuseppe
>
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>
> --
> Lucia Peixoto PhD
> Postdoctoral Research Fellow
> Laboratory of Dr. Ted Abel
> Department of Biology
> School of Arts and Sciences
> University of Pennsylvania
>
> "Think boldly, don't be afraid of making mistakes, don't miss small
> details, keep your eyes open, and be modest in everything except your
> aims."
> Albert Szent-Gyorgyi
More information about the Bioconductor
mailing list