[BioC] writeFastq to gzip compressed files

Harris A. Jaffee hj at jhu.edu
Wed Sep 25 22:23:43 CEST 2013


I believe gzfile only applies when reading.

With gzcon, you may have to flush or close the connection (not sure which
one - I think your fout2), or your whole session, to get the file written.


On Sep 25, 2013, at 4:07 PM, Marcus Davy wrote:

> Thanks for the responses, Martin I will give your work-around a try.
> 
> 
> I did also try out the following but in both cases they failed, and only
> produced a zero byte sized file (equivalent to bash 'touch' utility on the
> command line) using OS X.
> 
> fout <- gzfile("test.fq.gz", "wb")
> print(fout)
> writeFastq(rfq, fout)
> 
> fout2 <- gzcon(file("test.fq.gz", "wb"))
> print(fout2)
> writeFastq(rfq, fout)
> 
> 
> cheers,
> 
> 
> Marcus
> 
> 
> On Thu, Sep 26, 2013 at 6:01 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> 
>> On 09/25/2013 09:09 AM, Michael Lawrence wrote:
>> 
>>> Ideally, it would be:
>>> writeFastq(rfq, gzfile(fout))
>>> 
>>> But I don't think that works. In theory, it would be feasible to support
>>> this now that R has exposed its connection interface...
>>> 
>> 
>> That'll be an interesting learning experience, and I'll come up with a
>> proper implementation. Here's a work-around, With
>> 
>>  writeFastq_con(rfq, stdout())
>>  yy = writeFastq_con(rfq, gzfile(tempfile()))
>>  readFastq(yy)
>> 
>> Martin
>> 
>> .write1fastq <-
>>    function(x, con, ..., full=FALSE)
>> {
>>    i <- paste0("@", as.character(id(x)), "\n")
>>    fill <- if (full)
>>        paste0("\n+", as.character(id(x)), "\n")
>>    else "\n+\n"
>>    s <- as.character(sread(x))
>>    q <- as.character(quality(quality(**x)))
>>    cat(paste(i, s, fill, q, sep=""), file=con, sep="\n")
>>    length(i)
>> }
>> 
>> writeFastq_con <-
>>    function(object, file, ..., chunkSize=1000000L)
>>    ## object: ShortReadQ instance
>>    ## file: connection,  e.g., stdout(), gzfile(<...>, "wb")
>>    ## ...: additional arguments, in particular
>>    ##   full=TRUE: logical(1), TRUE: replicate id tag on '+' line
>>    ## chunkSize: size of each chunk processed for output, smaller
>>    ##     values use less memory but are slower
>> 
>> {
>>    if (!isOpen(file)) {
>>        open(file, "wb")
>>        on.exit(close(file))
>>    }
>> 
>>    len <- length(object)
>>    start <- 1L
>>    tot <- 0L
>>    while (start < len) {
>>        end <- min(start + chunkSize - 1L, len)
>>        tot <- tot + .write1fastq(object[seq(start, end)], file, ...)
>>        start <- end + 1L
>>    }
>>    if (tot != len)
>>        warning("output length does not equal input length")
>> 
>>    summary(file)$description
>> 
>> }
>> 
>> 
>>> Michael
>>> 
>>> 
>>> On Wed, Sep 25, 2013 at 7:39 AM, Marcus Davy <mdavy86 at gmail.com> wrote:
>>> 
>>> Hi,
>>>> as fastq files are often compressed now, I would like to be able to write
>>>> out to compressed 'fq.gz' files, rather than raw text.
>>>> 
>>>> I have searched the BioC threads and found a similar topic at;
>>>> 
>>>> 
>>>> http://article.gmane.org/**gmane.science.biology.**
>>>> informatics.conductor/45306/**match=writeFastq<http://article.gmane.org/gmane.science.biology.informatics.conductor/45306/match=writeFastq>
>>>> 
>>>> A poor example of what I would like to be able to achieve follows;
>>>> 
>>>> example(readFastq)
>>>> print(rfq)
>>>> fout <- "test.fq"
>>>> writeFastq(rfq, fout)
>>>> system(paste("gzip", fout))
>>>> 
>>>> but this does not compress the data as it is streaming out to file. I
>>>> would
>>>> like to be able to do something like below, so this is a potential
>>>> feature
>>>> request;
>>>> 
>>>> fout <- "test.fq.gz"
>>>> writeFastq(rfq, fout, compress=TRUE)
>>>> 
>>>> Is there an obvious way to compress using writeFastq that I am missing?
>>>> 
>>>> 
>>>> cheers,
>>>> 
>>>> Marcus
>>>> 
>>>> 
>>>>  sessionInfo()
>>>> R version 3.0.0 (2013-04-03)
>>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>> 
>>>> locale:
>>>> [1] C
>>>> 
>>>> attached base packages:
>>>> [1] parallel  splines   stats     graphics  grDevices utils     datasets
>>>> [8] methods   base
>>>> 
>>>> other attached packages:
>>>>  [1] ShortRead_1.18.0     latticeExtra_0.6-24  RColorBrewer_1.0-5
>>>>  [4] Rsamtools_1.12.3     lattice_0.20-15      Biostrings_2.28.0
>>>>  [7] GenomicRanges_1.12.2 IRanges_1.18.0       BiocGenerics_0.6.0
>>>> [10] BiocInstaller_1.10.3 limma_3.16.7         Hmisc_3.12-2
>>>> [13] Formula_1.1-1        survival_2.37-4
>>>> 
>>>>         [[alternative HTML version deleted]]
>>>> 
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.**science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>> 
>>>> 
>>>        [[alternative HTML version deleted]]
>>> 
>>> ______________________________**_________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>> Search the archives: http://news.gmane.org/gmane.**
>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>> 
>>> 
>> 
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>> 
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list