[R] gzfile with multiple entries in the archive
Duncan Temple Lang
duncan at wald.ucdavis.edu
Sat Nov 18 23:50:13 CET 2006
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Apologies for entering this late, but last week was extremely busy.
I hadn't realized that you would implement something so quickly
and I could have saved you some time.
I was in the process of adding facilities for gzipped tar
files in the Rcompression package.
A new version (0.3-0) is available from the www.omegahat.org
respositories
www.omegahat.org/R
for source and Windows.
This uses code from the zlib-1.2.3 contrib/ directory
to do the extraction and table of contents. So it should be
pretty quick. And it allows for "event-driven" programming
with callback functions, and has "hints"
for avoiding vector resizing issues which make it considerably
faster.
D.
John James wrote:
> Following suggestions from Prof. Ripley and several others to use gzfile,
> here's rough code that will unzip a tgz into your working directory and
> return a list of the files. (It doesn't warn you that it is overwriting
> files!)
>
> The magic numbers refer to the current tar header specification; the block
> sizes etc. are arbitrary.
>
> It is inefficient in that it re-reads the file from the start for every
> file. I couldn't get the file pointer to stay and change the readBin mode
> back from 'character' to 'raw' although the reverse is used! Is there a
> setting I've missed?
>
> Also, is there a better way to do the convert(..) function?
>
> All criticisms gratefully received, especially being pointed to an existing
> function.
>
> John James
> Mango Solutions
>
> unzip <- function(x, archiveDirectory = '.', zipExtension='tgz',
> block=50000, maxBlocks=100, maxCountFiles=100) {
> # Example
> # unzip('test.tgz')
> convert <- function(oct= 2, oldRoot=8, newRoot=10) {
> if((newRoot==16))
> return(structure(convert(oct, oldRoot, 10),
> class='hexmode'))
> if(newRoot>10)
> return(simpleError('WIP'))
> if(class(oct)=='hexmode') {
> oct <- unclass(oct)
> if(newRoot==10)
> return(oct)
> oldRoot <- 10
> return(simpleError('WIP'))
> }
> oct <- as.numeric(oct)
> ret <- 0
> oldPower <- 1
> while(oct > 0.1){
> newoct <- floor(oct / newRoot)
> rem <- oct - newoct * newRoot
> ret <- rem * oldPower + ret
> oldPower <- oldPower * oldRoot
> oct <- newoct
> }
> if(newRoot==16)
> ret <- structure(ret, class = 'hexmode')
> ret
> }
> listOfFiles <- list()
> theArchives <- list.files(archiveDirectory, pattern = zipExtension)
> if(length(grep(x, theArchives))==0)
> return(simpleError(paste('No archive matching *', x, '*.',
> zipExtension, ' found')))
> what <- paste(archiveDirectory, theArchives[grep(x, theArchives)],
> sep=.Platform$file.sep)
> tmp <- tempfile()
> nextBlockStartsAt <- readUpTo <- countFiles <- mu <- safety <- 0
> zz <- gzfile(what, 'rb')
> ww <- file(tmp, 'wb')
> on.exit(unlink(tmp))
> while(length(mu)>0) {
> if(safety > maxBlocks) {
> return(simpleError(paste('Archive File too large')))
> }
> safety <- safety + 1
> mu <- readBin(zz, 'raw', block)
> writeBin(mu, ww)
> }
> close(zz)
> close(ww)
> while(countFiles < maxCountFiles){
> countFiles <- countFiles + 1
> zz <- file(tmp, 'rb')
> stuff <- readBin(zz, 'raw', n=nextBlockStartsAt)
> header <- readBin(zz, character(), n=100)
> header <- header[nchar(header)>0][c(1,5)]
> close(zz)
> if(any(is.na(header))) {
> break;
> }
> listOfFiles[[countFiles]] <- header[1]
> zz <- file(tmp, 'rb')
> body <- readBin(zz, 'raw', n = 512 + nextBlockStartsAt +
> convert(header[2]))
> writeBin(body[-c(1:(512 + nextBlockStartsAt))], header[1])
> readUpTo <- 512 + nextBlockStartsAt + convert(header[2])
> nextBlockStartsAt <- (readUpTo%/%512 + 1) * 512
> close(zz)
> }
> listOfFiles
> }
>
> -----Original Message-----
> From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> Sent: 14 November 2006 15:18
> To: John James
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] gzfile with multiple entries in the archive
>
> On Tue, 14 Nov 2006, John James wrote:
>
>
>>If I open a tgz archive with gzfile and then parse it using readLines I
>
> miss
>
>>the initial line of each member of the archive - and also the name of the
>>file although the archive otherwise complete (but useless!).
>
>
> You can use a gzfile connection to read the underlying .tar file, but that
> is not a text file and you will need to pick its structure apart yourself
> via readBin and readChar.
>
>
>>Is there any way within R to extract both the list of files in a tgz
>
> archive
>
>>and to extract any one of these files?
>
>
>>Clearly I can use zcat and tar on Linux, but I need this to work within
>
> the
>
>>R environment on Windows!
>
>
> You could use tar on Windows: it is in the R tools set.
>
- --
Duncan Temple Lang duncan at wald.ucdavis.edu
Department of Statistics work: (530) 752-4782
4210 Mathematical Sciences Building fax: (530) 752-7099
One Shields Ave.
University of California at Davis
Davis,
CA 95616,
USA
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)
iD8DBQFFX44l9p/Jzwa2QP4RAh9GAJ9H0HMc8YOQV3OCehf5Zk4GFc9ApACfebXn
j3Jxj57iXe935pXaR2mRA0o=
=HAmN
-----END PGP SIGNATURE-----
More information about the R-help
mailing list