[R] gzfile with multiple entries in the archive

Sat Nov 18 23:50:13 CET 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Apologies for entering this late, but last week was extremely busy.

I hadn't realized that you would implement something so quickly
and I could have saved you some time.
I was in the process of adding facilities for gzipped tar
files in the Rcompression package.
A new version (0.3-0) is available from the www.omegahat.org
respositories

  www.omegahat.org/R

for source and Windows.

This uses code from the zlib-1.2.3 contrib/ directory
to do the extraction and table of contents. So it should be
pretty quick.  And it allows for "event-driven" programming
with callback functions, and has "hints"
for avoiding vector resizing issues which make it considerably
faster.

 D.

John James wrote:
> Following suggestions from Prof. Ripley and several others to use gzfile,
> here's rough code that will unzip a tgz into your working directory and
> return a list of the files. (It doesn't warn you that it is overwriting
> files!)
> 
> The magic numbers refer to the current tar header specification; the block
> sizes etc. are arbitrary.
> 
> It is inefficient in that it re-reads the file from the start for every
> file. I couldn't get the file pointer to stay and change the readBin mode
> back from 'character' to 'raw' although the reverse is used! Is there a
> setting I've missed?
> 
> Also, is there a better way to do the convert(..) function?
> 
> All criticisms gratefully received, especially being pointed to an existing
> function.
> 
> John James
> Mango Solutions
> 
> unzip <- function(x, archiveDirectory = '.', zipExtension='tgz',
> block=50000, maxBlocks=100, maxCountFiles=100) {
> 	# Example
> 	# unzip('test.tgz')
> 	convert <- function(oct= 2, oldRoot=8, newRoot=10) {
> 		if((newRoot==16))
> 			return(structure(convert(oct, oldRoot, 10),
> class='hexmode'))
> 		if(newRoot>10)
> 			return(simpleError('WIP'))
> 		if(class(oct)=='hexmode') {
> 			oct <- unclass(oct)
> 			if(newRoot==10)
> 				return(oct)
> 			oldRoot  <- 10
> 			return(simpleError('WIP'))
> 		}
> 		oct <- as.numeric(oct)
> 		ret <- 0
> 		oldPower <- 1
> 		while(oct > 0.1){
> 			newoct <- floor(oct / newRoot)
> 			rem <- oct - newoct * newRoot 
> 			ret <- rem * oldPower + ret
> 			oldPower <- oldPower * oldRoot
> 			oct <- newoct
> 		}
> 		if(newRoot==16)
> 			ret <- structure(ret,  class = 'hexmode')
> 		ret
> 	}
> 	listOfFiles <- list()
>   	theArchives <- list.files(archiveDirectory, pattern = zipExtension)
>   	if(length(grep(x, theArchives))==0)
>   		return(simpleError(paste('No archive matching *', x, '*.',
> zipExtension, ' found')))
>   	what <- paste(archiveDirectory, theArchives[grep(x, theArchives)],
> sep=.Platform$file.sep)
>   	tmp <- tempfile()
> 	nextBlockStartsAt <- readUpTo <- countFiles <- mu <- safety <- 0
> 	zz <- gzfile(what, 'rb')
> 	ww <- file(tmp, 'wb')
> 	on.exit(unlink(tmp))
> 	while(length(mu)>0) {
> 		if(safety > maxBlocks)	{
> 			return(simpleError(paste('Archive File too large')))
> 		}
> 		safety <- safety + 1
> 		mu <- readBin(zz, 'raw', block)
> 		writeBin(mu, ww) 
> 	}
> 	close(zz)
> 	close(ww)
> 	while(countFiles < maxCountFiles){
> 	  	countFiles <- countFiles + 1
> 	  	zz <- file(tmp, 'rb')
> 	  	stuff <- readBin(zz, 'raw', n=nextBlockStartsAt)
> 	  	header <- readBin(zz, character(), n=100)
> 	  	header <- header[nchar(header)>0][c(1,5)]
> 	  	close(zz)
> 	  	if(any(is.na(header))) {
> 	  		break;
> 	  	}
> 	  	listOfFiles[[countFiles]] <- header[1]
> 	  	zz <- file(tmp, 'rb')
> 	  	body <- readBin(zz, 'raw', n = 512 + nextBlockStartsAt +
> convert(header[2]))
> 	  	writeBin(body[-c(1:(512 + nextBlockStartsAt))], header[1])
> 		readUpTo <- 512 + nextBlockStartsAt + convert(header[2])
> 		nextBlockStartsAt <- (readUpTo%/%512 + 1) * 512
> 	  	close(zz)
> 	  }
> 	listOfFiles
> }
> 
> -----Original Message-----
> From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] 
> Sent: 14 November 2006 15:18
> To: John James
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] gzfile with multiple entries in the archive
> 
> On Tue, 14 Nov 2006, John James wrote:
> 
> 
>>If I open a tgz archive with gzfile and then parse it using readLines I
> 
> miss
> 
>>the initial line of each member of the archive - and also the name of the
>>file although the archive otherwise complete (but useless!).
> 
> 
> You can use a gzfile connection to read the underlying .tar file, but that 
> is not a text file and you will need to pick its structure apart yourself 
> via readBin and readChar.
> 
> 
>>Is there any way within R to extract both the list of files in a tgz
> 
> archive
> 
>>and to extract any one of these files?
> 
> 
>>Clearly I can use zcat and tar on Linux, but I need this to work within
> 
> the
> 
>>R environment on Windows!
> 
> 
> You could use tar on Windows: it is in the R tools set.
> 

- --
Duncan Temple Lang                    duncan at wald.ucdavis.edu
Department of Statistics              work:  (530) 752-4782
4210 Mathematical Sciences Building   fax:   (530) 752-7099
One Shields Ave.
University of California at Davis
Davis,
CA 95616,
USA
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (Darwin)

iD8DBQFFX44l9p/Jzwa2QP4RAh9GAJ9H0HMc8YOQV3OCehf5Zk4GFc9ApACfebXn
j3Jxj57iXe935pXaR2mRA0o=
=HAmN
-----END PGP SIGNATURE-----