[R] gzfile with multiple entries in the archive

Fri Nov 17 16:27:03 CET 2006

Following suggestions from Prof. Ripley and several others to use gzfile,
here's rough code that will unzip a tgz into your working directory and
return a list of the files. (It doesn't warn you that it is overwriting
files!)

The magic numbers refer to the current tar header specification; the block
sizes etc. are arbitrary.

It is inefficient in that it re-reads the file from the start for every
file. I couldn't get the file pointer to stay and change the readBin mode
back from 'character' to 'raw' although the reverse is used! Is there a
setting I've missed?

Also, is there a better way to do the convert(..) function?

All criticisms gratefully received, especially being pointed to an existing
function.

John James
Mango Solutions

unzip <- function(x, archiveDirectory = '.', zipExtension='tgz',
block=50000, maxBlocks=100, maxCountFiles=100) {
	# Example
	# unzip('test.tgz')
	convert <- function(oct= 2, oldRoot=8, newRoot=10) {
		if((newRoot==16))
			return(structure(convert(oct, oldRoot, 10),
class='hexmode'))
		if(newRoot>10)
			return(simpleError('WIP'))
		if(class(oct)=='hexmode') {
			oct <- unclass(oct)
			if(newRoot==10)
				return(oct)
			oldRoot  <- 10
			return(simpleError('WIP'))
		}
		oct <- as.numeric(oct)
		ret <- 0
		oldPower <- 1
		while(oct > 0.1){
			newoct <- floor(oct / newRoot)
			rem <- oct - newoct * newRoot 
			ret <- rem * oldPower + ret
			oldPower <- oldPower * oldRoot
			oct <- newoct
		}
		if(newRoot==16)
			ret <- structure(ret,  class = 'hexmode')
		ret
	}
	listOfFiles <- list()
  	theArchives <- list.files(archiveDirectory, pattern = zipExtension)
  	if(length(grep(x, theArchives))==0)
  		return(simpleError(paste('No archive matching *', x, '*.',
zipExtension, ' found')))
  	what <- paste(archiveDirectory, theArchives[grep(x, theArchives)],
sep=.Platform$file.sep)
  	tmp <- tempfile()
	nextBlockStartsAt <- readUpTo <- countFiles <- mu <- safety <- 0
	zz <- gzfile(what, 'rb')
	ww <- file(tmp, 'wb')
	on.exit(unlink(tmp))
	while(length(mu)>0) {
		if(safety > maxBlocks)	{
			return(simpleError(paste('Archive File too large')))
		}
		safety <- safety + 1
		mu <- readBin(zz, 'raw', block)
		writeBin(mu, ww) 
	}
	close(zz)
	close(ww)
	while(countFiles < maxCountFiles){
	  	countFiles <- countFiles + 1
	  	zz <- file(tmp, 'rb')
	  	stuff <- readBin(zz, 'raw', n=nextBlockStartsAt)
	  	header <- readBin(zz, character(), n=100)
	  	header <- header[nchar(header)>0][c(1,5)]
	  	close(zz)
	  	if(any(is.na(header))) {
	  		break;
	  	}
	  	listOfFiles[[countFiles]] <- header[1]
	  	zz <- file(tmp, 'rb')
	  	body <- readBin(zz, 'raw', n = 512 + nextBlockStartsAt +
convert(header[2]))
	  	writeBin(body[-c(1:(512 + nextBlockStartsAt))], header[1])
		readUpTo <- 512 + nextBlockStartsAt + convert(header[2])
		nextBlockStartsAt <- (readUpTo%/%512 + 1) * 512
	  	close(zz)
	  }
	listOfFiles
}

-----Original Message-----
From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] 
Sent: 14 November 2006 15:18
To: John James
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] gzfile with multiple entries in the archive

On Tue, 14 Nov 2006, John James wrote:

> If I open a tgz archive with gzfile and then parse it using readLines I
miss
> the initial line of each member of the archive - and also the name of the
> file although the archive otherwise complete (but useless!).

You can use a gzfile connection to read the underlying .tar file, but that 
is not a text file and you will need to pick its structure apart yourself 
via readBin and readChar.

> Is there any way within R to extract both the list of files in a tgz
archive
> and to extract any one of these files?

> Clearly I can use zcat and tar on Linux, but I need this to work within
the
> R environment on Windows!

You could use tar on Windows: it is in the R tools set.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595