[R] Pass By Value Questions

Matt Shotwell shotwelm at musc.edu
Thu Aug 19 21:04:17 CEST 2010


On Thu, 2010-08-19 at 14:27 -0400, Duncan Murdoch wrote:
> On 19/08/2010 12:57 PM, lists at jdadesign.net wrote:
> > I understand R is a "Pass-By-Value" language. I have a few practical
> > questions, however.
> >
> > I'm dealing with a "large" dataset (~1GB) and so my understanding of the
> > nuances of memory usage in R is becoming important.
> >
> > In an example such as:
> > > d <- read.csv("file.csv");
> > > n <- apply(d, 1, sum);
> > must "d" be copied to another location in memory in order to be used by
> > apply? In general, is copying only done when a variable is updated within
> > a function?
> >   
> 
> Generally R only copies when the variable is modified, but its rules for 
> detecting this are sometimes overly conservative, so you may get some 
> unnecessary copying.  For example,
> 
> d[1,1] <- 3
> 
> will probably not make a full copy of d when the internal version of 
> "[<-" is used, but if you have an R-level version, it probably will.  I 
> forget whether the dataframe method is internal or R level. 
> 
> In the apply(d, 1, sum) example, it would probably make a copy of each 
> row to pass to sum, but never a copy of the whole dataframe/array.
> > Would the following example be any different in terms of memory usage?
> > > d <- read.csv("file.csv");
> > > n <- apply(d[,2:10], 1, sum);
> > or can R reference the original "d" object since no changes to the object
> > are being made?
> >   
> 
> This would make a new object containing d[,2:10], and would pass that to 
> apply.

Since d is a data.frame, subsetting the columns would create a new
data.frame, as Duncan says. However, the columns of the new data.frame
would internally _reference_ the appropriate columns of d, until either
were modified. This does not apply to row subsetting. That is, d[2:10,]
would create a new data.frame and copy the relevant data. Nor does it
apply to _any_ subsetting of matrices.

> > I'm familiar with FF and BigMemory, but are there any packages/tricks
> > which allow for passing such objects by reference without having to code
> > in C?
> >   

It's difficult to determine exactly when data is copied internally by R.
The tracemem function may be used to track when entire objects are
duplicated. However, tracemem would not detect the duplication that
occurs, for example, when subsetting the rows of d. Otherwise, we can
monitor memory usage with gc(), and experiment with code on a trial and
error basis.

I have had limited success in avoiding duplication by utilizing R
environments. See for example http://biostatmatt.com/archives/663 .
However, this may be more trouble that it's worth.

-Matt

> Duncan Murdoch
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Matthew S. Shotwell
Graduate Student 
Division of Biostatistics and Epidemiology
Medical University of South Carolina



More information about the R-help mailing list