[R] What don't I understand about sample()?

@vi@e@gross m@iii@g oii gm@ii@com @vi@e@gross m@iii@g oii gm@ii@com
Sun Mar 16 02:31:45 CET 2025


Rui,

I just want to discuss this point you made for now:

	2.
	Systematic use of apply(., 1, mean).
	rowMeans (and colMeans) are much faster.

You are, of course, technically correct. I consider an alternative aspect in which you are teaching a programming language as a small set of "commands" that can do quite a bit in various legal combinations. These are often reserved words and primitives like "if" or operators like "<-" and so on.

You then can discuss various built-in functionality that extends your abilities such as an array of built in functions commonly used. I see these in several categories. Some, can be examined in R at the prompt to see what R code produces it. You can learn how it works and perhaps cannibalize from it to make your own code that uses helpful parts and perhaps avoids functionality you do not need.

The other kinds are relatively hidden as .Internal or a function that after someprocessing then calls .Internal(...) and can be code written in various (perhaps usually compiled) languages such as a version of C,  or perhaps pulled from libraries such as FORTRAN mathematical functions. These are sometimes faster partially because they are not being read and interpreted, or perhaps use cute programming tricks.

So, as I see it, whatever teacher or book or course is being used by the OP, they are supposedly learning ways to approach problems from some level in which they want them to learn about ideas. The goal is not just learning how to do one problem, but to add to your tool bag for additional problems. And, in R, some of those ideas revolve in how to do implicit loops and in some sense do vectorized operations. 

So, if I understood it, the assignment included getting some number of equal-sized vectors and instead of working on them one at a time, assemble them into a collection of vectors. Besides generalized lists, R supports dataframes as a list of vectors (or actually other things) and matrices implemented as vectors with attributes and some other structures.

The assignment seems to ask them to assemble the vectors into rows of a matrix object and then use something that operates on one row at a time, as already discussed. Clearly, someone has already made implementations that allow you to avoid directly using apply() and get the same, faster result, using rowMeans. But note that although there are perhaps additional rowwise operations available this way, such as rowSums, in the general case, there are many possible functions not already supplied.

If I want something like a rowMinMax, or something exotic where I consider the row to contain the coefficients of a polynomial to evaluate for some given value for X, then a less efficient solution can easily be put together using something like apply, or if using other data structures, using cousins like lapply/sapply/vapply and quite a few other methods in packages including functions like pmap and interesting methods in dplyr such as rowwise that allow per row calculations easily.

There are usually many ways you can do something in R, as in many languages, and sometimes a highly efficient solution for some situations is not the best or most efficient. An example might be how some forms of looping are better suited when the number of items is not known in advance. Often a compiled language will optimize away a loop over a fixed small number of items like asking for index going from 1 to 2 to calculate the squares of the index and storing them in an array of two. It may get replaced without a loop and calling a function twice, or even doing the calculation at compile them and inserting a data structure with the results and not doing anything major at run time.

In this case, the OP may actually have a small fixed number of such results and possibly there are faster ways to do the calculation. But, if the point is to learn how to do arbitrary calculations when the magnitude of the parts may vary, then this is reasonable. Of course, if the samples being used are not all the same length, this method can fail badly while other methods may still work well.

I note relatedly, that one of us here has expressed a wish to do some of their code as near to using base R as possible rather than jump right away into the tidyverse. I can respect that even as, I, personally, prefer using whatever works for me faster and easier. At least, that is true when I am prototyping or doing something one-off. If the code turns out to be used a lot and is slow, I might reconsider. But, having said that, I suspect some packages may allow faster code than the basics. Of course, things like the built-in native pipe |> can change things so an altered package using a slower pipe may no longer ...

-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Rui Barradas
Sent: Saturday, March 15, 2025 5:28 PM
To: Kevin Zembower <kevin using zembower.org>; r-help using r-project.org
Subject: Re: [R] What don't I understand about sample()?

Hello,

I have been following this thread and though answers have been given, 
some of them address R coding in general, not necessarily the sample() 
function. Here are some random notes I think the OP could use, prompted 
by the text linked to, chap3.pdf.

1.
Throughout the text, assignments use the equal sign instead of the left 
arrow. The left arrow is generally considered more idiomatic and there 
is an important diference beteewn he wo, see help("assignOps").


time.mean = with(CommuteAtlanta, mean(Time))
B = 1000
n = nrow(CommuteAtlanta)

# This should be used, not the above.
time.mean <- with(CommuteAtlanta, mean(Time))
B <- 1000
n <- nrow(CommuteAtlanta)


2.
Systematic use of apply(., 1, mean).
rowMeans (and colMeans) are much faster.


boot.statistics <- apply(boot.samples, 1, mean)
boot.statistics <- rowMeans(boot.samples)


3.
The first confidence interval computation seems awkward. I had never 
seen this way of computing a CI95.
Moreover, it's plain common sense to keep the results with the returned 
decimals and round for display purposes only.
And the normal intervals are computed in a more usual way later in the 
text, see sections 1.2 and 1.3.


me <- ceiling(10 * 2 * time.se)/10
round(time.mean, 1) + c(-1, 1) * me

# Straightforward.
normal_ci95 <- time.mean + c(-1, 1) * 2 * time.se
normal_ci95
round(normal_ci95, 1)

# section 1.2 , function boot.mean
interval = mean(x) + c(-1,1)*2*se

# section 1.3
with(students, mean(Height) + c(-1, 1) * 2 * sd(result))



4.
In section 1.2 there is a bootstrap function boot.mean().
The function could be improved to let users pass a conf.level of their 
choice.
And why force the function user to always have the plot displayed? Base 
functions hist() and barplot() have na argument 'plot' with default TRUE 
allowing the user to choose.

The following seems more user friendly.


boot.mean <- function(x, B, binwidth = NULL, conf.level = 0.95, plot = 
TRUE) {
   require(ggplot2)
   n <- length(x)
   boot.samples <- matrix( sample(x, size = n*B, replace = TRUE), B, n)
   boot.statistics <- rowMeans(boot.samples)
   se <- sd(boot.statistics)
   if ( is.null(binwidth) )
     binwidth <- diff(range(boot.statistics))/30

   p <- ggplot(data.frame(x = boot.statistics), aes(x = x)) +
     geom_histogram(aes(y = ..density..),binwidth = binwidth) +
     geom_density(color = "red")

   alpha <- 1 - (1 - conf.level)/2
   interval <- mean(x) + c(-1, 1) * qnorm(alpha) * se

   if(plot) {
     plot(p)
   }

   list(boot.statistics = boot.statistics, interval = interval, se = se, 
plot = p)
}


Hope this helps,

Rui Barradas



Às 17:28 de 15/03/2025, Kevin Zembower via R-help escreveu:
> Hi, Richard, thanks for replying. I should have mentioned the third
> edition, which we're using. The data file didn't change between the
> second and third editions, and the data on Body Mass Gain was the same
> as in the first edition, although the first edition data file contained
> additional variables.
> 
> According to my text, the BMGain was measured in grams. Thanks for
> pointing out that my statement of the problem lacked crucial
> information.
> 
> The matrix in my example comes from an example in
> https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where the author
> created a bootstrap example with a matrix that consisted of one row for
> every sample in the bootstrap, and one column for each mean in the
> original data. This allowed him to find the mean for each row to create
> the bootstrap statistics.
> 
> The only need for the tidyverse is to use the read_csv() function. I'm
> regrettably lazy in not determining which of the multiple functions in
> the tidyverse library loads read_csv(), and just using that one.
> 
> Thanks, again, for helping me to further understand R and this problem.
> 
> -Kevin
> 
> On Sat, 2025-03-15 at 12:00 +0100, r-help-request using r-project.org wrote:
>> Not having the book (and which of the three editions are you using?),
>> I downloaded the data and played with it for a bit.
>> dotchart() showed the Dark and Light conditions looked quite
>> different, but also showed that there are not very many cases.
>> After trying t.test, it occurred to me that I did not know whether
>> "BMGain" means gain in *grams* or gain in *percent*.
>> Reflection told me that for a growth experiment, percent made more
>> sense, which reminded my of one of my first
>> student advising experiences, where I said "never give the computer
>> percentages; let IT calculate the percentages
>> from the baseline and outcome, because once you've thrown away
>> information, the computer can't magically get it back."
>> In particular, in the real world I'd be worried about the possibility
>> that there was some confounding going on, so I would
>> much rather have initial weight and final weight as variables.
>> If BMGain is an absolute measure, the p value for a t test is teeny
>> tiny.
>> If BMGain is a percentage, the p value for a sensible t test is about
>> 0.03.
>>
>> A permutation test went like this.
>> is.light <- d$Group == "Light"
>> is.dark <- d$Group == "Dark"
>> score <- function (g) mean(g[is.light]) - mean(g[is.dark])
>> base.score <- score(d$BMGain)
>> perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain)))
>> sum(perm.scores >= base.score) / length(perm.scores)
>>
>> I don't actually see where matrix() comes into it, still less
>> anything
>> in the tidyverse.
>>
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença de vírus.
www.avg.com

______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list