[R] What don't I understand about sample()?
Ebert,Timothy Aaron
tebert @end|ng |rom u||@edu
Sat Mar 15 20:51:19 CET 2025
Brian Manly wrote a nice book about computer intensive methods in data analysis: https://www.amazon.com/Randomization-Bootstrap-Methods-Biology-Statistical/dp/1584885416. Therein there is a distinction between permutation tests and randomization tests. A permutation test uses every permutation while a randomization test uses a random selection of all permutations. In this case there are exactly 18 choose 2 possible permutations, or 18!/(10!*8!) = 43758. With such a small number, a permutation test is better. The permutation test is exact while the randomization test is approximate (you get a different quantitative answer each time you run the test). At around one million permutations is where I switch from permutation test to randomization test. A bootstrap test takes a random sample of the available data with replacement. The same observation can appear multiple times in each sample. This reduces the influence of rare events.
My guess is that the variable BMGain is body mass gain, and the units should be grams.
There are many ways to program this. A matrix is one approach. The matrix is filled with possible orderings of the data and two groups are formed by splitting the matrix. There are many ways to do this if you just want a number, but plotting the distribution of possible answers lets you see some of the mechanics of this process.
df <- read_csv("https://www.lock5stat.com/datasets3e/LightatNight.csv")
# Generate unique group assignments
unique_partitions <- t(combn(18, 10))
# Compute mean differences for unique partitions
diffs <- apply(unique_partitions, 1, function(idx) {
group_A <- df$BMGain[idx] # Values assigned to group A
group_B <- df$BMGain[-idx] # Remaining values assigned to group B
abs(mean(group_A) - mean(group_B)) # Difference in means
})
# Convert to data frame for plotting
diff_df <- data.frame(Difference = diffs)
# Calculate observed difference in original data
obs_diff <- abs(mean(df$BMGain[df$Group == "Light"]) - mean(df$BMGain[df$Group == "Dark"]))
num_below <- sum(diff_df$Difference <= obs_diff)
num_above <- sum(diff_df$Difference > obs_diff)
# Compute p-value (proportion of random diffs as extreme as observed)
p_value <- mean(abs(diffs) >= abs(obs_diff))
# Plot histogram with vline for observed difference
p <- ggplot(diff_df, aes(x = Difference)) +
geom_histogram(binwidth = 0.05, fill = "blue", alpha = 0.5, color = "black") +
geom_vline(xintercept = obs_diff, color = "red", linetype = "dashed", size = 1) +
theme_minimal() +
ggtitle("Distribution of Mean Differences") +
xlab("Mean Difference") +
ylab("Frequency")
# Extract histogram bin data
hist_data <- ggplot_build(p)$data[[1]]
max_y <- max(hist_data$count) # Get the max count from the histogram
# Add annotation with dynamic max_y
p +
annotate("text", x = obs_diff - 0.1, y = max_y -(max_y*.17),
label = paste("Below:", num_below),
hjust = 1, color = "blue", fontface = "bold") +
annotate("text", x = obs_diff + 0.1, y = max_y -(max_y*.17),
label = paste("Above:", num_above),
hjust = 0, color = "blue", fontface = "bold") +
annotate("text", x = obs_diff, y = max_y -(max_y*.1),
label = paste("Observed Diff =", round(obs_diff, 2)),
vjust = 0, color = "red", fontface = "bold")
# Print observed difference and p-value
cat("Observed Difference:", obs_diff, "\n")
cat("P-value:", p_value, "\n")
Note that unique_partitions can easily exceed the memory capacity of your computer. In that case a randomization test will be more useful. Note that the set seed part is commented out, and each run of the program will give a slightly different outcome.
library(ggplot2)
# Set number of randomizations
rand_number <- 1000 # Adjust as needed
# Create dataset
df <- read_csv("https://www.lock5stat.com/datasets3e/LightatNight.csv")
# Compute observed difference in original data
obs_diff <- abs(mean(df$BMGain[df$Group == "Light"]) - mean(df$BMGain[df$Group == "Dark"]))
# Perform randomization test
#set.seed(123) # For reproducibility in coding, Do Not Use in analyses
rand_diffs <- numeric(rand_number)
for (i in 1:rand_number) {
shuffled <- sample(df$BMGain) # Shuffle values
group_A <- shuffled[1:10] # First 10 as Light group
group_B <- shuffled[11:18] # Remaining 8 as Dark group
rand_diffs[i] <- abs(mean(group_A) - mean(group_B))
}
# Convert to data frame for plotting
diff_df <- data.frame(Difference = rand_diffs)
# Count how many are above/below observed difference
num_below <- sum(rand_diffs <= obs_diff)
num_above <- sum(rand_diffs > obs_diff)
# Compute p-value (proportion as extreme as observed)
p_value <- mean(abs(rand_diffs) >= abs(obs_diff))
# Create the base plot
p <- ggplot(diff_df, aes(x = Difference)) +
geom_histogram(binwidth = 0.05, fill = "blue", alpha = 0.5, color = "black") +
geom_vline(xintercept = obs_diff, color = "red", linetype = "dashed", size = 1) +
theme_minimal() +
ggtitle("Randomization Test: Distribution of Mean Differences") +
xlab("Mean Difference") +
ylab("Frequency")
# Extract histogram data for dynamic text placement
hist_data <- ggplot_build(p)$data[[1]]
max_y <- max(hist_data$count) # Highest bar in the histogram
# Annotate counts on either side of vline
p +
annotate("text", x = obs_diff - 0.1, y = max_y * 0.8,
label = paste("Below:", num_below),
hjust = 1, color = "blue", fontface = "bold") +
annotate("text", x = obs_diff + 0.1, y = max_y * 0.8,
label = paste("Above:", num_above),
hjust = 0, color = "blue", fontface = "bold") +
annotate("text", x = obs_diff, y = max_y * 0.9,
label = paste("Observed Diff =", round(obs_diff, 2)),
vjust = 0, color = "red", fontface = "bold")
# Print observed difference and p-value
cat("Observed Difference:", obs_diff, "\n")
cat("P-value:", p_value, "\n")
Regards,
Tim
-----Original Message-----
From: R-help <r-help-bounces using r-project.org> On Behalf Of Richard O'Keefe
Sent: Saturday, March 15, 2025 4:27 AM
To: Kevin Zembower <kevin using zembower.org>
Cc: r-help using r-project.org
Subject: Re: [R] What don't I understand about sample()?
[External Email]
Not having the book (and which of the three editions are you using?), I downloaded the data and played with it for a bit.
dotchart() showed the Dark and Light conditions looked quite different, but also showed that there are not very many cases.
After trying t.test, it occurred to me that I did not know whether "BMGain" means gain in *grams* or gain in *percent*.
Reflection told me that for a growth experiment, percent made more sense, which reminded my of one of my first student advising experiences, where I said "never give the computer percentages; let IT calculate the percentages from the baseline and outcome, because once you've thrown away information, the computer can't magically get it back."
In particular, in the real world I'd be worried about the possibility that there was some confounding going on, so I would much rather have initial weight and final weight as variables.
If BMGain is an absolute measure, the p value for a t test is teeny tiny.
If BMGain is a percentage, the p value for a sensible t test is about 0.03.
A permutation test went like this.
is.light <- d$Group == "Light"
is.dark <- d$Group == "Dark"
score <- function (g) mean(g[is.light]) - mean(g[is.dark]) base.score <- score(d$BMGain) perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain))) sum(perm.scores >= base.score) / length(perm.scores)
I don't actually see where matrix() comes into it, still less anything in the tidyverse.
On Sat, 15 Mar 2025 at 07:52, Kevin Zembower via R-help <r-help using r-project.org> wrote:
>
> Thank you all, very much, for your kind and detailed explanations. I
> didn't understand, mainly, that the matrix() call only called its
> parameters once. I was certain that this was a bug with sample()
> getting seeded with a constant value, and giving the same permutation.
>
> I think I need to make my MWE a little less minimal to continue
> learning. If you're familiar with the Lock5 statistics textbook, I'm
> working on the Light and Dark mice example, where groups of mice were
> exposed or not to light at night, then measured for weight gain. The
> statistic is mean difference in weight gain between the two groups.
>
> My understanding of how I'm supposed to construct a randomized
> distribution is to join the weight gains of the 10 mice exposed to
> light at night to the 8 mice not exposed to light at night. After
> shuffling this data, I arbitrarily group the first 10 values into the
> 'light' group, and the last 8 into the 'dark' group, and find the
> difference in their means.
>
> I think I can do this correctly with:
> ===================
> ## Less-minimal working example
> library(tidyverse)
>
> library(Lock5Data)
> data(LightatNight)
> str(LightatNight)
>
> ## Or, if you don't have the Lock5Data library:
> (d <-
> read_csv("https://nam10.safelinks.protection.outlook.com/?url=https%3A
> %2F%2Fwww.lock5stat.com%2Fdatasets3e%2FLigthtatNight.csv&data=05%7C02%
> 7Ctebert%40ufl.edu%7Cd58e9cb0dd1d493dfd3708dd639b4e26%7C0d4da0f84a314d
> 76ace60a62331e1b84%7C0%7C0%7C638776241003988479%7CUnknown%7CTWFpbGZsb3
> d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoi
> TWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=xICk3ewMwgDmCgvFq52yCaplry
> 5qZseKtrp2yBgIO%2FY%3D&reserved=0"))
>
> (lt <- d$BMGain[d$Group == "Light"])
> (dk <- d$BMGain[d$Group == "Dark"])
> (n_lt <- length(lt))
> (n_dk <- length(dk))
>
> (data <- c(lt, dk))
>
> B <- 10 #Will be 1000
> n <- length(data)
>
> random.samples <- matrix(NA, B, n)
> random.statistics <- rep(NA, B)
>
> for(i in 1:B) {
> random.samples[i,] <- sample(data)
> random.statistics[i] <- mean(random.samples[i, 1:n_lt]) -
> mean(random.samples[i, (n_lt + 1):(n_lt + n_dk)]) }
> random.samples random.statistics
>
> ## Trying to do it without a for(), using Peter's suggestion:
> (random.samples <- matrix(replicate(B, sample(data)), B, n,
> byrow=TRUE))
> compute.diff.means <- function(x) {
> return(mean(x[1:n_lt]) - mean(x[(n_lt+1):(n_lt+n_dk)])) }
> (random.statistics <- apply(random.samples, 1, compute.diff.means))
> =======================
>
> I think both of these methods give me the data I'm trying for. Any
> suggestions on my R coding techniques are welcome.
>
> Thank you all, again, for taking the time and effort to help me. Your
> help is greatly appreciated.
>
> -Kevin
>
> On Thu, 2025-03-13 at 17:00 -0400, Kevin Zembower wrote:
> > Hello, all,
> >
> > I'm learning to do randomized distributions in my Stats 101 class*.
> > I thought I could do it with a call to sample() inside a matrix(),
> > like:
> >
> > > matrix(sample(1:10, replace=TRUE), 5, 10, byrow=TRUE)
> > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> > [1,] 8 2 3 1 8 2 8 8 9 8
> > [2,] 8 2 3 1 8 2 8 8 9 8
> > [3,] 8 2 3 1 8 2 8 8 9 8
> > [4,] 8 2 3 1 8 2 8 8 9 8
> > [5,] 8 2 3 1 8 2 8 8 9 8
> > >
> >
> > Imagine my surprise to learn that all the rows were the same
> > permutation. I thought each time sample() was called inside the
> > matrix, it would generate a different permutation.
> >
> > I modeled this after the bootstrap sample techniques in
> > https://pa/
> > ges.stat.wisc.edu%2F~larget%2Fstat302%2Fchap3.pdf&data=05%7C02%7Cteb
> > ert%40ufl.edu%7Cd58e9cb0dd1d493dfd3708dd639b4e26%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638776241004291461%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=mTfoalDp1qnXCWjy4EaThFXMRC7b2HyhGV75nrI01sw%3D&reserved=0. I don't understand why it works in bootstrap samples (with replace=TRUE), but not in randomized distributions (with replace=FALSE).
> >
> > Thanks for any insight you can share with me, and any suggestions
> > for getting rows in a matrix with different permutations.
> >
> > -Kevin
> >
> > *No, this isn't a homework problem. We're using Lock5 as the text in
> > class, along with its StatKey web application. I'm just trying to
> > get more out of the class by also solving our problems using R, for
> > which I'm not receiving any class credit.
>
>
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat/
> .ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C02%7Ctebert%40ufl.edu
> %7Cd58e9cb0dd1d493dfd3708dd639b4e26%7C0d4da0f84a314d76ace60a62331e1b84
> %7C0%7C0%7C638776241004299940%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGki
> OnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ
> %3D%3D%7C0%7C%7C%7C&sdata=8U3Tea5HC8uDM4G%2FAKsE%2BStdoeMB7Kd%2BCgM3v6
> DG5XI%3D&reserved=0 PLEASE do read the posting guide
> https://www/.
> r-project.org%2Fposting-guide.html&data=05%7C02%7Ctebert%40ufl.edu%7Cd
> 58e9cb0dd1d493dfd3708dd639b4e26%7C0d4da0f84a314d76ace60a62331e1b84%7C0
> %7C0%7C638776241004308034%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRy
> dWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%
> 3D%7C0%7C%7C%7C&sdata=yeQqG29qVGgeTEEmxIG6e879bRn0sBMJjLEFwfYWOxk%3D&r
> eserved=0 and provide commented, minimal, self-contained, reproducible
> code.
______________________________________________
R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.r-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list