[R] What don't I understand about sample()?

Richard O'Keefe r@oknz @end|ng |rom gm@||@com
Sun Mar 16 05:32:09 CET 2025


Rgui 4.4.3 on Windows.  When I start it up, read.csv is just *there*.
I don't need to load any package to get it.

I have three reasons for being very sparing in the packages I use.
1. It took me long enough to get my head around R.  More packages =
more things to learn.  I *still* have major trouble grasping
tidyverse, and as far as I can see it doesn't solve any problem that
*I* have.  I install a package only when I have a specific need for
something it does, like spatial statistics.  (And yet I have hundreds
of packages installed, because packages depend on other packages.)
2. Everything changes, and they don't all change coherently.  A
package I've used for years may not be available in the next release.
This is not a theoretical possibility; it has happened to me often.
"If I don't use it I can't lose it."  Sometimes things break because
something else on the system (tcl/tk, or the C or Fortran compiler)
has changed.  I'm tired of things breaking because the C or Fortran compiler
is now stricter.
3. The universe of R packages is vast and constantly expanding.  This
makes it *impossible* for anyone to test every possible combination.  I
used to teach software engineering, and we had a slogan "if it isn't
tested it doesn't work".  Base R plus package X?  Probably tested.
Base R plus package Y?  Probably tested.  Base R plus X plus Y?
Not unless X requires Y or Y requires X.

There is also the didactic point that the more you work with base R
the better you will understand it, which you will need to understand
other things like tidyverse.  It's like mastering the alphabet before you
learn shorthand.


On Sun, 16 Mar 2025 at 06:55, <avi.e.gross using gmail.com> wrote:
>
> Kevin & Richard, and of course everyone,
>
> As the main topic here is not the tidyverse, I will mention the perils of loading in more than needed in general.
>
> If you want to use one or a very few functions, it can be more efficient and safe to load exactly what is needed. In the case of wanting to use read_csv(), I think this suffices:
>
> library(readr)
>
> If you instead use:
>
> library(tidyverse)
>
> You load a varying number of packages (it may change) including some like lubridate or forcats or ggplot2 that you may not be even thinking of using or never heard of.
>
> The bigger problem is shadowing that happens. For example, you may be getting warning messages like:
>
> ✖ dplyr::filter() masks stats::filter()
> ✖ dplyr::lag()    masks stats::lag()
>
> This can interfere with some other package you had already loaded unless it uses a notation like mypackage::filter(...) in their code to avoid being easily replaced but even then, if you yourself called what you though was filter() from base R or some package, you have a problem unless you invoke it like base::filter(...)
>
> The order packages like this load can matter as well as when you define a function of your own. So, it may be worth some effort to zoom in and call exactly what you need and only when you need it. I have seen code that only needs a package in rare conditions and only loads the package in one branch of an IF statement right before using in.
> .
> Packages can also be unloaded after use.
>
> From what you describe, none of this is crucially important as you are using R for your own purposes in your own RMarkDown file that you may not be distributing. And, when I write programs where I keep adjusting and adding things from the tidyverse, it is indeed much easier to just get the grouping on top and forget about it. That is, until I decide to do something with functional programming that uses reduce/filter/map... and have an odd error!
>
>
> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org> On Behalf Of Kevin Zembower via R-help
> Sent: Saturday, March 15, 2025 1:29 PM
> To: r-help using r-project.org
> Subject: Re: [R] What don't I understand about sample()?
>
> Hi, Richard, thanks for replying. I should have mentioned the third
> edition, which we're using. The data file didn't change between the
> second and third editions, and the data on Body Mass Gain was the same
> as in the first edition, although the first edition data file contained
> additional variables.
>
> According to my text, the BMGain was measured in grams. Thanks for
> pointing out that my statement of the problem lacked crucial
> information.
>
> The matrix in my example comes from an example in
> https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where the author
> created a bootstrap example with a matrix that consisted of one row for
> every sample in the bootstrap, and one column for each mean in the
> original data. This allowed him to find the mean for each row to create
> the bootstrap statistics.
>
> The only need for the tidyverse is to use the read_csv() function. I'm
> regrettably lazy in not determining which of the multiple functions in
> the tidyverse library loads read_csv(), and just using that one.
>
> Thanks, again, for helping me to further understand R and this problem.
>
> -Kevin
>
> On Sat, 2025-03-15 at 12:00 +0100, r-help-request using r-project.org wrote:
> > Not having the book (and which of the three editions are you using?),
> > I downloaded the data and played with it for a bit.
> > dotchart() showed the Dark and Light conditions looked quite
> > different, but also showed that there are not very many cases.
> > After trying t.test, it occurred to me that I did not know whether
> > "BMGain" means gain in *grams* or gain in *percent*.
> > Reflection told me that for a growth experiment, percent made more
> > sense, which reminded my of one of my first
> > student advising experiences, where I said "never give the computer
> > percentages; let IT calculate the percentages
> > from the baseline and outcome, because once you've thrown away
> > information, the computer can't magically get it back."
> > In particular, in the real world I'd be worried about the possibility
> > that there was some confounding going on, so I would
> > much rather have initial weight and final weight as variables.
> > If BMGain is an absolute measure, the p value for a t test is teeny
> > tiny.
> > If BMGain is a percentage, the p value for a sensible t test is about
> > 0.03.
> >
> > A permutation test went like this.
> > is.light <- d$Group == "Light"
> > is.dark <- d$Group == "Dark"
> > score <- function (g) mean(g[is.light]) - mean(g[is.dark])
> > base.score <- score(d$BMGain)
> > perm.scores <- sapply(1:997, function (i) score(sample(d$BMGain)))
> > sum(perm.scores >= base.score) / length(perm.scores)
> >
> > I don't actually see where matrix() comes into it, still less
> > anything
> > in the tidyverse.
> >
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list