[R] Why I wrote my MWE the way I did. [WAS:] Re: What don't I understand about sample()?
Kevin Zembower
kev|n @end|ng |rom zembower@org
Mon Mar 17 15:39:28 CET 2025
Hello, all, thanks, again for the detailed comments and suggestions.
This is one reason I really enjoy this group: the lively and
knowledgeable discussions that questions generate. I'm a little
hesitant that a future reader, just skimming the subject lines, will
miss the true breath of this discussion.
I'd like to clarify my use of the tidyverse library. I used it so that
I could use read_csv(). I was under the mistaken understanding that
read.csv() would not fetch a file from the internet, using a
'https://...' URL, that only read_csv() would do that. I'm pretty sure
that at some point in the 15 years that I've been aware of and using R,
read.csv() would not do that. I didn't do a lot with R during the time
that the tidyverse was developed and became popular; I had to learn it
fresh just a few years ago, when I kind of came back to R.
In this project, I was doing a 'data exploration' of sorts. I wasn't
concerned with optimizing anything but getting a correct answer. I
didn't explore whether other functions would also fetch internet files,
I didn't compare the execution speeds of apply() versus rowMeans()
(although, since I didn't know about rowMeans(), I'm glad Tim mentioned
it; I'll be sure to file this away for the next time it comes up).
Almost all respondents to my original question about sample() pointed
out my example didn't use the 'size=' parameter. I constructed my first
MWE (beyond the first one-line snippet I originally posted) to explain
how I couldn't just use 'size' to fill a matrix, like the bootstart
model I was working from, because I needed permutations of the original
dataset. (I had never heard of a permutation test. I think that's
beyond the scope of Stats101.) I wanted to make sure that anyone who
wished to participate in this discussion could do it in the easiest way
possible, without loading a library that they would otherwise have no
use for.
I asked for any suggestions for my R coding style, and I appreciate all
the respondents who went way above the call and researched the sources
I was working from and made suggestions and improvements. I'm still
reading through these to fully understand them, but I'm very grateful
that you took the time to try to help me.
Thank you all, again, for your efforts, and sharing your knowledge and
experience with all of us on this list.
-Kevin
On Sun, 2025-03-16 at 16:04 -0700, Jeff Newmiller wrote:
> The original question was about sample, a base R function. Dragging
> in tidyverse along the way could be regarded as complicating the
> question unnecessarily, but in some cases there can be undesirable or
> simply unexpected interactions between functions drawn from different
> packages. Such complications can turn out to be intrinsic to the
> question being posed, in which case it will be necessary to have
> things in their example just as they are in the original environment.
> In this case that does not seem to be the case... and OP may get
> fewer responses to their question because some people don't keep
> tidyverse installed and may not want to add it just to answer a
> question... leading to fewer responses. In some cases no one may
> respond, and OP would be left with no help.
>
> In this case it all turned out fine.. so this debate is getting
> stale, and there are reasons why including or excluding tidyverse
> might have been better. But in general, building a true Minimum
> Reproducible Example (MRE) will help communicate most clearly
> (consider using the reprex package to verify the example) and
> minimizing unnecessary packages (reprex can help paring things down)
> may avoid the dreaded "crickets" on the mailing list in the future.
> And sometimes building an MRE will help OP answer their own question.
>
> On March 16, 2025 12:52:07 PM PDT, avi.e.gross using gmail.com wrote:
> > Thanks for the clarification, Richard, as I clearly made the wrong
> > guess of what you meant.
> >
> > Your idea or objection was that you see the included read.csv
> > function as adequate and see no incentive to use read_csv, and
> > especially not if that is the only function being used. I only
> > partially agree.
> >
> > As usual, I look at things from multiple overlapping perspectives.
> >
> > There are actually more ways to read in a CSV or other such data
> > files including fread from data.table and another called feather
> > and other base functions. Some people choose ONE and use it
> > whenever possible and your choice might be the base version and
> > mine would not be.
> >
> > So, one perspective is that the base version is in some sense pre-
> > loaded and any other must be pre-downloaded and added with a
> > library statement. I am not sure how much that costs or if the base
> > version is also only partially preloaded and gotten only as needed.
> > But it can be a valid concern, especially as some people write
> > defensive code so that if it is not already installed, they first
> > fetch it.
> >
> > Another perspective, especially for larger files, is speed. One
> > article I have suggests the base version is quite SLOW.
> >
> > https://www.r-bloggers.com/2017/04/fast-data-loading-from-files-to-r/
> >
> > But that was in 2017, and using such concerns, you may be better
> > off with data.table ...
> >
> > Another issue is that some people have found it handy to deal with
> > tibbles rather than unenhanced data.frames and if you read it in
> > using the base, you may end up converting it later so the
> > underscore version saves a small step. The OP clearly does not need
> > this as no other tidyverse functions are used. Others may care.
> >
> > But related to this are things like not converting strings to
> > factors by default or play around with column names. It can be time
> > consuming to read in data and then use multiple commands to change
> > it to the way you want it, such as undoing the factors (albeit you
> > can just set the default in the base too) or converting a column it
> > guessed was integer to Boolean and so on.
> >
> > And I note I have used other features that I like and base does not
> > support. But, again, if the OP does not have any plans on using any
> > such features or defaults and is reading fairly small amounts of
> > data and running it once, there is no special reason to make it
> > worth leaving the base. If they may later want to use additional
> > tidyverse functionality, switching to use this by default may be
> > wise.
> >
> > My philosophy is to keep thing as simple as reasonable but no
> > simpler than reasonable. In programming languages, it is to use a
> > simple consistent set of tools that gets me what I want with
> > accuracy and thus it can be simpler to use the tidyverse a lot as
> > my default. To each their own.
> >
> > -----Original Message-----
> > From: Richard O'Keefe <raoknz using gmail.com>
> > Sent: Sunday, March 16, 2025 7:53 AM
> > To: avi.e.gross using gmail.com
> > Cc: Kevin Zembower <kevin using zembower.org>; r-help using r-project.org
> > Subject: Re: [R] What don't I understand about sample()?
> >
> > I think you think I mistook read_csv for read.csv. Not so. The
> > point
> > was that base R with no additional packages loaded already contains
> > a
> > CSV reader which is entirely adequate for the task at hand. When
> > you
> > are already struggling with the basics of a system (like how often
> > and
> > when arguments are evaluated), I think it's wisests to stick with
> > basic tools. When they taught me carpentry at school, they had me
> > on
> > chisels before getting to lathes (and in fact never did get to
> > lathes
> > at my school).
> >
> > Sure, R isn't perfect. But whenever I open the SAS manuals I
> > remember
> > that things could be much worse.
> >
> > On Sun, 16 Mar 2025 at 17:51, <avi.e.gross using gmail.com> wrote:
> > >
> > > Richard,
> > >
> > > The function with a period as a separator that you cite,
> > > read.csv, is part of normal base R.
> > >
> > > We have been discussing a different function named just a tad
> > > different that uses an underscore as a separator, read_csv that
> > > is similar but has some changes in how it works and the options
> > > supported and is considered part of the tidyverse grouping of
> > > packages and can also be gotten more compactly by importing
> > > package "readr" ...
> > >
> > > The OP, for reasons of their own, wanted to use read_csv and did
> > > not want or need anything else in the related packages.
> > >
> > > Of course, nobody is required to use other packages, albeit, as
> > > you noted, many packages you may choose to use have some
> > > dependencies on others you don't.
> > >
> > > Like many good things, added functionality available to you does
> > > add complexity and room for failures. But when a package is
> > > useful enough to be very useful, it can develop enough momentum
> > > that some functionality might well be a good idea to move into
> > > base R. As an example I already mentioned, of the various pipe
> > > implementations, a version has been added to base R and I suspect
> > > many older packages, including in the tidyverse, can adjust their
> > > code in new releases to use it but with CARE. Anyone still using
> > > older versions of R will experience failures in such a scenario.
> > >
> > > Luckily, many uses within a package are likely to be safe if done
> > > properly. Can anyone share if any such methods are in use?
> > >
> > > I mean, as an example, could a package early on check if the R
> > > version being used is later than the introduction, or some other
> > > way to check if a |> operation is supported? Could they then
> > > somehow introduce an operator that is either bound to |> or
> > > perhaps %>% and use that in any places in the code where both
> > > work the same, and only use the magrittr pipe when doing
> > > something it does differently such as needing to use a period to
> > > specify which argument in a function is receiving the pipelined
> > > data.
> > >
> > > There are programs people want to keep frozen so they only use
> > > the versions of R and packages that existed at some moment so you
> > > avoid some inevitable conflicts. So, I despair that older
> > > versions of R may stick around way too long and break with any
> > > newer packages.
> > >
> > > But languages cannot remain totally static or chances are people
> > > will move on to newer languages that offer things they want. Then
> > > again, there seem to still be COBOL programs out there.
> > >
> > > -----Original Message-----
> > > From: Richard O'Keefe <raoknz using gmail.com>
> > > Sent: Sunday, March 16, 2025 12:32 AM
> > > To: avi.e.gross using gmail.com
> > > Cc: Kevin Zembower <kevin using zembower.org>; r-help using r-project.org
> > > Subject: Re: [R] What don't I understand about sample()?
> > >
> > > Rgui 4.4.3 on Windows. When I start it up, read.csv is just
> > > *there*.
> > > I don't need to load any package to get it.
> > >
> > > I have three reasons for being very sparing in the packages I
> > > use.
> > > 1. It took me long enough to get my head around R. More packages
> > > =
> > > more things to learn. I *still* have major trouble grasping
> > > tidyverse, and as far as I can see it doesn't solve any problem
> > > that
> > > *I* have. I install a package only when I have a specific need
> > > for
> > > something it does, like spatial statistics. (And yet I have
> > > hundreds
> > > of packages installed, because packages depend on other
> > > packages.)
> > > 2. Everything changes, and they don't all change coherently. A
> > > package I've used for years may not be available in the next
> > > release.
> > > This is not a theoretical possibility; it has happened to me
> > > often.
> > > "If I don't use it I can't lose it." Sometimes things break
> > > because
> > > something else on the system (tcl/tk, or the C or Fortran
> > > compiler)
> > > has changed. I'm tired of things breaking because the C or
> > > Fortran compiler
> > > is now stricter.
> > > 3. The universe of R packages is vast and constantly expanding.
> > > This
> > > makes it *impossible* for anyone to test every possible
> > > combination. I
> > > used to teach software engineering, and we had a slogan "if it
> > > isn't
> > > tested it doesn't work". Base R plus package X? Probably
> > > tested.
> > > Base R plus package Y? Probably tested. Base R plus X plus Y?
> > > Not unless X requires Y or Y requires X.
> > >
> > > There is also the didactic point that the more you work with base
> > > R
> > > the better you will understand it, which you will need to
> > > understand
> > > other things like tidyverse. It's like mastering the alphabet
> > > before you
> > > learn shorthand.
> > >
> > >
> > > On Sun, 16 Mar 2025 at 06:55, <avi.e.gross using gmail.com> wrote:
> > > >
> > > > Kevin & Richard, and of course everyone,
> > > >
> > > > As the main topic here is not the tidyverse, I will mention the
> > > > perils of loading in more than needed in general.
> > > >
> > > > If you want to use one or a very few functions, it can be more
> > > > efficient and safe to load exactly what is needed. In the case
> > > > of wanting to use read_csv(), I think this suffices:
> > > >
> > > > library(readr)
> > > >
> > > > If you instead use:
> > > >
> > > > library(tidyverse)
> > > >
> > > > You load a varying number of packages (it may change) including
> > > > some like lubridate or forcats or ggplot2 that you may not be
> > > > even thinking of using or never heard of.
> > > >
> > > > The bigger problem is shadowing that happens. For example, you
> > > > may be getting warning messages like:
> > > >
> > > > ✖ dplyr::filter() masks stats::filter()
> > > > ✖ dplyr::lag() masks stats::lag()
> > > >
> > > > This can interfere with some other package you had already
> > > > loaded unless it uses a notation like mypackage::filter(...) in
> > > > their code to avoid being easily replaced but even then, if you
> > > > yourself called what you though was filter() from base R or
> > > > some package, you have a problem unless you invoke it like
> > > > base::filter(...)
> > > >
> > > > The order packages like this load can matter as well as when
> > > > you define a function of your own. So, it may be worth some
> > > > effort to zoom in and call exactly what you need and only when
> > > > you need it. I have seen code that only needs a package in rare
> > > > conditions and only loads the package in one branch of an IF
> > > > statement right before using in.
> > > > .
> > > > Packages can also be unloaded after use.
> > > >
> > > > From what you describe, none of this is crucially important as
> > > > you are using R for your own purposes in your own RMarkDown
> > > > file that you may not be distributing. And, when I write
> > > > programs where I keep adjusting and adding things from the
> > > > tidyverse, it is indeed much easier to just get the grouping on
> > > > top and forget about it. That is, until I decide to do
> > > > something with functional programming that uses
> > > > reduce/filter/map... and have an odd error!
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: R-help <r-help-bounces using r-project.org> On Behalf Of Kevin
> > > > Zembower via R-help
> > > > Sent: Saturday, March 15, 2025 1:29 PM
> > > > To: r-help using r-project.org
> > > > Subject: Re: [R] What don't I understand about sample()?
> > > >
> > > > Hi, Richard, thanks for replying. I should have mentioned the
> > > > third
> > > > edition, which we're using. The data file didn't change between
> > > > the
> > > > second and third editions, and the data on Body Mass Gain was
> > > > the same
> > > > as in the first edition, although the first edition data file
> > > > contained
> > > > additional variables.
> > > >
> > > > According to my text, the BMGain was measured in grams. Thanks
> > > > for
> > > > pointing out that my statement of the problem lacked crucial
> > > > information.
> > > >
> > > > The matrix in my example comes from an example in
> > > > https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where
> > > > the author
> > > > created a bootstrap example with a matrix that consisted of one
> > > > row for
> > > > every sample in the bootstrap, and one column for each mean in
> > > > the
> > > > original data. This allowed him to find the mean for each row
> > > > to create
> > > > the bootstrap statistics.
> > > >
> > > > The only need for the tidyverse is to use the read_csv()
> > > > function. I'm
> > > > regrettably lazy in not determining which of the multiple
> > > > functions in
> > > > the tidyverse library loads read_csv(), and just using that
> > > > one.
> > > >
> > > > Thanks, again, for helping me to further understand R and this
> > > > problem.
> > > >
> > > > -Kevin
> > > >
> > > > On Sat, 2025-03-15 at 12:00 +0100,
> > > > r-help-request using r-project.org wrote:
> > > > > Not having the book (and which of the three editions are you
> > > > > using?),
> > > > > I downloaded the data and played with it for a bit.
> > > > > dotchart() showed the Dark and Light conditions looked quite
> > > > > different, but also showed that there are not very many
> > > > > cases.
> > > > > After trying t.test, it occurred to me that I did not know
> > > > > whether
> > > > > "BMGain" means gain in *grams* or gain in *percent*.
> > > > > Reflection told me that for a growth experiment, percent made
> > > > > more
> > > > > sense, which reminded my of one of my first
> > > > > student advising experiences, where I said "never give the
> > > > > computer
> > > > > percentages; let IT calculate the percentages
> > > > > from the baseline and outcome, because once you've thrown
> > > > > away
> > > > > information, the computer can't magically get it back."
> > > > > In particular, in the real world I'd be worried about the
> > > > > possibility
> > > > > that there was some confounding going on, so I would
> > > > > much rather have initial weight and final weight as
> > > > > variables.
> > > > > If BMGain is an absolute measure, the p value for a t test is
> > > > > teeny
> > > > > tiny.
> > > > > If BMGain is a percentage, the p value for a sensible t test
> > > > > is about
> > > > > 0.03.
> > > > >
> > > > > A permutation test went like this.
> > > > > is.light <- d$Group == "Light"
> > > > > is.dark <- d$Group == "Dark"
> > > > > score <- function (g) mean(g[is.light]) - mean(g[is.dark])
> > > > > base.score <- score(d$BMGain)
> > > > > perm.scores <- sapply(1:997, function (i)
> > > > > score(sample(d$BMGain)))
> > > > > sum(perm.scores >= base.score) / length(perm.scores)
> > > > >
> > > > > I don't actually see where matrix() comes into it, still less
> > > > > anything
> > > > > in the tidyverse.
> > > > >
> > > >
> > > > ______________________________________________
> > > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more,
> > > > see
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide
> > > > https://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible
> > > > code.
> > > >
> > > > ______________________________________________
> > > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more,
> > > > see
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide
> > > > https://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible
> > > > code.
> > >
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > https://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list