[R] Why I wrote my MWE the way I did. [WAS:] Re: What don't I understand about sample()?

Richard O'Keefe r@oknz @end|ng |rom gm@||@com
Tue Mar 18 01:56:06 CET 2025


?read.csv
leads to the help page for read.table, of which read.csv is a special case.
In the description, the first argument is called 'file', which
suggests to the unwary reader that
it can only name a file.
But if you persist and read the description of the arguments, you learn that
"file can be a readable text-mode connection" and in the next paragraph
"file can also be a complete URL."

Chances are, I'm going to make mistakes.  I'm going to want to fix my
script and try again.
And again.  And again.  If I download a file, all these attempts are
going to my local SSD.
If I use a URL, all these attempts are going to reach out over the
network, and if my connection
goes down (I've had a pile of washing fall on my modem, which then
overheated and shut down)
I can't continue.

When is it a good idea to use read.table or any of its wrappers with a
URL as first argument?


On Tue, 18 Mar 2025 at 03:39, Kevin Zembower <kevin using zembower.org> wrote:
>
> Hello, all, thanks, again for the detailed comments and suggestions.
> This is one reason I really enjoy this group: the lively and
> knowledgeable discussions that questions generate. I'm a little
> hesitant that a future reader, just skimming the subject lines, will
> miss the true breath of this discussion.
>
> I'd like to clarify my use of the tidyverse library. I used it so that
> I could use read_csv(). I was under the mistaken understanding that
> read.csv() would not fetch a file from the internet, using a
> 'https://...' URL, that only read_csv() would do that. I'm pretty sure
> that at some point in the 15 years that I've been aware of and using R,
> read.csv() would not do that. I didn't do a lot with R during the time
> that the tidyverse was developed and became popular; I had to learn it
> fresh just a few years ago, when I kind of came back to R.
>
> In this project, I was doing a 'data exploration' of sorts. I wasn't
> concerned with optimizing anything but getting a correct answer. I
> didn't explore whether other functions would also fetch internet files,
> I didn't compare the execution speeds of apply() versus rowMeans()
> (although, since I didn't know about rowMeans(), I'm glad Tim mentioned
> it; I'll be sure to file this away for the next time it comes up).
>
> Almost all respondents to my original question about sample() pointed
> out my example didn't use the 'size=' parameter. I constructed my first
> MWE (beyond the first one-line snippet I originally posted) to explain
> how I couldn't just use 'size' to fill a matrix, like the bootstart
> model I was working from, because I needed permutations of the original
> dataset. (I had never heard of a permutation test. I think that's
> beyond the scope of Stats101.) I wanted to make sure that anyone who
> wished to participate in this discussion could do it in the easiest way
> possible, without loading a library that they would otherwise have no
> use for.
>
> I asked for any suggestions for my R coding style, and I appreciate all
> the respondents who went way above the call and researched the sources
> I was working from and made suggestions and improvements. I'm still
> reading through these to fully understand them, but I'm very grateful
> that you took the time to try to help me.
>
> Thank you all, again, for your efforts, and sharing your knowledge and
> experience with all of us on this list.
>
> -Kevin
>
> On Sun, 2025-03-16 at 16:04 -0700, Jeff Newmiller wrote:
> > The original question was about sample, a base R function. Dragging
> > in tidyverse along the way could be regarded as complicating the
> > question unnecessarily, but in some cases there can be undesirable or
> > simply unexpected interactions between functions drawn from different
> > packages. Such complications can turn out to be intrinsic to the
> > question being posed, in which case it will be necessary to have
> > things in their example just as they are in the original environment.
> > In this case that does not seem to be the case... and OP may get
> > fewer responses to their question because some people don't keep
> > tidyverse installed and may not want to add it just to answer a
> > question... leading to fewer responses. In some cases no one may
> > respond, and OP would be left with no help.
> >
> > In this case it all turned out fine.. so this debate is getting
> > stale, and there are reasons why including or excluding tidyverse
> > might have been better. But in general, building a true Minimum
> > Reproducible Example (MRE) will help communicate most clearly
> > (consider using the reprex package to verify the example) and
> > minimizing unnecessary packages (reprex can help paring things down)
> > may avoid the dreaded "crickets" on the mailing list in the future.
> > And sometimes building an MRE will help OP answer their own question.
> >
> > On March 16, 2025 12:52:07 PM PDT, avi.e.gross using gmail.com wrote:
> > > Thanks for the clarification, Richard, as I clearly made the wrong
> > > guess of what you meant.
> > >
> > > Your idea or objection was that you see the included read.csv
> > > function as adequate and see no incentive to use read_csv, and
> > > especially not if that is the only function being used. I only
> > > partially agree.
> > >
> > > As usual, I look at things from multiple overlapping perspectives.
> > >
> > > There are actually more ways to read in a CSV or other such data
> > > files including fread from data.table and another called feather
> > > and other base functions. Some people choose ONE and use it
> > > whenever possible and your choice might be the base version and
> > > mine would not be.
> > >
> > > So, one perspective is that the base version is in some sense pre-
> > > loaded and any other must be pre-downloaded and added with a
> > > library statement. I am not sure how much that costs or if the base
> > > version is also only partially preloaded and gotten only as needed.
> > > But it can be a valid concern, especially as some people write
> > > defensive code so that if it is not already installed, they first
> > > fetch it.
> > >
> > > Another perspective, especially for larger files, is speed. One
> > > article I have suggests the base version is quite SLOW.
> > >
> > > https://www.r-bloggers.com/2017/04/fast-data-loading-from-files-to-r/
> > >
> > > But that was in 2017, and using such concerns, you may be better
> > > off with data.table ...
> > >
> > > Another issue is that some people have found it handy to deal with
> > > tibbles rather than unenhanced data.frames and if you read it in
> > > using the base, you may end up converting it later so the
> > > underscore version saves a small step. The OP clearly does not need
> > > this as no other tidyverse functions are used. Others may care.
> > >
> > > But related to this are things like not converting strings to
> > > factors by default or play around with column names. It can be time
> > > consuming to read in data and then use multiple commands to change
> > > it to the way you want it, such as undoing the factors (albeit you
> > > can just set the default in the base too) or converting a column it
> > > guessed was integer to Boolean and so on.
> > >
> > > And I note I have used other features that I like and base does not
> > > support. But, again, if the OP does not have any plans on using any
> > > such features or defaults and is reading fairly small amounts of
> > > data and running it once, there is no special reason to make it
> > > worth leaving the base. If they may later want to use additional
> > > tidyverse functionality, switching to use this by default may be
> > > wise.
> > >
> > > My philosophy is to keep thing as simple as reasonable but no
> > > simpler than reasonable. In programming languages, it is to use a
> > > simple consistent set of tools that gets me what I want with
> > > accuracy and thus it can be simpler to use the tidyverse a lot as
> > > my default. To each their own.
> > >
> > > -----Original Message-----
> > > From: Richard O'Keefe <raoknz using gmail.com>
> > > Sent: Sunday, March 16, 2025 7:53 AM
> > > To: avi.e.gross using gmail.com
> > > Cc: Kevin Zembower <kevin using zembower.org>; r-help using r-project.org
> > > Subject: Re: [R] What don't I understand about sample()?
> > >
> > > I think you think I mistook read_csv for read.csv.  Not so.  The
> > > point
> > > was that base R with no additional packages loaded already contains
> > > a
> > > CSV reader which is entirely adequate for the task at hand.  When
> > > you
> > > are already struggling with the basics of a system (like how often
> > > and
> > > when arguments are evaluated), I think it's wisests to stick with
> > > basic tools.  When they taught me carpentry at school, they had me
> > > on
> > > chisels before getting to lathes (and in fact never did get to
> > > lathes
> > > at my school).
> > >
> > > Sure, R isn't perfect.  But whenever I open the SAS manuals I
> > > remember
> > > that things could be much worse.
> > >
> > > On Sun, 16 Mar 2025 at 17:51, <avi.e.gross using gmail.com> wrote:
> > > >
> > > > Richard,
> > > >
> > > > The function with a period as a separator that you cite,
> > > > read.csv, is part of normal base R.
> > > >
> > > > We have been discussing a different function named just a tad
> > > > different that uses an underscore as a separator, read_csv that
> > > > is similar but has some changes in how it works and the options
> > > > supported and is considered part of the tidyverse grouping of
> > > > packages and can also be gotten more compactly by importing
> > > > package "readr" ...
> > > >
> > > > The OP, for reasons of their own, wanted to use read_csv and did
> > > > not want or need anything else in the related packages.
> > > >
> > > > Of course, nobody is required to use other packages, albeit, as
> > > > you noted, many packages you may choose to use have some
> > > > dependencies on others you don't.
> > > >
> > > > Like many good things, added functionality available to you does
> > > > add complexity and room for failures. But when a package is
> > > > useful enough to be very useful, it can develop enough momentum
> > > > that some functionality might well be a good idea to move into
> > > > base R. As an example I already mentioned, of the various pipe
> > > > implementations, a version has been added to base R and I suspect
> > > > many older packages, including in the tidyverse, can adjust their
> > > > code in new releases to use it but with CARE. Anyone still using
> > > > older versions of R will experience failures in such a scenario.
> > > >
> > > > Luckily, many uses within a package are likely to be safe if done
> > > > properly. Can anyone share if any such methods are in use?
> > > >
> > > > I mean, as an example, could a package early on check if the R
> > > > version being used is later than the introduction, or some other
> > > > way to check if a |> operation is supported? Could they then
> > > > somehow introduce an operator that is either bound to |> or
> > > > perhaps %>% and use that in any places in the code where both
> > > > work the same, and only use the magrittr pipe when doing
> > > > something it does differently such as needing to use a period to
> > > > specify which argument in a function is receiving the pipelined
> > > > data.
> > > >
> > > > There are programs people want to keep frozen so they only use
> > > > the versions of R and packages that existed at some moment so you
> > > > avoid some inevitable conflicts. So, I despair that older
> > > > versions of R may stick around way too long and break with any
> > > > newer packages.
> > > >
> > > > But languages cannot remain totally static or chances are people
> > > > will move on to newer languages that offer things they want. Then
> > > > again, there seem to still be COBOL programs out there.
> > > >
> > > > -----Original Message-----
> > > > From: Richard O'Keefe <raoknz using gmail.com>
> > > > Sent: Sunday, March 16, 2025 12:32 AM
> > > > To: avi.e.gross using gmail.com
> > > > Cc: Kevin Zembower <kevin using zembower.org>; r-help using r-project.org
> > > > Subject: Re: [R] What don't I understand about sample()?
> > > >
> > > > Rgui 4.4.3 on Windows.  When I start it up, read.csv is just
> > > > *there*.
> > > > I don't need to load any package to get it.
> > > >
> > > > I have three reasons for being very sparing in the packages I
> > > > use.
> > > > 1. It took me long enough to get my head around R.  More packages
> > > > =
> > > > more things to learn.  I *still* have major trouble grasping
> > > > tidyverse, and as far as I can see it doesn't solve any problem
> > > > that
> > > > *I* have.  I install a package only when I have a specific need
> > > > for
> > > > something it does, like spatial statistics.  (And yet I have
> > > > hundreds
> > > > of packages installed, because packages depend on other
> > > > packages.)
> > > > 2. Everything changes, and they don't all change coherently.  A
> > > > package I've used for years may not be available in the next
> > > > release.
> > > > This is not a theoretical possibility; it has happened to me
> > > > often.
> > > > "If I don't use it I can't lose it."  Sometimes things break
> > > > because
> > > > something else on the system (tcl/tk, or the C or Fortran
> > > > compiler)
> > > > has changed.  I'm tired of things breaking because the C or
> > > > Fortran compiler
> > > > is now stricter.
> > > > 3. The universe of R packages is vast and constantly expanding.
> > > > This
> > > > makes it *impossible* for anyone to test every possible
> > > > combination.  I
> > > > used to teach software engineering, and we had a slogan "if it
> > > > isn't
> > > > tested it doesn't work".  Base R plus package X?  Probably
> > > > tested.
> > > > Base R plus package Y?  Probably tested.  Base R plus X plus Y?
> > > > Not unless X requires Y or Y requires X.
> > > >
> > > > There is also the didactic point that the more you work with base
> > > > R
> > > > the better you will understand it, which you will need to
> > > > understand
> > > > other things like tidyverse.  It's like mastering the alphabet
> > > > before you
> > > > learn shorthand.
> > > >
> > > >
> > > > On Sun, 16 Mar 2025 at 06:55, <avi.e.gross using gmail.com> wrote:
> > > > >
> > > > > Kevin & Richard, and of course everyone,
> > > > >
> > > > > As the main topic here is not the tidyverse, I will mention the
> > > > > perils of loading in more than needed in general.
> > > > >
> > > > > If you want to use one or a very few functions, it can be more
> > > > > efficient and safe to load exactly what is needed. In the case
> > > > > of wanting to use read_csv(), I think this suffices:
> > > > >
> > > > > library(readr)
> > > > >
> > > > > If you instead use:
> > > > >
> > > > > library(tidyverse)
> > > > >
> > > > > You load a varying number of packages (it may change) including
> > > > > some like lubridate or forcats or ggplot2 that you may not be
> > > > > even thinking of using or never heard of.
> > > > >
> > > > > The bigger problem is shadowing that happens. For example, you
> > > > > may be getting warning messages like:
> > > > >
> > > > > ✖ dplyr::filter() masks stats::filter()
> > > > > ✖ dplyr::lag()    masks stats::lag()
> > > > >
> > > > > This can interfere with some other package you had already
> > > > > loaded unless it uses a notation like mypackage::filter(...) in
> > > > > their code to avoid being easily replaced but even then, if you
> > > > > yourself called what you though was filter() from base R or
> > > > > some package, you have a problem unless you invoke it like
> > > > > base::filter(...)
> > > > >
> > > > > The order packages like this load can matter as well as when
> > > > > you define a function of your own. So, it may be worth some
> > > > > effort to zoom in and call exactly what you need and only when
> > > > > you need it. I have seen code that only needs a package in rare
> > > > > conditions and only loads the package in one branch of an IF
> > > > > statement right before using in.
> > > > > .
> > > > > Packages can also be unloaded after use.
> > > > >
> > > > > From what you describe, none of this is crucially important as
> > > > > you are using R for your own purposes in your own RMarkDown
> > > > > file that you may not be distributing. And, when I write
> > > > > programs where I keep adjusting and adding things from the
> > > > > tidyverse, it is indeed much easier to just get the grouping on
> > > > > top and forget about it. That is, until I decide to do
> > > > > something with functional programming that uses
> > > > > reduce/filter/map... and have an odd error!
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: R-help <r-help-bounces using r-project.org> On Behalf Of Kevin
> > > > > Zembower via R-help
> > > > > Sent: Saturday, March 15, 2025 1:29 PM
> > > > > To: r-help using r-project.org
> > > > > Subject: Re: [R] What don't I understand about sample()?
> > > > >
> > > > > Hi, Richard, thanks for replying. I should have mentioned the
> > > > > third
> > > > > edition, which we're using. The data file didn't change between
> > > > > the
> > > > > second and third editions, and the data on Body Mass Gain was
> > > > > the same
> > > > > as in the first edition, although the first edition data file
> > > > > contained
> > > > > additional variables.
> > > > >
> > > > > According to my text, the BMGain was measured in grams. Thanks
> > > > > for
> > > > > pointing out that my statement of the problem lacked crucial
> > > > > information.
> > > > >
> > > > > The matrix in my example comes from an example in
> > > > > https://pages.stat.wisc.edu/~larget/stat302/chap3.pdf, where
> > > > > the author
> > > > > created a bootstrap example with a matrix that consisted of one
> > > > > row for
> > > > > every sample in the bootstrap, and one column for each mean in
> > > > > the
> > > > > original data. This allowed him to find the mean for each row
> > > > > to create
> > > > > the bootstrap statistics.
> > > > >
> > > > > The only need for the tidyverse is to use the read_csv()
> > > > > function. I'm
> > > > > regrettably lazy in not determining which of the multiple
> > > > > functions in
> > > > > the tidyverse library loads read_csv(), and just using that
> > > > > one.
> > > > >
> > > > > Thanks, again, for helping me to further understand R and this
> > > > > problem.
> > > > >
> > > > > -Kevin
> > > > >
> > > > > On Sat, 2025-03-15 at 12:00 +0100,
> > > > > r-help-request using r-project.org wrote:
> > > > > > Not having the book (and which of the three editions are you
> > > > > > using?),
> > > > > > I downloaded the data and played with it for a bit.
> > > > > > dotchart() showed the Dark and Light conditions looked quite
> > > > > > different, but also showed that there are not very many
> > > > > > cases.
> > > > > > After trying t.test, it occurred to me that I did not know
> > > > > > whether
> > > > > > "BMGain" means gain in *grams* or gain in *percent*.
> > > > > > Reflection told me that for a growth experiment, percent made
> > > > > > more
> > > > > > sense, which reminded my of one of my first
> > > > > > student advising experiences, where I said "never give the
> > > > > > computer
> > > > > > percentages; let IT calculate the percentages
> > > > > > from the baseline and outcome, because once you've thrown
> > > > > > away
> > > > > > information, the computer can't magically get it back."
> > > > > > In particular, in the real world I'd be worried about the
> > > > > > possibility
> > > > > > that there was some confounding going on, so I would
> > > > > > much rather have initial weight and final weight as
> > > > > > variables.
> > > > > > If BMGain is an absolute measure, the p value for a t test is
> > > > > > teeny
> > > > > > tiny.
> > > > > > If BMGain is a percentage, the p value for a sensible t test
> > > > > > is about
> > > > > > 0.03.
> > > > > >
> > > > > > A permutation test went like this.
> > > > > > is.light <- d$Group == "Light"
> > > > > > is.dark <- d$Group == "Dark"
> > > > > > score <- function (g) mean(g[is.light]) - mean(g[is.dark])
> > > > > > base.score <- score(d$BMGain)
> > > > > > perm.scores <- sapply(1:997, function (i)
> > > > > > score(sample(d$BMGain)))
> > > > > > sum(perm.scores >= base.score) / length(perm.scores)
> > > > > >
> > > > > > I don't actually see where matrix() comes into it, still less
> > > > > > anything
> > > > > > in the tidyverse.
> > > > > >
> > > > >
> > > > > ______________________________________________
> > > > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more,
> > > > > see
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide
> > > > > https://www.R-project.org/posting-guide.html
> > > > > and provide commented, minimal, self-contained, reproducible
> > > > > code.
> > > > >
> > > > > ______________________________________________
> > > > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more,
> > > > > see
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide
> > > > > https://www.R-project.org/posting-guide.html
> > > > > and provide commented, minimal, self-contained, reproducible
> > > > > code.
> > > >
> > >
> > > ______________________________________________
> > > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > https://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
>
>



More information about the R-help mailing list