[R] Use of geometric mean .. in good data analysis

Mon Jan 22 21:23:20 CET 2024

Ah.... LOD's, typically LLOD's ("lower limits of detection").

Disclaimer: I am *NOT* in any sense an expert on such matters. What follows
are just some comments based on my personal experience. Please filter
accordingly. Also, while I kept it on list as Martin suggested it might be
useful to do so, most folks probably can safely ignore the rant that
follows as off topic and not of interest. So you've been warned!!

The rant:
My experience is: data that contain a "bunch" of values that are, e.g.
below a LLOD, are frequently reported and/or analyzed by various ad hoc,
and imho, uniformly bad methods. e.g.:

1) The censored values are recorded and analyzed as at the LLOD;
2) The censored values are recorded and analyzed at some arbitrary value
below the LLOD, like LLOD/2;
3) The censored values are are "imputed" by ad hoc methods, e.g. uniform
random values between 0 and the LLOD for left censoring.

To repeat, *IMO*, all of this is junk and will produced misleading
statistical results. Whether they mislead enough to substantively affect
the science or regulatory decisions depend on the specifics of the
circumstances. I accept no general claim as to their innocuousness.

Further:

a) When you have a "lot" of values -- 50%? 75%?, 25%? -- face facts: you
have (practically) no useful information from the values that you do have
to infer what the distribution of values that you don't have looks like.
All one can sensibly do is say that x% of the values are below a LOD and
here's the distribution of what lies above. Presumably, if you have such
data conditional on covariates with the obvious intent to determine the
relationship to those covariates, you could analyze the percentages of
LLOD's and known values separately. There are undoubtedly more
sophisticated methods out there, so this is where you need to go to the
literature to see what might suit; though I think it will still have to
come down to looking at these separately (e.g. with extra parameters to
account for unmeasurable values). Another way of saying this is: any
analysis which treats all the data as arising from a single distribution
will depend more on the assumptions you make than on the data. So good luck
with that!

b) If you have a "modest" amount of (known) censoring -- 5%?, 20%? 10%? --
methods for the analysis of censored data should be useful. My
understanding is that MI (multiple imputation) is regarded as a generally
useful approach, and there are many R packages that can do various flavors
of this. Again, you should consult the literature: there are very likely
nontechnical reviews of this topic, too, as well as online discussions and
tutorials.

So if you are serious about dealing with this and have a lot of data with
these issues, my advice would be to stop looking for ad hoc advice and dig
into the literature: it's one of the many areas of "data science" where
seemingly simple but pervasive questions require complex answers.

And, again, heed my personal caveats.

Thus endeth my rant.

Cheers to all,
Bert

On Mon, Jan 22, 2024 at 9:29 AM Rich Shepard <rshepard using appl-ecosys.com>
wrote:

> On Mon, 22 Jan 2024, Martin Maechler wrote:
>
> > I think it is a good question, not really only about geo-chemistry, but
> > about statistics in applied sciences (and engineering for that matter).
>
> > John W Tukey (and several other of the grands of the time) had the log
> > transform among the "First aid transformations":
> >
> > If the data for a continuous variable must all be positive it is also
> > typically the case that the distribution is considerably skewed to the
> > right. In such a case behave as a good human who sees another human in
> > health distress: apply First Aid -- do the things you learned to do
> > quickly without too much thought, because things must happen fast ---to
> > hopefully save the other's life.
>
> Martin,
>
> Thanks very much. I will look further into this because toxic metals and
> organic compounds in geochemical collections almost always have censored
> lab
> results (below method dection limits) that range from about 15% to 80% or
> more, and there almost always are very high extreme values.
>
> I'll learn to understand what benefits log transforms have over
> compositional data analyses.
>
> Best regards,
>
> Rich
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]