[R-sig-ME] Model diagnostics show slope in residuals plot and slope on the observed vs fitted plot is different than y = x
Fox, John
jfox at mcmaster.ca
Mon Oct 3 20:21:47 CEST 2016
Dear Carlos,
> -----Original Message-----
> From: Carlos Familia [mailto:carlosfamilia at gmail.com]
> Sent: Monday, October 3, 2016 2:03 PM
> To: Fox, John <jfox at mcmaster.ca>
> Cc: r-sig-mixed-models at r-project.org; Thierry Onkelinx
> <thierry.onkelinx at inbo.be>
> Subject: Re: [R-sig-ME] Model diagnostics show slope in residuals plot
> and slope on the observed vs fitted plot is different than y = x
>
> Dear John,
>
> Do you suggest then that I should get a different linear regression
> model per condition, and instead of analysing the overall effect of W
> across conditions, analyse it for each condition separately?
No, I'm not recommending that, or anything in particular beyond the observation that randomly discarding data in order to fit a fixed effect model to the remaining independent observations is probably not a good idea. I also don't see what's to be gained by fitting separate models to the conditions; if you expect interactions with condition, why not model them? I don't think that it would be responsible for me to give you statistical advice by email knowing next to nothing about your research.
> Do you think this would be fine for a referee?
I have no idea. I probably wouldn't feel that I could predict an unknown referee's response to your work even if I were in a position to recommend what you should do.
I'm sorry that I can't be more helpful.
Best,
John
> Please note that I have no problem with this.
>
>
> Many thanks,
> Carlos Família
>
>
>
> On 3 Oct 2016, at 14:19, Fox, John <jfox at mcmaster.ca
> <mailto:jfox at mcmaster.ca> > wrote:
>
> Dear Carlos,
>
> If I understand properly what's troubling you and the nature of the
> suggestion, I wouldn't do it -- that is, randomly discard data to get
> one observation per case.
>
> Paul Johnson explained clearly why the pattern you noticed in the
> residuals vs. fitted values plot occurred, as a consequence of
> shrinkage. One way of thinking about this is that using a mixed model is
> *more* important when you have few observations per case, where
> shrinkage will be greater, than when you have many observations per
> case, where the "estimates" of the random effects will nearly coincide
> with the case means (or in a more complex model, within-case
> regressions).
>
> Best,
> John
>
> -----------------------------
> John Fox, Professor
> McMaster University
> Hamilton, Ontario
> Canada L8S 4M4
> Web: socserv.mcmaster.ca/jfox <http://socserv.mcmaster.ca/jfox>
>
>
>
>
>
> -----Original Message-----
> From: R-sig-mixed-models [mailto:r-sig-mixed-models-
> bounces at r-project.org]
> On Behalf Of Carlos Familia
> Sent: October 3, 2016 8:20 AM
> To: Thierry Onkelinx <thierry.onkelinx at inbo.be
> <mailto:thierry.onkelinx at inbo.be> >
> Cc: r-sig-mixed-models at r-project.org <mailto:r-sig-mixed-
> models at r-project.org>
> Subject: Re: [R-sig-ME] Model diagnostics show slope in
> residuals plot and
> slope on the observed vs fitted plot is different than y = x
>
> Do you think this approach is sound and easily justifiable in
> a paper?
>
> Many thanks,
> Carlos
>
>
> On 3 Oct 2016, at 13:10, Thierry Onkelinx
> <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be> > wrote:
>
> In case you have more than one measurement per ID,
> select one of them
> at random. Something like
>
> library(dplyr)
> df %>%
> group_by(id) %>%
> sample_n(1)
>
>
>
> ir. Thierry Onkelinx
> Instituut voor natuur- en bosonderzoek / Research
> Institute for Nature
> and Forest team Biometrie & Kwaliteitszorg / team
> Biometrics & Quality
> Assurance Kliniekstraat 25
> 1070 Anderlecht
> Belgium
>
> To call in the statistician after the experiment is done
> may be no
> more than asking him to perform a post-mortem
> examination: he may be
> able to say what the experiment died of. ~ Sir Ronald
> Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner The
> combination of
> some data and an aching desire for an answer does not
> ensure that a
> reasonable answer can be extracted from a given body of
> data. ~ John
> Tukey
>
> 2016-10-03 14:06 GMT+02:00 Carlos Familia
> <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>
>
>
> <mailto:carlosfamilia at gmail.com>>:
>
>
> Dear Thierry,
>
> When you say sampling the data to have only one
> observation per ID, you
>
>
> mean reducing the dataset to cases where the Y variable was
> only measured in
> 1 condition?
>
>
> I have though about going lm with this, but it just
> didn’t feel wright...
>
> Many thanks,
> Carlos
>
>
>
> On 3 Oct 2016, at 13:02, Thierry Onkelinx
> <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be>
>
>
> <mailto:thierry.onkelinx at inbo.be>> wrote:
>
>
>
> Dear Carlos,
>
> I concur with Paul. After play a bit with the data
> you send me privately, I see
>
>
> a few things which cause problems:
>
>
> 1) the number of measurements per ID is low. 1/3
> has one measurement in
>
>
> each level of C, 1/3 in two out of three levels of C and 1/3
> in only one level of C.
>
>
> 2) the variance of ID is larger than the residual
> variance
> 3) the effect of W is small compared to the
> variance of ID
>
> If possible try to add extra covariates. If not I'd
> fall back on a simple lm.
>
>
> Either with ignoring the repeated measurements or by sampling
> the data so
> you have only one observation per ID.
>
>
>
> Best regards,
>
>
> ir. Thierry Onkelinx
> Instituut voor natuur- en bosonderzoek / Research
> Institute for
> Nature and Forest team Biometrie & Kwaliteitszorg /
> team Biometrics &
> Quality Assurance Kliniekstraat 25
> 1070 Anderlecht
> Belgium
>
> To call in the statistician after the experiment is
> done may be no
> more than asking him to perform a post-mortem
> examination: he may be
> able to say what the experiment died of. ~ Sir
> Ronald Aylmer Fisher
> The plural of anecdote is not data. ~ Roger Brinner
> The combination
> of some data and an aching desire for an answer
> does not ensure that
> a reasonable answer can be extracted from a given
> body of data. ~
> John Tukey
>
> 2016-10-03 11:17 GMT+02:00 Paul Johnson
> <paul.johnson at glasgow.ac.uk <mailto:paul.johnson at glasgow.ac.uk>
>
>
> <mailto:paul.johnson at glasgow.ac.uk>>:
>
>
> Hi Carlos,
>
> Your plot didn’t come through, as Thierry noted.
> However it’s expected that,
>
>
> unlike a standard linear regression model, an LMM with a
> nested structure
> such as yours will give a positive linear relationship
> between the residuals and
> the fitted values (might also be true for other structures?),
> provided the
> residual and random effect variances are > 0. Somebody will
> hopefully chip in
> with a formal explanation, but it’s basically a similar
> phenomenon to
> regression to the mean.
>
>
>
> Imagine a group of students taking each taking an
> aptitude test three times.
>
>
> There are two random factors: the difference in underlying
> aptitude between
> the students, modelled by the student ID random effect; and
> random variation
> between time points within each student (e.g. how good a
> particular student is
> feeling on a particular day). I’m ignoring variation between
> tests — let’s
> unrealistically assume they’s all the same and students
> completely forget about
> them between tests.
>
>
>
> The students with the best mean scores will be a
> mixture of excellent
>
>
> students having three so-so (some good, some bad) days, and
> moderately good
> students having the good luck to have three good days, and
> the very best
> scores will come from students who were both excellent and
> lucky (although
> this category will be small). An important point is that
> there is no way of using
> the data to separate the moderate-student-lucky-days high
> scores from the
> excellent-student-average-days scores. If we simply took the
> mean of the
> scores, we would be overestimating the performance of the
> students on
> average (we’d have good estimates of the excellent students
> and
> overestimates of the moderate ones), so the best estimate is
> achieved by
> shrinking the scores towards the mean.
>
>
>
> This is what happens when the model is fitted. Each
> student is given a
>
>
> residual (random effect) at the student level (how good the
> student is relative
> to the value predicted by the fixed effects) and three
> residuals at the
> observation (between-test-within-student) level. For students
> with good
> scores, this will be a compromise between the inseparable
> excellent-student-
> average-days scenario and the moderate-student-lucky-days
> scenario. As a
> result, students with high student-level residuals (the
> student random effects)
> will also tend to have high inter-test residuals. The same is
> also true in negative
> for poor students and students having three bad days. So the
> student random
> effects (which are part of the fitted values) and the
> residuals will be positively
> correlated.
>
>
>
> You can check this using by simulating new data
> from the fitted model re-
>
>
> fitting the model, and comparing the residuals-x-fitted plot
> (which will be
> "perfect”) to the one from your data. Here’s a function that
> does this for lme4
> fits:
>
>
>
> devtools::install_github("pcdjohnson/GLMMmisc")
> library(GLMMmisc)
> sim.residplot(fit) # repeat this a few times to
> account for sampling
> error
>
> If all is well, you should see a similar slope
> between the real and the
>
>
> simulated plots, in fact the general pattern of the residuals
> should be similar.
>
>
>
> (The new package DHARMa —
>
> https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-
> packa
> ge-for-residual-diagnostics-of-glmms/
>
> <https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-
> pack
> age-for-residual-diagnostics-of-glmms/> — takes a
> similar approach to
> assessing residuals, but in a less quick-and-dirty,
> more formally
> justified way.)
>
> All the best,
> Paul
>
>
>
>
>
>
> On 2 Oct 2016, at 16:57, Carlos Familia
> <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>
>
>
> <mailto:carlosfamilia at gmail.com>> wrote:
>
>
>
> Hello,
>
> I have in hands a quite large and unbalanced
> dataset, for which a Y
>
>
> continuous dependent variable was measured in 3 different
> conditions (C) for
> about 3000 subjects (ID) (although, not all subjects have Y
> values for the 3
> conditions). Additionally, there is continuous measure W
> which was measured
> for all subjects.
>
>
>
> I am interested in testing the following:
>
> - Is the effect of W significant overall
> - Is the effect of W significant at each
> level of C
> - Is the effect of C significant
>
> In order to try to answer this, I have
> specified the following model with
>
>
> lmer:
>
>
>
> lmer( Y ~ W * C + (1 | ID), data = df)
>
> Which seems to proper reflect the structure
> of the data (I might be wrong
>
>
> here, any suggestions would be welcome).
>
>
> However when running the diagnostic plots I
> noticed a slope in the
>
>
> residuals plot and a slope different than y = x for the
> observed vs fitted plot (as
> shown bellow). Which made me question the validity of the
> model for
> inference.
>
>
>
> Could I still use this model for inference?
> Should I specify a different
>
>
> formula? Should I turn to lme and try to include different
> variances for each
> level of conditions (C)? Any ideas?
>
>
>
> I would be really appreciated if anyone could
> help me with this.
>
> Thanks in advance,
> Carlos Família
>
>
> _______________________________________________
> R-sig-mixed-models at r-project.org <mailto:R-
> sig-mixed-models at r-project.org>
> <mailto:R-sig-mixed-models at r-project.org>
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-
> mixed-models
> <https://stat.ethz.ch/mailman/listinfo/r-sig-
> mixed-models>
>
>
>
> _______________________________________________
> R-sig-mixed-models at r-project.org <mailto:R-sig-
> mixed-models at r-project.org>
> <mailto:R-sig-mixed-models at r-project.org> mailing
> list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-
> models
> <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-
> models>
>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-
> models at r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
More information about the R-sig-mixed-models
mailing list