[R] Fwd: Potential Issue with lm.influence
peter dalgaard
pd@|gd @end|ng |rom gm@||@com
Wed Apr 3 11:36:02 CEST 2019
Yes, also notice that
> predict(fit3, new=moon_data, type="resp")
1 2 3 4 5 6
1.060694e+00 1.102008e+00 1.109695e+00 1.065515e+00 1.057896e+00 1.892312e+29
7 8 9 10 11 12
3.531271e+17 2.295015e+01 1.739889e+01 1.058165e+00 1.058041e+00 1.057957e+00
13
1.058217e+00
so the model of fit3 predicts that Jupiter and Saturn should have several bazillions of moons each!
-pd
> On 3 Apr 2019, at 01:53 , Fox, John <jfox using mcmaster.ca> wrote:
>
> Dear Eric,
>
> Have you looked at your data? -- for example:
>
> plot(log(Moons) ~ Volume, data = moon_data)
> text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1, subset = Volume > 400)
>
> The negative-binomial model doesn't look reasonable, does it?
>
> After you eliminate Jupiter there's one very high leverage point left, Saturn. Computing studentized residuals entails an approximation to deleting that as well from the model, so try fitting
>
> fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
> summary(fit3)
>
> which runs into numeric difficulties.
>
> Then look at:
>
> plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400)
>
> Finally, try
>
> plot(log(Moons) ~ log(Volume), data = moon_data)
> fit4 <- update(fit2, . ~ log(Volume))
> rstudent(fit4)
>
> I hope this helps,
> John
>
> -----------------------------------------------------------------
> John Fox
> Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> Web: https://socialsciences.mcmaster.ca/jfox/
>
>
>
>
>> -----Original Message-----
>> From: R-help [mailto:r-help-bounces using r-project.org] On Behalf Of Eric
>> Bridgeford
>> Sent: Tuesday, April 2, 2019 5:01 PM
>> To: Bert Gunter <bgunter.4567 using gmail.com>
>> Cc: R-help <r-help using r-project.org>
>> Subject: Re: [R] Fwd: Potential Issue with lm.influence
>>
>> I agree the influence documentation suggests NaNs may result; however, as
>> these can be manually computed and are, indeed, finite/existing (ie,
>> computing the held-out influence by manually training n models for n points
>> to obtain n leave one out influence measures), I don't possibly see how the
>> function SHOULD return NaN, and given that it is returning NaN, that
>> suggests to me that there should be either a) Providing an alternative
>> method to compute them that (may be slower) that returns the correct
>> results in the even that lm.influence does not return a good approximation
>> (ie, a command line argument for type="approx" that does the
>> approximation strategy employed currently, or an alternative type="direct"
>> or something like that that computes them manually), or b) a heuristic to
>> suggest why NaNs might result from one's particular inputs/what can be
>> done to fix it (if the approximation strategy is the source of the problem) or
>> what the issue is with the data that will cause NaNs. Hence I was looking to
>> start a discussion around the specific strategy employed to compute the
>> elements.
>>
>> Below is the code:
>> moon_data <- structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L, 11L,
>> 12L, 9L, 10L, 4L, 6L, 3L), .Label = c("Ceres ", "Earth",
>> "Eris ",
>>
>> "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ", "Neptune ",
>>
>> "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
>> Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2, 9.54, 19.22,
>> 30.06, 39.5, 43.35, 45.8, 67.7), Diameter = c(0.382, 0.949,
>>
>> 1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
>>
>> 0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e-04, 317.8,
>>
>> 95.2, 14.6, 17.2, 0.0022, 7e-04, 7e-04, 0.0025), Moons = c(0L,
>>
>>
>> 0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L), Volume =
>> c(0.0291869497930152,
>>
>>
>>
>> 0.447504348276571, 0.523598775598299, 0.0788376225681443,
>>
>>
>>
>> 0.000268082573106329, 737.393372232996, 441.729261571372,
>>
>>
>>
>> 33.6865588825666, 30.6549628355953, 0.00305362805928928,
>>
>>
>>
>> 0.00176714586764426, 0.00090477868423386, 0.00359136400182873
>>
>>
>> )), row.names = c(NA, -13L), class = "data.frame")
>>
>> fit <- glm.nb(Moons ~ Volume, data = moon_data)
>> rstudent(fit)
>>
>> fit2 <- update(fit, subset = Name != "Jupiter ")
>> rstudent(fit2)
>>
>> influence(fit2)$sigma
>>
>> # 1 2 3 4 5 7 8 9
>> 10 11 12 13
>> # 1.077945 1.077813 1.165025 1.181685 1.077954 NaN 1.044454 1.152110
>> 1.187586 1.181696 1.077954 1.165147
>>
>> Sincerely,
>> Eric
>>
>> On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter <bgunter.4567 using gmail.com>
>> wrote:
>>
>>> Also, I suggest you read ?influence which may explain the source of
>>> your NaN's .
>>>
>>> Bert Gunter
>>>
>>> "The trouble with having an open mind is that people keep coming along
>>> and sticking things into it."
>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>
>>>
>>> On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter <bgunter.4567 using gmail.com>
>> wrote:
>>>
>>>> I told you already: **Include code inline **
>>>>
>>>> See ?dput for how to include a text version of objects, such as data
>>>> frames, inline.
>>>>
>>>> Otherwise, I believe .txt text files are not stripped if you insist
>>>> on
>>>> *attaching* data or code. Others may have better advice.
>>>>
>>>>
>>>> Bert Gunter
>>>>
>>>> "The trouble with having an open mind is that people keep coming
>>>> along and sticking things into it."
>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>
>>>>
>>>> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford <ericwb95 using gmail.com>
>>>> wrote:
>>>>
>>>>> How can I add attachments? The following two files were attached in
>>>>> the initial message
>>>>>
>>>>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter <bgunter.4567 using gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Nothing was attached. The r-help server strips most attachments.
>>>>>> Include your code inline.
>>>>>>
>>>>>> Also note that
>>>>>>
>>>>>>> 0/0
>>>>>> [1] NaN
>>>>>>
>>>>>> so maybe something like that occurs in the course of your calculations.
>>>>>> But that's just a guess, so feel free to disregard.
>>>>>>
>>>>>>
>>>>>> Bert Gunter
>>>>>>
>>>>>> "The trouble with having an open mind is that people keep coming
>>>>>> along and sticking things into it."
>>>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
>>>>>> <ericwb95 using gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi R core team,
>>>>>>>
>>>>>>> I experienced the following issue with the attached data/code
>>>>>>> snippet, where the studentized residual for a single observation
>>>>>>> appears to be NaN given finite predictors/responses, which appears
>>>>>>> to be driven by the glm.influence method in the stats package. I
>>>>>>> am curious to whether this is a consequence of the specific
>>>>>>> implementation used for computing the influence, which it would
>>>>>>> appear is the driving force for the NaN influence for the point,
>>>>>>> that I was ultimately able to trace back through the lm.influence
>>>>>>> method to this specific line <
>>>>>>> https://github.com/SurajGupta/r-
>> source/blob/a28e609e72ed7c47f6ddfb
>>>>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
>>>>>>>>
>>>>>>> which
>>>>>>> calls C code which calls iminfl.f
>>>>>>> <
>>>>>>> https://github.com/SurajGupta/r-source/blob/master/src/library/sta
>>>>>>> ts/src/lminfl.f
>>>>>>>>
>>>>>>> (I
>>>>>>> don't know fortran so I can't debug further). My understanding is
>>>>>>> that the specific issue would have to do with the leave-one-out
>>>>>>> variance estimate associated with this particular point, which it
>>>>>>> seems based on my understanding should be finite given finite
>>>>>>> predictors/responses. Let me know. Thanks!
>>>>>>>
>>>>>>> Sincerely,
>>>>>>>
>>>>>>> --
>>>>>>> Eric Bridgeford
>>>>>>> ericwb.me
>>>>>>> ______________________________________________
>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Eric Bridgeford
>>>>> ericwb.me
>>>>>
>>>>
>>
>> --
>> Eric Bridgeford
>> ericwb.me
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk Priv: PDalgd using gmail.com
More information about the R-help
mailing list