[R] Fwd: Potential Issue with lm.influence

Wed Apr 3 11:36:02 CEST 2019

Yes, also notice that 

> predict(fit3, new=moon_data, type="resp")
           1            2            3            4            5            6 
1.060694e+00 1.102008e+00 1.109695e+00 1.065515e+00 1.057896e+00 1.892312e+29 
           7            8            9           10           11           12 
3.531271e+17 2.295015e+01 1.739889e+01 1.058165e+00 1.058041e+00 1.057957e+00 
          13 
1.058217e+00 

so the model of fit3 predicts that Jupiter and Saturn should have several bazillions of moons each!

-pd

> On 3 Apr 2019, at 01:53 , Fox, John <jfox using mcmaster.ca> wrote:
> 
> Dear Eric,
> 
> Have you looked at your data? -- for example:
> 
> 	plot(log(Moons) ~ Volume, data = moon_data)
> 	text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1, subset = Volume > 400)
> 
> The negative-binomial model doesn't look reasonable, does it?
> 
> After you eliminate Jupiter there's one very high leverage point left, Saturn. Computing studentized residuals entails an approximation to deleting that as well from the model, so try fitting
> 
> 	fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
> 	summary(fit3)
> 
> which runs into numeric difficulties.
> 
> Then look at:
> 
> 	plot(log(Moons) ~ Volume, data = moon_data, subset = Volume < 400)
> 
> Finally, try
> 
> 	plot(log(Moons) ~ log(Volume), data = moon_data)
> 	fit4 <- update(fit2, . ~ log(Volume))
> 	rstudent(fit4)
> 
> I hope this helps,
> John
> 
> -----------------------------------------------------------------
> John Fox
> Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> Web: https://socialsciences.mcmaster.ca/jfox/
> 
> 
> 
> 
>> -----Original Message-----
>> From: R-help [mailto:r-help-bounces using r-project.org] On Behalf Of Eric
>> Bridgeford
>> Sent: Tuesday, April 2, 2019 5:01 PM
>> To: Bert Gunter <bgunter.4567 using gmail.com>
>> Cc: R-help <r-help using r-project.org>
>> Subject: Re: [R] Fwd: Potential Issue with lm.influence
>> 
>> I agree the influence documentation suggests NaNs may result; however, as
>> these can be manually computed and are, indeed, finite/existing (ie,
>> computing the held-out influence by manually training n models for n points
>> to obtain n leave one out influence measures), I don't possibly see how the
>> function SHOULD return NaN, and given that it is returning NaN, that
>> suggests to me that there should be either a) Providing an alternative
>> method to compute them that (may be slower) that returns the correct
>> results in the even that lm.influence does not return a good approximation
>> (ie, a command line argument for type="approx" that does the
>> approximation strategy employed currently, or an alternative type="direct"
>> or something like that that computes them manually), or b) a heuristic to
>> suggest why NaNs might result from one's particular inputs/what can be
>> done to fix it (if the approximation strategy is the source of the problem) or
>> what the issue is with the data that will cause NaNs. Hence I was looking to
>> start a discussion around the specific strategy employed to compute the
>> elements.
>> 
>> Below is the code:
>> moon_data <- structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L, 11L,
>>                                               12L, 9L, 10L, 4L, 6L, 3L), .Label = c("Ceres ", "Earth",
>> "Eris ",
>> 
>>         "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ", "Neptune ",
>> 
>>         "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
>>                            Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2, 9.54, 19.22,
>>                                         30.06, 39.5, 43.35, 45.8, 67.7), Diameter = c(0.382, 0.949,
>> 
>>           1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
>> 
>>           0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e-04, 317.8,
>> 
>>                                 95.2, 14.6, 17.2, 0.0022, 7e-04, 7e-04, 0.0025), Moons = c(0L,
>> 
>> 
>>                0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L), Volume =
>> c(0.0291869497930152,
>> 
>> 
>> 
>>    0.447504348276571, 0.523598775598299, 0.0788376225681443,
>> 
>> 
>> 
>>    0.000268082573106329, 737.393372232996, 441.729261571372,
>> 
>> 
>> 
>>    33.6865588825666, 30.6549628355953, 0.00305362805928928,
>> 
>> 
>> 
>>    0.00176714586764426, 0.00090477868423386, 0.00359136400182873
>> 
>> 
>>                )), row.names = c(NA, -13L), class = "data.frame")
>> 
>> fit <- glm.nb(Moons ~ Volume, data = moon_data)
>> rstudent(fit)
>> 
>> fit2 <- update(fit, subset = Name != "Jupiter ")
>> rstudent(fit2)
>> 
>> influence(fit2)$sigma
>> 
>> #        1        2        3        4        5        7        8        9
>>     10       11       12       13
>> # 1.077945 1.077813 1.165025 1.181685 1.077954      NaN 1.044454 1.152110
>> 1.187586 1.181696 1.077954 1.165147
>> 
>> Sincerely,
>> Eric
>> 
>> On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter <bgunter.4567 using gmail.com>
>> wrote:
>> 
>>> Also, I suggest you read ?influence which may explain the source of
>>> your NaN's .
>>> 
>>> Bert Gunter
>>> 
>>> "The trouble with having an open mind is that people keep coming along
>>> and sticking things into it."
>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>> 
>>> 
>>> On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter <bgunter.4567 using gmail.com>
>> wrote:
>>> 
>>>> I told you already: **Include code inline **
>>>> 
>>>> See ?dput for how to include a text version of objects, such as data
>>>> frames, inline.
>>>> 
>>>> Otherwise, I believe .txt text files are not stripped if you insist
>>>> on
>>>> *attaching* data or code. Others may have better advice.
>>>> 
>>>> 
>>>> Bert Gunter
>>>> 
>>>> "The trouble with having an open mind is that people keep coming
>>>> along and sticking things into it."
>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>> 
>>>> 
>>>> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford <ericwb95 using gmail.com>
>>>> wrote:
>>>> 
>>>>> How can I add attachments? The following two files were attached in
>>>>> the initial message
>>>>> 
>>>>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter <bgunter.4567 using gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Nothing was attached. The r-help server strips most attachments.
>>>>>> Include your code inline.
>>>>>> 
>>>>>> Also note that
>>>>>> 
>>>>>>> 0/0
>>>>>> [1] NaN
>>>>>> 
>>>>>> so maybe something like that occurs in the course of your calculations.
>>>>>> But that's just a guess, so feel free to disregard.
>>>>>> 
>>>>>> 
>>>>>> Bert Gunter
>>>>>> 
>>>>>> "The trouble with having an open mind is that people keep coming
>>>>>> along and sticking things into it."
>>>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>>> 
>>>>>> 
>>>>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
>>>>>> <ericwb95 using gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi R core team,
>>>>>>> 
>>>>>>> I experienced the following issue with the attached data/code
>>>>>>> snippet, where the studentized residual for a single observation
>>>>>>> appears to be NaN given finite predictors/responses, which appears
>>>>>>> to be driven by the glm.influence method in the stats package. I
>>>>>>> am curious to whether this is a consequence of the specific
>>>>>>> implementation used for computing the influence, which it would
>>>>>>> appear is the driving force for the NaN influence for the point,
>>>>>>> that I was ultimately able to trace back through the lm.influence
>>>>>>> method to this specific line <
>>>>>>> https://github.com/SurajGupta/r-
>> source/blob/a28e609e72ed7c47f6ddfb
>>>>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
>>>>>>>> 
>>>>>>> which
>>>>>>> calls C code which calls iminfl.f
>>>>>>> <
>>>>>>> https://github.com/SurajGupta/r-source/blob/master/src/library/sta
>>>>>>> ts/src/lminfl.f
>>>>>>>> 
>>>>>>> (I
>>>>>>> don't know fortran so I can't debug further). My understanding is
>>>>>>> that the specific issue would have to do with the leave-one-out
>>>>>>> variance estimate associated with this particular point, which it
>>>>>>> seems based on my understanding should be finite given finite
>>>>>>> predictors/responses. Let me know. Thanks!
>>>>>>> 
>>>>>>> Sincerely,
>>>>>>> 
>>>>>>> --
>>>>>>> Eric Bridgeford
>>>>>>> ericwb.me
>>>>>>> ______________________________________________
>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> Eric Bridgeford
>>>>> ericwb.me
>>>>> 
>>>> 
>> 
>> --
>> Eric Bridgeford
>> ericwb.me
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes using cbs.dk  Priv: PDalgd using gmail.com