[R] Named numeric vectors with the same value but different names return different results when used as thresholds for calculating true positives

Lyndon Estes lestes at princeton.edu
Wed Jul 13 17:35:00 CEST 2011


Hi Bert,

Just to clarify, I used that particular construction simply to remove
the name from the vector, which I thought had something to do with the
problem I was having, which was that two seemingly identical answers
resulting from the quantile function where producing different results
in a construction for calculating true positives:

bin.ct <- quantile(x, 0.05) # 430.29
length(x) - length(x[x <= bin.ct])  # 2875
bin.ct <- quantile(x, 0.055) # 430.29
length(x) - length(x[x <= bin.ct])  # 3029 (wrong answer)

It turns out that the vector name had nothing to do with it, as Eik
showed, but converting to character and then back to numeric actually
did fix the problem:

bin.ct <- as.numeric(as.character(quantile(x, 0.05))) # 430.29
length(x) - length(x[x <= bin.ct])  # 2875
bin.ct <- as.numeric(as.character(quantile(x, 0.055))) # 430.29
length(x) - length(x[x <= bin.ct])  # 2875

Alternatively, rounding works, and I guess is preferable (but more
preferable would be an entire code rewrite, as Eik kindly pointed
out):

bin.ct <- round(quantile(x, 0.05), 2) # 430.29
length(x) - length(x[x <= bin.ct])  # 2875
bin.ct <- - round(quantile(x, 0.055), 2) # 430.29
length(x) - length(x[x <= bin.ct])  # 2875

Thanks for your time and comments.

Cheers, Lyndon



On Wed, Jul 13, 2011 at 11:08 AM, Bert Gunter <gunter.berton at gene.com> wrote:
> Rather strange ...
>
> Why would one convert a numeric to character and then back again to
> numeric? Why would one assume that such a conversion would retain full
> machine precision?
>
> In fact,
>  ?as.character
>
> tells you:
>
> as.character represents real and complex numbers to 15 significant
> digits (technically the compiler's setting of the ISO C constant
> DBL_DIG, which will be 15 on machines supporting IEC60559 arithmetic
> according to the C99 standard). This ensures that all the digits in
> the result will be reliable (and not the result of representation
> error), but does mean that conversion to character and back to numeric
> may change the number. If you want to convert numbers to character
> with the maximum possible precision, use format.
>
> ... which is exactly what you saw.
>
> -- Bert
>
>
> On Wed, Jul 13, 2011 at 7:46 AM, Lyndon Estes <lestes at princeton.edu> wrote:
>> Hello Eik,
>>
>> Thanks very much for your response and for directing me to a useful
>> explanation.
>>
>> To make sure I am grasping your answer correctly, the two problems I
>> was experiencing are related to the fact that the floating point
>> numbers I was calculating and using in subsequent indices were not
>> entirely equivalent, even if they were calculated using the same
>> function (quantile).
>>
>> I confirmed this as follows:
>>
>> j <- 0.055  # bins value for 5.5% threshold
>> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
>> bin.ct
>> #430.29
>> bin.ct2 <- quantile(x, j, na.rm = TRUE)
>> bin.ct2
>> #  5.5%
>> #430.29
>> bin.ct - bin.ct2
>> #        5.5%
>> #5.684342e-14
>> length(x) - length(x[x <= bin.ct])
>> length(x) - length(x[x <= bin.ct]) # 2875
>> length(x) - length(x[x <= bin.ct2])  # 3029
>>
>> Testing the unname() option does not fix the result, I should however note:
>> bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
>> bin.ct2 <- unname(quantile(x, j, na.rm = TRUE))
>> bin.ct - bin.ct2  # 5.684342e-14
>>
>> But rounding, as an alternative approach, works:
>> bin.ct2 <- round(quantile(x, j, na.rm = TRUE), 2)
>> bin.ct - bin.ct2
>> #5.5%
>> #   0
>> length(x) - length(x[x <= bin.ct]) # 2875
>> length(x) - length(x[x <= bin.ct2])  # 2875
>>
>> As to my code, it is part of a custom ROC function I created a while
>> back and started using again recently (the data are rainfall
>> values). I can't remember why I did this rather than using one of the
>> existing ROC functions, but I thought (probably incorrectly) that I
>> had some compelling reason. In
>> any case, it is quite unwieldy, so I will explore those other
>> packages, or try revise this to be more efficient
>>
>> (e.g. maybe this is a better approach, although the answers are fairly
>> different?).
>>
>> x2 <- x[order(x)]
>> y2 <- y[order(y)]
>> bins <- round(seq(min(x2), max(x2), by = diff(range(x2)) / 200), 2)
>> threshold <- seq(0, 100, by = 0.5)
>> tp <- rep(0, length(bins))
>> fp <- rep(0, length(bins))
>> for(i in 1:length(threshold)) {
>>  tp[i] <- length(x2) - length(x2[x2 <= bins[i]])
>>  fp[i] <- length(y2) - length(y2[y2 <= bins[i]])
>> }
>> ctch <- cbind(threshold, bins, tp, fp)
>> ctch[1:20, ]
>>
>> Thanks again for your help.
>>
>> Cheers, Lyndon
>>
>>
>> On Tue, Jul 12, 2011 at 5:09 AM, Eik Vettorazzi
>> <E.Vettorazzi at uke.uni-hamburg.de> wrote:
>>> Hi,
>>>
>>> Am 11.07.2011 22:57, schrieb Lyndon Estes:
>>>> ctch[ctch$threshold == 3.5, ]
>>>> # [1] threshold val       tp        fp        tn        fn        tpr
>>>>      fpr       tnr       fnr
>>>> #<0 rows> (or 0-length row.names)
>>>
>>> this is the very effective FAQ 7.31 trap.
>>> http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f
>>>
>>> Welcome to the first circle of Patrick Burns' R Inferno!
>>>
>>> Also, unname() is a more intuitive way of removing names.
>>>
>>> And I think your code is quite inefficient, because you calculate
>>> quantiles many times, which involves repeated ordering of x, and you may
>>> use a inefficient size of bin (either to small and therefore calculating
>>> the same split many times or to large and then missing some splits).
>>> I'm a bit puzzled what is x and y in your code, so any further advise is
>>> vague but you might have a look at any package that calculates
>>> ROC-curves such as ROCR or pROC (and many more).
>>>
>>> Hth
>>>
>>> --
>>> Eik Vettorazzi
>>>
>>> Department of Medical Biometry and Epidemiology
>>> University Medical Center Hamburg-Eppendorf
>>>
>>> Martinistr. 52
>>> 20246 Hamburg
>>>
>>> T ++49/40/7410-58243
>>> F ++49/40/7410-57790
>>>
>>
>>
>>
>> --
>> Lyndon Estes
>> Research Associate
>> Woodrow Wilson School
>> Princeton University
>> lestes at princeton.edu
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> "Men by nature long to get on to the ultimate truths, and will often
> be impatient with elementary studies or fight shy of them. If it were
> possible to reach the ultimate truths without the elementary studies
> usually prefixed to them, these would not be preparatory studies but
> superfluous diversions."
>
> -- Maimonides (1135-1204)
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>



-- 
Lyndon Estes
Research Associate
Woodrow Wilson School
Princeton University
+1-609-258-2392 (o)
+1-609-258-6082 (f)
+1-202-431-0496 (m)
lestes at princeton.edu



More information about the R-help mailing list