[R] Named numeric vectors with the same value but different names return different results when used as thresholds for calculating true positives

Lyndon Estes lestes at princeton.edu
Wed Jul 13 16:46:27 CEST 2011


Hello Eik,

Thanks very much for your response and for directing me to a useful
explanation.

To make sure I am grasping your answer correctly, the two problems I
was experiencing are related to the fact that the floating point
numbers I was calculating and using in subsequent indices were not
entirely equivalent, even if they were calculated using the same
function (quantile).

I confirmed this as follows:

j <- 0.055  # bins value for 5.5% threshold
bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
bin.ct
#430.29
bin.ct2 <- quantile(x, j, na.rm = TRUE)
bin.ct2
#  5.5%
#430.29
bin.ct - bin.ct2
#        5.5%
#5.684342e-14
length(x) - length(x[x <= bin.ct])
length(x) - length(x[x <= bin.ct]) # 2875
length(x) - length(x[x <= bin.ct2])  # 3029

Testing the unname() option does not fix the result, I should however note:
bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
bin.ct2 <- unname(quantile(x, j, na.rm = TRUE))
bin.ct - bin.ct2  # 5.684342e-14

But rounding, as an alternative approach, works:
bin.ct2 <- round(quantile(x, j, na.rm = TRUE), 2)
bin.ct - bin.ct2
#5.5%
#   0
length(x) - length(x[x <= bin.ct]) # 2875
length(x) - length(x[x <= bin.ct2])  # 2875

As to my code, it is part of a custom ROC function I created a while
back and started using again recently (the data are rainfall
values). I can't remember why I did this rather than using one of the
existing ROC functions, but I thought (probably incorrectly) that I
had some compelling reason. In
any case, it is quite unwieldy, so I will explore those other
packages, or try revise this to be more efficient

(e.g. maybe this is a better approach, although the answers are fairly
different?).

x2 <- x[order(x)]
y2 <- y[order(y)]
bins <- round(seq(min(x2), max(x2), by = diff(range(x2)) / 200), 2)
threshold <- seq(0, 100, by = 0.5)
tp <- rep(0, length(bins))
fp <- rep(0, length(bins))
for(i in 1:length(threshold)) {
  tp[i] <- length(x2) - length(x2[x2 <= bins[i]])
  fp[i] <- length(y2) - length(y2[y2 <= bins[i]])
}
ctch <- cbind(threshold, bins, tp, fp)
ctch[1:20, ]

Thanks again for your help.

Cheers, Lyndon


On Tue, Jul 12, 2011 at 5:09 AM, Eik Vettorazzi
<E.Vettorazzi at uke.uni-hamburg.de> wrote:
> Hi,
>
> Am 11.07.2011 22:57, schrieb Lyndon Estes:
>> ctch[ctch$threshold == 3.5, ]
>> # [1] threshold val       tp        fp        tn        fn        tpr
>>      fpr       tnr       fnr
>> #<0 rows> (or 0-length row.names)
>
> this is the very effective FAQ 7.31 trap.
> http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f
>
> Welcome to the first circle of Patrick Burns' R Inferno!
>
> Also, unname() is a more intuitive way of removing names.
>
> And I think your code is quite inefficient, because you calculate
> quantiles many times, which involves repeated ordering of x, and you may
> use a inefficient size of bin (either to small and therefore calculating
> the same split many times or to large and then missing some splits).
> I'm a bit puzzled what is x and y in your code, so any further advise is
> vague but you might have a look at any package that calculates
> ROC-curves such as ROCR or pROC (and many more).
>
> Hth
>
> --
> Eik Vettorazzi
>
> Department of Medical Biometry and Epidemiology
> University Medical Center Hamburg-Eppendorf
>
> Martinistr. 52
> 20246 Hamburg
>
> T ++49/40/7410-58243
> F ++49/40/7410-57790
>



-- 
Lyndon Estes
Research Associate
Woodrow Wilson School
Princeton University
lestes at princeton.edu



More information about the R-help mailing list