[R] breaks

Fri Jun 13 19:35:14 CEST 2003

>>>>> "DavidB" == David Brahm <brahm at alum.mit.edu>
>>>>>     on Fri, 13 Jun 2003 10:56:29 -0400 writes:

    DavidB> Martin Maechler <maechler at stat.math.ethz.ch> wrote:
    >> findInterval()

    DavidB> Hi, Martin.  I wasn't aware of findInterval().  findInterval(x, vec) looks to
    DavidB> me very similar to:
    R> cut(x, c(-Inf,vec,Inf), labels=FALSE, right=FALSE) - 1

    DavidB> so I'm curious what the differences are (e.g. speed,
    DavidB> duplicates in vec?).  In any case, findInterval()
    DavidB> and cut() ought to be in each other's "See Also",
    DavidB> don't you think?

When I wrote the precursor of findInterval() about 10 years ago (to be
dyn.load()ed into S-plus), I hadn't yet realized about the
several alternatives.  

However, when I added it to R, I knew about the N*ecdf()
alternative, i.e., ecdf() from package:stepfun which relies on
approx(....., method = "constant").
I found that findInterval() was slightly faster than approx()
even for unsorted `x' (by about a factor of 2 for large `vec') in my
test cases, but the real speed of findInterval() comes to play
when `x' is sorted -- something which is very typical e.g. for
evaluation of piecewise functions (splines etc).

    R> xx <- c(-2.0, 1.4, -1.2, -2.2, 0.4, 1.5, -2.2, 0.2, -0.4, -0.9)
    R> xx.y <- c(-2.2000000, -0.9666667, 0.2666667, 1.5000000)
    R> findInterval(xx, xx.y)
    DavidB> [1] 1 3 1 1 3 4 1 2 2 2
    R> cut(xx, c(-Inf,xx.y,Inf), labels=FALSE, right=FALSE) - 1
    DavidB> [1] 1 3 1 1 3 4 1 2 2 2

cut() is still slower than the ecdf() / approx() version
considerably for long `vec'  ...
I really should write a small article about this for "R News",
where I'd also show the simulation results...

Martin Maechler <maechler at stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><