[R] Median of streaming data

Rolf Turner r.turner at auckland.ac.nz
Sat Sep 27 00:53:08 CEST 2014

On 26/09/14 21:48, Martin Maechler wrote:
>>>>>> Rolf Turner <r.turner at auckland.ac.nz>


>      > I have coded up the algorithm from the Cameron and Turner
>      > paper.  Dunno if it gives exactly the same results as my
>      > (Splus?) code from lo these many years ago (the code that
>      > is lost in the mists of time), but it *seems* to work.
> excellent, thank you, Rolf!
>      > It is not designed to work with actual "streaming" data
>      > --- I don't know how to do that.  It takes a complete data
>      > vector as input.  Someone who knows about streaming data
>      > should be able to adapt it pretty easily.  Said he, the
>      > proverbial optimist.
> I agree; that should not be hard.
> One way is to replace   'y[ind]' by   'getY(ind)' everywhere in the code
> and let 'getY' be an argument to rlas() provided by the user.
>      > The function code and a help file are attached.  These
>      > files have had their names changed to end in ".txt" so
>      > that they will get through the mailing list processor
>      > without being stripped.  With a bit of luck.
> ;-)
> It did work indeed.
> I've added them to  'robustX' -- on R-forge,
> including a plot() method and some little more flexibility.
>    --> https://r-forge.r-project.org/R/?group_id=59


Since I posted my previous email I have found some typos in the 
documentation and made some adjustments to the code.

I also realized that the name "rlas" sounds a bit too much like a random 
number generator, according to R's conventions.  So I have decided that 
"lasr" would be a better name.

I hope this change of horses in midstream doesn't mess things up too 
much for you.  If it does, of course feel free to stick with "rlas".

I have attached revised *.R and *.Rd files.  Not sure that the *.Rd file 
will get through as-is.  Please let me know if it doesn't.



Rolf Turner
Technical Editor ANZJS
-------------- next part --------------
lasr <- function(y,b=0.2,mfn=function(n){0.1*n^(-0.25)},
                 nstart=30,scon=NULL) {
# Initialize:
y0    <- y[1:nstart]
alpha <- median(y0)
s     <- if(is.null(scon)) mean(abs(y0-alpha)) else scon
mu    <- mfn(nstart)/s
eps   <- s/nstart^b
kount <- sum(abs(alpha-y0) < eps)
g     <- kount/(eps*nstart)
ny    <- length(y)
n     <- nstart+1
locn  <- numeric(ny)
locn[1:(nstart-1)] <- NA
locn[nstart] <- alpha
scale  <- numeric(ny)
scale[1:(nstart-1)] <- if(is.null(scon)) NA else scon
scale[nstart] <- s

# Calculate recursively:
while(n <= ny) {
   s     <- if(is.null(scon)) ((n-1)*s + abs(y[n]-alpha))/n else scon
   mu    <- mfn(n)/s
   eye   <- if(abs(alpha - y[n]) < s/n^b) 1 else 0
   g     <- (1-1/n)*g + n^(b-1)*eye/s
   a     <- max(mu,g)
   alpha <- alpha + sign(y[n]-alpha)/(a*n)
   locn[n]  <- alpha
   scale[n] <- s
   n <- n+1
-------------- next part --------------
  Recursive location and scale.
  Calculate an estimate of location, asymptotically
  equivalent to the median, and an estimate of scale
  equal to the \bold{MEAN} absolute deviation.  Both
  done recursively.
lasr(y, b = 0.2, mfn = function(n){0.1 * n^(-0.25)},
     nstart = 30, scon = NULL)
  A numeric vector of i.i.d. data whose location and scale
  parameters are to be estimated.
  A tuning parameter (default value equal to that used by
  Holst, 1987).
  Function of the index of the data which must be positive
  and tend to 0 as the index tends to infinity.  The default
  function is that used by Holst, 1987.
  Starting values for the algorithm are formed from the first
  \code{nstart} values of \code{y}.  The default value is that
  used in Cameron and Turner, 1993.
  A constant value for the scale parameter \code{s}. If this
  is provided then the algorithm becomes equivalent to the
  algorithm of Holst, 1987.
A list with entries
  \item{locn}{The successive recursive estimates of location.  The
  first \code{nstart - 1} of these are \code{NA}.}
  \item{scale}{The successive recursive estimates of scale. If
  \code{scon} is specified these are all equal to \code{scon}. Otherwise
  the first \code{nstart - 1} values are \code{NA}.}

The \bold{mean} absolute deviation is used as an estimate of scale
(rather than the more expected \bold{median} absolute deviation) simply
because the former can be calculated recursively.

Cameron, Murray A. and Turner, T. Rolf (1993). Recursive location and
scale estimators. \emph{Commun. Statist. --- Theory Meth.} \bold{22}
(9) 2503--2515.

Holst, U. (1987). Recursive estimators of location.
\emph{Commun. Statist. --- Theory Meth.} \bold{16} (8) 2201--2226.
 \email{r.turner at auckland.ac.nz}
y  <- rnorm(10000)
z1 <- lasr(y)
z2 <- lasr(y,scon=1)
z3 <- lasr(y,scon=100)
OP <- par(mfrow=c(3,1))

\keyword{ univar }
\keyword{ robust }

More information about the R-help mailing list