[R] lokern package
Ravi Varadhan
RVaradhan at jhmi.edu
Thu Jul 2 19:00:03 CEST 2009
Dear Martin,
Thank you for the propmt and highly informative reply!
In my experiments, I compared (a) smooth.spline (with both GCV and CV for
automatic bw selection), (b) spm() in "SemiPar", which implements penalized
splines smoothing with REML estimation of penalty, (c) glkerns() in
"lokern", which implements the Gasser-Mueller kernel methodology, and (d)
locpoly() in "KernSmooth" with direct-plugin method of Ruppert and Wand to
estimate bandwidth. All of these smothers also have the derivative
estimation.
The key point that you observed as the reason for why glkerns() cannot be
made more efficient, without making it inferior, is that the bandwidth that
is "optimal" for function estimation is not necessarily optimal for
derivative estimation. This is explicitly considered in derivative
estimation in glkerns(). Other smoothers, however, do not recognize this,
i.e. they use the same bandwith (and the same function estimate, of course)
that is optimal only for function estimation to estimate derivatives. In
fact, this is one of the main points of my work. Not surprisingly,
glkerns() does a nice job of estimating f, f', and f'', whereas the other
methods, with the exception of spm(), get progressively worse with
increasing order of derivatives (I use RMISE - root-mean-integrated squared
error to judge quality of smoothers). A slight plug - I will be presenting
these results at UseR! 2009 next week.
I knew that glkerns(x, y, deriv=1) is a completely independent estimation
from glkerns(x, y, deriv=0). What I was really wondering was whether it
might not be possible to have a call such as: glkerns(x, y, deriv=c(0, 1,
2), x.out=x.out). This would minimize some overhead associated with setting
up repeated calls each time in R and Fortran. While the savings may not be
huge for a single function smoothing, it might be substantial for large
numbers of time-series.
Best regards,
Ravi.
----------------------------------------------------------------------------
-------
Ravi Varadhan, Ph.D.
Assistant Professor, The Center on Aging and Health
Division of Geriatric Medicine and Gerontology
Johns Hopkins University
Ph: (410) 502-2619
Fax: (410) 614-9625
Email: rvaradhan at jhmi.edu
Webpage:
http://www.jhsph.edu/agingandhealth/People/Faculty_personal_pages/Varadhan.h
tml
----------------------------------------------------------------------------
--------
-----Original Message-----
From: Martin Maechler [mailto:maechler at stat.math.ethz.ch]
Sent: Thursday, July 02, 2009 12:35 PM
To: Ravi Varadhan
Cc: r-help at r-project.org; Martin Maechler
Subject: Re: [R] lokern package
>>>>> "RV" == Ravi Varadhan <RVaradhan at jhmi.edu>
>>>>> on Thu, 2 Jul 2009 10:51:03 -0400 writes:
RV> Dear Martin, I have been playing a lot with the
RV> glkerns() function in the "lokern" package for
RV> "automatic" smoothing of time-series data. This kernel
RV> smoothing approach of Gasser and Mueller seems to
RV> perform quite well for estimating the function and its
RV> derivatives (first and second derivatives). In fact,
RV> this is one of the best methods based on my simulation
RV> studies for comparing a number of "automatic" bandwidth
RV> selection methods.
That's quite interesting!
What methods did you use in your comparison?
[I assume you did use smooth.spline() as well]
RV> I am interested in applying this to
RV> automatic smoothing and feature extraction for a "large"
RV> number (thousands) of time series, with hundreds of
RV> points per time series. This is where I am interested
RV> in seeing if the efficiency of "glkerns" can be
RV> improved.
RV> Here follows my specific question:
RV> You have to call glkerns() separately each time to
RV> compute the function and its derivatives, ie. if I want
RV> the function and its first 2 derivatives, I have to make
RV> 3 calls to glkerns().
Yes. Note however that each of these calls chooses a different kernel order
('korder') by default { korder <- deriv + 2 }, and uses *different*
automatic bandwidths depending on both deriv and korder.
Consequently, the result of glkerns(x,y, deriv=1) is
*not* the derivative of the estimate glkerns(x,y, deriv=0) but rather an
independent estimate of the derivative of the unknown g(). The same applies
for deriv=2 etc.
But the problem is even "worse": Let's assume you get the "optimal" global
bandwidth estimate 'bandwidth' = h from the deriv=0 case.
Then you could set this bandwidth explicitly for the deriv=1 and
deriv=2 calls {and you'd gain quite a bit of execution time !}.
*HOWEVER*, as the internal codes absolutely require
k_{ord} - nu == korder - nu to be an *even* number,
you can *not* keep 'korder' fixed together with 'bandwidth'.
But if you change 'korder', the kernels change and the meaning of the
bandwidth has changed too.
I have just uploaded version 1.0-8 of package 'lokern' to CRAN, and this now
contains a demo("gkl-derivs") which defines a utility function to play a bit
with this (keeping bandwidth fixed).
RV> This seems to me to be inefficient, especially for large
RV> time series. In smooth.spline(), for example, you can
RV> call it once to get the fit, and then use the fitted
RV> object to compute the derivatives using predict().
Note that smooth.spline() is very different:
If you use deriv=1 or deriv=2 in the predict() call, you get exactly the 1st
or 2nd derivative of the same smoothing spline function.
RV> Is it possible to have a similar feature in glkerns?
As I have explained, this is not easily possible currently more for
conceptual reasons than others.
In principle it should be possible however, but that would both need some
"theoretical" work {maybe just collecting the papers and formulae used} and
then some Fortran code digging and shuffling.
Would be a nice "semester work" for a student...
Martin
--
Martin <Maechler at stat.math.ethz.ch> http://stat.ethz.ch/people/maechler
Seminar für Statistik, ETH Zürich HG G 16 Rämistrasse 101
CH-8092 Zurich, SWITZERLAND
phone: +41-44-632-3408 fax: ...-1228 <><
RV> Thanks for any suggestions.
RV> Ravi.
RV>
----------------------------------------------------------------------------
RV> -------
RV> Ravi Varadhan, Ph.D.
RV> Assistant Professor, The Center on Aging and Health
RV> Division of Geriatric Medicine and Gerontology
RV> Johns Hopkins University
RV> Ph: (410) 502-2619
RV> Fax: (410) 614-9625
RV> Email: rvaradhan at jhmi.edu
RV> Webpage:
RV>
http://www.jhsph.edu/agingandhealth/People/Faculty_personal_pages/Varadhan.h
RV> tml
RV>
----------------------------------------------------------------------------
RV> --------
RV> [[alternative HTML version deleted]]
RV> ______________________________________________
RV> R-help at r-project.org mailing list
RV> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
RV> read the posting guide
RV> http://www.R-project.org/posting-guide.html and provide
RV> commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list