[R] C/C++/Fortran Rolling Window Regressions
jeremiah rounds
roundsjeremiah at gmail.com
Thu Jul 21 22:56:17 CEST 2016
I appreciate the timing, so much so I changed the code to show the issue.
It is a problem of scale.
roll_lm probably has a heavy start-up cost but otherwise completely
out-performs those other versions at scale. I suspect you are timing the
nearly constant time start-up cost in small data. I did give code to
paint a picture, but it was just cartoon code lifted from stackexchange.
If you want to characterize the real problem it is closer to:
30 day rolling windows on 24 daily (by hour) measurements for 5 years with
24+7 -1 dummy predictor variables and finally you need to do this for 300
sets of data.
Pseudo-code is closer to what follows and roll_lm can handle that input in
a timely manner. You can do it with lm.fit, but you need to spend a lot of
time waiting. The issue of accuracy needs a follow-up check. Not sure why
it would be different. Worth a check on that.
Thanks,
Jeremiah
library(rbenchmark)
N = 30*24*12*5
window = 30*24
npred = 15 #15 chosen arbitrarily...
set.seed(1)
library(zoo)
library(RcppArmadillo)
library(roll)
x = matrix(rnorm(N*(npred+1)), ncol = npred+1)
colnames(x) <- c("y", paste0("x", 1:npred))
z <- zoo(x)
benchmark(
roll_lm = roll_lm(coredata(z[, 1, drop = F]), coredata(z[, -1, drop =
F]), window,
center = FALSE), replications=3)
Which comes out as:
test replications elapsed relative user.self sys.self user.child
sys.child
1 roll_lm 3 6.273 1 38.312 0.654 0
0
## You arn't going to get that below...
benchmark(fastLm = rollapplyr(z, width = window,
function(x) coef(fastLm(cbind(1, x[, -1]), x[, 1])),
by.column = FALSE),
lm = rollapplyr(z, width = window,
function(x) coef(lm(y ~ ., data = as.data.frame(x))),
by.column = FALSE), replications=3)
On Thu, Jul 21, 2016 at 1:28 PM, Gabor Grothendieck <ggrothendieck at gmail.com
> wrote:
> I would be careful about making assumptions regarding what is faster.
> Performance tends to be nonintuitive.
>
> When I ran rollapply/lm, rollapply/fastLm and roll_lm on the example
> you provided rollapply/fastLm was three times faster than roll_lm. Of
> course this could change with data of different dimensions but it
> would be worthwhile to do actual benchmarks before making assumptions.
>
> I also noticed that roll_lm did not give the same coefficients as the
> other two.
>
> set.seed(1)
> library(zoo)
> library(RcppArmadillo)
> library(roll)
> z <- zoo(matrix(rnorm(10), ncol = 2))
> colnames(z) <- c("y", "x")
>
> ## rolling regression of width 4
> library(rbenchmark)
> benchmark(fastLm = rollapplyr(z, width = 4,
> function(x) coef(fastLm(cbind(1, x[, 2]), x[, 1])),
> by.column = FALSE),
> lm = rollapplyr(z, width = 4,
> function(x) coef(lm(y ~ x, data = as.data.frame(x))),
> by.column = FALSE),
> roll_lm = roll_lm(coredata(z[, 1, drop = F]), coredata(z[, 2, drop =
> F]), 4,
> center = FALSE))[1:4]
>
>
> test replications elapsed relative
> 1 fastLm 100 0.22 1.000
> 2 lm 100 0.72 3.273
> 3 roll_lm 100 0.64 2.909
>
> On Thu, Jul 21, 2016 at 3:45 PM, jeremiah rounds
> <roundsjeremiah at gmail.com> wrote:
> > Thanks all. roll::roll_lm was essentially what I wanted. I think
> maybe
> > I would prefer it to have options to return a few more things, but it is
> > the coefficients, and the remaining statistics you might want can be
> > calculated fast enough from there.
> >
> >
> > On Thu, Jul 21, 2016 at 12:36 PM, Achim Zeileis <
> Achim.Zeileis at uibk.ac.at>
> > wrote:
> >
> >> Jeremiah,
> >>
> >> for this purpose there are the "roll" and "RcppRoll" packages. Both use
> >> Rcpp and the former also provides rolling lm models. The latter has a
> >> generic interface that let's you define your own function.
> >>
> >> One thing to pay attention to, though, is the numerical reliability.
> >> Especially on large time series with relatively short windows there is a
> >> good chance of encountering numerically challenging situations. The QR
> >> decomposition used by lm is fairly robust while other more
> straightforward
> >> matrix multiplications may not be. This should be kept in mind when
> writing
> >> your own Rcpp code for plugging it into RcppRoll.
> >>
> >> But I haven't check what the roll package does and how reliable that
> is...
> >>
> >> hth,
> >> Z
> >>
> >>
> >> On Thu, 21 Jul 2016, jeremiah rounds wrote:
> >>
> >> Hi,
> >>>
> >>> A not unusual task is performing a multiple regression in a rolling
> window
> >>> on a time-series. A standard piece of advice for doing in R is
> >>> something
> >>> like the code that follows at the end of the email. I am currently
> using
> >>> an "embed" variant of that code and that piece of advice is out there
> too.
> >>>
> >>> But, it occurs to me that for such an easily specified matrix operation
> >>> standard R code is really slow. rollapply constantly returns to R
> >>> interpreter at each window step for a new lm. All lm is at its heart
> is
> >>> (X^t X)^(-1) * Xy, and if you think about doing that with Rcpp in
> rolling
> >>> window you are just incrementing a counter and peeling off rows (or
> >>> columns
> >>> of X and y) of a particular window size, and following that up with
> some
> >>> matrix multiplication in a loop. The psuedo-code for that Rcpp
> >>> practically writes itself and you might want a wrapper of something
> like:
> >>> rolling_lm (y=y, x=x, width=4).
> >>>
> >>> My question is this: has any of the thousands of R packages out there
> >>> published anything like that. Rolling window multiple regressions that
> >>> stay in C/C++ until the rolling window completes? No sense and
> writing it
> >>> if it exist.
> >>>
> >>>
> >>> Thanks,
> >>> Jeremiah
> >>>
> >>> Standard (slow) advice for "rolling window regression" follows:
> >>>
> >>>
> >>> set.seed(1)
> >>> z <- zoo(matrix(rnorm(10), ncol = 2))
> >>> colnames(z) <- c("y", "x")
> >>>
> >>> ## rolling regression of width 4
> >>> rollapply(z, width = 4,
> >>> function(x) coef(lm(y ~ x, data = as.data.frame(x))),
> >>> by.column = FALSE, align = "right")
> >>>
> >>> ## result is identical to
> >>> coef(lm(y ~ x, data = z[1:4,]))
> >>> coef(lm(y ~ x, data = z[2:5,]))
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list