[R] Out-of-sample prediction with VAR
peter at linelink.nl
peter at linelink.nl
Sun Feb 7 23:37:16 CET 2010
Good day,
I'm using a VAR model to forecast sales with some extra variables (google
trends data). I have divided my dataset into a trainingset (weekly sales +
vars in 2006 and 2007) and a holdout set (2008).
It is unclear to me how I should predict the out-of-sample data, because
using the predict() function in the vars package seems to estimate my
google trends vars as well. However, I want to forecast the sales figures,
with knowledge of the actual google trends data.
My questions:
1. How should I do this? I currently extract the linear model generated by
the VAR(3) function to predict the holdout set, but that seems
inappropriate?
2. In case that I am doing it right, how is it possible that a
automatically fitted model with more variables actually performs less good
(in terms of MAPE)? Shouldn't it at least predict just as well as the
simple AR(3) by finding that the extra variables have no added value?
My code:
ts_Y <- ts(log_residuals[1:104]); # detrended sales data
ts_XGG <- ts(salesmodeldata$gtrends_global[1:104]);
ts_XGL <- ts(salesmodeldata$gtrends_local[1:104]);
training_matrix <- data.frame(ts_Y, ts_XGG, ts_XGL);
### Try VAR(3)
var_model <- VAR (y=training_matrix, p=3, type="both", season=NULL,
exogen=NULL, lag.max=NULL);
## Out of sample forecasting
var.lm = lm(var_model$varresult$ts_Y); # the generated LM
ts_Y <- ts(log_residuals[105:155]);
ts_XGG <- ts(salesmodeldata$gtrends_global[105:155]);
ts_XGL <- ts(salesmodeldata$gtrends_local[105:155]);
# Notice how I manually create the lagged values to be used in the
Linear Model
holdout_matrix <- na.omit(data.frame(ts.union(ts_Y, ts_XGG, ts_XGL,
ts_Y.l1 = lag(ts_Y,-1), ts_Y.l2 = lag(ts_Y,-2), ts_Y.l3 = lag(ts_Y,-3),
ts_XGG.l1 = lag(ts_XGG,-1), ts_XGG.l2 = lag(ts_XGG,-2), ts_XGG.l3 =
lag(ts_XGG,-3), ts_XGL.l1 = lag(ts_XGL,-1), ts_XGL.l2 = lag(ts_XGL,-2),
ts_XGL.l3 = lag(ts_XGL,-3), const=1, trend=0.0001514194 )));
var.predict = predict(object=var_model, n.ahead=52, dumvar=holdout_matrix);
## Assess accuracy
calc_mape (holdout_matrix$ts_Y, var.predict, islog=T, print=T)
Some context:
For my Master's thesis I'm using R to test the predictive power of web
metrics (such as google trends data & pageviews) in sales forecasting. To
properly assess this, I employ a simple AR model (for time series without
the extra variables) and a VAR model for the predictions with the extra
variables. I also develop a random forest with, and without the buzz
variables and see if MAPE improves.
Many thanks in advance!
More information about the R-help
mailing list