[R] Possible overfitting of a GAM

Thomas L Jones, PhD jones3745 at verizon.net
Sat Feb 16 23:25:03 CET 2008


The subject is a Generalized Additive Model. Experts caution us against 
overfitting the data, which can cause inaccurate results. I am not a 
statistician (my background is in Computer Science). Perhaps some kind soul 
would take a look and vet the model for overfitting the data.

The study estimated the ebb and flow of traffic through a voting place. Just 
one voting place was studied; the election was the U.S. mid-term election 
about a year ago. Procedure: The voting day was divided into five-minute 
bins, and the number of voters arriving in each bin was recorded. The voting 
day was 13 hours long, giving 156 bins.

See http://tinyurl.com/36vzop for the scatterplot. There is a rather high 
random variation, due in part to the fact that the bin width was 
intentionally set to be narrow, in order to improve the amount of timing 
information gathered.

http://tinyurl.com/3xjsyo displays the fitted curve. A GAM was used, with 
the loess smoothing algorithm (locally weighted regression). The default 
span was used. http://tinyurl.com/34av6l gives the scatterplot and the 
fitted curve. The two seem to match reasonably well.

However, when I tried to generate the standard errors, things went awry. 
(Please see http://tinyurl.com/38ej2t ) There are three curves, seemingly 
the fitted curve and the curves for plus and minus two standard errors. The 
shapes seem okay, but there are large errors in the y values.

Question: Have I overfitted the data?

Feedback?

Tom
Thomas L. Jones, PhD, Computer Science



More information about the R-help mailing list