[R] Strange question/result about SVM
Ravi Varadhan
RVaradhan at jhmi.edu
Mon Sep 14 19:12:28 CEST 2009
Noah,
It may be just me - but how does "any" of your questions on prediction
modeling relate to R?
It seems to me that you have been getting a lot of "free" consulting from
this forum that is supposed to be a forum for help on R-related issues.
Ravi.
----------------------------------------------------------------------------
-------
Ravi Varadhan, Ph.D.
Assistant Professor, The Center on Aging and Health
Division of Geriatric Medicine and Gerontology
Johns Hopkins University
Ph: (410) 502-2619
Fax: (410) 614-9625
Email: rvaradhan at jhmi.edu
Webpage:
http://www.jhsph.edu/agingandhealth/People/Faculty_personal_pages/Varadhan.h
tml
----------------------------------------------------------------------------
--------
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Noah Silverman
Sent: Monday, September 14, 2009 1:00 PM
To: r help
Subject: [R] Strange question/result about SVM
Hello,
I have a very unusual situation with an SVM and wanted to get the
group's opinion.
We developed an experiment where we train the SVM with one set of data
(train data) and then test with a completely independent set of data
(test data). The results were VERY good.
I found and error in how we generate one of or training variables. We
discovered that it was indirectly influenced by future events. Clearly
that needed to be fixed. Fixing the variable immediately changed our
results from good to terrible. (Not a surprise since the erroneous
variable had future influence.)
A friend, who knows NOTHING of statistics or math, innocently asked,
"Why don't you just keep that variable since it seems to make your
results so much better." The idea, while naive, led me to thinking. We
can include future data in the training set, since it occurred in the
past, but what to do with the test data from today? As a test, I tried
simply setting the variable to the average of the value in the training
data. The results were great! Now since the data is scaled, and we set
the variable to the same value (constant from average of training data.)
it scaled to 0. Still, great results.
To summarize:
Bad var in training + Bad var in testing = great results
Good var in training + Good var in testing = bad results
Bad var in training + Constant in testing = great results.
I'm not an expert with the internals of the SVM, but clearly the bad
variable is setting some kind of threshhold or intercept when defining
the model. Can someone help me figure out why/how this is working?
Thanks!
--
N
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list