[R] quantreg speed
Yunqi Zhang
yqzhang at ucsd.edu
Sun Nov 16 02:40:48 CET 2014
Hi William,
Thank you very much for your reply.
I did a subsampling to reduce the number of samples to ~1.8 million. It
seems to work fine except for 99th percentile (p-values for all the
features are 1.0). Does this mean I’m subsampling too much? How should I
interpret the result?
tau: [1] 0.25
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 72.15700 0.03651 1976.10513 0.00000
f1 -0.51000 0.04906 -10.39508 0.00000
f2 -20.44200 0.03933 -519.78766 0.00000
f3 -2.37000 0.04871 -48.65117 0.00000
f1:f2 -2.52500 0.05315 -47.50361 0.00000
f1:f3 1.03600 0.06573 15.76193 0.00000
f2:f3 3.41300 0.05247 65.05075 0.00000
f1:f2:f3 -0.83800 0.07120 -11.77002 0.00000
Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
0.75, 0.9, 0.95, 0.99), data = data_stats)
tau: [1] 0.5
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 83.80900 0.05626 1489.61222 0.00000
f1 -0.92200 0.07528 -12.24692 0.00000
f2 -27.90700 0.05937 -470.07189 0.00000
f3 -6.45000 0.07204 -89.53909 0.00000
f1:f2 -2.66500 0.07933 -33.59275 0.00000
f1:f3 1.99000 0.09869 20.16440 0.00000
f2:f3 7.09600 0.07611 93.23813 0.00000
f1:f2:f3 -1.71200 0.10390 -16.47660 0.00000
Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
0.75, 0.9, 0.95, 0.99), data = data_stats)
tau: [1] 0.75
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 102.71700 0.10175 1009.45946 0.00000
f1 -1.59300 0.13241 -12.03125 0.00000
f2 -40.64200 0.10623 -382.58456 0.00000
f3 -14.40900 0.12096 -119.11988 0.00000
f1:f2 -2.97600 0.13867 -21.46071 0.00000
f1:f3 3.74600 0.16335 22.93165 0.00000
f2:f3 14.14800 0.12692 111.47217 0.00000
f1:f2:f3 -3.16400 0.17159 -18.43899 0.00000
Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
0.75, 0.9, 0.95, 0.99), data = data_stats)
tau: [1] 0.9
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 130.89400 0.20609 635.12464 0.00000
f1 -2.55500 0.28139 -9.07995 0.00000
f2 -60.90500 0.21322 -285.64558 0.00000
f3 -29.42300 0.23409 -125.69092 0.00000
f1:f2 -2.77700 0.29052 -9.55870 0.00000
f1:f3 7.89700 0.33308 23.70870 0.00000
f2:f3 27.78100 0.24338 114.14722 0.00000
f1:f2:f3 -6.95800 0.34491 -20.17327 0.00000
Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
0.75, 0.9, 0.95, 0.99), data = data_stats)
tau: [1] 0.95
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 157.45900 0.42733 368.47413 0.00000
f1 -4.10200 0.55834 -7.34678 0.00000
f2 -81.24000 0.44012 -184.58697 0.00000
f3 -46.17500 0.46235 -99.87033 0.00000
f1:f2 -2.01700 0.57651 -3.49866 0.00047
f1:f3 15.67000 0.67409 23.24600 0.00000
f2:f3 43.00100 0.47973 89.63500 0.00000
f1:f2:f3 -14.05100 0.69737 -20.14843 0.00000
Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *
f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,
0.75, 0.9, 0.95, 0.99), data = data_stats)
tau: [1] 0.99
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 2.544860e+02 3.878303e+07 1.000000e-05 9.999900e-01
f1 -1.420000e+01 5.917548e+11 0.000000e+00 1.000000e+00
f2 -1.582920e+02 3.450261e+07 0.000000e+00 1.000000e+00
f3 -1.139210e+02 4.763057e+07 0.000000e+00 1.000000e+00
f1:f2 5.725000e+00 1.324283e+12 0.000000e+00 1.000000e+00
f1:f3 6.811780e+02 1.153645e+13 0.000000e+00 1.000000e+00
f2:f3 1.042510e+02 2.299953e+24 0.000000e+00 1.000000e+00
f1:f2:f3 -6.763210e+02 2.299953e+24 0.000000e+00 1.000000e+00
Warning message:
In summary.rq(xi, ...) : 288000 non-positive fis
On Sat, Nov 15, 2014 at 8:19 PM, William Dunlap <wdunlap at tibco.com> wrote:
> You can time it yourself on increasingly large subsets of your data. E.g.,
>
> > dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6),
> x3=sample(c("A","B","C"),size=1e6,replace=TRUE))
> > dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6))
> > t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),];
> print(system.time(rq(data=d, y ~ x1 + x2*x3,
> tau=0.9)))},FUN.VALUE=numeric(5))
> user system elapsed
> 0 0 0
> user system elapsed
> 0 0 0
> user system elapsed
> 0.02 0.00 0.01
> user system elapsed
> 0.01 0.00 0.02
> user system elapsed
> 0.10 0.00 0.11
> user system elapsed
> 1.09 0.00 1.10
> user system elapsed
> 13.05 0.02 13.07
> user system elapsed
> 273.30 0.11 273.74
> > t
> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
> user.self 0 0 0.02 0.01 0.10 1.09 13.05 273.30
> sys.self 0 0 0.00 0.00 0.00 0.00 0.02 0.11
> elapsed 0 0 0.01 0.02 0.11 1.10 13.07 273.74
> user.child NA NA NA NA NA NA NA NA
> sys.child NA NA NA NA NA NA NA NA
>
> Do some regressions on t["elapsed",] as a function of n and predict up to
> n=10^7. E.g.,
> > summary(lm(t["elapsed",] ~ poly(n,4)))
>
> Call:
> lm(formula = t["elapsed", ] ~ poly(n, 4))
>
> Residuals:
> 1 2 3 4 5 6
> 7 8
> -2.375e-03 -2.970e-03 4.484e-03 1.674e-03 -8.723e-04 6.096e-05
> -9.199e-07 2.715e-09
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 3.601e+01 1.261e-03 28564.33 9.46e-14 ***
> poly(n, 4)1 2.493e+02 3.565e-03 69917.04 6.45e-15 ***
> poly(n, 4)2 5.093e+01 3.565e-03 14284.61 7.57e-13 ***
> poly(n, 4)3 1.158e+00 3.565e-03 324.83 6.43e-08 ***
> poly(n, 4)4 4.392e-02 3.565e-03 12.32 0.00115 **
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 0.003565 on 3 degrees of freedom
> Multiple R-squared: 1, Adjusted R-squared: 1
> F-statistic: 1.273e+09 on 4 and 3 DF, p-value: 3.575e-14
>
>
> It does not look good for n=10^7.
>
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
>
>> Hi all,
>>
>> I'm using quantreg rq() to perform quantile regression on a large data
>> set.
>> Each record has 4 fields and there are about 18 million records in total.
>> I
>> wonder if anyone has tried rq() on a large dataset and how long I should
>> expect it to finish. Or it is simply too large and I should subsample the
>> data. I would like to have an idea before I start to run and wait forever.
>>
>> In addition, I will appreciate if anyone could give me an idea how long it
>> takes for rq() to run approximately for certain dataset size.
>>
>> Yunqi
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list