[R] Bootstrapping in R
Bryan Mac
bryanmac.24 at gmail.com
Wed Oct 19 17:28:00 CEST 2016
Hi,
After running the bootstrapping, I would like to the output of the bootstrapped samples. How can I view the bootstrapped samples of each variable?
Bryan Mac
bryanmac.24 at gmail.com
> On Oct 18, 2016, at 3:57 AM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
>
> It means that the sd of the bootstrap samples is 0.21.
> See function ?boot.ci for confidence intervals.
> You should also start a new thread in R-Help, you will have more and better answers.
>
> Em 18-10-2016 08:15, Bryan Mac escreveu:
>> Hi Rui,
>>
>> I am having trouble understanding what this means exactly? Does this
>> mean that the bootstrapped number is +/-0.21 from the original?
>>
>>
>> How would i show all of the t’s in the bootstrap? I have about t1 to t28
>> so far. Would it be possible to show all of them?
>
> I don't understand what you mean by this. All of the results are returned and printed by boot().
>
> Rui Barradas
>>
>> Bryan Mac
>> bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>
>>
>>
>>
>>> On Oct 7, 2016, at 1:41 PM, ruipbarradas at sapo.pt
>>> <mailto:ruipbarradas at sapo.pt> wrote:
>>>
>>> Hello,
>>>
>>> That's just the definition of a function, you have to actually call
>>> it, in a call to boot(9, for instance.
>>>
>>>
>>> OLSCoef_NAR_NIC <- function(df, indices){
>>> sample <- df[indices, ]
>>> OLS_NAR_NIC_relation <- lm(NAR ~ NIC, data = sample)
>>> coef_ols_nar_nic <- coef(OLS_NAR_NIC_relation)
>>> coef_ols_nar_nic
>>> }
>>>
>>> boot(n_data, statistic = OLSCoef_NAR_NIC, R = 100)
>>>
>>> ORDINARY NONPARAMETRIC BOOTSTRAP
>>>
>>>
>>> Call:
>>> boot(data = n_data, statistic = OLSCoef_NAR_NIC, R = 100)
>>>
>>>
>>> Bootstrap Statistics :
>>> original bias std. error
>>> t1* 1.8788189 -0.013771706 0.59596631
>>> t2* 0.5003911 0.002478478 0.09016857
>>>
>>>
>>> As for the output in the format you want, I sugest you call lm(9, with
>>> your entire df, since it is big there's no reason to bootstrap it.
>>> Something like this:
>>>
>>> > model <- lm(NAR ~ NIC, data = data)
>>> > summary(model)
>>>
>>> Call:
>>> lm(formula = NAR ~ NIC, data = data)
>>>
>>> Residuals:
>>> Min 1Q Median 3Q Max
>>> -6.0459 -1.1916 0.2126 1.3424 4.8094
>>>
>>> Coefficients:
>>> Estimate Std. Error t value Pr(>|t|)
>>> (Intercept) 1.66395 0.18859 8.823 <2e-16 ***
>>> NIC 0.56384 0.02588 21.783 <2e-16 ***
>>> ---
>>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>>
>>> Residual standard error: 1.886 on 1267 degrees of freedom
>>> Multiple R-squared: 0.2725, Adjusted R-squared: 0.2719
>>> F-statistic: 474.5 on 1 and 1267 DF, p-value: < 2.2e-16
>>>
>>> Rui Barradas
>>>
>>> Citando Bryan Mac <bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>>:
>>>
>>>> By the way, when I ran the code, i didn’t see any output of results.
>>>>
>>>> This is what I got.
>>>> <OLS.PNG>
>>>> Bryan Mac
>>>> bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>
>>>>> On Oct 6, 2016, at 3:48 AM, ruipbarradas at sapo.pt
>>>>> <mailto:ruipbarradas at sapo.pt> wrote:
>>>
>>> Hello,
>>>
>>> I believe that your code is correct, I don't understand what you mean
>>> by "not showing up".
>>> If you want the coefficients, residuals, etc, your bootstrap statistic
>>> function needs to return those values. You can, for instance, use a
>>> different function, one to return the r-squared, another to return the
>>> coefficients or t.value, etc.
>>>
>>> This function would return the coefficients. Note that if you use the
>>> argument data = ... you don't need the name of the df in your formula.
>>> It makes the code more readable.
>>>
>>>
>>> OLSCoef_NAR_NIC <- function(df, indices){
>>> sample <- df[indices, ]
>>> OLS_NAR_NIC_relation <- lm(NAR ~ NIC, data = sample)
>>> coef_ols_nar_nic <- coef(OLS_NAR_NIC_relation)
>>> coef_ols_nar_nic
>>> }
>>>
>>> Rui Barradas
>>>
>>> Citando Bryan Mac <bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>>:
>>>
>>>> Hi Rui,
>>>>
>>>> My next steps is to run both Least Median Square Regression and
>>>> Ordinary Least Square Regression after the bootstrap.
>>>> Me and my colleague wrote the code for it. I am having doubts that it
>>>> is correct. Is this how you compete the OLS and LMS Regression?
>>>> Doesn’t my output have to model the sample below? I believe I do have
>>>> the code that can model it but its not showing up, but i do not see
>>>> the residuals or the coefficients (estimate/std. error,t.value,etc.)
>>>> Sample Code:
>>>> Call:
>>>> ## lm(formula = crime ~ poverty + single, data = cdata)
>>>> ##
>>>> ## Residuals:
>>>> ## Min 1Q Median 3Q Max
>>>> ## -811.1 -114.3 -22.4 121.9 689.8
>>>> ##
>>>> ## Coefficients:
>>>> ## Estimate Std. Error t value Pr(>|t|)
>>>> ## (Intercept) -1368.19 187.21 -7.31 2.5e-09 ***
>>>> ## poverty 6.79 8.99 0.76 0.45
>>>> ## single 166.37 19.42 8.57 3.1e-11 ***
>>>> ## ---
>>>> ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>>> ##
>>>> ## Residual standard error: 244 on 48 degrees of freedom
>>>> ## Multiple R-squared: 0.707, Adjusted R-squared: 0.695
>>>> ## F-statistic: 58 on 2 and 48 DF, p-value: 1.58e-13
>>>> This is our code:
>>>> OLSRegression <- function(df, indices){
>>>> sample <- df[indices, ]
>>>> OLS_NAR_NIC_relation <- lm(sample$NAR~sample$NIC, data = sample)
>>>> rsquared_ols_nar_nic <- summary(OLS_NAR_NIC_relation)$r.square
>>>>
>>>>
>>>> OLS_SQRTNAR_SQRTNIC_relation <- lm(sample$SQRTNAR~sample$SQRTNIC,
>>>> data = sample)
>>>> rsquared_ols_sqrtnar_sqrtnic <-
>>>> summary(OLS_SQRTNAR_SQRTNIC_relation)$r.square
>>>>
>>>>
>>>> out <- c(rsquared_ols_nar_nic, rsquared_ols_sqrtnar_sqrtnic)
>>>> return(out)
>>>> }
>>>> LMSRegression <- function(df, indices){
>>>> sample <- df[indices, ]
>>>> LMS_NAR_NIC_relation <- lm(sample$NAR~sample$NIC, data = sample,
>>>> method = "lms")
>>>> rsquared_lms_nar_nic <- summary(LMS_NAR_NIC_relation)$r.square
>>>>
>>>>
>>>> LMS_SQRTNAR_SQRTNIC_relation <- lm(sample$SQRTNAR~sample$SQRTNIC,
>>>> data = sample, method = "lms")
>>>> rsquared_lms_sqrtnar_sqrtnic <-
>>>> summary(LMS_SQRTNAR_SQRTNIC_relation)$r.square
>>>>
>>>>
>>>> out <- c(rsquared_lms_nar_nic, rsquared_lms_sqrtnar_sqrtnic)
>>>> return(out)
>>>> }
>>>> boot.out.ols <- boot(n_data, statistic = OLSRegression, R = 100)
>>>> boot.out.ols
>>>> plot(boot.out.ols, index = 1)
>>>> title(sub = "Histogram and Q-Q plot for relation between NAR-NIC
>>>> (OLS; R-Squared Value)", line = 4)
>>>> plot(boot.out.ols, index = 2)
>>>> title(sub = "Histogram and Q-Q plot for relation between
>>>> SQRTNAR-SQRTNIC (OLS; R-Squared Value)", line = 4)
>>>> ci_ols_1 <- boot.ci(boot.out.ols, index = 1, type = "all")
>>>> ci_ols_1
>>>> ci_ols_filtered_1 <- ci_ols_1$bca[, c(4,5)]
>>>> ci_ols_filtered_1
>>>> hist(boot.out.ols$t[,1], main = 'Determination of Coefficient:
>>>> NAR-NIC', xlab = 'R-Squared', col = 'LightBlue', probability = T)
>>>> lines(density(boot.out.ols$t[,1]), col = 'Red')
>>>> abline(v = ci_ols_filtered_1, col = 'brown')
>>>> ci_ols_2 <- boot.ci(boot.out.ols, index = 2, type = "all")
>>>> ci_ols_2
>>>> ci_ols_filtered_2 <- ci_ols_2$bca[, c(4,5)]
>>>> ci_ols_filtered_2
>>>> hist(boot.out.ols$t[,2], main = 'Determination of Coefficient:
>>>> SQRTNAR-SQRTNIC', xlab = 'R-Squared', col = 'LightBlue', probability = T)
>>>> lines(density(boot.out.ols$t[,2]), col = 'Red')
>>>> abline(v = ci_ols_filtered_2, col = 'brown')
>>>> boot.out.lms <- boot(n_data, statistic = LMSRegression, R = 100)
>>>> boot.out.lms
>>>> plot(boot.out.lms, index = 1)
>>>> title(sub = "Histogram and Q-Q plot for relation between NAR-NIC
>>>> (OLS; R-Squared Value)", line = 4)
>>>> plot(boot.out.lms, index = 2)
>>>> title(sub = "Histogram and Q-Q plot for relation between
>>>> SQRTNAR-SQRTNIC (OLS; R-Squared Value)", line = 4)
>>>> ci_lms_1<- boot.ci(boot.out.lms, index = 1, type = "all")
>>>> ci_lms_1
>>>> ci_lms_filtered_1 <- ci_lms_1$bca[, c(4,5)]
>>>> ci_lms_filtered_1
>>>> hist(boot.out.lms$t[,1], main = 'Determination of Coefficient:
>>>> NAR-NIC', xlab = 'R-Squared', col = 'LightBlue', probability = T)
>>>> lines(density(boot.out.lms$t[,1]), col = 'Red')
>>>> abline(v = ci_ols_filtered_1, col = 'brown')
>>>> ci_lms_2<- boot.ci(boot.out.lms, index = 2, type = "all")
>>>> ci_lms_2
>>>> ci_lms_filtered_2 <- ci_lms_2$bca[, c(4,5)]
>>>> ci_lms_filtered_2
>>>> hist(boot.out.lms$t[,2], main = 'Determination of Coefficient:
>>>> SQRTNAR-SQRTNIC', xlab = 'R-Squared', col = 'LightBlue', probability = T)
>>>> lines(density(boot.out.lms$t[,2]), col = 'Red')
>>>> abline(v = ci_ols_filtered_2, col = 'brown')
>>>> Bryan Mac
>>>> bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>
>>>>> On Oct 5, 2016, at 3:27 AM, ruipbarradas at sapo.pt
>>>>> <mailto:ruipbarradas at sapo.pt> wrote:
>>>
>>> Hello,
>>>
>>> Inline.
>>>
>>> Citando Bryan Mac <bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>>:
>>>
>>>> Hi Rui, Thanks.
>>>>
>>>> About this part of the code, I thought because we are bootstrapping
>>>> which is random sample WITH replacement, it would be replace=TRUE ?
>>>> Or is it replace=FALSE because its not trying to replace the values
>>>> in the columns, but just trying to randomly call 100 cases out of the
>>>> total?
>>>>
>>>> Yes, you said that you want to select a sub-df and _then_ bootstrap
>>>> it, so you should choose it without replacement, it's the bootstrap
>>>> that uses sampling with replacement.
>>>>
>>>> Rui Barradas
>>>> ix <- sample(1269, 100, replace = FALSE)
>>>> n_|data <- data[ix, cols]|
>>>> Also, I got no errors as well. Thanks.
>>>> Best,
>>>> Bryan Mac
>>>> bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>
>>>>> On Oct 4, 2016, at 3:56 AM, ruipbarradas at sapo.pt
>>>>> <mailto:ruipbarradas at sapo.pt> wrote:
>>>
>>> Two more things.
>>>
>>> 1) Don't call your df data or df, those are names of R functions.
>>> 2) I've just ran boot(data, statistic = DataSummary, R = 100), with
>>> the 1269 rows, and it gave me no error.
>>>
>>> Rui Barradas
>>>
>>> Citando Bryan Mac <bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>>:
>>>
>>>> Hi Rui,
>>>>
>>>> Its for a project that I am dealing with at work. It has to do with
>>>> estimation of advertisement performance.
>>>> What the code has to accomplish is to randomly select 100 cases each
>>>> time it is run and bootstrap it 100 times.
>>>> It can’t be just only the first 100 cases of the 1269 rows. It can be
>>>> anywhere between the first row to 1269 row.
>>>> I think for now what I am asking help on is, is there a functional
>>>> code where I will randomly select 100 rows out of my total (1269)?
>>>> Where each time it is run, you get different df/DataSummary and
>>>> bootstrap sample.
>>>> I think i need to edit this to achieve my purpose of randomly
>>>> selecting 100 rows out of my total
>>>> |cols <- c('NAR','SQRTNAR','NIC','SQRTNIC')
>>>> data[,cols] <- lapply(data[,cols],as.numeric) #to convert the variables into numeric values if not.
>>>> n_data <- data[(1:100),cols]|
>>>> I wanted to look at the trend if I increased the number of
>>>> bootstrapped samples (i.e.. 100, 200, 300, etc.) When i increased the
>>>> bootstrapped sample, the distribution got exponentially larger.
>>>> I thought that due to random sampling/bootstrapping you would get a
>>>> variation of scores.
>>>> I ran the df through the df through DataSummary and the bootstrap
>>>> results; I compared them and they are identical results.
>>>> By the way, i kept getting errors when I did 100 bootstrap samples
>>>> and had 1269 rows. It said that the sample was too small.
>>>> Bryan Mac
>>>> bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>
>>>> P.S. I am attaching an excel fie to show you what I mean. I
>>>> essentially randomly choose 100 cases out of total in the NAR column.
>>>> Once randomly selecting those 100 cases, bootstrap it 100 times.
>>>> Thats what I am looking to do.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
More information about the R-help
mailing list