[R] Bootstrapping in R

Wed Oct 19 17:28:00 CEST 2016

Hi,

After running the bootstrapping, I would like to the output of the bootstrapped samples. How can I view the bootstrapped samples of each variable?

Bryan Mac
bryanmac.24 at gmail.com

> On Oct 18, 2016, at 3:57 AM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
> 
> It means that the sd of the bootstrap samples is 0.21.
> See function ?boot.ci for confidence intervals.
> You should also start a new thread in R-Help, you will have more and better answers.
> 
> Em 18-10-2016 08:15, Bryan Mac escreveu:
>> Hi Rui,
>> 
>> I am having trouble understanding what this means exactly? Does this
>> mean that the bootstrapped number is +/-0.21 from the original?
>> 
>> 
>> How would i show all of the t’s in the bootstrap? I have about t1 to t28
>> so far. Would it be possible to show all of them?
> 
> I don't understand what you mean by this. All of the results are returned and printed by boot().
> 
> Rui Barradas
>> 
>> Bryan Mac
>> bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>
>> 
>> 
>> 
>>> On Oct 7, 2016, at 1:41 PM, ruipbarradas at sapo.pt
>>> <mailto:ruipbarradas at sapo.pt> wrote:
>>> 
>>> Hello,
>>> 
>>> That's just the definition of a function, you have to actually call
>>> it, in a call to boot(9, for instance.
>>> 
>>> 
>>> OLSCoef_NAR_NIC <- function(df, indices){
>>>  sample <- df[indices, ]
>>>  OLS_NAR_NIC_relation <- lm(NAR ~ NIC, data = sample)
>>>  coef_ols_nar_nic <- coef(OLS_NAR_NIC_relation)
>>>  coef_ols_nar_nic
>>> }
>>> 
>>> boot(n_data, statistic = OLSCoef_NAR_NIC, R = 100)
>>> 
>>> ORDINARY NONPARAMETRIC BOOTSTRAP
>>> 
>>> 
>>> Call:
>>> boot(data = n_data, statistic = OLSCoef_NAR_NIC, R = 100)
>>> 
>>> 
>>> Bootstrap Statistics :
>>>     original       bias    std. error
>>> t1* 1.8788189 -0.013771706  0.59596631
>>> t2* 0.5003911  0.002478478  0.09016857
>>> 
>>> 
>>> As for the output in the format you want, I sugest you call lm(9, with
>>> your entire df, since it is big there's no reason to bootstrap it.
>>> Something like this:
>>> 
>>> > model <- lm(NAR ~ NIC, data = data)
>>> > summary(model)
>>> 
>>> Call:
>>> lm(formula = NAR ~ NIC, data = data)
>>> 
>>> Residuals:
>>>    Min      1Q  Median      3Q     Max
>>> -6.0459 -1.1916  0.2126  1.3424  4.8094
>>> 
>>> Coefficients:
>>>            Estimate Std. Error t value Pr(>|t|)
>>> (Intercept)  1.66395    0.18859   8.823   <2e-16 ***
>>> NIC          0.56384    0.02588  21.783   <2e-16 ***
>>> ---
>>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>> 
>>> Residual standard error: 1.886 on 1267 degrees of freedom
>>> Multiple R-squared:  0.2725,    Adjusted R-squared:  0.2719
>>> F-statistic: 474.5 on 1 and 1267 DF,  p-value: < 2.2e-16
>>> 
>>> Rui Barradas
>>> 
>>> Citando Bryan Mac <bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>>:
>>> 
>>>> By the way, when I ran the code, i didn’t see any output of results.
>>>> 
>>>> This is what I got.
>>>> <OLS.PNG>
>>>> Bryan Mac
>>>> bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>
>>>>> On Oct 6, 2016, at 3:48 AM, ruipbarradas at sapo.pt
>>>>> <mailto:ruipbarradas at sapo.pt> wrote:
>>> 
>>> Hello,
>>> 
>>> I believe that your code is correct, I don't understand what you mean
>>> by "not showing up".
>>> If you want the coefficients, residuals, etc, your bootstrap statistic
>>> function needs to return those values. You can, for instance, use a
>>> different function, one to return the r-squared, another to return the
>>> coefficients or t.value, etc.
>>> 
>>> This function would return the coefficients. Note that if you use the
>>> argument data = ... you don't need the name of the df in your formula.
>>> It makes the code more readable.
>>> 
>>> 
>>> OLSCoef_NAR_NIC <- function(df, indices){
>>>  sample <- df[indices, ]
>>>  OLS_NAR_NIC_relation <- lm(NAR ~ NIC, data = sample)
>>>  coef_ols_nar_nic <- coef(OLS_NAR_NIC_relation)
>>>  coef_ols_nar_nic
>>> }
>>> 
>>> Rui Barradas
>>> 
>>> Citando Bryan Mac <bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>>:
>>> 
>>>> Hi Rui,
>>>> 
>>>> My next steps is to run both Least Median Square Regression and
>>>> Ordinary Least Square Regression after the bootstrap.
>>>> Me and my colleague wrote the code for it. I am having doubts that it
>>>> is correct. Is this how you compete the OLS and LMS Regression?
>>>> Doesn’t my output have to model the sample below? I believe I do have
>>>> the code that can model it but its not showing up, but i do not see
>>>> the residuals or the coefficients (estimate/std. error,t.value,etc.)
>>>> Sample Code:
>>>>  Call:
>>>> ## lm(formula = crime ~ poverty + single, data = cdata)
>>>> ##
>>>> ## Residuals:
>>>> ##    Min     1Q Median     3Q    Max
>>>> ## -811.1 -114.3  -22.4  121.9  689.8
>>>> ##
>>>> ## Coefficients:
>>>> ##             Estimate Std. Error t value Pr(>|t|)
>>>> ## (Intercept) -1368.19     187.21   -7.31  2.5e-09 ***
>>>> ## poverty         6.79       8.99    0.76     0.45
>>>> ## single        166.37      19.42    8.57  3.1e-11 ***
>>>> ## ---
>>>> ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>>> ##
>>>> ## Residual standard error: 244 on 48 degrees of freedom
>>>> ## Multiple R-squared:  0.707, Adjusted R-squared:  0.695
>>>> ## F-statistic:   58 on 2 and 48 DF,  p-value: 1.58e-13
>>>> This is our code:
>>>> OLSRegression <- function(df, indices){
>>>> sample <- df[indices, ]
>>>> OLS_NAR_NIC_relation <- lm(sample$NAR~sample$NIC, data = sample)
>>>> rsquared_ols_nar_nic <- summary(OLS_NAR_NIC_relation)$r.square
>>>> 
>>>> 
>>>> OLS_SQRTNAR_SQRTNIC_relation <- lm(sample$SQRTNAR~sample$SQRTNIC,
>>>> data = sample)
>>>> rsquared_ols_sqrtnar_sqrtnic <-
>>>> summary(OLS_SQRTNAR_SQRTNIC_relation)$r.square
>>>> 
>>>> 
>>>> out <- c(rsquared_ols_nar_nic, rsquared_ols_sqrtnar_sqrtnic)
>>>>  return(out)
>>>> }
>>>> LMSRegression <- function(df, indices){
>>>> sample <- df[indices, ]
>>>> LMS_NAR_NIC_relation <- lm(sample$NAR~sample$NIC, data = sample,
>>>> method = "lms")
>>>> rsquared_lms_nar_nic <- summary(LMS_NAR_NIC_relation)$r.square
>>>> 
>>>> 
>>>> LMS_SQRTNAR_SQRTNIC_relation <- lm(sample$SQRTNAR~sample$SQRTNIC,
>>>> data = sample, method = "lms")
>>>> rsquared_lms_sqrtnar_sqrtnic <-
>>>> summary(LMS_SQRTNAR_SQRTNIC_relation)$r.square
>>>> 
>>>> 
>>>> out <- c(rsquared_lms_nar_nic, rsquared_lms_sqrtnar_sqrtnic)
>>>>  return(out)
>>>> }
>>>> boot.out.ols <- boot(n_data, statistic = OLSRegression, R = 100)
>>>> boot.out.ols
>>>> plot(boot.out.ols, index = 1)
>>>> title(sub = "Histogram and Q-Q plot for relation between NAR-NIC
>>>> (OLS; R-Squared Value)", line = 4)
>>>> plot(boot.out.ols, index = 2)
>>>> title(sub = "Histogram and Q-Q plot for relation between
>>>> SQRTNAR-SQRTNIC (OLS; R-Squared Value)", line = 4)
>>>> ci_ols_1 <- boot.ci(boot.out.ols, index = 1, type = "all")
>>>> ci_ols_1
>>>> ci_ols_filtered_1 <- ci_ols_1$bca[, c(4,5)]
>>>> ci_ols_filtered_1
>>>> hist(boot.out.ols$t[,1], main = 'Determination of Coefficient:
>>>> NAR-NIC', xlab = 'R-Squared', col = 'LightBlue', probability = T)
>>>> lines(density(boot.out.ols$t[,1]), col = 'Red')
>>>> abline(v = ci_ols_filtered_1, col = 'brown')
>>>> ci_ols_2 <- boot.ci(boot.out.ols, index = 2, type = "all")
>>>> ci_ols_2
>>>> ci_ols_filtered_2 <- ci_ols_2$bca[, c(4,5)]
>>>> ci_ols_filtered_2
>>>> hist(boot.out.ols$t[,2], main = 'Determination of Coefficient:
>>>> SQRTNAR-SQRTNIC', xlab = 'R-Squared', col = 'LightBlue', probability = T)
>>>> lines(density(boot.out.ols$t[,2]), col = 'Red')
>>>> abline(v = ci_ols_filtered_2, col = 'brown')
>>>> boot.out.lms <- boot(n_data, statistic = LMSRegression, R = 100)
>>>> boot.out.lms
>>>> plot(boot.out.lms, index = 1)
>>>> title(sub = "Histogram and Q-Q plot for relation between NAR-NIC
>>>> (OLS; R-Squared Value)", line = 4)
>>>> plot(boot.out.lms, index = 2)
>>>> title(sub = "Histogram and Q-Q plot for relation between
>>>> SQRTNAR-SQRTNIC (OLS; R-Squared Value)", line = 4)
>>>> ci_lms_1<- boot.ci(boot.out.lms, index = 1, type = "all")
>>>> ci_lms_1
>>>> ci_lms_filtered_1 <- ci_lms_1$bca[, c(4,5)]
>>>> ci_lms_filtered_1
>>>> hist(boot.out.lms$t[,1], main = 'Determination of Coefficient:
>>>> NAR-NIC', xlab = 'R-Squared', col = 'LightBlue', probability = T)
>>>> lines(density(boot.out.lms$t[,1]), col = 'Red')
>>>> abline(v = ci_ols_filtered_1, col = 'brown')
>>>> ci_lms_2<- boot.ci(boot.out.lms, index = 2, type = "all")
>>>> ci_lms_2
>>>> ci_lms_filtered_2 <- ci_lms_2$bca[, c(4,5)]
>>>> ci_lms_filtered_2
>>>> hist(boot.out.lms$t[,2], main = 'Determination of Coefficient:
>>>> SQRTNAR-SQRTNIC', xlab = 'R-Squared', col = 'LightBlue', probability = T)
>>>> lines(density(boot.out.lms$t[,2]), col = 'Red')
>>>> abline(v = ci_ols_filtered_2, col = 'brown')
>>>> Bryan Mac
>>>> bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>
>>>>> On Oct 5, 2016, at 3:27 AM, ruipbarradas at sapo.pt
>>>>> <mailto:ruipbarradas at sapo.pt> wrote:
>>> 
>>> Hello,
>>> 
>>> Inline.
>>> 
>>> Citando Bryan Mac <bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>>:
>>> 
>>>> Hi Rui, Thanks.
>>>> 
>>>> About this part of the code, I thought because we are bootstrapping
>>>> which is random sample WITH replacement, it would be replace=TRUE ?
>>>> Or is it replace=FALSE because its not trying to replace the values
>>>> in the columns, but just trying to randomly call 100 cases out of the
>>>> total?
>>>> 
>>>> Yes, you said that you want to select a sub-df and _then_ bootstrap
>>>> it, so you should choose it without replacement, it's the bootstrap
>>>> that uses sampling with replacement.
>>>> 
>>>> Rui Barradas
>>>> ix <- sample(1269, 100, replace = FALSE)
>>>> n_|data <- data[ix, cols]|
>>>> Also, I got no errors as well. Thanks.
>>>> Best,
>>>> Bryan Mac
>>>> bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>
>>>>> On Oct 4, 2016, at 3:56 AM, ruipbarradas at sapo.pt
>>>>> <mailto:ruipbarradas at sapo.pt> wrote:
>>> 
>>> Two more things.
>>> 
>>> 1) Don't call your df data or df, those are names of R functions.
>>> 2) I've just ran boot(data, statistic = DataSummary, R = 100), with
>>> the 1269 rows, and it gave me no error.
>>> 
>>> Rui Barradas
>>> 
>>> Citando Bryan Mac <bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>>:
>>> 
>>>> Hi Rui,
>>>> 
>>>> Its for a project that I am dealing with at work. It has to do with
>>>> estimation of advertisement performance.
>>>> What the code has to accomplish is to randomly select 100 cases each
>>>> time it is run and bootstrap it 100 times.
>>>> It can’t be just only the first 100 cases of the 1269 rows. It can be
>>>> anywhere between the first row to 1269 row.
>>>> I think for now what I am asking help on is, is there a functional
>>>> code where I will randomly select 100 rows out of my total (1269)?
>>>> Where each time it is run, you get different df/DataSummary and
>>>> bootstrap sample.
>>>> I think i need to edit this to achieve my purpose of randomly
>>>> selecting 100 rows out of my total
>>>> |cols  <-  c('NAR','SQRTNAR','NIC','SQRTNIC')
>>>> data[,cols]  <-  lapply(data[,cols],as.numeric)  #to convert the variables into numeric values if not.
>>>> n_data  <-  data[(1:100),cols]|
>>>> I wanted to look at the trend if I increased the number of
>>>> bootstrapped samples (i.e.. 100, 200, 300, etc.) When i increased the
>>>> bootstrapped sample, the distribution got exponentially larger.
>>>> I thought that due to random sampling/bootstrapping you would get a
>>>> variation of scores.
>>>> I ran the df through the df through DataSummary and the bootstrap
>>>> results;  I compared them and they are identical results.
>>>> By the way, i kept getting errors when I did 100 bootstrap samples
>>>> and had 1269 rows. It said that the sample was too small.
>>>> Bryan Mac
>>>> bryanmac.24 at gmail.com <mailto:bryanmac.24 at gmail.com>
>>>> P.S. I am attaching an excel fie to show you what I mean. I
>>>> essentially randomly choose 100 cases out of total in the NAR column.
>>>> Once randomly selecting those 100 cases, bootstrap it 100 times.
>>>> Thats what I am looking to do.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>