1 Cars93 dataset

Lots of cars
Lots of cars

We will continue working with the Cars93 dataset included in MASS package. As we did in Applied Lecture 1, lets initially load the data, clean it from missing entries and attach the variables contained in Cars93.

library('MASS')
data(Cars93)
sum(is.na(Cars93))
## [1] 13
Cars93 = na.omit(Cars93)
n = nrow(Cars93)
n
## [1] 82
attach(Cars93)

2 Package bayesboot

First thing to do is to install the package (if it isn’t already), so simply execute

install.packages('bayesboot')

and then load it

library(bayesboot)

The basic command we are going to use from this package is bayesboot; type

?bayesboot

As we see the command has several arguments:

bayesboot(data, statistic, R = 4000, R2 = 4000, use.weights = FALSE, .progress = "none", .parallel = FALSE, ...)

Lets briefly go through the most important ones for our purposes.

Two important observations:

  1. The results from use.weights = FALSE and use.weights = TRUE are in most cases more or less the same. The option use.weights = FALSE is simple to use and is recommended (which is why it is the default). The option use.weights = TRUE requires the use of weighted statistics inside bayesboot and is more tricky. It should be noted however that for complicated problems use.weights = TRUE can be much faster (because it just uses weighted sums). In this practical we will see the use of both.
  2. The bayesboot function uses an internal command for simulating the weights from a Dirichlet distribution as \[\boldsymbol{\pi}\sim \mathrm{Dir}(1,\ldots,1).\] This is the only option that we have (we cannot change the concentration parameters to, say, \(\alpha_1=0.001,\ldots,\alpha_n=0.001\)) and we will have to live with that!

3 Motivating Bayesian bootstrap: smoothness

Prior to proceeding to the main material of this practical lets motivative a little bit the use of Bayesian bootstrap. In Lecture 7 a point was made that the results from classical and Bayesian bootstrap are usually very similar. Although, generally true this is not always the case. Here we consider a case of the following small sample where \(n = 5\).

samples = c(1.865, 3.053, 1.401, 0.569, 4.132)

Suppose we are interested in the mean. We can use the manual codes presented in Example 8.1 of Lecture 7 for obtaining histograms of the mean from classical and Bayesian bootstrap, but now lets use the functions boot and bayesboot.

First the classical bootstrap with boot. As we have seen, we must first construct a function with arguments data and indices, even if it is for a single statistic like the mean.

library(boot)
foo = function(data, indices){
  dt<-data[indices]
  c(mean(dt))
}
cb = boot(data = samples, statistic = foo, R = 10000)
head(cb$t)  # First 6 bootstrap replications
##        [,1]
## [1,] 2.6080
## [2,] 1.1610
## [3,] 2.4632
## [4,] 2.3704
## [5,] 3.3700
## [6,] 2.6574

Next, the Bayesian approach with bayesboot. For a single statistic this is simpler as we do not need to construct a function; we can simply use statistic = mean inside bayesboot. Below we also plot the resulting histograms.

bb = bayesboot(data = samples, statistic = mean, R = 10000)
names(bb)
## [1] "V1"
head(bb$V1) # First 6 Bayesian bootstrap posterior samples
## [1] 1.778630 2.544689 3.388796 3.057806 2.545925 2.122961
par(mfrow=c(1,2))
hist(cb$t, main = 'Classical bootstrap', col = 'lightgreen', xlab = '')
hist(bb$V1, main = 'Bayesian bootstrap', col = 'lightblue', xlab = '')

We see that here Bayesian bootstrap produces a smoother histogram (this is generally desirable) in comparison to classical bootstrap. Concerning presentation, we could also show the histograms on the same plot using the following code.

hist(bb$V1, col = 'lightblue', xlab = '', main = 'Bootstrap estimates of the mean')
hist(cb$t, col = 'lightgreen', xlab = '', add = TRUE, main = '')
legend('topright', legend = c('Classical bootstrap','Bayesian bootstrap'),
       col = c('lightgreen','lightblue'), pch = 15, bty = 'n')

Alternatively, and even better, we can use more transparent colors to show the overlap. For this we can use the command rgb:

rgb(red, green, blue, alpha, names = NULL, maxColorValue = 1)

Parameter alpha controls transparency, if we set it below 1 (e.g., 1/4) we can get the following plot

hist(bb$V1, col = rgb(0,0,1,1/4), xlab = '', main = 'Bootstrap estimates of the mean')
hist(cb$t, col = rgb(1,0,0,1/4), xlab = '', add = TRUE, main = '')
legend('topright', legend = c('Classical bootstrap','Bayesian bootstrap'),
       col = c(rgb(1,0,0,1/4),rgb(0,0,1,1/4)), pch = 15, bty = 'n')

4 Using the bayesboot package for non-regression analysis

Now lets use bayesboot for some Bayesian bootstrap analyses of sample statistics using the dataset Cars93.

4.1 Means and variances

Lets start by considering the mean of variable Price using 5000 posterior samples. We can use the code below storing the samples in the object bb.mean1.

set.seed(1)
bb.mean1 = bayesboot(Price, mean, R = 5000)

Above we are implicitly providing the data and statistic arguments of bayesboot and also using the default option for the weights. The explicit piece of corresponding code is the following.

bb.mean1 = bayesboot(data = Price, statistic = mean, R = 5000 , use.weights = FALSE)

For the remaining of this practical we will use explicit code. This is generally recommended for beginners.

Now, lets say that we do not want to use the use.weights = FALSE resampling approach and prefer to use the original Bayesian bootstrap based on weighted discrete sums with use.weights = TRUE. We run the below piece of code and we notice an error message.

bb.mean2 = bayesboot(data = Price, statistic = mean, R = 5000 , use.weights = TRUE)
## Applying the statistic on the original data and with uniform weights resulted in an error
## Error in mean.default(data, w, ...): 'trim' must be numeric of length one

This is because now we need to use the corresponding weighted statistic of the mean. The command in R for that is weighted.mean(). So, as we see the below code runs without errors.

set.seed(1)
bb.mean2 = bayesboot(data = Price, statistic = weighted.mean, R = 5000 , use.weights = TRUE)

Next, lets see what we can extract from bb.mean1:

bb.mean1
## The first 10 draws (out of 5000) from the Bayesian bootstrap:
## 
##          V1
## 1  18.42533
## 2  17.73820
## 3  17.66150
## 4  19.01128
## 5  19.27875
## 6  18.71575
## 7  21.00473
## 8  20.52713
## 9  20.01837
## 10 21.28450
## .. ...
## 
## Use summary() to produce a summary of the posterior distribution.
summary(bb.mean1)
## Bayesian bootstrap
## 
## Number of posterior draws: 5000 
## 
## Summary of the posterior (with 95% Highest Density Intervals):
##  statistic     mean       sd  hdi.low hdi.high
##         V1 19.15935 1.100644 17.02798 21.29605
## 
## Quantiles:
##  statistic    q2.5%     q25%  median     q75%  q97.5%
##         V1 17.21363 18.39279 19.0907 19.85719 21.5667
## 
## Call:
##  bayesboot(data = Price, statistic = mean, R = 5000, use.weights = FALSE)

Typing bb.mean1 gives us the first 10 samples of the mean of Price, but we also get a useful hint to use summary(), which as we see provides a lot of relevant information for the bootstrap posterior of the mean. Note that the 95% Highest Density Interval (HDI) is the Bayesian analogue to a classical bootstrap 95% confidence interval. The posterior sample is contained in bb.mean1$V1 so we can use this to, say, plot a histogram; for instance, in the following way.

hist(bb.mean1$V1, main = 'Bootstrap means of Price', xlab = '', col = 'pink')

Another option is to use the plot function which is compatible to bayesboot and also provides additional information (the mean and the 95% HDI).

plot(bb.mean1)

As we see it also allows for customisation (e.g., main title and colour).

plot(bb.mean1, main = 'Bootstrap means of Price', cex.main = 1, col = 'pink')

Above the argument cex.main = 1 ensures better visibility of the main title.

Tasks:

1. Produce the corresponding histograms and also summaries from the object bb.mean2. The results should look similar to the below.

## Bayesian bootstrap
## 
## Number of posterior draws: 5000 
## 
## Summary of the posterior (with 95% Highest Density Intervals):
##  statistic     mean       sd  hdi.low hdi.high
##         V1 19.20242 1.095229 17.17256  21.4256
## 
## Quantiles:
##  statistic    q2.5%     q25%   median     q75%   q97.5%
##         V1 17.23141 18.44833 19.14011 19.88169 21.52821
## 
## Call:
##  bayesboot(data = Price, statistic = weighted.mean, R = 5000,      use.weights = TRUE)

2. Say that we are interested in the variable Horsepower, specifically, its variance. Run bayesboot with the resampling approach using 5000 samples. Store it to an object bb.var (use set.seed(1)). Then like above produce the corresponding histograms and also summaries from the object bb.mean2. The results should look like the ones below.

## Bayesian bootstrap
## 
## Number of posterior draws: 5000 
## 
## Summary of the posterior (with 95% Highest Density Intervals):
##  statistic     mean       sd  hdi.low hdi.high
##         V1 2547.584 483.5881 1737.703 3530.944
## 
## Quantiles:
##  statistic    q2.5%     q25%   median     q75%  q97.5%
##         V1 1781.352 2197.792 2481.094 2834.723 3654.47
## 
## Call:
##  bayesboot(data = Horsepower, statistic = var, R = 5000, use.weights = FALSE)

4.2 Correlations

Here we are interested in the correlation between Price and Horsepower. To implement this with bayesboot we need to join the two variables in a data.frame or matrix object. Below we use data.frame (but a matrix with cbind(Price,Horsepower) would also be fine).

PrHo = data.frame(Price, Horsepower)
cor(PrHo)
##                Price Horsepower
## Price      1.0000000  0.7872511
## Horsepower 0.7872511  1.0000000

As we see we have a small problem because cor returns the estimated covariance matrix, while we only need one of the diagonal elements. However, we can easily define our own function which will extract only the upper diagonal cell.

cor.fun = function(data){
  cor(data)[1,2]
}
cor.fun(PrHo) 
## [1] 0.7872511

Now we can use this function in bayesboot with the resampling approach and get a histogram of the posterior samples.

set.seed(1)
bb.cor1 = bayesboot(data = PrHo, statistic = cor.fun, R = 5000, use.weights = FALSE)
plot(bb.cor1, main = 'Bootstrap correlations of Price and Horsepower',
     cex.main = 1, xlab = '', col = 'magenta')

What if we want to use the original Bayesian bootstrap with use.weights=TRUE? Making a function for weighted correlation may not be straightforward. Fortunately, package boot has a function for that called corr.

corr(PrHo) # same result as above because all weights are equal by default (this will be handled internally in bayesboot)
## [1] 0.7872511

Task:

3. Run bayesboot with the weighted approach using function corr and 5000 samples. Store it to an object bb.cor2 (use set.seed(1)). Then like above use plot, but also with the additional arguments inside: (i) showMode = TRUE, (ii) showCurve = TRUE and (iii) cred.mass = 0.99. The result should look like this.

With the additional provided arguments we see that we now get: (i) the posterior mode (or Maximum-A-Posteriori [MAP] estimate) instead of the posterior mean, (ii) a density instead of a histogram and (iii) a 99% HDI instead of a 95% HDI.

4.2 Multiple statistics

Finally, we might be interested in evaluating bootstrap estimates and posterior distributions for more than one quantities at the same time. In this case we will have to define our own functions to be used as input in bayesboot.

The below code initially defines a function for returning the mean and the standard deviation of a sample. Then, bayesbootis applied to variable Price.

foo1 = function(data){
  results = c(mean(data), sd(data))
  results
}
set.seed(1)
bb.mult1 = bayesboot(data = Price, statistic = foo1, R = 5000, use.weights = FALSE)
plot(bb.mult1)

The above does not look very good in terms of presentation (although in R console it may look just fine). We can improve it a bit by reducing the size of numbers/letters on the histograms with cex and also by reducing the size of the x-lab names with cex.lab (note that the file with R code does not use those). We would also like to replace “V1” and “V2” with “Mean” and “Standard deviation”, respectively, using the command names. The code is as follows.

names(bb.mult1)
## [1] "V1" "V2"
names(bb.mult1) = c('Mean', 'Standard deviation')
plot(bb.mult1, cex = 1, cex.lab = 1)

Also, remember the showCurve argument in plot as well as the summary command.

plot(bb.mult1, showCurve = TRUE, cex = 1, cex.lab = 1)

summary(bb.mult1)
## Bayesian bootstrap
## 
## Number of posterior draws: 5000 
## 
## Summary of the posterior (with 95% Highest Density Intervals):
##           statistic     mean       sd   hdi.low hdi.high
##                Mean 19.20463 1.107464 17.104500 21.33873
##  Standard deviation  9.78327 1.175866  7.671027 12.15408
## 
## Quantiles:
##           statistic     q2.5%      q25%    median     q75%   q97.5%
##                Mean 17.235688 18.432300 19.152950 19.90166 21.54428
##  Standard deviation  7.853313  8.962115  9.631686 10.48639 12.45521
## 
## Call:
##  bayesboot(data = Price, statistic = foo1, R = 5000, use.weights = FALSE)

Of course, always remember that you can extract the posterior samples from bayesboot and store them in new objects, for instance,

head(bb.mult1)
## The first 10 draws (out of 6) from the Bayesian bootstrap:
## 
##       Mean Standard deviation
## 1 18.20680           8.174865
## 2 18.61832           8.485256
## 3 18.06523           9.547009
## 4 17.97355          11.799315
## 5 20.52275          11.600940
## 6 17.79648           9.755993
## .. ...
## 
## Use summary() to produce a summary of the posterior distribution.
bb.mean = bb.mult1$Mean
bb.sd = bb.mult1$`Standard deviation` # Using `` because there is a space

which you can use to create your own customised plots.

As a final observation, we do not need necessarily to define a function separately before using bayesboot (in our case the function foo1 above). We can define it inside bayesboot. In our example the following code will give exactly the same results.

set.seed(1)
bb.mult1 = bayesboot(data = Price, statistic = function(data){c(mean(data), sd(data))},
                     R = 5000, use.weights = FALSE)

Of course, if the function is complicated it is better to define it outside bayesboot.

Task:

4. Load the package moments. This includes the functions skewness() and kurtosis(). Run bayesboot for variable Horsepower for these two statistics using 5000 samples and the resampling option, storing the results in an object bb.mult2 (use as usual set.seed(1)). Define appropriate names for the variables and produce a density plot.

5 Using the bayesboot package for regression analysis

5.1 With one predictor

Given everything we have learned so far and with the Applied Lecture 1 on regression this should be relatively straightforward. Say, that again we are interested in Price as the response variable and Horsepower as the predictor and that we are interested in the regression coefficients of the intercept and the slope.

We must first define a function that returns the coefficient estimates. This is relatively easy to do using a combination of the commands coef and lm.

lm.coefs1 = function(data) {
  coef( lm(Price ~ Horsepower, data = data) )
}
lm.coefs1(Cars93)
## (Intercept)  Horsepower 
##  -2.3214825   0.1535693

We can now apply bayesboot using the resampling method.

# The below takes some time for R = 5000 lets use R = 1000
bb.reg1 = bayesboot(data = Cars93, statistic = lm.coefs1, R = 1000, use.weights = FALSE)
plot(bb.reg1, cex = 1, cex.lab = 1)

summary(bb.reg1)
## Bayesian bootstrap
## 
## Number of posterior draws: 1000 
## 
## Summary of the posterior (with 95% Highest Density Intervals):
##    statistic       mean         sd    hdi.low  hdi.high
##  (Intercept) -2.5226059 2.16190701 -6.8118282 1.8948591
##   Horsepower  0.1551188 0.01844872  0.1179382 0.1916066
## 
## Quantiles:
##    statistic      q2.5%       q25%     median       q75%    q97.5%
##  (Intercept) -7.1454985 -3.8838908 -2.4897783 -1.2030188 1.5968906
##   Horsepower  0.1211688  0.1438173  0.1539134  0.1665158 0.1965004
## 
## Call:
##  bayesboot(data = Cars93, statistic = lm.coefs1, R = 1000, use.weights = FALSE)

Note that now bayesboot automatically assigns proper names in the plot above ((Intercept) and Horsepower instead of V1 and V2).

We can also produce plots of the bootstrap regression lines, the original OLS regression line and the data points (as we have done in Applied Lecture 1).

plot(1, bty="l", xlab="Horsepower", ylab="Price", xlim = range(Horsepower), ylim = range(Price))
for (j in 1:1000) {
  abline(coef = bb.reg1[j,], col='pink', lwd = 2)
}
abline(coef = coef(lm(Price ~ Horsepower, data = Cars93)), col=1, lwd = 2)
points(Horsepower, Price, bty = 'l', pch = 16)

Of course, we can again store the posterior samples in objects for any type of post-processing.

b0 = bb.reg1$`(Intercept)`
b1 = bb.reg1$Horsepower

Task:

5. Define a function lm.coefs2 similar to lm.coef1 above, but now with two arguments data and w, which inside lm will have the additional argument weights = w. Then use this function to run bayesboot with the weighted method for 1000 samples (set.seed(1)), storing results to an object bb.reg2. Produce density plots of the regression coefficients.

5.2 With multiple predictors

The extension to multiple predictors is pretty straightforward. We construct again a function using coef and lm and this time we add the predictors that we want inside lm. Like in Applied Lecture 1, assume that we are interested in the model with response Price and predictors Horsepower, Fuel.tank.capacity and Passengers.

lm.coefs3 = function(data) {
  coef( lm(Price ~ Horsepower + Fuel.tank.capacity + Passengers, data = data) )
}
lm.coefs3(Cars93)
##        (Intercept)         Horsepower Fuel.tank.capacity         Passengers 
##        -10.7166535          0.1215349          0.6476156          0.4946594

Then, we have the following code and results.

bb.reg3 = bayesboot(data = Cars93, statistic = lm.coefs3, R = 1000, use.weights = FALSE)
summary(bb.reg3)
## Bayesian bootstrap
## 
## Number of posterior draws: 1000 
## 
## Summary of the posterior (with 95% Highest Density Intervals):
##           statistic        mean         sd      hdi.low   hdi.high
##         (Intercept) -10.3364986 4.08028525 -17.61199056 -2.4718820
##  Fuel.tank.capacity   0.6145359 0.39545548  -0.20924737  1.3368692
##          Horsepower   0.1255108 0.03056119   0.06733603  0.1875315
##          Passengers   0.4219182 0.86379843  -1.22119980  2.2390925
## 
## Quantiles:
##           statistic        q2.5%        q25%      median       q75%     q97.5%
##         (Intercept) -18.17511637 -13.2777616 -10.1779113 -7.4560290 -3.0413979
##  Fuel.tank.capacity  -0.20352609   0.3699071   0.6385119  0.8680470  1.3557652
##          Horsepower   0.07173276   0.1049201   0.1232754  0.1427855  0.1927868
##          Passengers  -1.22012762  -0.1807864   0.4026125  0.9777024  2.2407161
## 
## Call:
##  bayesboot(data = Cars93, statistic = lm.coefs3, R = 1000, use.weights = FALSE)
plot(bb.reg3, col = 'lightgreen', cex = 1, cex.lab = 1)

Similarly to the conclusions in Applied Lecture 1, we see that Horsepower seems to be the only influential predictor as it is the only one with a 95% HDI that does not contain 0.

Note: In the regression analyses R = 1000 was used simply to speed up the computations.