We will continue working with the Cars93 dataset included in
MASS
package. As we did in Applied Lecture 1, lets
initially load the data, clean it from missing entries and attach the
variables contained in Cars93.
library('MASS')
data(Cars93)
sum(is.na(Cars93))
## [1] 13
Cars93 = na.omit(Cars93)
n = nrow(Cars93)
n
## [1] 82
attach(Cars93)
First thing to do is to install the package (if it isn’t already), so simply execute
install.packages('bayesboot')
and then load it
library(bayesboot)
The basic command we are going to use from this package is
bayesboot
; type
?bayesboot
As we see the command has several arguments:
bayesboot(data, statistic, R = 4000, R2 = 4000, use.weights = FALSE, .progress = "none", .parallel = FALSE, ...)
Lets briefly go through the most important ones for our purposes.
data
: This corresponds to the sample used as input and
it can be a vector
, a list
, a
matrix
or a data.frame
.statistic
: The estimator/statistical functional (in
Lecture terms) of interest. Can be one or more (in the latter case we
have to make our own function).R
: The number of posterior samples from Bayesian
bootstrap (analogue to the number of classical bootstrap
replications).R2
: This is used when use.weights = FALSE
(see below). It is the size of samples used to approximate weighted
statistics. No need to worry much about setting this; default value is
again 4000.use.weights
: This argument is important. When
TRUE
the data are reweighted and we calculate
weighted statistics according to the original Bayesian bootstrap like
described in Lecture 7. When FALSE
(the default option) the
reweighting will be approximated by resampling the data.Two important observations:
use.weights = FALSE
and
use.weights = TRUE
are in most cases more or less the same.
The option use.weights = FALSE
is simple to use and is
recommended (which is why it is the default). The option
use.weights = TRUE
requires the use of weighted
statistics inside bayesboot
and is more tricky. It
should be noted however that for complicated problems
use.weights = TRUE
can be much faster (because it just uses
weighted sums). In this practical we will see the use of both.bayesboot
function uses an internal command for
simulating the weights from a Dirichlet distribution as \[\boldsymbol{\pi}\sim
\mathrm{Dir}(1,\ldots,1).\] This is the only option that we have
(we cannot change the concentration parameters to, say, \(\alpha_1=0.001,\ldots,\alpha_n=0.001\)) and
we will have to live with that!Prior to proceeding to the main material of this practical lets motivative a little bit the use of Bayesian bootstrap. In Lecture 7 a point was made that the results from classical and Bayesian bootstrap are usually very similar. Although, generally true this is not always the case. Here we consider a case of the following small sample where \(n = 5\).
samples = c(1.865, 3.053, 1.401, 0.569, 4.132)
Suppose we are interested in the mean. We can use the manual
codes presented in Example 8.1 of Lecture 7 for obtaining histograms of
the mean from classical and Bayesian bootstrap, but now lets use the
functions boot
and bayesboot
.
First the classical bootstrap with boot
. As we have
seen, we must first construct a function with arguments
data
and indices
, even if it is for a single
statistic like the mean.
library(boot)
foo = function(data, indices){
dt<-data[indices]
c(mean(dt))
}
cb = boot(data = samples, statistic = foo, R = 10000)
head(cb$t) # First 6 bootstrap replications
## [,1]
## [1,] 2.6080
## [2,] 1.1610
## [3,] 2.4632
## [4,] 2.3704
## [5,] 3.3700
## [6,] 2.6574
Next, the Bayesian approach with bayesboot
. For a single
statistic this is simpler as we do not need to construct a function; we
can simply use statistic = mean
inside
bayesboot
. Below we also plot the resulting histograms.
bb = bayesboot(data = samples, statistic = mean, R = 10000)
names(bb)
## [1] "V1"
head(bb$V1) # First 6 Bayesian bootstrap posterior samples
## [1] 1.778630 2.544689 3.388796 3.057806 2.545925 2.122961
par(mfrow=c(1,2))
hist(cb$t, main = 'Classical bootstrap', col = 'lightgreen', xlab = '')
hist(bb$V1, main = 'Bayesian bootstrap', col = 'lightblue', xlab = '')
We see that here Bayesian bootstrap produces a smoother histogram (this is generally desirable) in comparison to classical bootstrap. Concerning presentation, we could also show the histograms on the same plot using the following code.
hist(bb$V1, col = 'lightblue', xlab = '', main = 'Bootstrap estimates of the mean')
hist(cb$t, col = 'lightgreen', xlab = '', add = TRUE, main = '')
legend('topright', legend = c('Classical bootstrap','Bayesian bootstrap'),
col = c('lightgreen','lightblue'), pch = 15, bty = 'n')
Alternatively, and even better, we can use more transparent colors to
show the overlap. For this we can use the command rgb
:
rgb(red, green, blue, alpha, names = NULL, maxColorValue = 1)
Parameter alpha controls transparency, if we set it below 1 (e.g., 1/4) we can get the following plot
hist(bb$V1, col = rgb(0,0,1,1/4), xlab = '', main = 'Bootstrap estimates of the mean')
hist(cb$t, col = rgb(1,0,0,1/4), xlab = '', add = TRUE, main = '')
legend('topright', legend = c('Classical bootstrap','Bayesian bootstrap'),
col = c(rgb(1,0,0,1/4),rgb(0,0,1,1/4)), pch = 15, bty = 'n')
Now lets use bayesboot
for some Bayesian bootstrap
analyses of sample statistics using the dataset Cars93.
Lets start by considering the mean of variable Price using
5000 posterior samples. We can use the code below storing the samples in
the object bb.mean1
.
set.seed(1)
bb.mean1 = bayesboot(Price, mean, R = 5000)
Above we are implicitly providing the data
and
statistic
arguments of bayesboot
and also
using the default option for the weights. The explicit piece of
corresponding code is the following.
bb.mean1 = bayesboot(data = Price, statistic = mean, R = 5000 , use.weights = FALSE)
For the remaining of this practical we will use explicit code. This is generally recommended for beginners.
Now, lets say that we do not want to use the
use.weights = FALSE
resampling approach and prefer to use
the original Bayesian bootstrap based on weighted discrete sums with
use.weights = TRUE
. We run the below piece of code and we
notice an error message.
bb.mean2 = bayesboot(data = Price, statistic = mean, R = 5000 , use.weights = TRUE)
## Applying the statistic on the original data and with uniform weights resulted in an error
## Error in mean.default(data, w, ...): 'trim' must be numeric of length one
This is because now we need to use the corresponding weighted
statistic of the mean. The command in R for that is
weighted.mean()
. So, as we see the below code runs without
errors.
set.seed(1)
bb.mean2 = bayesboot(data = Price, statistic = weighted.mean, R = 5000 , use.weights = TRUE)
Next, lets see what we can extract from bb.mean1
:
bb.mean1
## The first 10 draws (out of 5000) from the Bayesian bootstrap:
##
## V1
## 1 18.42533
## 2 17.73820
## 3 17.66150
## 4 19.01128
## 5 19.27875
## 6 18.71575
## 7 21.00473
## 8 20.52713
## 9 20.01837
## 10 21.28450
## .. ...
##
## Use summary() to produce a summary of the posterior distribution.
summary(bb.mean1)
## Bayesian bootstrap
##
## Number of posterior draws: 5000
##
## Summary of the posterior (with 95% Highest Density Intervals):
## statistic mean sd hdi.low hdi.high
## V1 19.15935 1.100644 17.02798 21.29605
##
## Quantiles:
## statistic q2.5% q25% median q75% q97.5%
## V1 17.21363 18.39279 19.0907 19.85719 21.5667
##
## Call:
## bayesboot(data = Price, statistic = mean, R = 5000, use.weights = FALSE)
Typing bb.mean1
gives us the first 10 samples of the
mean of Price, but we also get a useful hint to use
summary()
, which as we see provides a lot of relevant
information for the bootstrap posterior of the mean. Note that the
95% Highest Density Interval (HDI) is the Bayesian
analogue to a classical bootstrap 95% confidence interval. The posterior
sample is contained in bb.mean1$V1
so we can use this to,
say, plot a histogram; for instance, in the following way.
hist(bb.mean1$V1, main = 'Bootstrap means of Price', xlab = '', col = 'pink')
Another option is to use the plot
function which is
compatible to bayesboot
and also provides additional
information (the mean and the 95% HDI).
plot(bb.mean1)
As we see it also allows for customisation (e.g., main title and colour).
plot(bb.mean1, main = 'Bootstrap means of Price', cex.main = 1, col = 'pink')
Above the argument cex.main = 1
ensures better
visibility of the main title.
Tasks:
1. Produce the corresponding histograms and also
summaries from the object bb.mean2
. The results should look
similar to the below.
## Bayesian bootstrap
##
## Number of posterior draws: 5000
##
## Summary of the posterior (with 95% Highest Density Intervals):
## statistic mean sd hdi.low hdi.high
## V1 19.20242 1.095229 17.17256 21.4256
##
## Quantiles:
## statistic q2.5% q25% median q75% q97.5%
## V1 17.23141 18.44833 19.14011 19.88169 21.52821
##
## Call:
## bayesboot(data = Price, statistic = weighted.mean, R = 5000, use.weights = TRUE)
2. Say that we are interested in the variable
Horsepower, specifically, its variance. Run
bayesboot
with the resampling approach using 5000 samples.
Store it to an object bb.var
(use
set.seed(1)
). Then like above produce the corresponding
histograms and also summaries from the object bb.mean2
. The
results should look like the ones below.
## Bayesian bootstrap
##
## Number of posterior draws: 5000
##
## Summary of the posterior (with 95% Highest Density Intervals):
## statistic mean sd hdi.low hdi.high
## V1 2547.584 483.5881 1737.703 3530.944
##
## Quantiles:
## statistic q2.5% q25% median q75% q97.5%
## V1 1781.352 2197.792 2481.094 2834.723 3654.47
##
## Call:
## bayesboot(data = Horsepower, statistic = var, R = 5000, use.weights = FALSE)
Here we are interested in the correlation between Price and
Horsepower. To implement this with bayesboot
we
need to join the two variables in a data.frame
or
matrix
object. Below we use data.frame
(but a
matrix with cbind(Price,Horsepower)
would also be
fine).
PrHo = data.frame(Price, Horsepower)
cor(PrHo)
## Price Horsepower
## Price 1.0000000 0.7872511
## Horsepower 0.7872511 1.0000000
As we see we have a small problem because cor
returns
the estimated covariance matrix, while we only need one of the diagonal
elements. However, we can easily define our own function which will
extract only the upper diagonal cell.
cor.fun = function(data){
cor(data)[1,2]
}
cor.fun(PrHo)
## [1] 0.7872511
Now we can use this function in bayesboot
with the
resampling approach and get a histogram of the posterior samples.
set.seed(1)
bb.cor1 = bayesboot(data = PrHo, statistic = cor.fun, R = 5000, use.weights = FALSE)
plot(bb.cor1, main = 'Bootstrap correlations of Price and Horsepower',
cex.main = 1, xlab = '', col = 'magenta')
What if we want to use the original Bayesian bootstrap with
use.weights=TRUE
? Making a function for weighted
correlation may not be straightforward. Fortunately, package
boot
has a function for that called corr
.
corr(PrHo) # same result as above because all weights are equal by default (this will be handled internally in bayesboot)
## [1] 0.7872511
Task:
3. Run bayesboot
with the weighted
approach using function corr
and 5000 samples. Store it to
an object bb.cor2
(use set.seed(1)
). Then like
above use plot
, but also with the additional arguments
inside: (i) showMode = TRUE
, (ii)
showCurve = TRUE
and (iii) cred.mass = 0.99
.
The result should look like this.
With the additional provided arguments we see that we now get: (i) the posterior mode (or Maximum-A-Posteriori [MAP] estimate) instead of the posterior mean, (ii) a density instead of a histogram and (iii) a 99% HDI instead of a 95% HDI.
Finally, we might be interested in evaluating bootstrap estimates and
posterior distributions for more than one quantities at the same time.
In this case we will have to define our own functions to be used as
input in bayesboot
.
The below code initially defines a function for returning the mean
and the standard deviation of a sample. Then, bayesboot
is
applied to variable Price.
foo1 = function(data){
results = c(mean(data), sd(data))
results
}
set.seed(1)
bb.mult1 = bayesboot(data = Price, statistic = foo1, R = 5000, use.weights = FALSE)
plot(bb.mult1)
The above does not look very good in terms of presentation (although
in R console it may look just fine). We can improve it a bit by reducing
the size of numbers/letters on the histograms with cex
and
also by reducing the size of the x-lab names with cex.lab
(note that the file with R code does not use those). We would also like
to replace “V1” and “V2” with “Mean” and “Standard deviation”,
respectively, using the command names
. The code is as
follows.
names(bb.mult1)
## [1] "V1" "V2"
names(bb.mult1) = c('Mean', 'Standard deviation')
plot(bb.mult1, cex = 1, cex.lab = 1)
Also, remember the showCurve
argument in
plot
as well as the summary
command.
plot(bb.mult1, showCurve = TRUE, cex = 1, cex.lab = 1)
summary(bb.mult1)
## Bayesian bootstrap
##
## Number of posterior draws: 5000
##
## Summary of the posterior (with 95% Highest Density Intervals):
## statistic mean sd hdi.low hdi.high
## Mean 19.20463 1.107464 17.104500 21.33873
## Standard deviation 9.78327 1.175866 7.671027 12.15408
##
## Quantiles:
## statistic q2.5% q25% median q75% q97.5%
## Mean 17.235688 18.432300 19.152950 19.90166 21.54428
## Standard deviation 7.853313 8.962115 9.631686 10.48639 12.45521
##
## Call:
## bayesboot(data = Price, statistic = foo1, R = 5000, use.weights = FALSE)
Of course, always remember that you can extract the posterior samples
from bayesboot
and store them in new objects, for
instance,
head(bb.mult1)
## The first 10 draws (out of 6) from the Bayesian bootstrap:
##
## Mean Standard deviation
## 1 18.20680 8.174865
## 2 18.61832 8.485256
## 3 18.06523 9.547009
## 4 17.97355 11.799315
## 5 20.52275 11.600940
## 6 17.79648 9.755993
## .. ...
##
## Use summary() to produce a summary of the posterior distribution.
bb.mean = bb.mult1$Mean
bb.sd = bb.mult1$`Standard deviation` # Using `` because there is a space
which you can use to create your own customised plots.
As a final observation, we do not need necessarily to define a
function separately before using bayesboot
(in our case the
function foo1
above). We can define it inside
bayesboot
. In our example the following code will give
exactly the same results.
set.seed(1)
bb.mult1 = bayesboot(data = Price, statistic = function(data){c(mean(data), sd(data))},
R = 5000, use.weights = FALSE)
Of course, if the function is complicated it is better to define it
outside bayesboot
.
Task:
4. Load the package moments
. This
includes the functions skewness()
and
kurtosis()
. Run bayesboot
for variable
Horsepower for these two statistics using 5000 samples and the
resampling option, storing the results in an object
bb.mult2
(use as usual set.seed(1)
). Define
appropriate names for the variables and produce a density plot.
Given everything we have learned so far and with the Applied Lecture 1 on regression this should be relatively straightforward. Say, that again we are interested in Price as the response variable and Horsepower as the predictor and that we are interested in the regression coefficients of the intercept and the slope.
We must first define a function that returns the coefficient
estimates. This is relatively easy to do using a combination of the
commands coef
and lm
.
lm.coefs1 = function(data) {
coef( lm(Price ~ Horsepower, data = data) )
}
lm.coefs1(Cars93)
## (Intercept) Horsepower
## -2.3214825 0.1535693
We can now apply bayesboot
using the resampling
method.
# The below takes some time for R = 5000 lets use R = 1000
bb.reg1 = bayesboot(data = Cars93, statistic = lm.coefs1, R = 1000, use.weights = FALSE)
plot(bb.reg1, cex = 1, cex.lab = 1)
summary(bb.reg1)
## Bayesian bootstrap
##
## Number of posterior draws: 1000
##
## Summary of the posterior (with 95% Highest Density Intervals):
## statistic mean sd hdi.low hdi.high
## (Intercept) -2.5226059 2.16190701 -6.8118282 1.8948591
## Horsepower 0.1551188 0.01844872 0.1179382 0.1916066
##
## Quantiles:
## statistic q2.5% q25% median q75% q97.5%
## (Intercept) -7.1454985 -3.8838908 -2.4897783 -1.2030188 1.5968906
## Horsepower 0.1211688 0.1438173 0.1539134 0.1665158 0.1965004
##
## Call:
## bayesboot(data = Cars93, statistic = lm.coefs1, R = 1000, use.weights = FALSE)
Note that now bayesboot
automatically assigns proper
names in the plot above ((Intercept) and Horsepower instead of V1 and
V2).
We can also produce plots of the bootstrap regression lines, the original OLS regression line and the data points (as we have done in Applied Lecture 1).
plot(1, bty="l", xlab="Horsepower", ylab="Price", xlim = range(Horsepower), ylim = range(Price))
for (j in 1:1000) {
abline(coef = bb.reg1[j,], col='pink', lwd = 2)
}
abline(coef = coef(lm(Price ~ Horsepower, data = Cars93)), col=1, lwd = 2)
points(Horsepower, Price, bty = 'l', pch = 16)
Of course, we can again store the posterior samples in objects for any type of post-processing.
b0 = bb.reg1$`(Intercept)`
b1 = bb.reg1$Horsepower
Task:
5. Define a function lm.coefs2
similar
to lm.coef1
above, but now with two arguments
data
and w
, which inside lm
will
have the additional argument weights = w
. Then use this
function to run bayesboot
with the weighted method for 1000
samples (set.seed(1)
), storing results to an object
bb.reg2
. Produce density plots of the regression
coefficients.
The extension to multiple predictors is pretty straightforward. We
construct again a function using coef
and lm
and this time we add the predictors that we want inside lm
.
Like in Applied Lecture 1, assume that we are interested in the model
with response Price and predictors Horsepower,
Fuel.tank.capacity and Passengers.
lm.coefs3 = function(data) {
coef( lm(Price ~ Horsepower + Fuel.tank.capacity + Passengers, data = data) )
}
lm.coefs3(Cars93)
## (Intercept) Horsepower Fuel.tank.capacity Passengers
## -10.7166535 0.1215349 0.6476156 0.4946594
Then, we have the following code and results.
bb.reg3 = bayesboot(data = Cars93, statistic = lm.coefs3, R = 1000, use.weights = FALSE)
summary(bb.reg3)
## Bayesian bootstrap
##
## Number of posterior draws: 1000
##
## Summary of the posterior (with 95% Highest Density Intervals):
## statistic mean sd hdi.low hdi.high
## (Intercept) -10.3364986 4.08028525 -17.61199056 -2.4718820
## Fuel.tank.capacity 0.6145359 0.39545548 -0.20924737 1.3368692
## Horsepower 0.1255108 0.03056119 0.06733603 0.1875315
## Passengers 0.4219182 0.86379843 -1.22119980 2.2390925
##
## Quantiles:
## statistic q2.5% q25% median q75% q97.5%
## (Intercept) -18.17511637 -13.2777616 -10.1779113 -7.4560290 -3.0413979
## Fuel.tank.capacity -0.20352609 0.3699071 0.6385119 0.8680470 1.3557652
## Horsepower 0.07173276 0.1049201 0.1232754 0.1427855 0.1927868
## Passengers -1.22012762 -0.1807864 0.4026125 0.9777024 2.2407161
##
## Call:
## bayesboot(data = Cars93, statistic = lm.coefs3, R = 1000, use.weights = FALSE)
plot(bb.reg3, col = 'lightgreen', cex = 1, cex.lab = 1)
Similarly to the conclusions in Applied Lecture 1, we see that Horsepower seems to be the only influential predictor as it is the only one with a 95% HDI that does not contain 0.
Note: In the regression analyses
R = 1000
was used simply to speed up the computations.