We return to the Cars93 dataset and follow the usual steps.
library('MASS')
data(Cars93)
class(Cars93)
## [1] "data.frame"
dim(Cars93)
## [1] 93 27
sum(is.na(Cars93))
## [1] 13
Cars93 = na.omit(Cars93)
n = nrow(Cars93)
n
## [1] 82
head(Cars93, n = 2)
## Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
## 1 Acura Integra Small 12.9 15.9 18.8 25 31
## 2 Acura Legend Midsize 29.2 33.9 38.7 18 25
## AirBags DriveTrain Cylinders EngineSize Horsepower RPM
## 1 None Front 4 1.8 140 6300
## 2 Driver & Passenger Front 6 3.2 200 5500
## Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
## 1 2890 Yes 13.2 5 177 102
## 2 2335 Yes 18.0 5 195 115
## Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
## 1 68 37 26.5 11 2705 non-USA Acura Integra
## 2 71 38 30.0 15 3560 non-USA Acura Legend
In the following we are initially interested in a multiple regression
application, considering Price as response variable. Obviously,
we cannot consider Min.Price and Max.Price as
predictors since Price is just the average of those (see
information in ?Cars93
). We also want to consider only
continuous predictors. So, we define keep
as below and
create the data frame cars93
which contains only the
variables of interest.
keep = c('Price', 'MPG.city', 'MPG.highway', 'EngineSize',
'Horsepower', 'RPM', 'Rev.per.mile', 'Fuel.tank.capacity',
'Length', 'Wheelbase', 'Width', 'Turn.circle', 'Rear.seat.room')
cars93 = Cars93[,keep]
head(cars93,n=2)
## Price MPG.city MPG.highway EngineSize Horsepower RPM Rev.per.mile
## 1 15.9 25 31 1.8 140 6300 2890
## 2 33.9 18 25 3.2 200 5500 2335
## Fuel.tank.capacity Length Wheelbase Width Turn.circle Rear.seat.room
## 1 13.2 177 102 68 37 26.5
## 2 18.0 195 115 71 38 30.0
We can use lm
to fit the full model that contains all
predictors and then the command summary
for the results.
Note that the “.” symbol below means that we include all predictors in
cars93
.
linear.model = lm(Price ~ . , data = cars93)
summary(linear.model)
##
## Call:
## lm(formula = Price ~ ., data = cars93)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4747 -2.7390 -0.0843 2.6081 23.5256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.467131 30.727240 1.252 0.21484
## MPG.city 0.077612 0.441156 0.176 0.86087
## MPG.highway -0.327216 0.413307 -0.792 0.43125
## EngineSize 0.009996 2.607578 0.004 0.99695
## Horsepower 0.155249 0.037247 4.168 8.77e-05 ***
## RPM -0.003068 0.002254 -1.361 0.17787
## Rev.per.mile 0.002801 0.002513 1.115 0.26891
## Fuel.tank.capacity 0.340698 0.492717 0.691 0.49159
## Length 0.011950 0.125946 0.095 0.92469
## Wheelbase 0.787171 0.286581 2.747 0.00767 **
## Width -1.345927 0.489693 -2.749 0.00763 **
## Turn.circle -0.473591 0.376897 -1.257 0.21315
## Rear.seat.room -0.044031 0.332012 -0.133 0.89488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.533 on 69 degrees of freedom
## Multiple R-squared: 0.7371, Adjusted R-squared: 0.6913
## F-statistic: 16.12 on 12 and 69 DF, p-value: 1.362e-15
Note that the above is equivalent to the below (but much simpler):
linear.model = lm(Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + RPM +
Rev.per.mile + Fuel.tank.capacity + Length + Wheelbase + Width +
Turn.circle + Rear.seat.room, data = cars93)
Another equivalent alternative, is to use attach
(this
extracts the variables in cars93
as vectors) in which case
we do not need to add data = cars93
inside
lm
:
attach(cars93)
linear.model = lm(Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + RPM +
Rev.per.mile + Fuel.tank.capacity + Length + Wheelbase + Width +
Turn.circle + Rear.seat.room)
In any case, regardless of the coding approach, from
summary
we observe initially that the estimate of \(\beta_0\) is quite large (38.467131) and
also not statistically significant. This is due to differences in scales
of variables. We could bring it down by reducing the scale of the
response, by taking the logarithm for instance. As we observe below this
also removes to some degree the skewness from the distribution of the
response variable.
logPrice = log(Price)
par(mfrow = c(1,2))
hist(Price, probability = TRUE)
lines(density(Price))
hist(logPrice, probability = TRUE)
lines(density(logPrice))
So let’s use logPrice
, replacing Price
in
cars93
, and see the result.
cars93[, 'Price'] = logPrice
names(cars93)[1] = 'logPrice'
head(cars93, n = 2)
## logPrice MPG.city MPG.highway EngineSize Horsepower RPM Rev.per.mile
## 1 2.766319 25 31 1.8 140 6300 2890
## 2 3.523415 18 25 3.2 200 5500 2335
## Fuel.tank.capacity Length Wheelbase Width Turn.circle Rear.seat.room
## 1 13.2 177 102 68 37 26.5
## 2 18.0 195 115 71 38 30.0
linear.model = lm(logPrice ~ ., data = cars93)
summary(linear.model)
##
## Call:
## lm(formula = logPrice ~ ., data = cars93)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44103 -0.13519 -0.00941 0.14358 0.50699
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.398e+00 1.260e+00 1.903 0.061246 .
## MPG.city -5.545e-03 1.810e-02 -0.306 0.760198
## MPG.highway -1.041e-02 1.695e-02 -0.614 0.541103
## EngineSize -9.016e-02 1.070e-01 -0.843 0.402189
## Horsepower 6.178e-03 1.528e-03 4.043 0.000135 ***
## RPM -1.101e-04 9.244e-05 -1.191 0.237702
## Rev.per.mile 4.638e-05 1.031e-04 0.450 0.654210
## Fuel.tank.capacity 3.424e-02 2.021e-02 1.694 0.094755 .
## Length -2.201e-04 5.166e-03 -0.043 0.966140
## Wheelbase 3.316e-02 1.176e-02 2.821 0.006254 **
## Width -4.357e-02 2.009e-02 -2.169 0.033530 *
## Turn.circle -3.902e-03 1.546e-02 -0.252 0.801486
## Rear.seat.room -1.940e-03 1.362e-02 -0.142 0.887148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.227 on 69 degrees of freedom
## Multiple R-squared: 0.8009, Adjusted R-squared: 0.7663
## F-statistic: 23.13 on 12 and 69 DF, p-value: < 2.2e-16
The intercept is now much lower, while working on the logarithmic scale also changes some \(p\)-values.
A last think we might want to do in such regression analyses, is to bring all the predictors in the same scale. As we see below the predictors are in very different scales, which means that the magnitudes of the regression coefficients are not directly comparable.
apply(cars93[, -1], 2, range)
## MPG.city MPG.highway EngineSize Horsepower RPM Rev.per.mile
## [1,] 16 22 1.0 55 3800 1320
## [2,] 46 50 5.7 300 6500 3755
## Fuel.tank.capacity Length Wheelbase Width Turn.circle Rear.seat.room
## [1,] 9.2 141 90 60 32 19
## [2,] 23.0 219 117 78 45 36
We can make the predictors to be on the scale by standardising them;
i.e., having mean equal to 0 and variance equal to 1. We can use the
command scale
for that (type ?scale
for
options). Let’s keep cars93
as is (we are going to need it
later) and create its scaled version stored in
scaled.cars93
.
scaled.cars93 = cars93
scaled.cars93[, -1] = scale(cars93[, -1])
linear.model = lm(logPrice ~ ., data = scaled.cars93)
summary(linear.model)
##
## Call:
## lm(formula = logPrice ~ ., data = scaled.cars93)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44103 -0.13519 -0.00941 0.14358 0.50699
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.839752 0.025064 113.299 < 2e-16 ***
## MPG.city -0.031018 0.101222 -0.306 0.760198
## MPG.highway -0.052179 0.084955 -0.614 0.541103
## EngineSize -0.090525 0.107396 -0.843 0.402189
## Horsepower 0.315399 0.078004 4.043 0.000135 ***
## RPM -0.064271 0.053960 -1.191 0.237702
## Rev.per.mile 0.023062 0.051264 0.450 0.654210
## Fuel.tank.capacity 0.103069 0.060840 1.694 0.094755 .
## Length -0.003362 0.078909 -0.043 0.966140
## Wheelbase 0.214423 0.076021 2.821 0.006254 **
## Width -0.160938 0.074198 -2.169 0.033530 *
## Turn.circle -0.012349 0.048928 -0.252 0.801486
## Rear.seat.room -0.005512 0.038695 -0.142 0.887148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.227 on 69 degrees of freedom
## Multiple R-squared: 0.8009, Adjusted R-squared: 0.7663
## F-statistic: 23.13 on 12 and 69 DF, p-value: < 2.2e-16
The above regression coefficients are comparable and we see again changes in significance levels. One might ask is re-scaling always necessary? Not absolutely, if we prefer interpretability (coefficient estimates corresponding to the real scales of the predictors) we may work on the actual given scales. Here we will work with the scaled predictors.
So, based on the above estimated model the coefficients that are significant on a 5% level (apart from the intercept) are the ones corresponding to Horsepower, Wheelbase and Width. The question is, can we trust those estimates? The answer is not before checking for potential multicollinearities. Multicollinearity refers to the situation where the predictors in a regression model are linearly dependent, meaning they can be highly-correlated (see Wikipedia page for info). Under multicollinearity estimates of regression coefficients are subjected to high variance and are, therefore, unreliable.
We see below that we have indeed some very high correlations! So, ideally we should do something to address this issue.
cov93 = cov(scaled.cars93[, -1])
cor93 = cov2cor(cov93)
library(pheatmap)
pheatmap(cor93, display_numbers = TRUE, cluster_rows = FALSE,
cluster_cols = FALSE, fontsize_number = 10,
main = 'Correlations between predictors')
There are several approaches to deal with the problem of multicollinearity, for instance one can use ridge regression instead of standard regression. Here we will use a traditional approach based on Variance Inflation Factors (VIFs). Briefly, under this approach each predictor is regressed against all other predictors and the VIF is calculated based on the \(R^2\) measure. The procedure is simple and we will not go into details (see Wikipedia page), for predictor \(X_j\) we basically calculate \[ VIF_j = \frac{1}{1-R^2_j}.\] Common rule-of-thumb thresholds that suggest high multicollinearity are VIF > 5 or VIF > 10, depending how strict we want to be.
In R we can use can use the vif
command from package
car
. Let’s see what this gives for the full model
considered above.
library(car)
## Loading required package: carData
vif(linear.model)
## MPG.city MPG.highway EngineSize Horsepower
## 16.110561 11.348431 18.135714 9.567437
## RPM Rev.per.mile Fuel.tank.capacity Length
## 4.578366 4.132175 5.820200 9.790667
## Wheelbase Width Turn.circle Rear.seat.room
## 9.087039 8.656598 3.764142 2.354300
Very high VIFs indeed! The highest VIF is that of variable EngineSize. So, lets remove this predictor and re-fit the model and calculate the VIFs.
linear.model = lm(logPrice ~ . - EngineSize, data = scaled.cars93)
vif(linear.model)
## MPG.city MPG.highway Horsepower RPM
## 15.818730 11.312104 5.141064 2.435494
## Rev.per.mile Fuel.tank.capacity Length Wheelbase
## 3.712738 5.609562 9.762725 9.007541
## Width Turn.circle Rear.seat.room
## 8.328024 3.745159 2.187627
The VIFs have decreased, yet are still high. We repeat this process until all VIFs are below 5 (we choose a strict cut-off threshold).
linear.model = lm(logPrice ~ . - EngineSize - MPG.city, data = scaled.cars93)
vif(linear.model)
## MPG.highway Horsepower RPM Rev.per.mile
## 2.777327 5.114868 2.431202 3.424145
## Fuel.tank.capacity Length Wheelbase Width
## 5.342574 9.517154 9.006463 8.316865
## Turn.circle Rear.seat.room
## 3.738376 2.160643
linear.model = lm(logPrice ~ . - EngineSize - MPG.city - Length, data = scaled.cars93)
vif(linear.model)
## MPG.highway Horsepower RPM Rev.per.mile
## 2.701898 4.893156 2.403916 3.422715
## Fuel.tank.capacity Wheelbase Width Turn.circle
## 5.228544 7.021996 7.327134 3.464725
## Rear.seat.room
## 2.140781
linear.model = lm(logPrice ~ . - EngineSize - MPG.city - Length -
Width, data = scaled.cars93)
vif(linear.model)
## MPG.highway Horsepower RPM Rev.per.mile
## 2.622137 4.668149 2.226758 3.232764
## Fuel.tank.capacity Wheelbase Turn.circle Rear.seat.room
## 5.089194 5.739041 3.008665 1.968458
linear.model = lm(logPrice ~ . - EngineSize - MPG.city - Length -
Width - Wheelbase, data = scaled.cars93)
vif(linear.model)
## MPG.highway Horsepower RPM Rev.per.mile
## 2.527488 4.477091 2.159676 3.023761
## Fuel.tank.capacity Turn.circle Rear.seat.room
## 4.285448 2.922203 1.407649
We finally arrive at a model where all VIFs are below 5. As we can see below the statistically significant coefficients (at a 5% level) are those of Horsepower and Fuel.tank.capacity. Predictors MPG.highway and Rear.seat.room are significant only at a 10% level.
summary(linear.model)
##
## Call:
## lm(formula = logPrice ~ . - EngineSize - MPG.city - Length -
## Width - Wheelbase, data = scaled.cars93)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46567 -0.15178 -0.01096 0.14059 0.71803
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.83975 0.02628 108.039 < 2e-16 ***
## MPG.highway -0.08020 0.04204 -1.908 0.0603 .
## Horsepower 0.26305 0.05596 4.701 1.17e-05 ***
## RPM -0.02127 0.03886 -0.547 0.5858
## Rev.per.mile 0.03364 0.04599 0.731 0.4668
## Fuel.tank.capacity 0.12478 0.05475 2.279 0.0255 *
## Turn.circle -0.03595 0.04521 -0.795 0.4291
## Rear.seat.room 0.05274 0.03138 1.681 0.0970 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.238 on 74 degrees of freedom
## Multiple R-squared: 0.7652, Adjusted R-squared: 0.743
## F-statistic: 34.45 on 7 and 74 DF, p-value: < 2.2e-16
The corresponding approximate large-sample Wald confidence intervals are as follows.
Wald = confint(linear.model)
Wald
## 2.5 % 97.5 %
## (Intercept) 2.787379055 2.89212501
## MPG.highway -0.163977289 0.00357341
## Horsepower 0.151555277 0.37455261
## RPM -0.098714305 0.05616593
## Rev.per.mile -0.057993982 0.12526927
## Fuel.tank.capacity 0.015692929 0.23386533
## Turn.circle -0.126025170 0.05413421
## Rear.seat.room -0.009783874 0.11525610
Let’s remember now how we can obtain CIs using the function
Boot
from package car that allows for both
bootstrap regression approaches shown in lectures.
For NPB we define method = 'case'
in Boot
(the default actually) and we use 10000 bootstrap replications,
specified by argumentR
.
set.seed(1)
NPB.results = Boot(linear.model, method = 'case', R = 10000)
## Loading required namespace: boot
summary(NPB.results)
##
## Number of bootstrap replications R = 10000
## original bootBias bootSE bootMed
## (Intercept) 2.839752 0.00099434 0.026800 2.840731
## MPG.highway -0.080202 -0.00082885 0.044753 -0.078731
## Horsepower 0.263054 0.00512053 0.071841 0.264146
## RPM -0.021274 -0.00439467 0.046032 -0.024978
## Rev.per.mile 0.033638 0.00069709 0.042386 0.032964
## Fuel.tank.capacity 0.124779 0.00628697 0.058936 0.129299
## Turn.circle -0.035945 -0.00426736 0.057202 -0.039391
## Rear.seat.room 0.052736 -0.00676223 0.037725 0.048324
As we can see the estimated bias of the estimates is very low. For
confidence intervals the function Boot
provides the
following options: (i) normal, (ii) basic, (iii) percentile and (iv)
bias-corrected (BS); see Practical 1 for a reminder.
normal.NPB = confint(NPB.results, level = 0.95, type = "norm")
basic.NPB = confint(NPB.results, level = 0.95, type = "basic")
percent.NPB = confint(NPB.results, level = 0.95, type = "perc")
bc.NPB = confint(NPB.results, level = 0.95, type = "bca")
normal.NPB
## Bootstrap normal confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) 2.786230295 2.891285093
## MPG.highway -0.167086563 0.008340383
## Horsepower 0.117127886 0.398738935
## RPM -0.107100496 0.073341472
## Rev.per.mile -0.050133850 0.116014950
## Fuel.tank.capacity 0.002980227 0.234004091
## Turn.circle -0.143791482 0.080435245
## Rear.seat.room -0.014440799 0.133437490
basic.NPB
## Bootstrap basic confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) 2.7861754223 2.89077450
## MPG.highway -0.1610190342 0.01627329
## Horsepower 0.1024429956 0.39046871
## RPM -0.1055911626 0.07447638
## Rev.per.mile -0.0528113389 0.11305539
## Fuel.tank.capacity -0.0002177163 0.23092675
## Turn.circle -0.1382376262 0.08425570
## Rear.seat.room -0.0087907564 0.14042088
percent.NPB
## Bootstrap percent confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) 2.78872957 2.893328644
## MPG.highway -0.17667717 0.000615155
## Horsepower 0.13563918 0.423664893
## RPM -0.11702475 0.063042793
## Rev.per.mile -0.04578010 0.120086628
## Fuel.tank.capacity 0.01863151 0.249775973
## Turn.circle -0.15614665 0.066346667
## Rear.seat.room -0.03494866 0.114262978
bc.NPB
## Bootstrap bca confidence intervals
##
## 2.5 % 97.5 %
## (Intercept) 2.788024500 2.8923466961
## MPG.highway -0.178980509 -0.0009383059
## Horsepower 0.134059564 0.4213289277
## RPM -0.111189082 0.0681630419
## Rev.per.mile -0.046210250 0.1195165088
## Fuel.tank.capacity 0.002479506 0.2346471559
## Turn.circle -0.161732317 0.0628412266
## Rear.seat.room -0.017640247 0.1280065885
As we can see, overall, the NPB CIs are similar to the Wald’s intervals. One observation here is that the BC method (which is also generally considered more reliable) gives (marginally) a statistically significant result for the coefficient of MPG.highway.
Package car
is compatible to the base-R
hist
command, which we can use. Note that by default the BC
bootstrap CIs are used (represented as the horizontal black lines
below).
hist(NPB.results, legend ="top", col = 'red', main = 'Nonparametric Bootstrap')
To implement SRB we simply use method = 'residual'
in
the command Boot
. The remaining code then is basically more
or less the same. Here we just calculate the corresponding intervals
without presenting them (visual comparisons are presented in subsection
2.3).
set.seed(1)
SRB.results = Boot(linear.model, method = 'residual', R=10000)
summary(SRB.results)
##
## Number of bootstrap replications R = 10000
## original bootBias bootSE bootMed
## (Intercept) 2.839752 1.1535e-05 0.026659 2.839548
## MPG.highway -0.080202 -2.2133e-04 0.042582 -0.080844
## Horsepower 0.263054 4.5880e-04 0.056662 0.263029
## RPM -0.021274 -3.9374e-04 0.039587 -0.021726
## Rev.per.mile 0.033638 2.5390e-04 0.046956 0.034025
## Fuel.tank.capacity 0.124779 -2.8753e-04 0.055116 0.124016
## Turn.circle -0.035945 -3.3380e-04 0.045648 -0.035837
## Rear.seat.room 0.052736 5.4665e-04 0.031961 0.053097
normal.SRB = confint(SRB.results, level= 0.95, type="norm")
basic.SRB = confint(SRB.results, level= 0.95, type="basic")
percent.SRB = confint(SRB.results, level= 0.95, type="perc")
bc.SRB = confint(SRB.results, level= 0.95, type="bca")
hist(SRB.results, legend ="top", col = 'lightgreen',
main = 'Semiparametric Residual Bootstrap')
We can further consider the Bayesian approach to bootstrap. We first
load package bayesboot and then define a function which
will return the coefficients of a linear model and which will be used as
input to the command bayesboot
. We check the correctness of
the coefficients for scaled.cars93
.
library(bayesboot)
betas = function(data) {
coef( lm(logPrice ~ . - EngineSize - MPG.city - Length -
Width - Wheelbase, data = data) )
}
betas(scaled.cars93)
## (Intercept) MPG.highway Horsepower RPM
## 2.83975203 -0.08020194 0.26305394 -0.02127419
## Rev.per.mile Fuel.tank.capacity Turn.circle Rear.seat.room
## 0.03363764 0.12477913 -0.03594548 0.05273611
We then run bayesboot
with
use.weights = FALSE
, which is the default option (actually
use.weights = TRUE
is to be avoided in the regression
setting as it is not straightforward to define a weighted statistic; see
Practical 2 for a reminder).
The command summary
gives all the relevant information.
Note: (i) the variables are presented in alphabetical
order and (ii) the 95% credible intervals are given in
the first table in the columns hdi.low
and
hdi.low
. As we see the Bayesian approach is in support of
the variables Horsepower and Fuel.tank.capacity, but
also of MPG.highway (like the NPB-BC method).
set.seed(1)
BBR.results = bayesboot(data = scaled.cars93, statistic = betas, R = 10000, use.weights = FALSE)
summary(BBR.results)
## Bayesian bootstrap
##
## Number of posterior draws: 10000
##
## Summary of the posterior (with 95% Highest Density Intervals):
## statistic mean sd hdi.low hdi.high
## (Intercept) 2.84031547 0.02423927 2.79470144 2.889922904
## Fuel.tank.capacity 0.13097722 0.05247347 0.03098518 0.235375479
## Horsepower 0.26680123 0.06363847 0.14448541 0.392370433
## MPG.highway -0.07975419 0.03860207 -0.15734644 -0.005605435
## Rear.seat.room 0.04716027 0.03243652 -0.01548012 0.109775027
## Rev.per.mile 0.03389027 0.03834551 -0.04109425 0.110426610
## RPM -0.02401111 0.04227813 -0.10427614 0.060624927
## Turn.circle -0.03874963 0.05287822 -0.14476889 0.059022524
##
## Quantiles:
## statistic q2.5% q25% median q75%
## (Intercept) 2.79469958 2.823310666 2.83989590 2.856379857
## Fuel.tank.capacity 0.02812839 0.094937163 0.13077755 0.166534177
## Horsepower 0.14944600 0.222538960 0.26450391 0.307654357
## MPG.highway -0.16202987 -0.104466793 -0.07740673 -0.053252498
## Rear.seat.room -0.01695058 0.024960232 0.04783397 0.069900896
## Rev.per.mile -0.04007993 0.008415042 0.03324996 0.059218810
## RPM -0.11004861 -0.052238577 -0.02271270 0.005157555
## Turn.circle -0.15052118 -0.071510345 -0.03549737 -0.001635448
## q97.5%
## 2.889918136
## 0.233565821
## 0.400029048
## -0.008824718
## 0.108712737
## 0.111647139
## 0.055668595
## 0.055752603
##
## Call:
## bayesboot(data = scaled.cars93, statistic = betas, R = 10000, use.weights = FALSE)
When there are many methods under consideration it is more convenient to compare the different CIs in a graphical way. The codes below are one possible approach to do this for our application (note that more efficient, clever and nicer solutions may certainly exist!). We will use package plotrix for this:
install.packages('plotrix')
library(plotrix)
We will focus on the variables Horsepower,
Fuel.tank.capacity and MPG.highway which seem to be
influential on the response logPrice (although there is a
question mark for MPG.highway). Let’s start with the latter
predictor. MPG.highway is always the second row in the Wald and
bootstrap CI objects which we have created (see above). What we want is
a matrix with 2 columns (lower and upper limits) and rows corresponding
to each method. For the Bayesian method there does not seem to be a way
to extract directly the 95% credible interval from the object
BBR.results
so we will need to look it up in
summary(BBR.results)
, so we have:
bayes.MPG = c(-0.15734644, -0.005605435)
CI.MPG = rbind(Wald[2,],normal.NPB[2,],basic.NPB[2,],percent.NPB[2,],bc.NPB[2,],
normal.SRB[2,],basic.SRB[2,],percent.SRB[2,],bc.SRB[2,], bayes.MPG)
rownames(CI.MPG) = c('Wald.MPG', 'normal.NPB.MPG', 'basic.NPB.MPG', 'percent.NPB.MPG', 'bc.NPB.MPG',
'normal.SRB.MPG', 'basic.SRB.MPG', 'percent.SRB.MPG', 'bc.SRB.MPG','bayes.MPG')
CI.MPG
## 2.5 % 97.5 %
## Wald.MPG -0.1639773 0.0035734098
## normal.NPB.MPG -0.1670866 0.0083403832
## basic.NPB.MPG -0.1610190 0.0162732928
## percent.NPB.MPG -0.1766772 0.0006151550
## bc.NPB.MPG -0.1789805 -0.0009383059
## normal.SRB.MPG -0.1634390 0.0034777620
## basic.SRB.MPG -0.1642900 0.0035792410
## percent.SRB.MPG -0.1639831 0.0038861222
## bc.SRB.MPG -0.1621129 0.0056865806
## bayes.MPG -0.1573464 -0.0056054350
We will also need a vector of length equal to 10 (because we consider
10 methods) where the first 9 elements are equal to the frequentist
estimate of \(\beta_{MPG}\) and the
last element is its posterior mean (again we look up in
summary(BBR.results)
).
bayes.mean.MPG = -0.07975419
point.MPG = c(rep(coef(linear.model)[2],9), bayes.mean.MPG)
point.MPG
## MPG.highway MPG.highway MPG.highway MPG.highway MPG.highway MPG.highway
## -0.08020194 -0.08020194 -0.08020194 -0.08020194 -0.08020194 -0.08020194
## MPG.highway MPG.highway MPG.highway
## -0.08020194 -0.08020194 -0.08020194 -0.07975419
We can now use plotCI
from package plotrix
(type ?plotCI
for information). We see a lot of
customisation below; the important argument is xaxt ='n'
which suppresses the default x-labels in plotCI
and then
allows us to use axis
in order to insert the names of the
methods. Important note: All the base-R graphical
parameter are controlled by par
: you can find all the
information about them by typing ?par
.
plotCI(x = 1:10, y = point.MPG,li = CI.MPG[,1], ui = CI.MPG[,2], lwd = 2, main = 'MPG.highway',
col = c(1,rep(2,4),rep(3,4),4), bty = 'l', ylab = '',xlab = 'Methods', xaxt ='n')
axis(1, at = 1:10, labels = c('Wald', 'normal', 'basic', 'percent', 'BC',
'normal', 'basic', 'percent', 'BC','Bayes'), cex.axis = 0.8)
legend('topright', legend = c('NPB','SRB'), col = c(2,3), pch = c(15,15), bty = 'n', cex = 0.8)
abline(h = 0, lty = 3)
We note that SRB produces shorter intervals than NPB. Also, the Bayesian
bootstrap CI is the shortest and does not include 0 (as does also the
NPB-BC interval). The code and corresponding figures for
Horsepower and Fuel.tank.capacity are presented
below.
bayes.horsepower = c(0.14448541, 0.392370433)
CI.horsepower = rbind(Wald[3,],normal.NPB[3,],basic.NPB[3,],percent.NPB[3,],bc.NPB[3,],
normal.SRB[3,],basic.SRB[3,],percent.SRB[3,],bc.SRB[3,],bayes.horsepower)
bayes.mean.horsepower= 0.26680123
point.horsepower = c(rep(coef(linear.model)[3],9), bayes.mean.horsepower)
plotCI(x = 1:10, y = point.horsepower,li = CI.horsepower[,1], ui = CI.horsepower[,2], lwd = 2,
main = 'Horsepower', col = c(1,rep(2,4),rep(3,4),4), bty = 'l', ylab = '',
xlab = 'Methods', xaxt ='n')
axis(1, at = 1:10, labels = c('Wald', 'normal', 'basic', 'percent', 'BC',
'normal', 'basic', 'percent', 'BC','Bayes'), cex.axis = 0.8)
legend('bottomright', legend = c('NPB','SRB'), col = c(2,3), pch = c(15,15), bty = 'n', cex = 0.8)
bayes.fuel = c(0.03098518, 0.235375479)
CI.fuel = rbind(Wald[6,],normal.NPB[6,],basic.NPB[6,],percent.NPB[6,],bc.NPB[6,],
normal.SRB[6,],basic.SRB[6,],percent.SRB[6,],bc.SRB[6,],bayes.fuel)
bayes.mean.fuel= 0.13097722
point.fuel = c(rep(coef(linear.model)[6],9), bayes.mean.fuel)
plotCI(x = 1:10, y = point.fuel,li = CI.fuel[,1], ui = CI.fuel[,2], lwd = 2,
main = 'Fuel Tank Capacity', col = c(1,rep(2,4),rep(3,4),4), bty = 'l', ylab = '',
xlab = 'Methods', xaxt ='n')
axis(1, at = 1:10, labels = c('Wald', 'normal', 'basic', 'percent', 'BC',
'normal', 'basic', 'percent', 'BC','Bayes'),
cex.axis = 0.8)
legend('bottomright', legend = c('NPB','SRB'), col = c(2,3), pch = c(15,15),
bty = 'n', cex = 0.8)
abline(h = 0, lty = 3)
Assume now that based on the previous findings we are comfortable working with Horsepower, Fuel.tank.capacity and also MPG.highway as trustworthy predictors/explanatory variables of the response logPrice. (note, of course, that different approaches could have been used for selecting predictors, for instance best subset or stepwise selection, lasso regression and so forth).
Say that we are interested in inferring about the conditional densities of logPrice across different quantiles of Horsepower for low and high fuel consumption (variable MPG.highway) and for median Fuel.tank.capacity across the different cars in the dataset. This is a question that can be addressed by a flexible DPM regression model. We load the required packages for the analysis:
library(BNPmix)
library(gridExtra)
Note that here we will work with the unscaled data frame
cars93
as we want the grid cut-offs to be on a natural
scale. Below we define the grids for the response and the predictor
matrix X
. Note that we select the 10% and
90% percentiles of MPG.highway for
high and low fuel consumption,
respectively (extreme choices like 1% and 99% are not recommended since
very few cars will be in these categories). Also note, that we could
have used something like
grid_x2 = quantile(X[,2], probs = c(0.01, 0.25, 0.50, 0.75, 0.99))
for Horsepower, but below we just set the cut-offs according to
our own preference.
grid_y = seq(min(logPrice), max(logPrice), length.out = 100)
X = cars93[, c('MPG.highway', 'Horsepower', 'Fuel.tank.capacity')]
grid_x1 = quantile(X[,1], probs = c(0.1, 0.9))
grid_x2 = c(50, 100, 150, 200, 300)
grid_x3 = quantile(X[,3], probs = c(0.50))
grid_x1
## 10% 90%
## 25 36
grid_x2
## [1] 50 100 150 200 300
grid_x3
## 50%
## 15.95
grid_X = expand.grid(grid_x1, grid_x2, grid_x3)
Next, we have to make some choices about our model. Do we expect to
find any clusters and, if yes, how many? We do not have any prior
expectation and the main question is not about clustering, so we just
set a large number of expected potential clusters with
Ek = 10
. Should we use a hierarchical model or not? We
decide to keep things simple and set hyper = FALSE
.
Finally, should we stick to the default semi-empirical Bayes prior
specifications of BNPmix? We could, but we decide to pursue a fully
Bayesian approach and set m0 = c(0,0,0,0)
,
a0 = 3
and b0 = 100
(see Practical 4 for a
reminder of the model).
DPprior = PYcalibrate(Ek = 10, n = n, discount = 0)
prior = list(strength = DPprior$strength, discount = 0, hyper = FALSE,
m0 = c(0,0,0,0), a0 = 3, b0 = 100)
output <- list(grid_x = grid_X, grid_y = grid_y, out_type = 'FULL',
out_param = TRUE)
mcmc.total.n = 11000
mcmc.burnin = 1000
mcmc = list(niter = mcmc.total.n, nburn = mcmc.burnin, method = 'ICS')
set.seed(1)
fit.reg = PYregression(y = logPrice, x = X, prior = prior, mcmc = mcmc,
output = output)
## Completed: 1100/11000 - in 0.155 sec
## Completed: 2200/11000 - in 0.439 sec
## Completed: 3300/11000 - in 0.697 sec
## Completed: 4400/11000 - in 1.019 sec
## Completed: 5500/11000 - in 1.262 sec
## Completed: 6600/11000 - in 1.505 sec
## Completed: 7700/11000 - in 1.809 sec
## Completed: 8800/11000 - in 2.083 sec
## Completed: 9900/11000 - in 2.335 sec
## Completed: 11000/11000 - in 2.579 sec
##
## Estimation done in 2.579 seconds
summary(fit.reg)
## PYdensity function call:
## 1000 burn-in iterations
## 11000 iterations
## Global estimation time: 2.58 seconds
## Average number of groups: 1.0042
## Min number of groups: 1 ; Max number of groups: 2
Based on the output from summary
we see that the
structure of this model is not related to or does not reflect any
clustering effect.
Now, as seen in Practical 4 we need to use the information in
fit.reg$density
in order to produce the plots. This object
has 10 columns because we have 2 percentiles of MPH.highway, 5
values of Horsepower and only 1 percentile (the median) of
Fuel.tank.capacity.
dim(fit.reg$density)
## [1] 100 10 10000
So, we can use the following code with package ggplot2
to produce the plots of interest.
# For 10% MPG.highway
regplot1 <- data.frame(
dens = as.vector(apply(fit.reg$density[, 1:5,], c(1, 2), mean)),
qlow = as.vector(apply(fit.reg$density[, 1:5,], c(1, 2),
quantile, probs = 0.025)),
qupp = as.vector(apply(fit.reg$density[, 1:5,], c(1, 2),
quantile, probs = 0.975)),
grid = rep(grid_y, length(grid_x2)),
label = factor(rep(paste("Horsepower = ", grid_x2), each = length(grid_y)),
level = rep(paste("Horsepower = ", grid_x2)))
)
# For 10% MPG.highway
regplot2 <- data.frame(
dens = as.vector(apply(fit.reg$density[, 6:10,], c(1, 2), mean)),
qlow = as.vector(apply(fit.reg$density[, 6:10,], c(1, 2),
quantile, probs = 0.025)),
qupp = as.vector(apply(fit.reg$density[, 6:10,], c(1, 2),
quantile, probs = 0.975)),
grid = rep(grid_y, length(grid_x2)),
label = factor(rep(paste("Horsepower = ", grid_x2), each = length(grid_y)),
level = rep(paste("Horsepower = ", grid_x2)))
)
library(ggplot2)
plot1 = ggplot(regplot1) + theme_bw() +
geom_line(data = regplot1, map = aes(x = grid, y = dens)) +
geom_ribbon(data = regplot1, map = aes(x = grid, ymin = qlow,
ymax = qupp), fill = "green", alpha = 0.3) +
facet_wrap(~label, ncol = 5, nrow = 1) +
labs(x = "logPrice", y = "MPG.highway = 22")
plot2 = ggplot(regplot2) + theme_bw() +
geom_line(data = regplot2, map = aes(x = grid, y = dens)) +
geom_ribbon(data = regplot2, map = aes(x = grid, ymin = qlow,
ymax = qupp), fill = "red", alpha = 0.3) +
facet_wrap(~label, ncol = 5, nrow = 1) +
labs(x = "logPrice", y = "MPG.highway = 55")
grid.arrange(plot1, plot2, layout_matrix = matrix(1:2, 2, 1))
We generally observe the following: (i) within both categories (high/low
consumption) the mode of distribution of logPrice shifts from
left to right as Horsepower increases (understandable), (ii)
for Horsepower = 50 there are no major differences in the
distributions of high and low fuel consumption cars, however, as
Horsepower increases low-fuel consumption cars
(MPG.highway = 55) are more expensive (also understandable) and
(iii) there is a lot of uncertainty in the distributions of low-fuel
consumption cars with high Horsepower; this is because there
are only a few cars (data points) with this combination of
attributes.
Here we will consider the unsupervised setting; i.e., we do not consider logPrice as a response variable. Previously, we saw that the DPM regression model for logPrice did not detect any clusters in the data. Does this mean there are no clusters? Not necessarily, since it can be just the case that the particular model and its specific structure do not relate to potential latent subgroups.
Recall that we have an interesting variable in the 1st column of
Cars93
, namely, the car brand.
Cars93[,1]
## [1] Acura Acura Audi Audi BMW
## [6] Buick Buick Buick Buick Cadillac
## [11] Cadillac Chevrolet Chevrolet Chevrolet Chevrolet
## [16] Chevrolet Chrylser Chrysler Chrysler Dodge
## [21] Dodge Dodge Dodge Dodge Eagle
## [26] Eagle Ford Ford Ford Ford
## [31] Ford Ford Ford Geo Geo
## [36] Honda Honda Honda Hyundai Hyundai
## [41] Hyundai Hyundai Infiniti Lexus Lexus
## [46] Lincoln Lincoln Mazda Mazda Mazda
## [51] Mercedes-Benz Mercedes-Benz Mercury Mercury Mitsubishi
## [56] Mitsubishi Nissan Nissan Nissan Oldsmobile
## [61] Oldsmobile Oldsmobile Plymouth Pontiac Pontiac
## [66] Pontiac Pontiac Pontiac Saab Saturn
## [71] Subaru Subaru Subaru Suzuki Toyota
## [76] Toyota Toyota Volkswagen Volkswagen Volkswagen
## [81] Volvo Volvo
## 32 Levels: Acura Audi BMW Buick Cadillac Chevrolet Chrylser Chrysler ... Volvo
Initially, we see 32 different car brands (levels) but we notice that
one Chrysler
is misspelled Chrylser
, so let’s
fix that. In the third line we use factor()
to reset the
levels of Cars93[,1]
to the proper number which is now
31.
which(Cars93[,1] == 'Chrylser')
## [1] 17
Cars93[17,1] = 'Chrysler'
Cars93[,1] = factor(Cars93[,1])
table(Cars93[,1])
##
## Acura Audi BMW Buick Cadillac
## 2 2 1 4 2
## Chevrolet Chrysler Dodge Eagle Ford
## 5 3 5 2 7
## Geo Honda Hyundai Infiniti Lexus
## 2 3 4 1 2
## Lincoln Mazda Mercedes-Benz Mercury Mitsubishi
## 2 3 2 2 2
## Nissan Oldsmobile Plymouth Pontiac Saab
## 3 3 1 5 1
## Saturn Subaru Suzuki Toyota Volkswagen
## 1 3 1 3 3
## Volvo
## 2
Here we will consider the previous 4 variables (logPrice,
MPG.highway, Horsepower, Fuel.tank.capacity)
forming a matrix Y
for a multivariate DPM model. Note that
the choice of the variables is rather arbitrary, mainly following the
previous analysis, so we could consider other variables. We proceed by
defining the multivariate grid.
Y = cars93[, c('logPrice', 'MPG.highway', 'Horsepower', 'Fuel.tank.capacity')]
grid.length = 20
grid1 = seq(min(Y[,1]), max(Y[,1]), length.out = grid.length)
grid2 = seq(min(Y[,2]), max(Y[,2]), length.out = grid.length)
grid3 = seq(min(Y[,3]), max(Y[,3]), length.out = grid.length)
grid4 = seq(min(Y[,4]), max(Y[,4]), length.out = grid.length)
grid = expand.grid(grid1, grid2, grid3, grid4)
dim(grid)
## [1] 160000 4
Following that, we need to make some modelling choices again.
Similarly to before we set Ek = 10
,
hyper = FALSE
and follow a fully Bayesian approach by
setting m0 = rep(0,4)
and
Sigma0 = diag(1000,4)
(see Application 2 for a
reminder).
DPprior = PYcalibrate(Ek = 10, n = n, discount = 0)
DPprior
## $strength
## [1] 2.774189
##
## $discount
## [1] 0
mcmc.total.n = 5000
mcmc.burnin = 1000
mcmc = list(niter = mcmc.total.n, nburn = mcmc.burnin, method = 'ICS')
prior = list(strength = DPprior$strength, discount = 0, m0 = rep(0,4),
Sigma0 = diag(1000,4), hyper = FALSE)
output = list(grid = grid, out_param = TRUE)
set.seed(1)
mult.fit = PYdensity(y = Y, mcmc = mcmc, prior = prior, output = output)
## Completed: 500/5000 - in 0.244 sec
## Completed: 1000/5000 - in 0.453 sec
## Completed: 1500/5000 - in 62.549 sec
## Completed: 2000/5000 - in 120.372 sec
## Completed: 2500/5000 - in 170.146 sec
## Completed: 3000/5000 - in 218.886 sec
## Completed: 3500/5000 - in 267.226 sec
## Completed: 4000/5000 - in 316.627 sec
## Completed: 4500/5000 - in 364.322 sec
## Completed: 5000/5000 - in 413.641 sec
##
## Estimation done in 413.641 seconds
summary(mult.fit)
## PYdensity function call:
## 1000 burn-in iterations
## 5000 iterations
## Global estimation time: 413.64 seconds
## Average number of groups: 2.0005
## Min number of groups: 2 ; Max number of groups: 3
cluster.labels = mult.fit$clust
n.cluster = apply(cluster.labels, 1, unique)
n.cluster = lapply(n.cluster, length)
n.cluster = unlist(n.cluster)
counts = table(n.cluster)
counts
## n.cluster
## 2 3
## 3998 2
We see that the model shows quite decisively the existence of two
clusters in the data (despite us setting Ek = 10
), which is
quite interesting. Let’s visualise the univariate and bivariate density
estimates.
p1 = plot(mult.fit, dim = c(1, 1), xlab = "logPrice", ylab = "Density",
col = 2)
p2 = plot(mult.fit, dim = c(2, 2), xlab = "MPG.highway", ylab = "Density",
col = 3)
p3 = plot(mult.fit, dim = c(3, 3), xlab = "Horsepower", ylab = "Density",
col = 4)
p4 = plot(mult.fit, dim = c(4, 4), xlab = "Fuel.tank.capacity", ylab = "Density",
col = 5)
p12 = plot(mult.fit, dim = c(1, 2), show_clust = TRUE, xlab = "logPrice",
ylab = "MPG.highway", col = 2)
p13 = plot(mult.fit, dim = c(1, 3), show_clust = TRUE, xlab = "logPrice",
ylab = "Horsepower", col = 2)
p14 = plot(mult.fit, dim = c(1, 4), show_clust = TRUE, xlab = "logPrice",
ylab = "Fuel.tank.capacity", col = 2)
p21 = plot(mult.fit, dim = c(2, 1), show_clust = TRUE, xlab = "MPG.highway",
ylab = "logPrice", col = 3)
p23 = plot(mult.fit, dim = c(2, 3), show_clust = TRUE, xlab = "MPG.highway",
ylab = "Horsepower", col = 3)
p24 = plot(mult.fit, dim = c(2, 4), show_clust = TRUE, xlab = "MPG.highway",
ylab = "Fuel.tank.capacity", col = 3)
p31 = plot(mult.fit, dim = c(3, 1), show_clust = TRUE, xlab = "Horsepower",
ylab = "logPrice", col = 4)
p32 = plot(mult.fit, dim = c(3, 2), show_clust = TRUE, xlab = "Horsepower",
ylab = "MPG.highway", col = 4)
p34 = plot(mult.fit, dim = c(3, 4), show_clust = TRUE, xlab = "Horsepower",
ylab = "Fuel.tank.capacity", col = 4)
p41 = plot(mult.fit, dim = c(4, 1), show_clust = TRUE, xlab = "Fuel.tank.capacity",
ylab = "logPrice", col = 5)
p42 = plot(mult.fit, dim = c(4, 2), show_clust = TRUE, xlab = "Fuel.tank.capacity",
ylab = "MPG.highway", col = 5)
p43 = plot(mult.fit, dim = c(4, 3), show_clust = TRUE, xlab = "Fuel.tank.capacity",
ylab = "Horsepower", col = 5)
library(gridExtra)
grid.arrange(p1, p12, p13, p14, p21, p2, p23, p24, p31, p32, p3, p34, p41, p42,
p43, p4, layout_matrix = matrix(1:16, 4, 4))
It would be interesting to inspect the cluster
allocations of the data points (corresponding to the colored
labels of the points in the bivariate plots) in relation to the car
brands. This is something we have not seen before! The cluster
allocations for every MCMC sample are contained in
mult.fit$clust
. What we want to do is to find the average
allocation per data point and then round it. Following that we can see
in which cluster each car belongs.
names(mult.fit)
## [1] "density" "data" "grideval" "grid_x" "grid_y"
## [6] "clust" "mean" "beta" "sigma2" "probs"
## [11] "niter" "nburn" "tot_time" "univariate" "regression"
## [16] "dep" "group_log" "group" "wvals"
MCMC.allocations = mult.fit$clust
dim(MCMC.allocations)
## [1] 4000 82
allocations = round(apply(MCMC.allocations, 2, mean))
allocations
## [1] 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 1 1 1 2 1 2 1 2 1 1 1 1 2 2 1 2 1
## [39] 2 1 2 1 1 1 1 1 1 2 2 1 1 1 2 1 2 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 2 2 1
## [77] 1 2 1 1 1 1
table(allocations)
## allocations
## 1 2
## 58 24
clusters = c()
clusters[allocations == 1] = 'Cluster 1'
clusters[allocations == 2] = 'Cluster 2'
cbind(as.character(Cars93[,1]),clusters)
## clusters
## [1,] "Acura" "Cluster 1"
## [2,] "Acura" "Cluster 1"
## [3,] "Audi" "Cluster 1"
## [4,] "Audi" "Cluster 1"
## [5,] "BMW" "Cluster 1"
## [6,] "Buick" "Cluster 1"
## [7,] "Buick" "Cluster 1"
## [8,] "Buick" "Cluster 1"
## [9,] "Buick" "Cluster 1"
## [10,] "Cadillac" "Cluster 1"
## [11,] "Cadillac" "Cluster 1"
## [12,] "Chevrolet" "Cluster 2"
## [13,] "Chevrolet" "Cluster 2"
## [14,] "Chevrolet" "Cluster 1"
## [15,] "Chevrolet" "Cluster 1"
## [16,] "Chevrolet" "Cluster 1"
## [17,] "Chrysler" "Cluster 1"
## [18,] "Chrysler" "Cluster 1"
## [19,] "Chrysler" "Cluster 1"
## [20,] "Dodge" "Cluster 2"
## [21,] "Dodge" "Cluster 2"
## [22,] "Dodge" "Cluster 1"
## [23,] "Dodge" "Cluster 1"
## [24,] "Dodge" "Cluster 1"
## [25,] "Eagle" "Cluster 2"
## [26,] "Eagle" "Cluster 1"
## [27,] "Ford" "Cluster 2"
## [28,] "Ford" "Cluster 1"
## [29,] "Ford" "Cluster 2"
## [30,] "Ford" "Cluster 1"
## [31,] "Ford" "Cluster 1"
## [32,] "Ford" "Cluster 1"
## [33,] "Ford" "Cluster 1"
## [34,] "Geo" "Cluster 2"
## [35,] "Geo" "Cluster 2"
## [36,] "Honda" "Cluster 1"
## [37,] "Honda" "Cluster 2"
## [38,] "Honda" "Cluster 1"
## [39,] "Hyundai" "Cluster 2"
## [40,] "Hyundai" "Cluster 1"
## [41,] "Hyundai" "Cluster 2"
## [42,] "Hyundai" "Cluster 1"
## [43,] "Infiniti" "Cluster 1"
## [44,] "Lexus" "Cluster 1"
## [45,] "Lexus" "Cluster 1"
## [46,] "Lincoln" "Cluster 1"
## [47,] "Lincoln" "Cluster 1"
## [48,] "Mazda" "Cluster 2"
## [49,] "Mazda" "Cluster 2"
## [50,] "Mazda" "Cluster 1"
## [51,] "Mercedes-Benz" "Cluster 1"
## [52,] "Mercedes-Benz" "Cluster 1"
## [53,] "Mercury" "Cluster 2"
## [54,] "Mercury" "Cluster 1"
## [55,] "Mitsubishi" "Cluster 2"
## [56,] "Mitsubishi" "Cluster 1"
## [57,] "Nissan" "Cluster 2"
## [58,] "Nissan" "Cluster 1"
## [59,] "Nissan" "Cluster 1"
## [60,] "Oldsmobile" "Cluster 1"
## [61,] "Oldsmobile" "Cluster 1"
## [62,] "Oldsmobile" "Cluster 1"
## [63,] "Plymouth" "Cluster 1"
## [64,] "Pontiac" "Cluster 2"
## [65,] "Pontiac" "Cluster 2"
## [66,] "Pontiac" "Cluster 1"
## [67,] "Pontiac" "Cluster 1"
## [68,] "Pontiac" "Cluster 1"
## [69,] "Saab" "Cluster 1"
## [70,] "Saturn" "Cluster 2"
## [71,] "Subaru" "Cluster 2"
## [72,] "Subaru" "Cluster 1"
## [73,] "Subaru" "Cluster 1"
## [74,] "Suzuki" "Cluster 2"
## [75,] "Toyota" "Cluster 2"
## [76,] "Toyota" "Cluster 1"
## [77,] "Toyota" "Cluster 1"
## [78,] "Volkswagen" "Cluster 2"
## [79,] "Volkswagen" "Cluster 1"
## [80,] "Volkswagen" "Cluster 1"
## [81,] "Volvo" "Cluster 1"
## [82,] "Volvo" "Cluster 1"
A lot of brands belong to one category or the other. Logically, this should depend on the car model, specifications etc. However, clearly, luxurious brands like Acura, BMW, Cadillac, Infiniti, Lexus and Mercedes-Benz, among others, all fall in the same cluster. In the plot below we have logPrice on the \(y\)-axis and MPG.highway on the \(x\)-axis, labeling the data points with the name of the brand and using different colours and point types for the two clusters. On the top-left corner Infiniti and Mercedes-Benz stand out, while near the bottom-right corner we find Honda and, especially, Geo as economic and very efficient options!
plot(MPG.highway, logPrice, col = allocations, pch = allocations, bty = 'l', cex = 1)
text(MPG.highway, logPrice,as.character(Cars93[,1]), cex = 0.6, pos = 4)
As a final remark we can note that the above clustering analysis can be potentially improved by adding further variables. For, instance one might discover an extra cluster in the data for the middle region (~ 27-33 MPG.highway). That would be especially the case, if the added variables provide new information relevant to the clusters that is not contained in the current variables.