1 Cars93 dataset

More cars
More cars

We return to the Cars93 dataset and follow the usual steps.

library('MASS')
data(Cars93)
class(Cars93)
## [1] "data.frame"
dim(Cars93)
## [1] 93 27
sum(is.na(Cars93))
## [1] 13
Cars93 = na.omit(Cars93)
n = nrow(Cars93)
n
## [1] 82
head(Cars93, n = 2)
##   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
## 1        Acura Integra   Small      12.9  15.9      18.8       25          31
## 2        Acura  Legend Midsize      29.2  33.9      38.7       18          25
##              AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
## 1               None      Front         4        1.8        140 6300
## 2 Driver & Passenger      Front         6        3.2        200 5500
##   Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
## 1         2890             Yes               13.2          5    177       102
## 2         2335             Yes               18.0          5    195       115
##   Width Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
## 1    68          37           26.5           11   2705 non-USA Acura Integra
## 2    71          38           30.0           15   3560 non-USA  Acura Legend

In the following we are initially interested in a multiple regression application, considering Price as response variable. Obviously, we cannot consider Min.Price and Max.Price as predictors since Price is just the average of those (see information in ?Cars93). We also want to consider only continuous predictors. So, we define keep as below and create the data frame cars93 which contains only the variables of interest.

keep = c('Price', 'MPG.city', 'MPG.highway', 'EngineSize', 
  'Horsepower', 'RPM', 'Rev.per.mile', 'Fuel.tank.capacity', 
  'Length', 'Wheelbase', 'Width', 'Turn.circle', 'Rear.seat.room')

cars93 = Cars93[,keep]
head(cars93,n=2)
##   Price MPG.city MPG.highway EngineSize Horsepower  RPM Rev.per.mile
## 1  15.9       25          31        1.8        140 6300         2890
## 2  33.9       18          25        3.2        200 5500         2335
##   Fuel.tank.capacity Length Wheelbase Width Turn.circle Rear.seat.room
## 1               13.2    177       102    68          37           26.5
## 2               18.0    195       115    71          38           30.0

2 Multiple linear regression

2.1 Estimating the linear model

Transformations and re-scaling

We can use lm to fit the full model that contains all predictors and then the command summary for the results. Note that the “.” symbol below means that we include all predictors in cars93.

linear.model = lm(Price ~ . , data = cars93)
summary(linear.model)
## 
## Call:
## lm(formula = Price ~ ., data = cars93)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4747  -2.7390  -0.0843   2.6081  23.5256 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        38.467131  30.727240   1.252  0.21484    
## MPG.city            0.077612   0.441156   0.176  0.86087    
## MPG.highway        -0.327216   0.413307  -0.792  0.43125    
## EngineSize          0.009996   2.607578   0.004  0.99695    
## Horsepower          0.155249   0.037247   4.168 8.77e-05 ***
## RPM                -0.003068   0.002254  -1.361  0.17787    
## Rev.per.mile        0.002801   0.002513   1.115  0.26891    
## Fuel.tank.capacity  0.340698   0.492717   0.691  0.49159    
## Length              0.011950   0.125946   0.095  0.92469    
## Wheelbase           0.787171   0.286581   2.747  0.00767 ** 
## Width              -1.345927   0.489693  -2.749  0.00763 ** 
## Turn.circle        -0.473591   0.376897  -1.257  0.21315    
## Rear.seat.room     -0.044031   0.332012  -0.133  0.89488    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.533 on 69 degrees of freedom
## Multiple R-squared:  0.7371, Adjusted R-squared:  0.6913 
## F-statistic: 16.12 on 12 and 69 DF,  p-value: 1.362e-15

Note that the above is equivalent to the below (but much simpler):

linear.model = lm(Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + RPM +
                    Rev.per.mile + Fuel.tank.capacity + Length + Wheelbase + Width +
                    Turn.circle + Rear.seat.room, data = cars93)

Another equivalent alternative, is to use attach (this extracts the variables in cars93 as vectors) in which case we do not need to add data = cars93 inside lm:

attach(cars93)
linear.model = lm(Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + RPM +
                    Rev.per.mile + Fuel.tank.capacity + Length + Wheelbase + Width +
                    Turn.circle + Rear.seat.room)

In any case, regardless of the coding approach, from summary we observe initially that the estimate of \(\beta_0\) is quite large (38.467131) and also not statistically significant. This is due to differences in scales of variables. We could bring it down by reducing the scale of the response, by taking the logarithm for instance. As we observe below this also removes to some degree the skewness from the distribution of the response variable.

logPrice = log(Price)
par(mfrow = c(1,2))
hist(Price, probability = TRUE)
lines(density(Price))
hist(logPrice, probability = TRUE)
lines(density(logPrice))

So let’s use logPrice, replacing Price in cars93, and see the result.

cars93[, 'Price'] = logPrice
names(cars93)[1] = 'logPrice'
head(cars93, n = 2)
##   logPrice MPG.city MPG.highway EngineSize Horsepower  RPM Rev.per.mile
## 1 2.766319       25          31        1.8        140 6300         2890
## 2 3.523415       18          25        3.2        200 5500         2335
##   Fuel.tank.capacity Length Wheelbase Width Turn.circle Rear.seat.room
## 1               13.2    177       102    68          37           26.5
## 2               18.0    195       115    71          38           30.0
linear.model = lm(logPrice ~ ., data = cars93)
summary(linear.model)
## 
## Call:
## lm(formula = logPrice ~ ., data = cars93)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44103 -0.13519 -0.00941  0.14358  0.50699 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.398e+00  1.260e+00   1.903 0.061246 .  
## MPG.city           -5.545e-03  1.810e-02  -0.306 0.760198    
## MPG.highway        -1.041e-02  1.695e-02  -0.614 0.541103    
## EngineSize         -9.016e-02  1.070e-01  -0.843 0.402189    
## Horsepower          6.178e-03  1.528e-03   4.043 0.000135 ***
## RPM                -1.101e-04  9.244e-05  -1.191 0.237702    
## Rev.per.mile        4.638e-05  1.031e-04   0.450 0.654210    
## Fuel.tank.capacity  3.424e-02  2.021e-02   1.694 0.094755 .  
## Length             -2.201e-04  5.166e-03  -0.043 0.966140    
## Wheelbase           3.316e-02  1.176e-02   2.821 0.006254 ** 
## Width              -4.357e-02  2.009e-02  -2.169 0.033530 *  
## Turn.circle        -3.902e-03  1.546e-02  -0.252 0.801486    
## Rear.seat.room     -1.940e-03  1.362e-02  -0.142 0.887148    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.227 on 69 degrees of freedom
## Multiple R-squared:  0.8009, Adjusted R-squared:  0.7663 
## F-statistic: 23.13 on 12 and 69 DF,  p-value: < 2.2e-16

The intercept is now much lower, while working on the logarithmic scale also changes some \(p\)-values.

A last think we might want to do in such regression analyses, is to bring all the predictors in the same scale. As we see below the predictors are in very different scales, which means that the magnitudes of the regression coefficients are not directly comparable.

apply(cars93[, -1], 2, range)
##      MPG.city MPG.highway EngineSize Horsepower  RPM Rev.per.mile
## [1,]       16          22        1.0         55 3800         1320
## [2,]       46          50        5.7        300 6500         3755
##      Fuel.tank.capacity Length Wheelbase Width Turn.circle Rear.seat.room
## [1,]                9.2    141        90    60          32             19
## [2,]               23.0    219       117    78          45             36

We can make the predictors to be on the scale by standardising them; i.e., having mean equal to 0 and variance equal to 1. We can use the command scale for that (type ?scale for options). Let’s keep cars93 as is (we are going to need it later) and create its scaled version stored in scaled.cars93.

scaled.cars93 = cars93
scaled.cars93[, -1] = scale(cars93[, -1])
linear.model = lm(logPrice ~ ., data = scaled.cars93)
summary(linear.model)
## 
## Call:
## lm(formula = logPrice ~ ., data = scaled.cars93)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44103 -0.13519 -0.00941  0.14358  0.50699 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.839752   0.025064 113.299  < 2e-16 ***
## MPG.city           -0.031018   0.101222  -0.306 0.760198    
## MPG.highway        -0.052179   0.084955  -0.614 0.541103    
## EngineSize         -0.090525   0.107396  -0.843 0.402189    
## Horsepower          0.315399   0.078004   4.043 0.000135 ***
## RPM                -0.064271   0.053960  -1.191 0.237702    
## Rev.per.mile        0.023062   0.051264   0.450 0.654210    
## Fuel.tank.capacity  0.103069   0.060840   1.694 0.094755 .  
## Length             -0.003362   0.078909  -0.043 0.966140    
## Wheelbase           0.214423   0.076021   2.821 0.006254 ** 
## Width              -0.160938   0.074198  -2.169 0.033530 *  
## Turn.circle        -0.012349   0.048928  -0.252 0.801486    
## Rear.seat.room     -0.005512   0.038695  -0.142 0.887148    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.227 on 69 degrees of freedom
## Multiple R-squared:  0.8009, Adjusted R-squared:  0.7663 
## F-statistic: 23.13 on 12 and 69 DF,  p-value: < 2.2e-16

The above regression coefficients are comparable and we see again changes in significance levels. One might ask is re-scaling always necessary? Not absolutely, if we prefer interpretability (coefficient estimates corresponding to the real scales of the predictors) we may work on the actual given scales. Here we will work with the scaled predictors.

So, based on the above estimated model the coefficients that are significant on a 5% level (apart from the intercept) are the ones corresponding to Horsepower, Wheelbase and Width. The question is, can we trust those estimates? The answer is not before checking for potential multicollinearities. Multicollinearity refers to the situation where the predictors in a regression model are linearly dependent, meaning they can be highly-correlated (see Wikipedia page for info). Under multicollinearity estimates of regression coefficients are subjected to high variance and are, therefore, unreliable.

We see below that we have indeed some very high correlations! So, ideally we should do something to address this issue.

cov93 = cov(scaled.cars93[, -1])
cor93 = cov2cor(cov93)
library(pheatmap)
pheatmap(cor93, display_numbers = TRUE, cluster_rows = FALSE,
         cluster_cols = FALSE, fontsize_number = 10,
         main = 'Correlations between predictors')

Dealing with multicollinearity

There are several approaches to deal with the problem of multicollinearity, for instance one can use ridge regression instead of standard regression. Here we will use a traditional approach based on Variance Inflation Factors (VIFs). Briefly, under this approach each predictor is regressed against all other predictors and the VIF is calculated based on the \(R^2\) measure. The procedure is simple and we will not go into details (see Wikipedia page), for predictor \(X_j\) we basically calculate \[ VIF_j = \frac{1}{1-R^2_j}.\] Common rule-of-thumb thresholds that suggest high multicollinearity are VIF > 5 or VIF > 10, depending how strict we want to be.

In R we can use can use the vif command from package car. Let’s see what this gives for the full model considered above.

library(car)
## Loading required package: carData
vif(linear.model)
##           MPG.city        MPG.highway         EngineSize         Horsepower 
##          16.110561          11.348431          18.135714           9.567437 
##                RPM       Rev.per.mile Fuel.tank.capacity             Length 
##           4.578366           4.132175           5.820200           9.790667 
##          Wheelbase              Width        Turn.circle     Rear.seat.room 
##           9.087039           8.656598           3.764142           2.354300

Very high VIFs indeed! The highest VIF is that of variable EngineSize. So, lets remove this predictor and re-fit the model and calculate the VIFs.

linear.model = lm(logPrice ~ . - EngineSize, data = scaled.cars93)
vif(linear.model)
##           MPG.city        MPG.highway         Horsepower                RPM 
##          15.818730          11.312104           5.141064           2.435494 
##       Rev.per.mile Fuel.tank.capacity             Length          Wheelbase 
##           3.712738           5.609562           9.762725           9.007541 
##              Width        Turn.circle     Rear.seat.room 
##           8.328024           3.745159           2.187627

The VIFs have decreased, yet are still high. We repeat this process until all VIFs are below 5 (we choose a strict cut-off threshold).

linear.model = lm(logPrice ~ . - EngineSize - MPG.city, data = scaled.cars93)
vif(linear.model)
##        MPG.highway         Horsepower                RPM       Rev.per.mile 
##           2.777327           5.114868           2.431202           3.424145 
## Fuel.tank.capacity             Length          Wheelbase              Width 
##           5.342574           9.517154           9.006463           8.316865 
##        Turn.circle     Rear.seat.room 
##           3.738376           2.160643
linear.model = lm(logPrice ~ . - EngineSize - MPG.city - Length, data = scaled.cars93)
vif(linear.model)
##        MPG.highway         Horsepower                RPM       Rev.per.mile 
##           2.701898           4.893156           2.403916           3.422715 
## Fuel.tank.capacity          Wheelbase              Width        Turn.circle 
##           5.228544           7.021996           7.327134           3.464725 
##     Rear.seat.room 
##           2.140781
linear.model = lm(logPrice ~ . - EngineSize - MPG.city - Length - 
                    Width, data = scaled.cars93)
vif(linear.model)
##        MPG.highway         Horsepower                RPM       Rev.per.mile 
##           2.622137           4.668149           2.226758           3.232764 
## Fuel.tank.capacity          Wheelbase        Turn.circle     Rear.seat.room 
##           5.089194           5.739041           3.008665           1.968458
linear.model = lm(logPrice ~ . - EngineSize - MPG.city - Length - 
                    Width - Wheelbase, data = scaled.cars93)
vif(linear.model)
##        MPG.highway         Horsepower                RPM       Rev.per.mile 
##           2.527488           4.477091           2.159676           3.023761 
## Fuel.tank.capacity        Turn.circle     Rear.seat.room 
##           4.285448           2.922203           1.407649

We finally arrive at a model where all VIFs are below 5. As we can see below the statistically significant coefficients (at a 5% level) are those of Horsepower and Fuel.tank.capacity. Predictors MPG.highway and Rear.seat.room are significant only at a 10% level.

summary(linear.model)
## 
## Call:
## lm(formula = logPrice ~ . - EngineSize - MPG.city - Length - 
##     Width - Wheelbase, data = scaled.cars93)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46567 -0.15178 -0.01096  0.14059  0.71803 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.83975    0.02628 108.039  < 2e-16 ***
## MPG.highway        -0.08020    0.04204  -1.908   0.0603 .  
## Horsepower          0.26305    0.05596   4.701 1.17e-05 ***
## RPM                -0.02127    0.03886  -0.547   0.5858    
## Rev.per.mile        0.03364    0.04599   0.731   0.4668    
## Fuel.tank.capacity  0.12478    0.05475   2.279   0.0255 *  
## Turn.circle        -0.03595    0.04521  -0.795   0.4291    
## Rear.seat.room      0.05274    0.03138   1.681   0.0970 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.238 on 74 degrees of freedom
## Multiple R-squared:  0.7652, Adjusted R-squared:  0.743 
## F-statistic: 34.45 on 7 and 74 DF,  p-value: < 2.2e-16

The corresponding approximate large-sample Wald confidence intervals are as follows.

Wald = confint(linear.model)
Wald 
##                           2.5 %     97.5 %
## (Intercept)         2.787379055 2.89212501
## MPG.highway        -0.163977289 0.00357341
## Horsepower          0.151555277 0.37455261
## RPM                -0.098714305 0.05616593
## Rev.per.mile       -0.057993982 0.12526927
## Fuel.tank.capacity  0.015692929 0.23386533
## Turn.circle        -0.126025170 0.05413421
## Rear.seat.room     -0.009783874 0.11525610

2.2 Bootstrap regression confidence intervals

Let’s remember now how we can obtain CIs using the function Boot from package car that allows for both bootstrap regression approaches shown in lectures.

Nonparametric bootstrap (NPB)

For NPB we define method = 'case' in Boot (the default actually) and we use 10000 bootstrap replications, specified by argumentR.

set.seed(1)
NPB.results = Boot(linear.model, method = 'case', R = 10000)
## Loading required namespace: boot
summary(NPB.results)
## 
## Number of bootstrap replications R = 10000 
##                     original    bootBias   bootSE   bootMed
## (Intercept)         2.839752  0.00099434 0.026800  2.840731
## MPG.highway        -0.080202 -0.00082885 0.044753 -0.078731
## Horsepower          0.263054  0.00512053 0.071841  0.264146
## RPM                -0.021274 -0.00439467 0.046032 -0.024978
## Rev.per.mile        0.033638  0.00069709 0.042386  0.032964
## Fuel.tank.capacity  0.124779  0.00628697 0.058936  0.129299
## Turn.circle        -0.035945 -0.00426736 0.057202 -0.039391
## Rear.seat.room      0.052736 -0.00676223 0.037725  0.048324

As we can see the estimated bias of the estimates is very low. For confidence intervals the function Boot provides the following options: (i) normal, (ii) basic, (iii) percentile and (iv) bias-corrected (BS); see Practical 1 for a reminder.

normal.NPB = confint(NPB.results, level = 0.95, type = "norm")
basic.NPB = confint(NPB.results, level = 0.95, type = "basic")
percent.NPB = confint(NPB.results, level = 0.95, type = "perc")
bc.NPB = confint(NPB.results, level = 0.95, type = "bca")
normal.NPB
## Bootstrap normal confidence intervals
## 
##                           2.5 %      97.5 %
## (Intercept)         2.786230295 2.891285093
## MPG.highway        -0.167086563 0.008340383
## Horsepower          0.117127886 0.398738935
## RPM                -0.107100496 0.073341472
## Rev.per.mile       -0.050133850 0.116014950
## Fuel.tank.capacity  0.002980227 0.234004091
## Turn.circle        -0.143791482 0.080435245
## Rear.seat.room     -0.014440799 0.133437490
basic.NPB 
## Bootstrap basic confidence intervals
## 
##                            2.5 %     97.5 %
## (Intercept)         2.7861754223 2.89077450
## MPG.highway        -0.1610190342 0.01627329
## Horsepower          0.1024429956 0.39046871
## RPM                -0.1055911626 0.07447638
## Rev.per.mile       -0.0528113389 0.11305539
## Fuel.tank.capacity -0.0002177163 0.23092675
## Turn.circle        -0.1382376262 0.08425570
## Rear.seat.room     -0.0087907564 0.14042088
percent.NPB 
## Bootstrap percent confidence intervals
## 
##                          2.5 %      97.5 %
## (Intercept)         2.78872957 2.893328644
## MPG.highway        -0.17667717 0.000615155
## Horsepower          0.13563918 0.423664893
## RPM                -0.11702475 0.063042793
## Rev.per.mile       -0.04578010 0.120086628
## Fuel.tank.capacity  0.01863151 0.249775973
## Turn.circle        -0.15614665 0.066346667
## Rear.seat.room     -0.03494866 0.114262978
bc.NPB 
## Bootstrap bca confidence intervals
## 
##                           2.5 %        97.5 %
## (Intercept)         2.788024500  2.8923466961
## MPG.highway        -0.178980509 -0.0009383059
## Horsepower          0.134059564  0.4213289277
## RPM                -0.111189082  0.0681630419
## Rev.per.mile       -0.046210250  0.1195165088
## Fuel.tank.capacity  0.002479506  0.2346471559
## Turn.circle        -0.161732317  0.0628412266
## Rear.seat.room     -0.017640247  0.1280065885

As we can see, overall, the NPB CIs are similar to the Wald’s intervals. One observation here is that the BC method (which is also generally considered more reliable) gives (marginally) a statistically significant result for the coefficient of MPG.highway.

Package car is compatible to the base-R hist command, which we can use. Note that by default the BC bootstrap CIs are used (represented as the horizontal black lines below).

hist(NPB.results, legend ="top", col = 'red', main = 'Nonparametric Bootstrap')

Semiparametric regression bootstrap (SRB)

To implement SRB we simply use method = 'residual' in the command Boot. The remaining code then is basically more or less the same. Here we just calculate the corresponding intervals without presenting them (visual comparisons are presented in subsection 2.3).

set.seed(1)
SRB.results = Boot(linear.model, method = 'residual', R=10000)
summary(SRB.results)
## 
## Number of bootstrap replications R = 10000 
##                     original    bootBias   bootSE   bootMed
## (Intercept)         2.839752  1.1535e-05 0.026659  2.839548
## MPG.highway        -0.080202 -2.2133e-04 0.042582 -0.080844
## Horsepower          0.263054  4.5880e-04 0.056662  0.263029
## RPM                -0.021274 -3.9374e-04 0.039587 -0.021726
## Rev.per.mile        0.033638  2.5390e-04 0.046956  0.034025
## Fuel.tank.capacity  0.124779 -2.8753e-04 0.055116  0.124016
## Turn.circle        -0.035945 -3.3380e-04 0.045648 -0.035837
## Rear.seat.room      0.052736  5.4665e-04 0.031961  0.053097
normal.SRB = confint(SRB.results, level= 0.95, type="norm")
basic.SRB = confint(SRB.results, level= 0.95, type="basic")
percent.SRB = confint(SRB.results, level= 0.95, type="perc")
bc.SRB = confint(SRB.results, level= 0.95, type="bca")
hist(SRB.results, legend ="top", col = 'lightgreen', 
     main = 'Semiparametric Residual Bootstrap')

Bayesian bootstrap regression

We can further consider the Bayesian approach to bootstrap. We first load package bayesboot and then define a function which will return the coefficients of a linear model and which will be used as input to the command bayesboot. We check the correctness of the coefficients for scaled.cars93.

library(bayesboot)
betas = function(data) {
  coef( lm(logPrice ~ . - EngineSize - MPG.city - Length - 
             Width - Wheelbase, data = data) )
}
betas(scaled.cars93)
##        (Intercept)        MPG.highway         Horsepower                RPM 
##         2.83975203        -0.08020194         0.26305394        -0.02127419 
##       Rev.per.mile Fuel.tank.capacity        Turn.circle     Rear.seat.room 
##         0.03363764         0.12477913        -0.03594548         0.05273611

We then run bayesboot with use.weights = FALSE, which is the default option (actually use.weights = TRUE is to be avoided in the regression setting as it is not straightforward to define a weighted statistic; see Practical 2 for a reminder).

The command summary gives all the relevant information. Note: (i) the variables are presented in alphabetical order and (ii) the 95% credible intervals are given in the first table in the columns hdi.low and hdi.low. As we see the Bayesian approach is in support of the variables Horsepower and Fuel.tank.capacity, but also of MPG.highway (like the NPB-BC method).

set.seed(1)
BBR.results = bayesboot(data = scaled.cars93, statistic = betas, R = 10000, use.weights = FALSE)
summary(BBR.results)
## Bayesian bootstrap
## 
## Number of posterior draws: 10000 
## 
## Summary of the posterior (with 95% Highest Density Intervals):
##           statistic        mean         sd     hdi.low     hdi.high
##         (Intercept)  2.84031547 0.02423927  2.79470144  2.889922904
##  Fuel.tank.capacity  0.13097722 0.05247347  0.03098518  0.235375479
##          Horsepower  0.26680123 0.06363847  0.14448541  0.392370433
##         MPG.highway -0.07975419 0.03860207 -0.15734644 -0.005605435
##      Rear.seat.room  0.04716027 0.03243652 -0.01548012  0.109775027
##        Rev.per.mile  0.03389027 0.03834551 -0.04109425  0.110426610
##                 RPM -0.02401111 0.04227813 -0.10427614  0.060624927
##         Turn.circle -0.03874963 0.05287822 -0.14476889  0.059022524
## 
## Quantiles:
##           statistic       q2.5%         q25%      median         q75%
##         (Intercept)  2.79469958  2.823310666  2.83989590  2.856379857
##  Fuel.tank.capacity  0.02812839  0.094937163  0.13077755  0.166534177
##          Horsepower  0.14944600  0.222538960  0.26450391  0.307654357
##         MPG.highway -0.16202987 -0.104466793 -0.07740673 -0.053252498
##      Rear.seat.room -0.01695058  0.024960232  0.04783397  0.069900896
##        Rev.per.mile -0.04007993  0.008415042  0.03324996  0.059218810
##                 RPM -0.11004861 -0.052238577 -0.02271270  0.005157555
##         Turn.circle -0.15052118 -0.071510345 -0.03549737 -0.001635448
##        q97.5%
##   2.889918136
##   0.233565821
##   0.400029048
##  -0.008824718
##   0.108712737
##   0.111647139
##   0.055668595
##   0.055752603
## 
## Call:
##  bayesboot(data = scaled.cars93, statistic = betas, R = 10000,      use.weights = FALSE)

2.3 Visualising multiple CIs

When there are many methods under consideration it is more convenient to compare the different CIs in a graphical way. The codes below are one possible approach to do this for our application (note that more efficient, clever and nicer solutions may certainly exist!). We will use package plotrix for this:

install.packages('plotrix')
library(plotrix)

We will focus on the variables Horsepower, Fuel.tank.capacity and MPG.highway which seem to be influential on the response logPrice (although there is a question mark for MPG.highway). Let’s start with the latter predictor. MPG.highway is always the second row in the Wald and bootstrap CI objects which we have created (see above). What we want is a matrix with 2 columns (lower and upper limits) and rows corresponding to each method. For the Bayesian method there does not seem to be a way to extract directly the 95% credible interval from the object BBR.results so we will need to look it up in summary(BBR.results), so we have:

bayes.MPG = c(-0.15734644, -0.005605435)
CI.MPG = rbind(Wald[2,],normal.NPB[2,],basic.NPB[2,],percent.NPB[2,],bc.NPB[2,],
               normal.SRB[2,],basic.SRB[2,],percent.SRB[2,],bc.SRB[2,], bayes.MPG)
rownames(CI.MPG) = c('Wald.MPG', 'normal.NPB.MPG', 'basic.NPB.MPG', 'percent.NPB.MPG', 'bc.NPB.MPG',
                     'normal.SRB.MPG', 'basic.SRB.MPG', 'percent.SRB.MPG', 'bc.SRB.MPG','bayes.MPG')
CI.MPG
##                      2.5 %        97.5 %
## Wald.MPG        -0.1639773  0.0035734098
## normal.NPB.MPG  -0.1670866  0.0083403832
## basic.NPB.MPG   -0.1610190  0.0162732928
## percent.NPB.MPG -0.1766772  0.0006151550
## bc.NPB.MPG      -0.1789805 -0.0009383059
## normal.SRB.MPG  -0.1634390  0.0034777620
## basic.SRB.MPG   -0.1642900  0.0035792410
## percent.SRB.MPG -0.1639831  0.0038861222
## bc.SRB.MPG      -0.1621129  0.0056865806
## bayes.MPG       -0.1573464 -0.0056054350

We will also need a vector of length equal to 10 (because we consider 10 methods) where the first 9 elements are equal to the frequentist estimate of \(\beta_{MPG}\) and the last element is its posterior mean (again we look up in summary(BBR.results)).

bayes.mean.MPG = -0.07975419
point.MPG = c(rep(coef(linear.model)[2],9), bayes.mean.MPG)
point.MPG
## MPG.highway MPG.highway MPG.highway MPG.highway MPG.highway MPG.highway 
## -0.08020194 -0.08020194 -0.08020194 -0.08020194 -0.08020194 -0.08020194 
## MPG.highway MPG.highway MPG.highway             
## -0.08020194 -0.08020194 -0.08020194 -0.07975419

We can now use plotCI from package plotrix (type ?plotCI for information). We see a lot of customisation below; the important argument is xaxt ='n' which suppresses the default x-labels in plotCI and then allows us to use axis in order to insert the names of the methods. Important note: All the base-R graphical parameter are controlled by par: you can find all the information about them by typing ?par.

plotCI(x = 1:10, y = point.MPG,li = CI.MPG[,1], ui = CI.MPG[,2], lwd = 2, main = 'MPG.highway',
       col = c(1,rep(2,4),rep(3,4),4), bty = 'l', ylab = '',xlab = 'Methods', xaxt ='n')
axis(1, at = 1:10, labels = c('Wald', 'normal', 'basic', 'percent', 'BC', 
                              'normal', 'basic', 'percent', 'BC','Bayes'), cex.axis = 0.8)
legend('topright', legend = c('NPB','SRB'), col = c(2,3), pch = c(15,15), bty = 'n', cex = 0.8)
abline(h = 0, lty = 3)

We note that SRB produces shorter intervals than NPB. Also, the Bayesian bootstrap CI is the shortest and does not include 0 (as does also the NPB-BC interval). The code and corresponding figures for Horsepower and Fuel.tank.capacity are presented below.

bayes.horsepower = c(0.14448541, 0.392370433)
CI.horsepower = rbind(Wald[3,],normal.NPB[3,],basic.NPB[3,],percent.NPB[3,],bc.NPB[3,],
               normal.SRB[3,],basic.SRB[3,],percent.SRB[3,],bc.SRB[3,],bayes.horsepower)
bayes.mean.horsepower= 0.26680123 
point.horsepower = c(rep(coef(linear.model)[3],9), bayes.mean.horsepower)
plotCI(x = 1:10, y = point.horsepower,li = CI.horsepower[,1], ui = CI.horsepower[,2], lwd = 2,
       main = 'Horsepower', col = c(1,rep(2,4),rep(3,4),4), bty = 'l', ylab = '',
       xlab = 'Methods', xaxt ='n')
axis(1, at = 1:10, labels = c('Wald', 'normal', 'basic', 'percent', 'BC',
                              'normal', 'basic', 'percent', 'BC','Bayes'), cex.axis = 0.8)
legend('bottomright', legend = c('NPB','SRB'), col = c(2,3), pch = c(15,15), bty = 'n', cex = 0.8)

bayes.fuel = c(0.03098518,  0.235375479)
CI.fuel = rbind(Wald[6,],normal.NPB[6,],basic.NPB[6,],percent.NPB[6,],bc.NPB[6,],
                      normal.SRB[6,],basic.SRB[6,],percent.SRB[6,],bc.SRB[6,],bayes.fuel)
bayes.mean.fuel= 0.13097722
point.fuel = c(rep(coef(linear.model)[6],9), bayes.mean.fuel)
plotCI(x = 1:10, y = point.fuel,li = CI.fuel[,1], ui = CI.fuel[,2], lwd = 2, 
       main = 'Fuel Tank Capacity', col = c(1,rep(2,4),rep(3,4),4), bty = 'l', ylab = '',
       xlab = 'Methods', xaxt ='n')
axis(1, at = 1:10, labels = c('Wald', 'normal', 'basic', 'percent', 'BC',
                              'normal', 'basic', 'percent', 'BC','Bayes'),
     cex.axis = 0.8)
legend('bottomright', legend = c('NPB','SRB'), col = c(2,3), pch = c(15,15),
       bty = 'n', cex = 0.8)
abline(h = 0, lty = 3)

2.4 A DPM regression model

Assume now that based on the previous findings we are comfortable working with Horsepower, Fuel.tank.capacity and also MPG.highway as trustworthy predictors/explanatory variables of the response logPrice. (note, of course, that different approaches could have been used for selecting predictors, for instance best subset or stepwise selection, lasso regression and so forth).

Say that we are interested in inferring about the conditional densities of logPrice across different quantiles of Horsepower for low and high fuel consumption (variable MPG.highway) and for median Fuel.tank.capacity across the different cars in the dataset. This is a question that can be addressed by a flexible DPM regression model. We load the required packages for the analysis:

library(BNPmix)
library(gridExtra)

Note that here we will work with the unscaled data frame cars93 as we want the grid cut-offs to be on a natural scale. Below we define the grids for the response and the predictor matrix X. Note that we select the 10% and 90% percentiles of MPG.highway for high and low fuel consumption, respectively (extreme choices like 1% and 99% are not recommended since very few cars will be in these categories). Also note, that we could have used something like grid_x2 = quantile(X[,2], probs = c(0.01, 0.25, 0.50, 0.75, 0.99)) for Horsepower, but below we just set the cut-offs according to our own preference.

grid_y = seq(min(logPrice), max(logPrice), length.out = 100)
X = cars93[, c('MPG.highway', 'Horsepower', 'Fuel.tank.capacity')]
grid_x1 = quantile(X[,1], probs = c(0.1, 0.9))
grid_x2 = c(50, 100, 150, 200, 300)
grid_x3 = quantile(X[,3], probs = c(0.50))
grid_x1
## 10% 90% 
##  25  36
grid_x2
## [1]  50 100 150 200 300
grid_x3
##   50% 
## 15.95
grid_X = expand.grid(grid_x1, grid_x2, grid_x3)

Next, we have to make some choices about our model. Do we expect to find any clusters and, if yes, how many? We do not have any prior expectation and the main question is not about clustering, so we just set a large number of expected potential clusters with Ek = 10. Should we use a hierarchical model or not? We decide to keep things simple and set hyper = FALSE. Finally, should we stick to the default semi-empirical Bayes prior specifications of BNPmix? We could, but we decide to pursue a fully Bayesian approach and set m0 = c(0,0,0,0), a0 = 3 and b0 = 100 (see Practical 4 for a reminder of the model).

DPprior = PYcalibrate(Ek = 10, n = n, discount = 0)
prior = list(strength = DPprior$strength, discount = 0, hyper = FALSE,
             m0 = c(0,0,0,0), a0 = 3, b0 = 100)
output <- list(grid_x = grid_X, grid_y = grid_y, out_type = 'FULL',
               out_param = TRUE)
mcmc.total.n = 11000
mcmc.burnin = 1000
mcmc = list(niter = mcmc.total.n, nburn = mcmc.burnin, method = 'ICS')
set.seed(1)
fit.reg = PYregression(y = logPrice, x = X, prior = prior, mcmc = mcmc,
                       output = output)
## Completed:   1100/11000 - in 0.155 sec
## Completed:   2200/11000 - in 0.439 sec
## Completed:   3300/11000 - in 0.697 sec
## Completed:   4400/11000 - in 1.019 sec
## Completed:   5500/11000 - in 1.262 sec
## Completed:   6600/11000 - in 1.505 sec
## Completed:   7700/11000 - in 1.809 sec
## Completed:   8800/11000 - in 2.083 sec
## Completed:   9900/11000 - in 2.335 sec
## Completed:   11000/11000 - in 2.579 sec
## 
## Estimation done in 2.579 seconds
summary(fit.reg)
## PYdensity function call:
##  1000    burn-in iterations
##  11000   iterations 
##  Global estimation time: 2.58 seconds
##  Average number of groups:  1.0042 
##  Min number of groups:  1 ; Max number of groups:  2

Based on the output from summary we see that the structure of this model is not related to or does not reflect any clustering effect.

Now, as seen in Practical 4 we need to use the information in fit.reg$density in order to produce the plots. This object has 10 columns because we have 2 percentiles of MPH.highway, 5 values of Horsepower and only 1 percentile (the median) of Fuel.tank.capacity.

dim(fit.reg$density)
## [1]   100    10 10000

So, we can use the following code with package ggplot2 to produce the plots of interest.

# For 10% MPG.highway
regplot1 <- data.frame(
  dens = as.vector(apply(fit.reg$density[, 1:5,], c(1, 2), mean)),
  qlow = as.vector(apply(fit.reg$density[, 1:5,], c(1, 2),
                         quantile, probs = 0.025)),
  qupp = as.vector(apply(fit.reg$density[, 1:5,], c(1, 2),
                         quantile, probs = 0.975)),
  grid = rep(grid_y, length(grid_x2)),
  label = factor(rep(paste("Horsepower = ", grid_x2), each = length(grid_y)),
                 level = rep(paste("Horsepower = ", grid_x2)))
)
# For 10% MPG.highway
regplot2 <- data.frame(
  dens = as.vector(apply(fit.reg$density[, 6:10,], c(1, 2), mean)),
  qlow = as.vector(apply(fit.reg$density[, 6:10,], c(1, 2),
                         quantile, probs = 0.025)),
  qupp = as.vector(apply(fit.reg$density[, 6:10,], c(1, 2),
                         quantile, probs = 0.975)),
  grid = rep(grid_y, length(grid_x2)),
  label = factor(rep(paste("Horsepower = ", grid_x2), each = length(grid_y)),
                 level = rep(paste("Horsepower = ", grid_x2)))
)
library(ggplot2)
plot1 = ggplot(regplot1) + theme_bw() +
  geom_line(data = regplot1, map = aes(x = grid, y = dens)) +
  geom_ribbon(data = regplot1, map = aes(x = grid, ymin = qlow,
                                         ymax = qupp), fill = "green", alpha = 0.3) +
  facet_wrap(~label, ncol = 5, nrow = 1) +
  labs(x = "logPrice", y = "MPG.highway = 22")

plot2 = ggplot(regplot2) + theme_bw() +
  geom_line(data = regplot2, map = aes(x = grid, y = dens)) +
  geom_ribbon(data = regplot2, map = aes(x = grid, ymin = qlow,
                                         ymax = qupp), fill = "red", alpha = 0.3) +
  facet_wrap(~label, ncol = 5, nrow = 1) +
  labs(x = "logPrice", y = "MPG.highway = 55")

grid.arrange(plot1, plot2, layout_matrix = matrix(1:2, 2, 1))

We generally observe the following: (i) within both categories (high/low consumption) the mode of distribution of logPrice shifts from left to right as Horsepower increases (understandable), (ii) for Horsepower = 50 there are no major differences in the distributions of high and low fuel consumption cars, however, as Horsepower increases low-fuel consumption cars (MPG.highway = 55) are more expensive (also understandable) and (iii) there is a lot of uncertainty in the distributions of low-fuel consumption cars with high Horsepower; this is because there are only a few cars (data points) with this combination of attributes.

Multivariate unsupervised analysis

Here we will consider the unsupervised setting; i.e., we do not consider logPrice as a response variable. Previously, we saw that the DPM regression model for logPrice did not detect any clusters in the data. Does this mean there are no clusters? Not necessarily, since it can be just the case that the particular model and its specific structure do not relate to potential latent subgroups.

Recall that we have an interesting variable in the 1st column of Cars93, namely, the car brand.

Cars93[,1]
##  [1] Acura         Acura         Audi          Audi          BMW          
##  [6] Buick         Buick         Buick         Buick         Cadillac     
## [11] Cadillac      Chevrolet     Chevrolet     Chevrolet     Chevrolet    
## [16] Chevrolet     Chrylser      Chrysler      Chrysler      Dodge        
## [21] Dodge         Dodge         Dodge         Dodge         Eagle        
## [26] Eagle         Ford          Ford          Ford          Ford         
## [31] Ford          Ford          Ford          Geo           Geo          
## [36] Honda         Honda         Honda         Hyundai       Hyundai      
## [41] Hyundai       Hyundai       Infiniti      Lexus         Lexus        
## [46] Lincoln       Lincoln       Mazda         Mazda         Mazda        
## [51] Mercedes-Benz Mercedes-Benz Mercury       Mercury       Mitsubishi   
## [56] Mitsubishi    Nissan        Nissan        Nissan        Oldsmobile   
## [61] Oldsmobile    Oldsmobile    Plymouth      Pontiac       Pontiac      
## [66] Pontiac       Pontiac       Pontiac       Saab          Saturn       
## [71] Subaru        Subaru        Subaru        Suzuki        Toyota       
## [76] Toyota        Toyota        Volkswagen    Volkswagen    Volkswagen   
## [81] Volvo         Volvo        
## 32 Levels: Acura Audi BMW Buick Cadillac Chevrolet Chrylser Chrysler ... Volvo

Initially, we see 32 different car brands (levels) but we notice that one Chrysler is misspelled Chrylser, so let’s fix that. In the third line we use factor() to reset the levels of Cars93[,1] to the proper number which is now 31.

which(Cars93[,1] == 'Chrylser')
## [1] 17
Cars93[17,1] = 'Chrysler'
Cars93[,1] = factor(Cars93[,1])
table(Cars93[,1])
## 
##         Acura          Audi           BMW         Buick      Cadillac 
##             2             2             1             4             2 
##     Chevrolet      Chrysler         Dodge         Eagle          Ford 
##             5             3             5             2             7 
##           Geo         Honda       Hyundai      Infiniti         Lexus 
##             2             3             4             1             2 
##       Lincoln         Mazda Mercedes-Benz       Mercury    Mitsubishi 
##             2             3             2             2             2 
##        Nissan    Oldsmobile      Plymouth       Pontiac          Saab 
##             3             3             1             5             1 
##        Saturn        Subaru        Suzuki        Toyota    Volkswagen 
##             1             3             1             3             3 
##         Volvo 
##             2

Here we will consider the previous 4 variables (logPrice, MPG.highway, Horsepower, Fuel.tank.capacity) forming a matrix Y for a multivariate DPM model. Note that the choice of the variables is rather arbitrary, mainly following the previous analysis, so we could consider other variables. We proceed by defining the multivariate grid.

Y = cars93[, c('logPrice', 'MPG.highway', 'Horsepower', 'Fuel.tank.capacity')]
grid.length = 20
grid1 = seq(min(Y[,1]), max(Y[,1]), length.out = grid.length)
grid2 = seq(min(Y[,2]), max(Y[,2]), length.out = grid.length)
grid3 = seq(min(Y[,3]), max(Y[,3]), length.out = grid.length)
grid4 = seq(min(Y[,4]), max(Y[,4]), length.out = grid.length)
grid = expand.grid(grid1, grid2, grid3, grid4)
dim(grid)
## [1] 160000      4

Following that, we need to make some modelling choices again. Similarly to before we set Ek = 10, hyper = FALSE and follow a fully Bayesian approach by setting m0 = rep(0,4) and Sigma0 = diag(1000,4) (see Application 2 for a reminder).

DPprior = PYcalibrate(Ek = 10, n = n, discount = 0)
DPprior
## $strength
## [1] 2.774189
## 
## $discount
## [1] 0
mcmc.total.n = 5000
mcmc.burnin = 1000
mcmc = list(niter = mcmc.total.n, nburn = mcmc.burnin, method = 'ICS')
prior = list(strength =  DPprior$strength, discount = 0, m0 = rep(0,4), 
             Sigma0 = diag(1000,4), hyper = FALSE)
output = list(grid = grid, out_param = TRUE)
set.seed(1)
mult.fit = PYdensity(y = Y, mcmc = mcmc, prior = prior, output = output)
## Completed:   500/5000 - in 0.244 sec
## Completed:   1000/5000 - in 0.453 sec
## Completed:   1500/5000 - in 62.549 sec
## Completed:   2000/5000 - in 120.372 sec
## Completed:   2500/5000 - in 170.146 sec
## Completed:   3000/5000 - in 218.886 sec
## Completed:   3500/5000 - in 267.226 sec
## Completed:   4000/5000 - in 316.627 sec
## Completed:   4500/5000 - in 364.322 sec
## Completed:   5000/5000 - in 413.641 sec
## 
## Estimation done in 413.641 seconds
summary(mult.fit)
## PYdensity function call:
##  1000    burn-in iterations
##  5000    iterations 
##  Global estimation time: 413.64 seconds
##  Average number of groups:  2.0005 
##  Min number of groups:  2 ; Max number of groups:  3
cluster.labels = mult.fit$clust
n.cluster = apply(cluster.labels, 1, unique)
n.cluster = lapply(n.cluster, length)
n.cluster = unlist(n.cluster)
counts = table(n.cluster)
counts
## n.cluster
##    2    3 
## 3998    2

We see that the model shows quite decisively the existence of two clusters in the data (despite us setting Ek = 10), which is quite interesting. Let’s visualise the univariate and bivariate density estimates.

p1 = plot(mult.fit, dim = c(1, 1), xlab = "logPrice", ylab = "Density",
          col = 2)
p2 = plot(mult.fit, dim = c(2, 2), xlab = "MPG.highway", ylab = "Density",
          col = 3)
p3 = plot(mult.fit, dim = c(3, 3), xlab = "Horsepower", ylab = "Density",
          col = 4)
p4 = plot(mult.fit, dim = c(4, 4), xlab = "Fuel.tank.capacity", ylab = "Density",
          col = 5)

p12 = plot(mult.fit, dim = c(1, 2), show_clust = TRUE, xlab = "logPrice",
           ylab = "MPG.highway", col = 2)
p13 = plot(mult.fit, dim = c(1, 3), show_clust = TRUE, xlab = "logPrice",
           ylab = "Horsepower", col = 2)
p14 = plot(mult.fit, dim = c(1, 4), show_clust = TRUE, xlab = "logPrice",
           ylab = "Fuel.tank.capacity", col = 2)

p21 = plot(mult.fit, dim = c(2, 1), show_clust = TRUE, xlab = "MPG.highway",
           ylab = "logPrice", col = 3)
p23 = plot(mult.fit, dim = c(2, 3), show_clust = TRUE, xlab = "MPG.highway",
           ylab = "Horsepower", col = 3)
p24 = plot(mult.fit, dim = c(2, 4), show_clust = TRUE, xlab = "MPG.highway",
           ylab = "Fuel.tank.capacity", col = 3)

p31 = plot(mult.fit, dim = c(3, 1), show_clust = TRUE, xlab = "Horsepower",
           ylab = "logPrice", col = 4)
p32 = plot(mult.fit, dim = c(3, 2), show_clust = TRUE, xlab = "Horsepower",
           ylab = "MPG.highway", col = 4)
p34 = plot(mult.fit, dim = c(3, 4), show_clust = TRUE, xlab = "Horsepower",
           ylab = "Fuel.tank.capacity", col = 4)

p41 = plot(mult.fit, dim = c(4, 1), show_clust = TRUE, xlab = "Fuel.tank.capacity",
           ylab = "logPrice", col = 5)
p42 = plot(mult.fit, dim = c(4, 2), show_clust = TRUE, xlab = "Fuel.tank.capacity",
           ylab = "MPG.highway", col = 5)
p43 = plot(mult.fit, dim = c(4, 3), show_clust = TRUE, xlab = "Fuel.tank.capacity",
           ylab = "Horsepower", col = 5)

library(gridExtra)
grid.arrange(p1, p12, p13, p14, p21, p2, p23, p24, p31, p32, p3, p34, p41, p42,
             p43, p4, layout_matrix = matrix(1:16, 4, 4))

It would be interesting to inspect the cluster allocations of the data points (corresponding to the colored labels of the points in the bivariate plots) in relation to the car brands. This is something we have not seen before! The cluster allocations for every MCMC sample are contained in mult.fit$clust. What we want to do is to find the average allocation per data point and then round it. Following that we can see in which cluster each car belongs.

names(mult.fit)
##  [1] "density"    "data"       "grideval"   "grid_x"     "grid_y"    
##  [6] "clust"      "mean"       "beta"       "sigma2"     "probs"     
## [11] "niter"      "nburn"      "tot_time"   "univariate" "regression"
## [16] "dep"        "group_log"  "group"      "wvals"
MCMC.allocations = mult.fit$clust
dim(MCMC.allocations)
## [1] 4000   82
allocations = round(apply(MCMC.allocations, 2, mean))
allocations
##  [1] 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 1 1 1 2 1 2 1 2 1 1 1 1 2 2 1 2 1
## [39] 2 1 2 1 1 1 1 1 1 2 2 1 1 1 2 1 2 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 2 2 1
## [77] 1 2 1 1 1 1
table(allocations)
## allocations
##  1  2 
## 58 24
clusters = c()
clusters[allocations == 1] = 'Cluster 1'
clusters[allocations == 2] = 'Cluster 2'
cbind(as.character(Cars93[,1]),clusters)
##                       clusters   
##  [1,] "Acura"         "Cluster 1"
##  [2,] "Acura"         "Cluster 1"
##  [3,] "Audi"          "Cluster 1"
##  [4,] "Audi"          "Cluster 1"
##  [5,] "BMW"           "Cluster 1"
##  [6,] "Buick"         "Cluster 1"
##  [7,] "Buick"         "Cluster 1"
##  [8,] "Buick"         "Cluster 1"
##  [9,] "Buick"         "Cluster 1"
## [10,] "Cadillac"      "Cluster 1"
## [11,] "Cadillac"      "Cluster 1"
## [12,] "Chevrolet"     "Cluster 2"
## [13,] "Chevrolet"     "Cluster 2"
## [14,] "Chevrolet"     "Cluster 1"
## [15,] "Chevrolet"     "Cluster 1"
## [16,] "Chevrolet"     "Cluster 1"
## [17,] "Chrysler"      "Cluster 1"
## [18,] "Chrysler"      "Cluster 1"
## [19,] "Chrysler"      "Cluster 1"
## [20,] "Dodge"         "Cluster 2"
## [21,] "Dodge"         "Cluster 2"
## [22,] "Dodge"         "Cluster 1"
## [23,] "Dodge"         "Cluster 1"
## [24,] "Dodge"         "Cluster 1"
## [25,] "Eagle"         "Cluster 2"
## [26,] "Eagle"         "Cluster 1"
## [27,] "Ford"          "Cluster 2"
## [28,] "Ford"          "Cluster 1"
## [29,] "Ford"          "Cluster 2"
## [30,] "Ford"          "Cluster 1"
## [31,] "Ford"          "Cluster 1"
## [32,] "Ford"          "Cluster 1"
## [33,] "Ford"          "Cluster 1"
## [34,] "Geo"           "Cluster 2"
## [35,] "Geo"           "Cluster 2"
## [36,] "Honda"         "Cluster 1"
## [37,] "Honda"         "Cluster 2"
## [38,] "Honda"         "Cluster 1"
## [39,] "Hyundai"       "Cluster 2"
## [40,] "Hyundai"       "Cluster 1"
## [41,] "Hyundai"       "Cluster 2"
## [42,] "Hyundai"       "Cluster 1"
## [43,] "Infiniti"      "Cluster 1"
## [44,] "Lexus"         "Cluster 1"
## [45,] "Lexus"         "Cluster 1"
## [46,] "Lincoln"       "Cluster 1"
## [47,] "Lincoln"       "Cluster 1"
## [48,] "Mazda"         "Cluster 2"
## [49,] "Mazda"         "Cluster 2"
## [50,] "Mazda"         "Cluster 1"
## [51,] "Mercedes-Benz" "Cluster 1"
## [52,] "Mercedes-Benz" "Cluster 1"
## [53,] "Mercury"       "Cluster 2"
## [54,] "Mercury"       "Cluster 1"
## [55,] "Mitsubishi"    "Cluster 2"
## [56,] "Mitsubishi"    "Cluster 1"
## [57,] "Nissan"        "Cluster 2"
## [58,] "Nissan"        "Cluster 1"
## [59,] "Nissan"        "Cluster 1"
## [60,] "Oldsmobile"    "Cluster 1"
## [61,] "Oldsmobile"    "Cluster 1"
## [62,] "Oldsmobile"    "Cluster 1"
## [63,] "Plymouth"      "Cluster 1"
## [64,] "Pontiac"       "Cluster 2"
## [65,] "Pontiac"       "Cluster 2"
## [66,] "Pontiac"       "Cluster 1"
## [67,] "Pontiac"       "Cluster 1"
## [68,] "Pontiac"       "Cluster 1"
## [69,] "Saab"          "Cluster 1"
## [70,] "Saturn"        "Cluster 2"
## [71,] "Subaru"        "Cluster 2"
## [72,] "Subaru"        "Cluster 1"
## [73,] "Subaru"        "Cluster 1"
## [74,] "Suzuki"        "Cluster 2"
## [75,] "Toyota"        "Cluster 2"
## [76,] "Toyota"        "Cluster 1"
## [77,] "Toyota"        "Cluster 1"
## [78,] "Volkswagen"    "Cluster 2"
## [79,] "Volkswagen"    "Cluster 1"
## [80,] "Volkswagen"    "Cluster 1"
## [81,] "Volvo"         "Cluster 1"
## [82,] "Volvo"         "Cluster 1"

A lot of brands belong to one category or the other. Logically, this should depend on the car model, specifications etc. However, clearly, luxurious brands like Acura, BMW, Cadillac, Infiniti, Lexus and Mercedes-Benz, among others, all fall in the same cluster. In the plot below we have logPrice on the \(y\)-axis and MPG.highway on the \(x\)-axis, labeling the data points with the name of the brand and using different colours and point types for the two clusters. On the top-left corner Infiniti and Mercedes-Benz stand out, while near the bottom-right corner we find Honda and, especially, Geo as economic and very efficient options!

plot(MPG.highway, logPrice, col = allocations, pch = allocations, bty = 'l', cex = 1)
text(MPG.highway, logPrice,as.character(Cars93[,1]), cex = 0.6, pos = 4)

As a final remark we can note that the above clustering analysis can be potentially improved by adding further variables. For, instance one might discover an extra cluster in the data for the middle region (~ 27-33 MPG.highway). That would be especially the case, if the added variables provide new information relevant to the clusters that is not contained in the current variables.