We have 1,599 observations of Portuguese red wines. The dataset was originally used to predict the quality of the wine based on 11 physicochemical tests.
The quality
is a score between 0 and 10.
We want a continuous output so we will use alcohol
as
the output and ignore quality
.
##
## Call:
## lm(formula = alcohol ~ . - quality, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.07175 -0.39267 -0.04056 0.35396 2.44365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.42298 0.01536 678.791 < 2e-16 ***
## fixed.acidity 0.92702 0.03594 25.796 < 2e-16 ***
## volatile.acidity 0.06461 0.02048 3.154 0.001638 **
## citric.acid 0.16181 0.02686 6.024 2.11e-09 ***
## residual.sugar 0.40100 0.01733 23.135 < 2e-16 ***
## chlorides -0.06881 0.01862 -3.696 0.000227 ***
## free.sulfur.dioxide -0.02242 0.02151 -1.042 0.297517
## total.sulfur.dioxide -0.07552 0.02264 -3.336 0.000868 ***
## density -1.16521 0.02533 -45.998 < 2e-16 ***
## pH 0.58085 0.02394 24.263 < 2e-16 ***
## sulphates 0.21134 0.01758 12.020 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.614 on 1588 degrees of freedom
## Multiple R-squared: 0.6701, Adjusted R-squared: 0.668
## F-statistic: 322.5 on 10 and 1588 DF, p-value: < 2.2e-16
There are issues of dependence between the variables.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity 1.00 -0.26 0.67 0.11
## volatile.acidity -0.26 1.00 -0.55 0.00
## citric.acid 0.67 -0.55 1.00 0.14
## residual.sugar 0.11 0.00 0.14 1.00
## chlorides 0.09 0.06 0.20 0.06
## free.sulfur.dioxide -0.15 -0.01 -0.06 0.19
## total.sulfur.dioxide -0.11 0.08 0.04 0.20
## density 0.67 0.02 0.36 0.36
## pH -0.68 0.23 -0.54 -0.09
## sulphates 0.18 -0.26 0.31 0.01
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## fixed.acidity 0.09 -0.15 -0.11 0.67
## volatile.acidity 0.06 -0.01 0.08 0.02
## citric.acid 0.20 -0.06 0.04 0.36
## residual.sugar 0.06 0.19 0.20 0.36
## chlorides 1.00 0.01 0.05 0.20
## free.sulfur.dioxide 0.01 1.00 0.67 -0.02
## total.sulfur.dioxide 0.05 0.67 1.00 0.07
## density 0.20 -0.02 0.07 1.00
## pH -0.27 0.07 -0.07 -0.34
## sulphates 0.37 0.05 0.04 0.15
## pH sulphates
## fixed.acidity -0.68 0.18
## volatile.acidity 0.23 -0.26
## citric.acid -0.54 0.31
## residual.sugar -0.09 0.01
## chlorides -0.27 0.37
## free.sulfur.dioxide 0.07 0.05
## total.sulfur.dioxide -0.07 0.04
## density -0.34 0.15
## pH 1.00 -0.20
## sulphates -0.20 1.00
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27171 -0.18006 0.02867 0.25684 0.74620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1490 0.1167 1.277 0.2047
## x1 0.3952 0.1410 2.804 0.0061 **
## x2 -0.1196 0.1523 -0.786 0.4340
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3982 on 97 degrees of freedom
## Multiple R-squared: 0.0843, Adjusted R-squared: 0.06542
## F-statistic: 4.465 on 2 and 97 DF, p-value: 0.01396
As we heard earlier in the course, regularisation is a way to prevent overfitting by adding a penalty term to the loss function.
See notes
We can use the caret
package to fit a lasso model. I
will take out a test set to evaluate the model.
We can use the caret
package to fit a lasso model. I
will take out a test set to evaluate the model.
## glmnet
##
## 1279 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1152, 1151, 1152, 1150, 1151, 1150, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.001 0.6171850 0.6738015 0.4709744
## 0.051 0.6587054 0.6580512 0.5174123
## 0.101 0.7668547 0.5735796 0.6181435
## 0.151 0.8899249 0.3813854 0.7262289
## 0.201 0.9424549 0.2922005 0.7760478
## 0.251 0.9697889 0.2512246 0.8034673
## 0.301 0.9840139 0.2507113 0.8170962
## 0.351 1.0003742 0.2507113 0.8312198
## 0.401 1.0189397 0.2507113 0.8467446
## 0.451 1.0395921 0.2507113 0.8636891
##
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.001.
We can use the caret
package to fit a ridge regression
model.
## glmnet
##
## 1279 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1151, 1152, 1150, 1151, 1149, 1151, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.001 0.6338490 0.6666922 0.4933181
## 0.051 0.6338490 0.6666922 0.4933181
## 0.101 0.6588921 0.6525803 0.5168654
## 0.151 0.6835236 0.6374974 0.5392489
## 0.201 0.7051758 0.6235013 0.5584318
## 0.251 0.7240786 0.6108218 0.5750330
## 0.301 0.7407173 0.5993372 0.5895403
## 0.351 0.7555061 0.5889190 0.6025471
## 0.401 0.7686677 0.5795473 0.6143427
## 0.451 0.7806062 0.5709360 0.6250983
##
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.051.
Perhaps, we have to accept that the relationship between the variables is too complex for a linear model.
##
## Call:
## lm(formula = y ~ poly(x, 5))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.198 -1.587 -0.007 1.274 5.481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.9501 0.2084 47.739 < 2e-16 ***
## poly(x, 5)1 58.0370 2.0947 27.707 < 2e-16 ***
## poly(x, 5)2 4.4511 2.0947 2.125 0.036186 *
## poly(x, 5)3 -2.4853 2.0947 -1.187 0.238384
## poly(x, 5)4 5.1229 2.0947 2.446 0.016300 *
## poly(x, 5)5 7.6516 2.0947 3.653 0.000425 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.095 on 95 degrees of freedom
## Multiple R-squared: 0.893, Adjusted R-squared: 0.8874
## F-statistic: 158.6 on 5 and 95 DF, p-value: < 2.2e-16
## glmnet
##
## 101 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 90, 91, 93, 90, 91, 93, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.001 2.084721 0.8894493 1.796939
## 0.051 2.083582 0.8897003 1.797469
## 0.101 2.087353 0.8898525 1.799477
## 0.151 2.098398 0.8896972 1.811938
## 0.201 2.116231 0.8892630 1.827760
## 0.251 2.137840 0.8888947 1.845880
## 0.301 2.163765 0.8884723 1.865730
## 0.351 2.195501 0.8876170 1.888634
## 0.401 2.224451 0.8869069 1.911221
## 0.451 2.250752 0.8862820 1.930449
##
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.051.
Here’s a table of the fitted coefficients for the quintic model.
Variable | Standard | Lasso |
---|---|---|
(Intercept) |
9.95 | 9.95 |
poly(x, 5)1 |
58.04 | 57.86 |
poly(x, 5)2 |
4.45 | 4.27 |
poly(x, 5)3 |
-2.49 | -2.3 |
poly(x, 5)4 |
5.12 | 4.94 |
poly(x, 5)5 |
7.65 | 7.47 |
Here’s a table of the fitted coefficients for the quintic model.
Variable | Standard | Lasso |
---|---|---|
(Intercept) |
9.46 | 9.46 |
poly(x, 5)1 |
21.52 | 21.45 |
poly(x, 5)2 |
-2.82 | -2.76 |
poly(x, 5)3 |
-3.32 | -3.25 |
poly(x, 5)4 |
4.3 | 4.24 |
poly(x, 5)5 |
1.27 | 1.2 |
The \(k\)-nearest neighbours algorithm is still a lazy algorithm. It assigns a value for new data points based on the training data.
The \(k\)-nearest neighbours algorithm has the same choices as the classification variant:
## k-Nearest Neighbors
##
## 1279 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1151, 1152, 1151, 1151, 1150, 1151, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.6640768 0.6532562 0.4004811
## 2 0.6288326 0.6702197 0.4368942
## 3 0.6219524 0.6718558 0.4475138
## 4 0.6244981 0.6683918 0.4590103
## 5 0.6241265 0.6695391 0.4632466
## 6 0.6228080 0.6717532 0.4640830
## 7 0.6264861 0.6683710 0.4700554
## 8 0.6252424 0.6722159 0.4720161
## 9 0.6326032 0.6659768 0.4807979
## 10 0.6382291 0.6610378 0.4889308
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.
## k-Nearest Neighbors
##
## 100 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 89, 91, 88, 90, 90, 90, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.2440621 0.6845630 0.1522891
## 2 0.2249974 0.7055742 0.1437213
## 3 0.2181852 0.6967848 0.1454025
## 4 0.2189395 0.7194136 0.1521513
## 5 0.2121521 0.7379648 0.1471030
## 6 0.2176583 0.7114566 0.1536681
## 7 0.2204738 0.7081920 0.1574887
## 8 0.2276576 0.6983430 0.1639295
## 9 0.2327024 0.6819850 0.1670122
## 10 0.2350292 0.6824917 0.1706130
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
An interesting variant is the weighted \(k\)-nearest neighbours algorithm (use
kknn
in caret).
See notes
Just like in the classification case, we have a tree structure where each node is a decision based on a variable, but, at the leaves, we have a continuous value.
## CART
##
## 1279 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1151, 1151, 1151, 1152, 1151, 1152, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01 0.7685858 0.5051059 0.5945301
## 0.02 0.8019660 0.4590454 0.6310750
## 0.03 0.8234011 0.4278642 0.6526073
## 0.04 0.8570786 0.3775874 0.6725195
## 0.05 0.8738034 0.3519282 0.6979339
## 0.06 0.9054071 0.3029838 0.7261190
## 0.07 0.9099455 0.2961153 0.7277403
## 0.08 0.9099455 0.2961153 0.7277403
## 0.09 0.9099455 0.2961153 0.7277403
## 0.10 0.9099455 0.2961153 0.7277403
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01.
## CART
##
## 100 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 90, 90, 90, 91, 90, 90, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.001 0.3312531 0.4661436 0.2275115
## 0.011 0.3312531 0.4661436 0.2275115
## 0.021 0.3350106 0.4602345 0.2339854
## 0.031 0.3333469 0.4477083 0.2429868
## 0.041 0.3351225 0.4413603 0.2494965
## 0.051 0.3259153 0.4395412 0.2477909
## 0.061 0.3384965 0.4366370 0.2568656
## 0.071 0.3363805 0.4728246 0.2554663
## 0.081 0.3396735 0.4647555 0.2601945
## 0.091 0.3401932 0.4732604 0.2592124
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.051.
In the final decision tree example, it would have been nice to have more complexity in the division of the data than a constant.
Fitting straight lines to data is easy, but, as we saw with regularised regression, the linear assumption is often too restrictive.
See notes
## Multivariate Adaptive Regression Spline
##
## 1279 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1151, 1150, 1150, 1152, 1151, 1150, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 5 0.6882569 0.5988978 0.5417483
## 1 6 0.6165495 0.6790580 0.4781320
## 1 7 0.5714513 0.7231695 0.4329544
## 1 8 0.5611583 0.7338352 0.4276597
## 1 9 0.5557677 0.7381726 0.4220416
## 1 10 0.5561513 0.7369281 0.4218393
## 1 11 0.5573855 0.7357573 0.4236253
## 1 12 0.5560040 0.7362815 0.4228172
## 1 13 0.5525987 0.7394891 0.4205849
## 1 14 0.5500273 0.7417924 0.4183758
## 1 15 0.5473462 0.7439594 0.4174334
## 2 5 0.7041022 0.5794912 0.5504140
## 2 6 0.6343698 0.6590481 0.4913669
## 2 7 0.5787956 0.7161153 0.4420777
## 2 8 0.5592512 0.7355229 0.4297027
## 2 9 0.5502236 0.7432360 0.4224084
## 2 10 0.5457894 0.7465072 0.4191210
## 2 11 0.5394854 0.7522509 0.4152162
## 2 12 0.5362836 0.7546599 0.4149909
## 2 13 0.5343549 0.7562030 0.4136871
## 2 14 0.5313146 0.7589206 0.4118940
## 2 15 0.5313511 0.7591761 0.4125838
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
## Call: earth(x=matrix[1279,10], y=c(9.5,11.8,9.6,...), keepxy=TRUE, degree=2,
## nprune=14)
##
## coefficients
## (Intercept) 9.7356673
## h(-0.643065-fixed.acidity) -0.3428044
## h(fixed.acidity- -0.643065) 1.1524476
## h(citric.acid-1.22702) 0.3906266
## h(0.327105-residual.sugar) -0.5031012
## h(0.0282519-density) 1.3645855
## h(density-0.0282519) -1.0505630
## h(pH- -1.82084) 0.4315024
## h(0.187905-sulphates) -0.7474732
## h(fixed.acidity- -0.643065) * h(2.45487-residual.sugar) -0.1750150
## h(1.19486-fixed.acidity) * h(0.327105-residual.sugar) -0.3948844
## h(fixed.acidity-1.19486) * h(0.327105-residual.sugar) -0.5944470
## h(1.17136-total.sulfur.dioxide) * h(0.187905-sulphates) 0.2032019
## h(density-1.03496) * h(pH- -1.82084) 0.3379802
##
## Selected 14 of 21 terms, and 7 of 10 predictors (nprune=14)
## Termination condition: Reached nk 21
## Importance: density, fixed.acidity, residual.sugar, pH, sulphates, ...
## Number of terms at each degree of interaction: 1 8 5
## GCV 0.2775733 RSS 336.6625 GRSq 0.7625749 RSq 0.774497
## Multivariate Adaptive Regression Spline
##
## 100 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 89, 91, 88, 90, 90, 90, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 0.3556969 0.3532962 0.2577542
## 1 3 0.3233225 0.4408176 0.2416776
## 1 4 0.3358005 0.4073014 0.2483367
## 1 5 0.3373541 0.4059216 0.2483129
## 1 6 0.3146914 0.4518349 0.2356478
## 1 7 0.3229163 0.4290682 0.2460686
## 2 2 0.3556969 0.3532962 0.2577542
## 2 3 0.3186444 0.4351783 0.2363528
## 2 4 0.2845204 0.5428754 0.2139715
## 2 5 0.2534200 0.6307710 0.1933713
## 2 6 0.2588088 0.6263873 0.1915131
## 2 7 0.2591623 0.6249842 0.1959579
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 2.
## Call: earth(x=matrix[100,2], y=c(0.2447,0.6471...), keepxy=TRUE, degree=2,
## nprune=5)
##
## coefficients
## (Intercept) -0.0877179
## h(0.4422-x1) 1.3398597
## h(x1-0.4422) 2.2128781
## h(x1-0.4422) * h(x2-0.48129) -4.6053999
##
## Selected 4 of 15 terms, and 2 of 2 predictors (nprune=5)
## Termination condition: Reached nk 21
## Importance: x1, x2
## Number of terms at each degree of interaction: 1 2 1
## GCV 0.0943028 RSS 7.895266 GRSq 0.4496379 RSq 0.5298675
The true function is