We have 1,599 observations of Portuguese red wines. The dataset was originally used to predict the quality of the wine based on 11 physicochemical tests.
The quality is a score between 0 and 10.
We want a continuous output so we will use alcohol as
the output and ignore quality.
## 
## Call:
## lm(formula = alcohol ~ . - quality, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07175 -0.39267 -0.04056  0.35396  2.44365 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          10.42298    0.01536 678.791  < 2e-16 ***
## fixed.acidity         0.92702    0.03594  25.796  < 2e-16 ***
## volatile.acidity      0.06461    0.02048   3.154 0.001638 ** 
## citric.acid           0.16181    0.02686   6.024 2.11e-09 ***
## residual.sugar        0.40100    0.01733  23.135  < 2e-16 ***
## chlorides            -0.06881    0.01862  -3.696 0.000227 ***
## free.sulfur.dioxide  -0.02242    0.02151  -1.042 0.297517    
## total.sulfur.dioxide -0.07552    0.02264  -3.336 0.000868 ***
## density              -1.16521    0.02533 -45.998  < 2e-16 ***
## pH                    0.58085    0.02394  24.263  < 2e-16 ***
## sulphates             0.21134    0.01758  12.020  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.614 on 1588 degrees of freedom
## Multiple R-squared:  0.6701, Adjusted R-squared:  0.668 
## F-statistic: 322.5 on 10 and 1588 DF,  p-value: < 2.2e-16
There are issues of dependence between the variables.
##                      fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity                 1.00            -0.26        0.67           0.11
## volatile.acidity             -0.26             1.00       -0.55           0.00
## citric.acid                   0.67            -0.55        1.00           0.14
## residual.sugar                0.11             0.00        0.14           1.00
## chlorides                     0.09             0.06        0.20           0.06
## free.sulfur.dioxide          -0.15            -0.01       -0.06           0.19
## total.sulfur.dioxide         -0.11             0.08        0.04           0.20
## density                       0.67             0.02        0.36           0.36
## pH                           -0.68             0.23       -0.54          -0.09
## sulphates                     0.18            -0.26        0.31           0.01
##                      chlorides free.sulfur.dioxide total.sulfur.dioxide density
## fixed.acidity             0.09               -0.15                -0.11    0.67
## volatile.acidity          0.06               -0.01                 0.08    0.02
## citric.acid               0.20               -0.06                 0.04    0.36
## residual.sugar            0.06                0.19                 0.20    0.36
## chlorides                 1.00                0.01                 0.05    0.20
## free.sulfur.dioxide       0.01                1.00                 0.67   -0.02
## total.sulfur.dioxide      0.05                0.67                 1.00    0.07
## density                   0.20               -0.02                 0.07    1.00
## pH                       -0.27                0.07                -0.07   -0.34
## sulphates                 0.37                0.05                 0.04    0.15
##                         pH sulphates
## fixed.acidity        -0.68      0.18
## volatile.acidity      0.23     -0.26
## citric.acid          -0.54      0.31
## residual.sugar       -0.09      0.01
## chlorides            -0.27      0.37
## free.sulfur.dioxide   0.07      0.05
## total.sulfur.dioxide -0.07      0.04
## density              -0.34      0.15
## pH                    1.00     -0.20
## sulphates            -0.20      1.00
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27171 -0.18006  0.02867  0.25684  0.74620 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.1490     0.1167   1.277   0.2047   
## x1            0.3952     0.1410   2.804   0.0061 **
## x2           -0.1196     0.1523  -0.786   0.4340   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3982 on 97 degrees of freedom
## Multiple R-squared:  0.0843, Adjusted R-squared:  0.06542 
## F-statistic: 4.465 on 2 and 97 DF,  p-value: 0.01396
As we heard earlier in the course, regularisation is a way to prevent overfitting by adding a penalty term to the loss function.
See notes
We can use the caret package to fit a lasso model. I
will take out a test set to evaluate the model.
We can use the caret package to fit a lasso model. I
will take out a test set to evaluate the model.
## glmnet 
## 
## 1279 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1152, 1151, 1152, 1150, 1151, 1150, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE       Rsquared   MAE      
##   0.001   0.6171850  0.6738015  0.4709744
##   0.051   0.6587054  0.6580512  0.5174123
##   0.101   0.7668547  0.5735796  0.6181435
##   0.151   0.8899249  0.3813854  0.7262289
##   0.201   0.9424549  0.2922005  0.7760478
##   0.251   0.9697889  0.2512246  0.8034673
##   0.301   0.9840139  0.2507113  0.8170962
##   0.351   1.0003742  0.2507113  0.8312198
##   0.401   1.0189397  0.2507113  0.8467446
##   0.451   1.0395921  0.2507113  0.8636891
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.001.
We can use the caret package to fit a ridge regression
model.
## glmnet 
## 
## 1279 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1151, 1152, 1150, 1151, 1149, 1151, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE       Rsquared   MAE      
##   0.001   0.6338490  0.6666922  0.4933181
##   0.051   0.6338490  0.6666922  0.4933181
##   0.101   0.6588921  0.6525803  0.5168654
##   0.151   0.6835236  0.6374974  0.5392489
##   0.201   0.7051758  0.6235013  0.5584318
##   0.251   0.7240786  0.6108218  0.5750330
##   0.301   0.7407173  0.5993372  0.5895403
##   0.351   0.7555061  0.5889190  0.6025471
##   0.401   0.7686677  0.5795473  0.6143427
##   0.451   0.7806062  0.5709360  0.6250983
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.051.
Perhaps, we have to accept that the relationship between the variables is too complex for a linear model.
## 
## Call:
## lm(formula = y ~ poly(x, 5))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.198 -1.587 -0.007  1.274  5.481 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.9501     0.2084  47.739  < 2e-16 ***
## poly(x, 5)1  58.0370     2.0947  27.707  < 2e-16 ***
## poly(x, 5)2   4.4511     2.0947   2.125 0.036186 *  
## poly(x, 5)3  -2.4853     2.0947  -1.187 0.238384    
## poly(x, 5)4   5.1229     2.0947   2.446 0.016300 *  
## poly(x, 5)5   7.6516     2.0947   3.653 0.000425 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.095 on 95 degrees of freedom
## Multiple R-squared:  0.893,  Adjusted R-squared:  0.8874 
## F-statistic: 158.6 on 5 and 95 DF,  p-value: < 2.2e-16
## glmnet 
## 
## 101 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 90, 91, 93, 90, 91, 93, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE     
##   0.001   2.084721  0.8894493  1.796939
##   0.051   2.083582  0.8897003  1.797469
##   0.101   2.087353  0.8898525  1.799477
##   0.151   2.098398  0.8896972  1.811938
##   0.201   2.116231  0.8892630  1.827760
##   0.251   2.137840  0.8888947  1.845880
##   0.301   2.163765  0.8884723  1.865730
##   0.351   2.195501  0.8876170  1.888634
##   0.401   2.224451  0.8869069  1.911221
##   0.451   2.250752  0.8862820  1.930449
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.051.
Here’s a table of the fitted coefficients for the quintic model.
| Variable | Standard | Lasso | 
|---|---|---|
(Intercept) | 
9.95 | 9.95 | 
poly(x, 5)1 | 
58.04 | 57.86 | 
poly(x, 5)2 | 
4.45 | 4.27 | 
poly(x, 5)3 | 
-2.49 | -2.3 | 
poly(x, 5)4 | 
5.12 | 4.94 | 
poly(x, 5)5 | 
7.65 | 7.47 | 
Here’s a table of the fitted coefficients for the quintic model.
| Variable | Standard | Lasso | 
|---|---|---|
(Intercept) | 
9.46 | 9.46 | 
poly(x, 5)1 | 
21.52 | 21.45 | 
poly(x, 5)2 | 
-2.82 | -2.76 | 
poly(x, 5)3 | 
-3.32 | -3.25 | 
poly(x, 5)4 | 
4.3 | 4.24 | 
poly(x, 5)5 | 
1.27 | 1.2 | 
The \(k\)-nearest neighbours algorithm is still a lazy algorithm. It assigns a value for new data points based on the training data.
The \(k\)-nearest neighbours algorithm has the same choices as the classification variant:
## k-Nearest Neighbors 
## 
## 1279 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1151, 1152, 1151, 1151, 1150, 1151, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.6640768  0.6532562  0.4004811
##    2  0.6288326  0.6702197  0.4368942
##    3  0.6219524  0.6718558  0.4475138
##    4  0.6244981  0.6683918  0.4590103
##    5  0.6241265  0.6695391  0.4632466
##    6  0.6228080  0.6717532  0.4640830
##    7  0.6264861  0.6683710  0.4700554
##    8  0.6252424  0.6722159  0.4720161
##    9  0.6326032  0.6659768  0.4807979
##   10  0.6382291  0.6610378  0.4889308
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.
## k-Nearest Neighbors 
## 
## 100 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 89, 91, 88, 90, 90, 90, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.2440621  0.6845630  0.1522891
##    2  0.2249974  0.7055742  0.1437213
##    3  0.2181852  0.6967848  0.1454025
##    4  0.2189395  0.7194136  0.1521513
##    5  0.2121521  0.7379648  0.1471030
##    6  0.2176583  0.7114566  0.1536681
##    7  0.2204738  0.7081920  0.1574887
##    8  0.2276576  0.6983430  0.1639295
##    9  0.2327024  0.6819850  0.1670122
##   10  0.2350292  0.6824917  0.1706130
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.
An interesting variant is the weighted \(k\)-nearest neighbours algorithm (use
kknn in caret).
See notes
Just like in the classification case, we have a tree structure where each node is a decision based on a variable, but, at the leaves, we have a continuous value.
## CART 
## 
## 1279 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1151, 1151, 1151, 1152, 1151, 1152, ... 
## Resampling results across tuning parameters:
## 
##   cp    RMSE       Rsquared   MAE      
##   0.01  0.7685858  0.5051059  0.5945301
##   0.02  0.8019660  0.4590454  0.6310750
##   0.03  0.8234011  0.4278642  0.6526073
##   0.04  0.8570786  0.3775874  0.6725195
##   0.05  0.8738034  0.3519282  0.6979339
##   0.06  0.9054071  0.3029838  0.7261190
##   0.07  0.9099455  0.2961153  0.7277403
##   0.08  0.9099455  0.2961153  0.7277403
##   0.09  0.9099455  0.2961153  0.7277403
##   0.10  0.9099455  0.2961153  0.7277403
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01.
## CART 
## 
## 100 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 90, 90, 90, 91, 90, 90, ... 
## Resampling results across tuning parameters:
## 
##   cp     RMSE       Rsquared   MAE      
##   0.001  0.3312531  0.4661436  0.2275115
##   0.011  0.3312531  0.4661436  0.2275115
##   0.021  0.3350106  0.4602345  0.2339854
##   0.031  0.3333469  0.4477083  0.2429868
##   0.041  0.3351225  0.4413603  0.2494965
##   0.051  0.3259153  0.4395412  0.2477909
##   0.061  0.3384965  0.4366370  0.2568656
##   0.071  0.3363805  0.4728246  0.2554663
##   0.081  0.3396735  0.4647555  0.2601945
##   0.091  0.3401932  0.4732604  0.2592124
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.051.
In the final decision tree example, it would have been nice to have more complexity in the division of the data than a constant.
Fitting straight lines to data is easy, but, as we saw with regularised regression, the linear assumption is often too restrictive.
See notes
## Multivariate Adaptive Regression Spline 
## 
## 1279 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1151, 1150, 1150, 1152, 1151, 1150, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1        5      0.6882569  0.5988978  0.5417483
##   1        6      0.6165495  0.6790580  0.4781320
##   1        7      0.5714513  0.7231695  0.4329544
##   1        8      0.5611583  0.7338352  0.4276597
##   1        9      0.5557677  0.7381726  0.4220416
##   1       10      0.5561513  0.7369281  0.4218393
##   1       11      0.5573855  0.7357573  0.4236253
##   1       12      0.5560040  0.7362815  0.4228172
##   1       13      0.5525987  0.7394891  0.4205849
##   1       14      0.5500273  0.7417924  0.4183758
##   1       15      0.5473462  0.7439594  0.4174334
##   2        5      0.7041022  0.5794912  0.5504140
##   2        6      0.6343698  0.6590481  0.4913669
##   2        7      0.5787956  0.7161153  0.4420777
##   2        8      0.5592512  0.7355229  0.4297027
##   2        9      0.5502236  0.7432360  0.4224084
##   2       10      0.5457894  0.7465072  0.4191210
##   2       11      0.5394854  0.7522509  0.4152162
##   2       12      0.5362836  0.7546599  0.4149909
##   2       13      0.5343549  0.7562030  0.4136871
##   2       14      0.5313146  0.7589206  0.4118940
##   2       15      0.5313511  0.7591761  0.4125838
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
## Call: earth(x=matrix[1279,10], y=c(9.5,11.8,9.6,...), keepxy=TRUE, degree=2,
##             nprune=14)
## 
##                                                         coefficients
## (Intercept)                                                9.7356673
## h(-0.643065-fixed.acidity)                                -0.3428044
## h(fixed.acidity- -0.643065)                                1.1524476
## h(citric.acid-1.22702)                                     0.3906266
## h(0.327105-residual.sugar)                                -0.5031012
## h(0.0282519-density)                                       1.3645855
## h(density-0.0282519)                                      -1.0505630
## h(pH- -1.82084)                                            0.4315024
## h(0.187905-sulphates)                                     -0.7474732
## h(fixed.acidity- -0.643065) * h(2.45487-residual.sugar)   -0.1750150
## h(1.19486-fixed.acidity) * h(0.327105-residual.sugar)     -0.3948844
## h(fixed.acidity-1.19486) * h(0.327105-residual.sugar)     -0.5944470
## h(1.17136-total.sulfur.dioxide) * h(0.187905-sulphates)    0.2032019
## h(density-1.03496) * h(pH- -1.82084)                       0.3379802
## 
## Selected 14 of 21 terms, and 7 of 10 predictors (nprune=14)
## Termination condition: Reached nk 21
## Importance: density, fixed.acidity, residual.sugar, pH, sulphates, ...
## Number of terms at each degree of interaction: 1 8 5
## GCV 0.2775733    RSS 336.6625    GRSq 0.7625749    RSq 0.774497
## Multivariate Adaptive Regression Spline 
## 
## 100 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 89, 91, 88, 90, 90, 90, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1       2       0.3556969  0.3532962  0.2577542
##   1       3       0.3233225  0.4408176  0.2416776
##   1       4       0.3358005  0.4073014  0.2483367
##   1       5       0.3373541  0.4059216  0.2483129
##   1       6       0.3146914  0.4518349  0.2356478
##   1       7       0.3229163  0.4290682  0.2460686
##   2       2       0.3556969  0.3532962  0.2577542
##   2       3       0.3186444  0.4351783  0.2363528
##   2       4       0.2845204  0.5428754  0.2139715
##   2       5       0.2534200  0.6307710  0.1933713
##   2       6       0.2588088  0.6263873  0.1915131
##   2       7       0.2591623  0.6249842  0.1959579
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 2.
## Call: earth(x=matrix[100,2], y=c(0.2447,0.6471...), keepxy=TRUE, degree=2,
##             nprune=5)
## 
##                              coefficients
## (Intercept)                    -0.0877179
## h(0.4422-x1)                    1.3398597
## h(x1-0.4422)                    2.2128781
## h(x1-0.4422) * h(x2-0.48129)   -4.6053999
## 
## Selected 4 of 15 terms, and 2 of 2 predictors (nprune=5)
## Termination condition: Reached nk 21
## Importance: x1, x2
## Number of terms at each degree of interaction: 1 2 1
## GCV 0.0943028    RSS 7.895266    GRSq 0.4496379    RSq 0.5298675
The true function is