MLNN regression

JP Gosling

2024-11-12

Running examples


Source: Created using the Image Creator in Bing

Wine data (1)


We have 1,599 observations of Portuguese red wines. The dataset was originally used to predict the quality of the wine based on 11 physicochemical tests.


The quality is a score between 0 and 10.


We want a continuous output so we will use alcohol as the output and ignore quality.

Wine data (2)

# Load the data
wine <- read.csv("winequality-red.csv")

# Standardise the explanatory variables
wine[, -c(11,12)] <- scale(wine[, -c(11,12)])

Wine data (3)

summary(lm(alcohol ~ . - quality,
           data = wine))
## 
## Call:
## lm(formula = alcohol ~ . - quality, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07175 -0.39267 -0.04056  0.35396  2.44365 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          10.42298    0.01536 678.791  < 2e-16 ***
## fixed.acidity         0.92702    0.03594  25.796  < 2e-16 ***
## volatile.acidity      0.06461    0.02048   3.154 0.001638 ** 
## citric.acid           0.16181    0.02686   6.024 2.11e-09 ***
## residual.sugar        0.40100    0.01733  23.135  < 2e-16 ***
## chlorides            -0.06881    0.01862  -3.696 0.000227 ***
## free.sulfur.dioxide  -0.02242    0.02151  -1.042 0.297517    
## total.sulfur.dioxide -0.07552    0.02264  -3.336 0.000868 ***
## density              -1.16521    0.02533 -45.998  < 2e-16 ***
## pH                    0.58085    0.02394  24.263  < 2e-16 ***
## sulphates             0.21134    0.01758  12.020  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.614 on 1588 degrees of freedom
## Multiple R-squared:  0.6701, Adjusted R-squared:  0.668 
## F-statistic: 322.5 on 10 and 1588 DF,  p-value: < 2.2e-16

Wine data (4)

There are issues of dependence between the variables.


##                      fixed.acidity volatile.acidity citric.acid residual.sugar
## fixed.acidity                 1.00            -0.26        0.67           0.11
## volatile.acidity             -0.26             1.00       -0.55           0.00
## citric.acid                   0.67            -0.55        1.00           0.14
## residual.sugar                0.11             0.00        0.14           1.00
## chlorides                     0.09             0.06        0.20           0.06
## free.sulfur.dioxide          -0.15            -0.01       -0.06           0.19
## total.sulfur.dioxide         -0.11             0.08        0.04           0.20
## density                       0.67             0.02        0.36           0.36
## pH                           -0.68             0.23       -0.54          -0.09
## sulphates                     0.18            -0.26        0.31           0.01
##                      chlorides free.sulfur.dioxide total.sulfur.dioxide density
## fixed.acidity             0.09               -0.15                -0.11    0.67
## volatile.acidity          0.06               -0.01                 0.08    0.02
## citric.acid               0.20               -0.06                 0.04    0.36
## residual.sugar            0.06                0.19                 0.20    0.36
## chlorides                 1.00                0.01                 0.05    0.20
## free.sulfur.dioxide       0.01                1.00                 0.67   -0.02
## total.sulfur.dioxide      0.05                0.67                 1.00    0.07
## density                   0.20               -0.02                 0.07    1.00
## pH                       -0.27                0.07                -0.07   -0.34
## sulphates                 0.37                0.05                 0.04    0.15
##                         pH sulphates
## fixed.acidity        -0.68      0.18
## volatile.acidity      0.23     -0.26
## citric.acid          -0.54      0.31
## residual.sugar       -0.09      0.01
## chlorides            -0.27      0.37
## free.sulfur.dioxide   0.07      0.05
## total.sulfur.dioxide -0.07      0.04
## density              -0.34      0.15
## pH                    1.00     -0.20
## sulphates            -0.20      1.00

Wine data (5)

Wine data (6)

Wine data (6)

What did you expect? (1)

What did you expect? (2)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27171 -0.18006  0.02867  0.25684  0.74620 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.1490     0.1167   1.277   0.2047   
## x1            0.3952     0.1410   2.804   0.0061 **
## x2           -0.1196     0.1523  -0.786   0.4340   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3982 on 97 degrees of freedom
## Multiple R-squared:  0.0843, Adjusted R-squared:  0.06542 
## F-statistic: 4.465 on 2 and 97 DF,  p-value: 0.01396

Going beyond the data


Source: xkcd.com/605

End of section

Regularised regression


Source: Created using the Image Creator in Bing

The basic idea


As we heard earlier in the course, regularisation is a way to prevent overfitting by adding a penalty term to the loss function.


See notes

Wine data (1)

We can use the caret package to fit a lasso model. I will take out a test set to evaluate the model.


# Retain 20% of the data for testing
set.seed(123)
train_indices <- sample(1:nrow(wine), 0.8*nrow(wine))
wine_train <- wine[train_indices, ]
wine_test <- wine[-train_indices, ]

Wine data (1)

We can use the caret package to fit a lasso model. I will take out a test set to evaluate the model.


library(caret)
wine_lasso <- train(alcohol ~ . - quality,
                    data = wine_train,
                    method = "glmnet",
                    trControl = trainControl(method = "cv"),
                    tuneGrid = expand.grid(alpha = 1,
                                           lambda = seq(0.001, 0.5, 0.05)))

Wine data (2)

## glmnet 
## 
## 1279 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1152, 1151, 1152, 1150, 1151, 1150, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE       Rsquared   MAE      
##   0.001   0.6171850  0.6738015  0.4709744
##   0.051   0.6587054  0.6580512  0.5174123
##   0.101   0.7668547  0.5735796  0.6181435
##   0.151   0.8899249  0.3813854  0.7262289
##   0.201   0.9424549  0.2922005  0.7760478
##   0.251   0.9697889  0.2512246  0.8034673
##   0.301   0.9840139  0.2507113  0.8170962
##   0.351   1.0003742  0.2507113  0.8312198
##   0.401   1.0189397  0.2507113  0.8467446
##   0.451   1.0395921  0.2507113  0.8636891
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.001.

Wine data (3)

We can use the caret package to fit a ridge regression model.


library(caret)
wine_ridge <- train(alcohol ~ . - quality,
                    data = wine_train,
                    method = "glmnet",
                    trControl = trainControl(method = "cv"),
                    tuneGrid = expand.grid(alpha = 0,
                                           lambda = seq(0.001, 0.5, 0.05)))

Wine data (4)

## glmnet 
## 
## 1279 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1151, 1152, 1150, 1151, 1149, 1151, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE       Rsquared   MAE      
##   0.001   0.6338490  0.6666922  0.4933181
##   0.051   0.6338490  0.6666922  0.4933181
##   0.101   0.6588921  0.6525803  0.5168654
##   0.151   0.6835236  0.6374974  0.5392489
##   0.201   0.7051758  0.6235013  0.5584318
##   0.251   0.7240786  0.6108218  0.5750330
##   0.301   0.7407173  0.5993372  0.5895403
##   0.351   0.7555061  0.5889190  0.6025471
##   0.401   0.7686677  0.5795473  0.6143427
##   0.451   0.7806062  0.5709360  0.6250983
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.051.

Wine data (5)


Perhaps, we have to accept that the relationship between the variables is too complex for a linear model.

Old example (1)

Old example (2)


# Fit a quintic model
quintic_model <- lm(y ~ poly(x, 5))

summary(quintic_model)
## 
## Call:
## lm(formula = y ~ poly(x, 5))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.198 -1.587 -0.007  1.274  5.481 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.9501     0.2084  47.739  < 2e-16 ***
## poly(x, 5)1  58.0370     2.0947  27.707  < 2e-16 ***
## poly(x, 5)2   4.4511     2.0947   2.125 0.036186 *  
## poly(x, 5)3  -2.4853     2.0947  -1.187 0.238384    
## poly(x, 5)4   5.1229     2.0947   2.446 0.016300 *  
## poly(x, 5)5   7.6516     2.0947   3.653 0.000425 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.095 on 95 degrees of freedom
## Multiple R-squared:  0.893,  Adjusted R-squared:  0.8874 
## F-statistic: 158.6 on 5 and 95 DF,  p-value: < 2.2e-16

Old example (3)


# Quintic fit using caret
library(caret)

# Create a data frame
data <- data.frame(x = x, y = y)

# Fit the model
quintic_lasso <- train(y ~ poly(x, 5),
                       data = data,
                       method = "glmnet",
                       trControl = trainControl(method = "cv"),
                       tuneGrid = expand.grid(alpha = 1,
                                              lambda = seq(0.001, 0.5, 0.05)))

Old example (4)


## glmnet 
## 
## 101 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 90, 91, 93, 90, 91, 93, ... 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared   MAE     
##   0.001   2.084721  0.8894493  1.796939
##   0.051   2.083582  0.8897003  1.797469
##   0.101   2.087353  0.8898525  1.799477
##   0.151   2.098398  0.8896972  1.811938
##   0.201   2.116231  0.8892630  1.827760
##   0.251   2.137840  0.8888947  1.845880
##   0.301   2.163765  0.8884723  1.865730
##   0.351   2.195501  0.8876170  1.888634
##   0.401   2.224451  0.8869069  1.911221
##   0.451   2.250752  0.8862820  1.930449
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.051.

Old example (5)


Here’s a table of the fitted coefficients for the quintic model.


Variable Standard Lasso
(Intercept) 9.95 9.95
poly(x, 5)1 58.04 57.86
poly(x, 5)2 4.45 4.27
poly(x, 5)3 -2.49 -2.3
poly(x, 5)4 5.12 4.94
poly(x, 5)5 7.65 7.47

Old example (6)

Old example (7)

Old example (8)


Here’s a table of the fitted coefficients for the quintic model.


Variable Standard Lasso
(Intercept) 9.46 9.46
poly(x, 5)1 21.52 21.45
poly(x, 5)2 -2.82 -2.76
poly(x, 5)3 -3.32 -3.25
poly(x, 5)4 4.3 4.24
poly(x, 5)5 1.27 1.2

End of section

\(k\)-nearest neighbours


Source: Created using the Image Creator in Bing

The algorithm

The \(k\)-nearest neighbours algorithm is still a lazy algorithm. It assigns a value for new data points based on the training data.


  1. Compute the distance between the new input and each training input.
  2. Sort the distances in ascending order.
  3. Select the \(k\) nearest neighbours of the new input.
  4. Assign the mean of the \(k\) nearest neighbours to the new input.

Choices

The \(k\)-nearest neighbours algorithm has the same choices as the classification variant:


  • The distance metric to use.


  • The number of neighbours to consider.


  • (Plus a technicality about distance ties.)

Wine data (1)

library(caret)

# Fit a k-nearest neighbours model
wine_knn <- train(alcohol ~ . - quality,
                  data = wine_train,
                  method = "knn",
                  trControl = trainControl(method = "cv"),
                  tuneGrid = expand.grid(k = 1:10))

Wine data (2)

## k-Nearest Neighbors 
## 
## 1279 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1151, 1152, 1151, 1151, 1150, 1151, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.6640768  0.6532562  0.4004811
##    2  0.6288326  0.6702197  0.4368942
##    3  0.6219524  0.6718558  0.4475138
##    4  0.6244981  0.6683918  0.4590103
##    5  0.6241265  0.6695391  0.4632466
##    6  0.6228080  0.6717532  0.4640830
##    7  0.6264861  0.6683710  0.4700554
##    8  0.6252424  0.6722159  0.4720161
##    9  0.6326032  0.6659768  0.4807979
##   10  0.6382291  0.6610378  0.4889308
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.

Wine data (3)

What did you expect? (1)

library(caret)

# Fit a k-nearest neighbours model
wdye_knn <- train(y ~ .,
                  data = wdye,
                  method = "knn",
                  trControl = trainControl(method = "cv"),
                  tuneGrid = expand.grid(k = 1:10))

What did you expect? (2)

## k-Nearest Neighbors 
## 
## 100 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 89, 91, 88, 90, 90, 90, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.2440621  0.6845630  0.1522891
##    2  0.2249974  0.7055742  0.1437213
##    3  0.2181852  0.6967848  0.1454025
##    4  0.2189395  0.7194136  0.1521513
##    5  0.2121521  0.7379648  0.1471030
##    6  0.2176583  0.7114566  0.1536681
##    7  0.2204738  0.7081920  0.1574887
##    8  0.2276576  0.6983430  0.1639295
##    9  0.2327024  0.6819850  0.1670122
##   10  0.2350292  0.6824917  0.1706130
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 5.

What did you expect? (3)

What did you expect? (4)

Variant


An interesting variant is the weighted \(k\)-nearest neighbours algorithm (use kknn in caret).


See notes

What did you expect?

End of section

Decision trees


Source: xkcd.com/518

The algorithm


Just like in the classification case, we have a tree structure where each node is a decision based on a variable, but, at the leaves, we have a continuous value.


  1. Find the best orthogonal split of the data based on some criterion.
  2. Recursively apply this to the resulting subsets.
  3. Stop when some stopping criterion is met.


  • See notes

Wine data (1)

library(caret)

# Fit a decision tree model
wine_tree <- train(alcohol ~ . - quality,
                   data = wine_train,
                   method = "rpart",
                   maxdepth = 3,
                   trControl = trainControl(method = "cv"),
                   tuneGrid = expand.grid(cp = seq(0.01, 0.1, 0.01)))

Wine data (2)

## CART 
## 
## 1279 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1151, 1151, 1151, 1152, 1151, 1152, ... 
## Resampling results across tuning parameters:
## 
##   cp    RMSE       Rsquared   MAE      
##   0.01  0.7685858  0.5051059  0.5945301
##   0.02  0.8019660  0.4590454  0.6310750
##   0.03  0.8234011  0.4278642  0.6526073
##   0.04  0.8570786  0.3775874  0.6725195
##   0.05  0.8738034  0.3519282  0.6979339
##   0.06  0.9054071  0.3029838  0.7261190
##   0.07  0.9099455  0.2961153  0.7277403
##   0.08  0.9099455  0.2961153  0.7277403
##   0.09  0.9099455  0.2961153  0.7277403
##   0.10  0.9099455  0.2961153  0.7277403
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.01.

Wine data (3)

Wine data (4)

What did you expect? (1)

What did you expect? (2)

## CART 
## 
## 100 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 90, 90, 90, 91, 90, 90, ... 
## Resampling results across tuning parameters:
## 
##   cp     RMSE       Rsquared   MAE      
##   0.001  0.3312531  0.4661436  0.2275115
##   0.011  0.3312531  0.4661436  0.2275115
##   0.021  0.3350106  0.4602345  0.2339854
##   0.031  0.3333469  0.4477083  0.2429868
##   0.041  0.3351225  0.4413603  0.2494965
##   0.051  0.3259153  0.4395412  0.2477909
##   0.061  0.3384965  0.4366370  0.2568656
##   0.071  0.3363805  0.4728246  0.2554663
##   0.081  0.3396735  0.4647555  0.2601945
##   0.091  0.3401932  0.4732604  0.2592124
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.051.

What did you expect? (3)

What did you expect? (4)

What did you expect? (5)

End of section

Multivariate adaptive regression splines


Source: Created using the Image Creator in Bing

Motivation


In the final decision tree example, it would have been nice to have more complexity in the division of the data than a constant.


Fitting straight lines to data is easy, but, as we saw with regularised regression, the linear assumption is often too restrictive.


See notes

Hinge functions in 1d (1)

Hinge functions in 1d (2)

Wine data (1)

library(earth)
library(caret)

# Fit a MARS model
wine_mars <- train(alcohol ~ . - quality,
                   data = wine_train,
                   method = "earth",
                   trControl = trainControl(method = "cv"),
                   tuneGrid = expand.grid(degree = 1:2,
                                          nprune = 5:15))

Wine data (2)

## Multivariate Adaptive Regression Spline 
## 
## 1279 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1151, 1150, 1150, 1152, 1151, 1150, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1        5      0.6882569  0.5988978  0.5417483
##   1        6      0.6165495  0.6790580  0.4781320
##   1        7      0.5714513  0.7231695  0.4329544
##   1        8      0.5611583  0.7338352  0.4276597
##   1        9      0.5557677  0.7381726  0.4220416
##   1       10      0.5561513  0.7369281  0.4218393
##   1       11      0.5573855  0.7357573  0.4236253
##   1       12      0.5560040  0.7362815  0.4228172
##   1       13      0.5525987  0.7394891  0.4205849
##   1       14      0.5500273  0.7417924  0.4183758
##   1       15      0.5473462  0.7439594  0.4174334
##   2        5      0.7041022  0.5794912  0.5504140
##   2        6      0.6343698  0.6590481  0.4913669
##   2        7      0.5787956  0.7161153  0.4420777
##   2        8      0.5592512  0.7355229  0.4297027
##   2        9      0.5502236  0.7432360  0.4224084
##   2       10      0.5457894  0.7465072  0.4191210
##   2       11      0.5394854  0.7522509  0.4152162
##   2       12      0.5362836  0.7546599  0.4149909
##   2       13      0.5343549  0.7562030  0.4136871
##   2       14      0.5313146  0.7589206  0.4118940
##   2       15      0.5313511  0.7591761  0.4125838
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.

Wine data (3)

Wine data (4)

# Look at the model
summary(wine_mars$finalModel)
## Call: earth(x=matrix[1279,10], y=c(9.5,11.8,9.6,...), keepxy=TRUE, degree=2,
##             nprune=14)
## 
##                                                         coefficients
## (Intercept)                                                9.7356673
## h(-0.643065-fixed.acidity)                                -0.3428044
## h(fixed.acidity- -0.643065)                                1.1524476
## h(citric.acid-1.22702)                                     0.3906266
## h(0.327105-residual.sugar)                                -0.5031012
## h(0.0282519-density)                                       1.3645855
## h(density-0.0282519)                                      -1.0505630
## h(pH- -1.82084)                                            0.4315024
## h(0.187905-sulphates)                                     -0.7474732
## h(fixed.acidity- -0.643065) * h(2.45487-residual.sugar)   -0.1750150
## h(1.19486-fixed.acidity) * h(0.327105-residual.sugar)     -0.3948844
## h(fixed.acidity-1.19486) * h(0.327105-residual.sugar)     -0.5944470
## h(1.17136-total.sulfur.dioxide) * h(0.187905-sulphates)    0.2032019
## h(density-1.03496) * h(pH- -1.82084)                       0.3379802
## 
## Selected 14 of 21 terms, and 7 of 10 predictors (nprune=14)
## Termination condition: Reached nk 21
## Importance: density, fixed.acidity, residual.sugar, pH, sulphates, ...
## Number of terms at each degree of interaction: 1 8 5
## GCV 0.2775733    RSS 336.6625    GRSq 0.7625749    RSq 0.774497

Wine data (5)

What did you expect? (1)

What did you expect? (2)


## Multivariate Adaptive Regression Spline 
## 
## 100 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 89, 91, 88, 90, 90, 90, ... 
## Resampling results across tuning parameters:
## 
##   degree  nprune  RMSE       Rsquared   MAE      
##   1       2       0.3556969  0.3532962  0.2577542
##   1       3       0.3233225  0.4408176  0.2416776
##   1       4       0.3358005  0.4073014  0.2483367
##   1       5       0.3373541  0.4059216  0.2483129
##   1       6       0.3146914  0.4518349  0.2356478
##   1       7       0.3229163  0.4290682  0.2460686
##   2       2       0.3556969  0.3532962  0.2577542
##   2       3       0.3186444  0.4351783  0.2363528
##   2       4       0.2845204  0.5428754  0.2139715
##   2       5       0.2534200  0.6307710  0.1933713
##   2       6       0.2588088  0.6263873  0.1915131
##   2       7       0.2591623  0.6249842  0.1959579
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 5 and degree = 2.

What did you expect? (3)

What did you expect? (4)

## Call: earth(x=matrix[100,2], y=c(0.2447,0.6471...), keepxy=TRUE, degree=2,
##             nprune=5)
## 
##                              coefficients
## (Intercept)                    -0.0877179
## h(0.4422-x1)                    1.3398597
## h(x1-0.4422)                    2.2128781
## h(x1-0.4422) * h(x2-0.48129)   -4.6053999
## 
## Selected 4 of 15 terms, and 2 of 2 predictors (nprune=5)
## Termination condition: Reached nk 21
## Importance: x1, x2
## Number of terms at each degree of interaction: 1 2 1
## GCV 0.0943028    RSS 7.895266    GRSq 0.4496379    RSq 0.5298675

The true function is

y <- ifelse(x1 > x2,
            x1^2 + rnorm(n, 0, 0.1),
            x2 - 2*x1 + rnorm(n, 0, 0.1))

End of chapter