##
## Call:
## lm(formula = alcohol ~ pH + sulphates + pH:sulphates, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3232 -0.8016 -0.2270 0.6386 4.9341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.44185 0.02597 402.102 < 2e-16 ***
## pH 0.23441 0.02635 8.897 < 2e-16 ***
## sulphates 0.23082 0.03134 7.365 2.82e-13 ***
## pH:sulphates 0.09599 0.02009 4.777 1.95e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.026 on 1595 degrees of freedom
## Multiple R-squared: 0.07422, Adjusted R-squared: 0.07247
## F-statistic: 42.62 on 3 and 1595 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = alcohol ~ pH, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8085 -0.8378 -0.2253 0.6509 4.9470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.42298 0.02609 399.522 <2e-16 ***
## pH 0.21914 0.02610 8.397 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.043 on 1597 degrees of freedom
## Multiple R-squared: 0.04228, Adjusted R-squared: 0.04169
## F-statistic: 70.51 on 1 and 1597 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = alcohol ~ sulphates, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1006 -0.8593 -0.2535 0.6377 4.3700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.42298 0.02654 392.707 < 2e-16 ***
## sulphates 0.09974 0.02655 3.757 0.000178 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.061 on 1597 degrees of freedom
## Multiple R-squared: 0.00876, Adjusted R-squared: 0.008139
## F-statistic: 14.11 on 1 and 1597 DF, p-value: 0.0001783
We can see a clear effect of removing variables on the model fit.
Model | Adjusted \(R^2\) | RMSE |
---|---|---|
Full | 0.072 | 1.025 |
pH only | 0.042 | 1.043 |
sulphates only | 0.008 | 1.061 |
Removing the pH
variable from the model has a greater
effect on the model fit than removing the sulphates
variable.
Is it fair to compare models with a different number of variables?
We lose two parameters from the model here because we have the interaction term.
See notes
Let’s permute the pH
variable and see how it affects the
model fit.
##
## Call:
## lm(formula = alcohol ~ pH + sulphates + pH:sulphates, data = wine_perm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1010 -0.8641 -0.2572 0.6422 4.3734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.42279 0.02655 392.592 < 2e-16 ***
## pH -0.01956 0.02657 -0.736 0.461778
## sulphates 0.10035 0.02656 3.778 0.000164 ***
## pH:sulphates 0.02189 0.02691 0.813 0.416170
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.062 on 1595 degrees of freedom
## Multiple R-squared: 0.009532, Adjusted R-squared: 0.007669
## F-statistic: 5.117 on 3 and 1595 DF, p-value: 0.001591
##
## Call:
## lm(formula = alcohol ~ pH + sulphates + pH:sulphates, data = wine_perm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7969 -0.8466 -0.2272 0.6507 4.9025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.42308 0.02610 399.322 <2e-16 ***
## pH 0.21857 0.02613 8.363 <2e-16 ***
## sulphates 0.01521 0.02616 0.582 0.561
## pH:sulphates -0.01166 0.03031 -0.385 0.700
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.044 on 1595 degrees of freedom
## Multiple R-squared: 0.04259, Adjusted R-squared: 0.04079
## F-statistic: 23.65 on 3 and 1595 DF, p-value: 5.604e-15
We can again see a clear effect of removing variables on the model fit, but this might be a more honest approach
Model | Adjusted \(R^2\) | RMSE |
---|---|---|
Full | 0.072 | 1.025 |
pH permutated | 0.008 | 1.06 |
sulphates permutated | 0.041 | 1.042 |
One way to think about variable importance in the context of decision trees is to consider the number of times a variable is used to split the data.
However, it is more influential to do early splits in the tree than later splits. We might therefore count the number of times a variable is used to do the first split.
In the randomForest
package, the importance
function can be used to calculate an importance metric for each variable
in the model.
See notes
## MeanDecreaseGini
## fixed.acidity 77.00334
## volatile.acidity 105.82246
## citric.acid 75.69601
## residual.sugar 71.87860
## chlorides 83.52143
## free.sulfur.dioxide 68.86417
## total.sulfur.dioxide 107.32531
## density 95.55846
## pH 78.32486
## sulphates 114.57800
## alcohol 148.13808
We had some success with the handwriting data using a k-nearest neighbours model.
## k-Nearest Neighbors
##
## 2000 samples
## 784 predictor
## 10 classes: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1600, 1601, 1599, 1600, 1600
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.9065198 0.8960155
## 2 0.8910160 0.8787635
## 3 0.8995098 0.8882166
## 4 0.8890135 0.8765324
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.
We can use the DALEX
package to interpret the model.
library(DALEX)
# Create an explainer object
explainer <- explain(knn_model,
data = MNIST_train[, -1],
y = as.factor(MNIST_train$label))
## Preparation of a new explainer is initiated
## -> model label : train ( default )
## -> data : 2000 rows 784 cols
## -> data : tibble converted into a data.frame
## -> target variable : 2000 values
## -> predict function : yhat.train will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package caret , ver. 6.0.94 , task multiclass ( default )
## -> predicted values : predict function returns multiple columns: 10 ( default )
## -> residual function : difference between 1 and probability of true class ( default )
## -> residuals : numerical, min = 0 , mean = 0 , max = 0
## A new explainer has been created!
This could take a very long time to run…
It would be nice to be able to visualise the effect of a variable on the model output.
There are many variants of this: we will concentrate on
Both of these plot types can be produced using the iml
package. See notes
We will fit a random forest model to the wine data with the
alcohol
variable as the response.
What is the effect of the pH
variable on the model
output?
What is the effect of the pH
variable on the model
output?
Let’s create some data with a strong interaction between two variables.