6.5 Feature Importance

A feature’s importance is the increase in the model’s prediction error after we permuted the feature’s values (breaks the relationship between the feature and the outcome).

6.5.1 The Theory

The concept is really straightforward: We measure a feature’s importance by calculating the increase of the model’s prediction error after permuting the feature. A feature is “important” if permuting its values increases the model error, because the model relied on the feature for the prediction. A feature is “unimportant” if permuting its values keeps the model error unchanged, because the model ignored the feature for the prediction. The permutation feature importance measurement was introduced for Random Forests by Breiman (2001)29. Based on this idea, Fisher, Rudin, and Dominici (2018)30 proposed a model-agnostic version of the feature importance - they called it model reliance. They also introduce more advanced ideas about feature importance, for example a (model-specific) version that accounts for the fact that many prediction models may fit the data well. Their paper is worth a read.

The permutation feature importance algorithm based on Breiman (2001) and Fisher, Rudin, and Dominici (2018):

Input: Trained model \(\hat{f}\), feature matrix \(X\), target vector \(Y\), error measure \(L(Y,\hat{Y})\)

  1. Estimate the original model error \(e_{orig}(\hat{f})=L(Y,\hat{f}(X))\) (e.g. mean squared error)
  2. For each feature \(j\in1,\ldots,p\) do
    • Generate feature matrix \(X_{perm_{j}}\) by permuting feature \(X_j\) in \(X\). This breaks the association between \(X_j\) and \(Y\).
    • Estimate error \(e_{perm}=L(Y,\hat{f}(X_{perm_j}))\) based on the predictions of the permuted data.
    • Calculate permutation feature importance \(FI_j=e_{perm}(\hat{f})/e_{orig}(\hat{f})\). Alternatively, the difference can be used: \(FI_j=e_{perm}(\hat{f})-e_{orig}(\hat{f})\)
  3. Sort variables by descending \(FI\).

In their paper, Fisher, Rudin, and Dominici (2018) propose to split the dataset in half and exchange the \(X_j\) values of the two halves instead of permuting \(X_j\). This is exactly the same as permuting the feature \(X_j\) if you think about it. If you want to have a more accurate estimate, you can estimate the error of permuting \(X_j\) by pairing each instance with the \(X_j\) value of each other instance (except with itself). This gives you a dataset of size \(n(n-1)\) to estimate the permutation error and it takes a big amount of computation time. I can only recommend using the \(n(n-1)\) - method when you are serious about getting extremely accurate estimates.

6.5.2 Example and Interpretation

We show examples for classification and regression.

Cervical cancer (Classification)

We fit a random forest model to predict cervical cancer. We measure the error increase by: \(1-AUC\) (one minus the area under the ROC curve). Features that are associated model error increase by a factor of 1 (= no change) were not important for predicting cervical cancer.

The importance for each of the features in predicting cervical cancer with a random forest. The importance is the factor by which the error is increased compared to the original model error.

FIGURE 6.26: The importance for each of the features in predicting cervical cancer with a random forest. The importance is the factor by which the error is increased compared to the original model error.

The feature with the highest importance was associated with an error increase of 7.8 after permutation.

Bike sharing (Regression)

We fit a support vector machine model to predict the number of rented bikes, given weather conditions and calendric information. As error measurement we use the mean absolute error.

The importance for each of the features in predicting bike counts with a support vector machine.

FIGURE 6.27: The importance for each of the features in predicting bike counts with a support vector machine.

6.5.3 Advantages

  • Nice interpretation: Feature importance is the increase of model error when the feature’s information is destroyed.
  • Feature importance provides a highly compressed, global insight into the model’s behavior.
  • A positive aspect of using the error ratio instead of the error difference is that the feature importance measurements are comparable across different problems.
  • The importance measure automatically takes into account all interactions with other features. By permuting the feature you also destroy the interaction effects with other features. This means that the permutation feature importance measure regards both the feature main effect and the interaction effects on the model performance. This is also a disadvantage because the importance of the interaction between two features will be included in the importance measures of both features. This means that the feature importances don’t add up to the total drop in performance, would we shuffle all the features, but they are greater than that. Only when there is no interaction between the features, like in a linear model, the importances would roughly add up.
  • Permutation feature importance doesn’t require retraining the model like. Some other methods suggest to delete a feature, retrain the model and then compare the model error. Since retraining a machine learning model can take a long time, ‘only’ permuting a feature can safe lots of time.
  • Importance methods that retrain the model with a subset of the features seem intuitive at first glance, but the model with the reduced data is meaningless for the feature importance. We are interested in the feature importance of a fixed model, and when I say fixed, I also mean that the features have to be used. Retraining with a reduced dataset creates a different model from the one we are interested in. Let’s say you train a sparse linear model (with LASSO) with a fixed amount of features with a non-zero weight. The dataset has 100 features, you set the number of non-zero weights to 5. You analyze the importance of one of the features that got a non-zero weight. You remove the feature and retrain the model. The model performance stays the same, because now another equally good feature gets a non-zero weight and your conclusion would be that the feature was not important. Another example: the model is a decision tree and we analyze the importance of the feature that was chose as the first split. We remove the feature and retrain the model. Since another feature will be chosen as the first split, the whole tree can be very different, meaning that we compare the error rates of (potentially) completely different trees to decide how important that feature is.

6.5.4 Disadvantages

  • The feature importance measure is tied to the error of the model. This is not inherently bad, but in some cases not what you need. In some cases you would prefer to know how much the model’s output varies for one feature, ignoring what it means for the performance. For example: You want to find out how robust your model’s output is, given someone manipulates the features. In this case, you wouldn’t be interested in how much the model performance drops given the permutation of a feature, but rather how much of the model’s output variance is explained by each feature. Model variance (explained by the features) and feature importance correlate strongly when the model generalizes well (i.e. it doesn’t overfit).
  • You need access to the actual outcome target. If someone only gives you the model and unlabeled data - but not the actual target - you can’t compute the permutation feature importance.
  • The permutation feature importance measure depends on shuffling the feature, which adds randomness to the importance measure. When the permutation is repeated, the results might differ. Repeating the permutation and averaging the importance measures over repetitions stabilizes the measure, but increases the time of computation.
  • When features are correlated, the permutation feature importance measure can be biased by unrealistic data instances. The problem is the same as for partial dependence plots: The permutation of features generates unlikely data instances when two features are correlated. When they are positively correlated (like height and weight of a person) and I shuffle one of the features, then I create new instances that are unlikely or even physically impossible (2m person weighting 30kg for example), yet I use those new instances to measure the importance. In other words, for the permutation feature importance of a correlated feature we consider how much the model performance drops when we exchange the feature with values that we would never observe in reality. Check if the features are strongly correlated and be careful with the interpretation of the feature importance when they are.
  • Another tricky thing: Adding a correlated feature can decrease the importance of the associated feature, by splitting up the importance of both features. Let me explain with an example what I mean by “splitting up” the feature importance: We want to predict the probability of rain and use the temperature at 8:00 AM of the day before as a feature together with other uncorrelated features. I fit a random forest and it turns out the temperature is the most important feature and all is good and I sleep well the next night. Now imagine another scenario in which I additionally include the temperature at 9:00 AM as a feature, which is highly correlated with the temperature at 8:00 AM. The temperature at 9:00 AM doesn’t give me much additional information, when I already know the temperature at 8:00 AM. But having more features is always good, right? I fit a random forest with the two temperature features and the uncorrelated features. Some of the trees in the random forest pick up the 8:00 AM temperature, some the 9:00 AM temperature, some both and some none. The two temperature features together have a bit more importance than the single temperature feature before, but instead of being on the top of the list of the important features, each temperature is now somewhere in the middle. By introducing a correlated feature, I kicked the most important feature from the top of the importance ladder to mediocrity. On one hand, that’s okay, because it simply reflects the behaviour of the underlying machine learning model, here the random forest. The 8:00 AM temperature simply has become less important, because the model can now rely on the 9:00 AM measure as well. On the other hand, it makes the interpretation of the feature importances way more difficult. Imagine that you want to check the features for measurement errors. The check is expensive and you decide to only check the top 3 most important features. In the first case you would check the temperature, in the second case you would not include any temperature feature, simply because they now share the importance. Even though the importance measure might make sense on the model behaviour level, it’s damn confusing if you have correlated features.

  1. Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1). Springer: 5–32.

  2. Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. 2018. “Model Class Reliance: Variable Importance Measures for any Machine Learning Model Class, from the ‘Rashomon’ Perspective.” http://arxiv.org/abs/1801.01489.