5.5 RuleFit
The RuleFit algorithm (J. H. Friedman and Popescu 200820) fits sparse linear models which include automatically detected interaction effects in the form of binary decision rules.
The standard linear model doesn’t account for interactions between the features. Wouldn’t it be convenient to have a model that is as simple and interpretable as linear models, but that also integrates feature interactions? RuleFit fills this gap. RuleFit fits a sparse linear model with the original features and also a set of new features which are decision rules. These new features capture interactions between the original features. RuleFit generates these features automatically from decision trees. Each path through a tree can be turned into a decision rule by combining the split decisions to a rule. The node predictions are thrown away and only the splits are used in the decision rules:
Where do the decision trees come from? These are trees that are trained to predict the outcome of interest, so that the splits are meaningful for the task at hand and not arbitrary. Any algorithm that creates a lot of trees can be used for RuleFit, like a Random Forest for example. Each tree is disassembled into decision rules, which are used as additional features in a linear Lasso model.
The RuleFit paper uses the Boston housing data for illustration: The goal is to predict the median house value in the Boston neighbourhood. One of the rules (read: features) generated by RuleFit: “if (number of rooms \(>6.64\)) and (concentration of nitric oxide \(<0.67\)) then \(1\) else \(0\)”
RuleFit also comes with a feature importance measurement, which helps to identify linear terms and rules that are important for the prediction. The feature importance is calculated from the weights of the regression model. The importance measure can be aggregated for the original features (which appear once untransformed and possibly in many decision rules).
RuleFit also introduces partial dependence plots to plot the average change of the prediction by changing one feature. The partial dependence plot is a model-agnostic method, which can be used with any model, and it has its own part in the book.
5.5.1 Interpretation and Example
Since RuleFit estimates a linear model in the end, the interpretation is equivalent to linear models. The only difference is that the model has new features that are coming from decision rules. Decision rules are binary features: A value of 1 means that all conditions of the rule are met, otherwise the value is 0. For linear terms in RuleFit, the interpretation is the same as in linear regression models: If \(x_j\) increases by one unit, the predicted outcome changes by \(\beta_j\).
In this example, we use RuleFit to predict the number of rented bikes on a given day. Some of the genereated rules for the bike rental prediction task are: The table shows five of the generated rules with their Lasso weights and importances (measured as a function of the weight, see later in the chapter) after fitting ‘RuleFit’ on the bike dataset:
Description | Weight | Importance |
---|---|---|
days_since_2011 > 428 & temp > 5.199151 | 590 | 288 |
37.45835 <= hum <= 90.156225 | -18 | 248 |
days_since_2011 > 111 & weathersit in (“GOOD”, “MISTY”) | 598 | 228 |
temp > 8.058349 & weathersit in (“GOOD”, “MISTY”) | 506 | 227 |
temp > 13.228349 & days_since_2011 > 554 | 522 | 186 |
The most important rule was: “days_since_2011 > 428 & temp > 5.199151” and the associated weight is 590.1. The interpretation is: If days_since_2011 > 428 & temp > 5.199151, then the predicted number of bikes increases by 590.1, given all other features values stay fixed. In total, 335 such rules were created from the original 8 features. Quite a lot! But thanks to Lasso, only 34 of the 335 got a weight different from zero.
Computing the global feature importances reveals that temperature and the time trend are the most important features:
The feature importance measurement includes the importance of the raw feature term and all the decision rules the feature appears in.
5.5.2 Guidelines
In this section we will talk about the advantages and disadvantages of RuleFit and how to interpret it.
Interpretation template
The interpretation is analogue to linear models: The predicted outcome changes by \(\beta_j\) if feature \(x_j\) changes by one unit, given all other features stay the same. The weight interpretation of a decision rule is a special case: If all conditions of a decision rule \(r_k\) apply, the predicted outcome changes by \(\alpha_k\) (the learned weight for rule \(r_k\) in the linear model). And, respectively, for classification: If all conditions of decision rule \(r_k\) apply, the odds for event vs. no-event changes by a factor of \(\alpha_k\).
Advantages:
- RuleFit adds feature interactions automatically to linear models. Therefore it solves the problem of linear models that you have to add interaction terms manually and it helps a bit with the issue of modeling non-linear relationships.
- RuleFit can handle both classification and regression tasks.
- The created rules are easy to interpret, because they are binary decision rules. Either the rule applies to an instance or not. Good interpretability is only guaranteed as long as the number of conditions within a rule is not to big. A rule with 1 to 3 conditions seems reasonable to me. This translates into a maximum depth of 3 for the trees in the tree ensemble.
- Even if there are many rules in the model, they do not apply to each instance, so for one instance only a handful of rules are important (non-zero weights). This improves local interpretability.
- The RuleFit proposes a bunch of useful diagnostic tools. These tools are model-agnostic, that’s why you will find them in the model-agnostic section: feature importance, partial dependence plots and feature interactions.
Disadvantages:
- Sometimes RuleFit creates many rules which get a non-zero weight in the Lasso model. The interpretability degrades with higher number of features in the model. A promising solution is to force feature effects to be monotonic, meaning that an increase in a feature has to result in an increase of the predicted outcome.
- An anecdotal drawback: The papers claim good performance of RuleFit - often close to the predictive performance of Random Forests! - yet in the few cases where I personally tried it, the performance was disappointing.
- The end product of the RuleFit procedure is a linear model with additional fancy features (the decision rules). But since it is a linear model, the weight interpretation is still unintuitive (given all features are fixed, increasing feature \(x_j\) by one unit, increases the predicted outcome by \(\beta_j\)). It gets a bit more tricky if you have overlapping rules: For example one decision rule (feature) for the bike prediction could be: “temp > 10” and another rule could be “temp > 15 & weather=‘GOOD’”. When the weather is good and the temperature is above 15 degrees, the temperature is automatically also always bigger then 10, which means in the cases where the second rule applies, the first one also always applies. The interpretation of the estimated weight for the second rule is: ‘Given all other features are fixed, the predicted number of bikes increases by \(\beta_2\)’. BUT, now it becomes really clear that the ‘all other feature fixed’ is problematic, because if rule 2 applies, also rule 1 applies and the interpretation is nonsensical.
The RuleFit algorithm is implemented in R by Fokkema and Christoffersen (2017)21 and you can find a Python version on Github.
5.5.3 Theory
Let’s dive deeper into the technicalities of the RuleFit algorithm. RuleFit consists of two components: The first component produces “rules” from decision trees and the second component fits a linear model with the original features and the new rules as input (hence the name “RuleFit”). It enables automatic integration of interactions between features into a linear model, while having the interpretability of a sparse linear model.
Step 1: Rule generation
How does a rule look like? The rules that the algorithm generates have a simple form: For example: “if \(x2<3\) and \(x5<7\) then \(1\) else \(0\)”. The rules are constructed by disassembling decision trees: Each path to a node in a tree can be turned into a decision rule. The trees used for the rules are fitted to predict the target outcome. The splits and resulting rules are optimised to predict the outcome you are interested in. You simply chain the binary decisions that lead to a certain node with a logical “AND”, and voilà, you have a rule. It is desirable to generate a lot of diverse and meaningful rules. Gradient boosting is used to fit an ensemble of decision trees (by regressing or classifying \(y\) with your original features \(X\)). Each resulting tree is turned into multiple rules. Not only boosted trees, but any type of ensemble of trees can be used to generate the trees for RuleFit:
\[f(x)=a_0+\sum_{m=1}^M{}a_m{}f_m(X)\]
where \(M\) is the number of trees and \(f_m(x)\) represents the prediction function of the \(m\)-th tree. Bagged ensembles, Random forests, AdaBoost and MART yield ensemble of trees and can be used for RuleFit.
From all of the trees of the ensemble, we produce the rules. Each rule \(r_m\) takes on the form:
\[r_m(x)=\prod_{j\in\text{T}_m}I(x_j\in{}s_{jm})\]
where \(\text{T}_{m}\) is the set of features used in \(m\)-th tree, \(I(\cdot)\) is the indicator function, which is 1 if the feature \(x_j\) is in the specified subset of values \(s_{jm}\) for \(x_j\) (as specified by the tree splits) and otherwise 0. For numerical features, \(s_{jm}\) is one to multiple intervals in the value range of the feature \(x_j\), depending on the number of splits in that feature. In case of a single split, the \(s_{jm}\) looks like one of the two cases: \(x_{s_{jm},\text{lower}}<x_j\) or \(x_j<x_{s_{jm},upper}\). Further splits in that feature create more complicated intervals. For categorical features the subset \(s_{jm}\) contains some specific categories of \(x_j\).
A made up example for the bike rental dataset:
\[r_{17}(x)=I(x_{\text{temp}}<15)\cdot{}I(x_{\text{weather}}\in\{\text{good},\text{cloudy}\})\cdot{}I(10\leq{}x_{\text{windspeed}}<20)\]
This rule will only be equal to 1 if all of the three conditions are met, otherwise 0. RuleFit extracts all possible rules from a tree, not only from the leaf nodes. So another rule that would be created is:
\[r_{18}(x)=I(x_{\text{temp}}<15)\cdot{}I(x_{\text{weather}}\in\{\text{good},\text{cloudy}\}\]
In total,
\[K=\sum_{m=1}^M2(t_m-1)\]
rules are created from the ensemble of \(M\) trees, with \(t_m\) terminal nodes each. A trick that is introduced by the RuleFit authors is to fit trees with random depth, so that a lot of diverse rules are generated with different lengths. Note that we throw away the predicted value in each node and only keep the conditions that lead us to the node and create a rule from it. The weighting of the decision rules will happen in step 2 of fitting RuleFit.
Another way to see the first step is, that it generates a new set of features out of your original features \(X\). Those features are binary and can represent quite complex interactions of your original \(X\). The rules are chosen to maximise the prediction task at hand. The rules are automatically generated from the covariates matrix X. You can see the rules simply as new features based on your original features.
Step 2: Sparse linear model
You will get A LOT of rules from the first step. Since the first step is only a feature transformation function on your original dataset you are still not done with fitting a model and also you want to reduce the number of rules. Next to the rules, also all your ‘raw’ features from your original dataset will be used in the Lasso linear model. Every rule and original feature becomes a feature in Lasso and gets a weight estimate. The original, raw features are added because trees suck at representing simple linear relationships between y and x. Before we put everything into Lasso to get a sparse linear model, we winsorise the original features, so that they are more robust against outliers:
\[l_j^*(x_j)=min(\delta_j^+,max(\delta_j^-,x_j))\]
where \(\delta_j^-\) and \(\delta_j^+\) are the \(\delta\) quantiles of the data distribution of \(x_j\). A choice of \(0.05\) for \(\delta\) means that every value of \(x_j\) that is in the 5% lowest or 5% highest values will be set to the values at 5% or 95% respectively. As a rule of thumb, you can choose \(\delta=0.025\). In addition, the linear terms have to be normalised so that they have the same prior influence as a typical decision rule:
\[l_j(x_j)=0.4\cdot{}l^*_j(x_j)/std(l^*_j(x_j))\]
The \(0.4\) is the average standard deviation of rules with a uniform support distribution \(s_k\sim{}U(0,1)\).
We combine both types of features to generate a new feature matrix and estimate a sparse linear model with Lasso, with the following structure:
\[\hat{f}(x)=\hat{\beta}_0+\sum_{k=1}^K\hat{\alpha}_k{}r_k(x)+\sum_{j=1}^p\hat{\beta}_j{}l_j(x_j)\]
where \(\hat{\alpha}\) are the estimated weights for the rule features and \(\hat{\beta}\) for the original features. Since RuleFit uses Lasso, the loss function gets the additional constraint that forces some of the weights to get a zero estimate:
\[(\{\hat{\alpha}\}_1^K,\{\hat{\beta}\}_0^p)=argmin_{\{\hat{\alpha}\}_1^K,\{\hat{\beta}\}_0^p}\sum_{i=1}^n{}L(y_i,f(x))+\lambda\cdot(\sum_{k=1}^K|\alpha_k|+\sum_{j=1}^p|b_j|)\]
The outcome is a linear model that has linear effects for all of the original features and for the rules. The interpretation is the same as with linear models, the only difference is that some features are now binary rules.
Step 3 (optional): Feature importance For the linear terms of the original features, the feature importance is measured with the standardised predictor:
\[I_j=|\hat{\beta}_j|\cdot{}std(l_j(x_j))\]
where \(\beta_j\) is the weight from the Lasso model and \(std(l_j(x_j))\) the standard deviation of the linear term over the data.
For the decision rule terms, the importance is calculated with:
\[I_k=|\hat{\alpha}_k|\cdot\sqrt{s_k(1-s_k)}\]
where \(\hat{\alpha}_k\) is the associated Lasso weight of the decision rule and \(s_k\) is the support of the feature in the data, which is the percentage of data points for which the decision rule applies (where \(r_k(x)=0\)):
\[s_k=\frac{1}{n}\sum_{i=1}^n{}r_k(x_i)\]
A feature \(x_j\) occurs as a linear term and possibly also within many decision rules. How do we measure the total importance of the feature \(x_j\)? The importance \(J_j(x)\) of feature \(x_j\) can be measured at each individual prediction:
\[J_j(x)=I_l(x)+\sum_{x_j\in{}r_k}I_k(x)/m_k\]
where \(I_l\) is the importance of the linear term and \(I_k\) the importance of the decision rules in which \(x_j\) appears, and \(m_k\) is the number of features that constitute rule \(r_k\). Summing the feature importance over all instances gives us the global feature importance:
\[J_j(X)=\sum_{i=1}^n{}J_j(x_i)\]
It is possible to choose a subset of instances and calculate the feature importance for this group.
Friedman, Jerome H, and Bogdan E Popescu. 2008. “Predictive Learning via Rule Ensembles.” The Annals of Applied Statistics. JSTOR, 916–54.↩
Fokkema, Marjolein, and Benjamin Christoffersen. 2017. Pre: Prediction Rule Ensembles. https://CRAN.R-project.org/package=pre.↩