3.5 Human-friendly Explanations

Let’s dig deeper and discover what we humans accept as ‘good’ explanations and what the implications for interpretable machine learning are.

Research from the humanities can help us to figure that out. Miller (2017) did a huge survey of publications about explanations and this Chapter builds on his summary.

In this Chapter, I want to convince you of the following: As an explanation for an event, humans prefer short explanations (just 1 or 2 causes), which contrast the current situation with a situation where the event would not have happened. Especially abnormal causes make good explanations. Explanations are social interactions between the explainer and the explainee (receiver of the explanation) and therefore the social context has a huge influence on the actual content of the explanation.

If you build the explanation system to get ALL the factors for a certain prediction or behaviour, you do not want a human-style explanation, but rather a complete causal attribution. You probably want a causal attribution when you are legally required to state all influencing features or if you are debugging the machine learning model. In this case, ignore the following points. In all other setting, where mostly lay persons or people with little time are the recipients of the explanation, follow the advice here.

3.5.1 What is an explanation?

An explanation is the answer to a why-question (Miller 2017).

  • Why did the treatment not work on the patient?
  • Why was my loan rejected?
  • Why haven’t we been contacted by alien life yet?

The first two kind of questions can be answered with an “everyday”-explanation, while the third one is from the category “More general scientific phenomena and philosophical questions”. We focus on the “everyday”-type explanation, because this is relevant for interpretable machine learning. Questions starting with “how” can usually be turned into “why” questions: “How was my loan rejected?” can be turned into “Why was my loan rejected”.

The term “explanation” means the social and cognitive process of explaining, but it’s also the product of these processes. The explainer can be a human or a machine

3.5.2 What is a “good” explanation?

Now that we know what an explanation is, the question arises, what a good explanation is.

“Many artificial intelligence and machine learning publications and methods claim to be about ‘interpretable’ or ‘explainable’ methods, yet often this claim is only based on the authors intuition instead of hard facts and research.” - Miller (2017)

Miller (2017) summarises what a ‘good’ explanation is, which this Chapter replicates in condensed form and with concrete suggestions for machine learning applications.

Explanations are contrastive (Lipton 2016): Humans usually don’t ask why a certain prediction was made, but rather why this prediction was made instead of another prediction. We tend to think in counterfactual cases, i.e. “How would the prediction have looked like, if input X were different?”. For a house value prediction, a person might be interested in why the predicted price was high compared to the lower price she expected. When my loan application is rejected, I am not interested what in general constitutes a rejection or an approval. I am interested in the factors of my application that would need to change so that it got accepted. I want to know the contrast between my application and the would-be-accepted version of my application. The realisation that contrastive explanations matter, is an important finding for explainable machine learning. As we will see, most interpretable models allow to extract some form of explanation that implicitly contrast it to an artificial data instance or an average of instances. A doctor who wonders: “Why did the treatment not work on the patient?”, might ask for an explanation contrastive to a patient, where the treatment worked and who is similar to the non-responsive patient. Contrastive explanations are easier to understand than complete explanations. A complete explanation to the doctor’s why question (why does the treatment not work) might include: The patient has the disease already since 10 years, 11 genes are over-expressed making the disease more severe, the patients body is very fast in breaking down the medication into ineffective chemicals , etc.. The contrastive explanation, which answers the question compared to the other patient, for whom the drug worked, might be much simpler: The non-responsive patient has a certain combination of genes, that make the medication much less effective, compared to the other patient. The best explanation is the one that highlights the greatest difference between the object of interest and the reference object. What it means for interpretable machine learning: Humans don’t want a complete explanation for a prediction but rather compare what the difference were to another instance’s prediction (could also be an artificial one). Making explanations contrastive is application dependent, because it requires a point of reference for comparison. And this might depend on the data point to be explained, but also on the user receiving an explanation. A user of a house price prediction website might want to have an explanation of a house price prediction contrastive to her own house or maybe to some other house on the website or maybe to an average house in the neighbourhood. The solution for creating contrastive explanations in an automated fashion might include finding prototypes or archetypes in the data to contrast to.

Explanations are selected: People don’t expect explanations to cover the actual and complete list of causes of an event. We are used to selecting one or two causes from a huge number of possible causes as THE explanation. For proof, switch on the television and watch some news: “The drop in share prices is blamed on a growing backlash against the product due to problems consumers are reporting with the latest software update.”, “Tsubasa and his team lost the match because of a weak defence: they left their opponents to much free space to play out their strategy.”, “The increased distrust in established institutions and our government are the main factors that reduced voter turnout.” The fact that an event can be explained by different causes is called the Rashomon Effect. Rashomon is a Japanese movie in which alternative, contradictory stories (explanations) of a samurai’s death are told. For machine learning models it is beneficial, when a good prediction can be made from different features. Ensemble methods can combine multiple models with different features (different explanations) and thrive because averaging over those “stories” makes the predictions more robust and accurate. But it also means that there is no good selective explanation why they made the prediction. What it means for interpretable machine learning: Make the explanation very short, give only 1 to 3 reasons, even if the world is more complex. The LIME method does a good job with this.

Explanations are social: They are part of a conversation or interaction between the explainer and the receiver of the explanation. The social context determines the content and type of explanations. If I wanted to explain why digital cryptocurrencies are worth so much, to a technical person I would say things like: “The decentralised, distributed blockchain-based ledger that cannot be controlled by a central entity resonates with people’s desire to secure their wealth, which explains the high demand and price.”. But to my grandma I might say: “Look Grandma: Cryptocurrencies are a bit like computer gold. People like and pay a lot for gold, and young people like and pay a lot for computer gold.” What it means for interpretable machine learning: Be mindful of the social setting of your machine learning application and of the target audience. Getting the social part of the machine learning model right depends completely on your specific application. Find experts from the humanities (e.g. psychologists and sociologists) to help you out.

Explanations focus on the abnormal. People focus more on abnormal causes to explain events (Kahnemann 19817). These are causes, that had a small likelihood but happened anyways (counterfactual explanation). And removing these abnormal causes would have changed the outcome a lot. Humans consider these kinds of “abnormal” causes to be good explanations. An example (Štrumbelj and Kononenko (2011)8): Assume that we have a dataset of test situations between teachers and students. The teachers have the option to directly let students pass a course after they have given a presentation or they can ask additional questions to test the student’s knowledge, which determines if the student passes. This means we have one feature ‘teacher’-feature, which is either 0 (teacher does not test) or 1 (teacher does test). The students can have different levels of preparation (student feature), which translate to different probabilities of correctly answering the teacher’s question (in case she decides to test the student). We want to predict if a student will pass the course and explain our prediction. The chance to pass is 100% if the teacher does not ask additional questions, else the probability to pass is according to the student’s level of preparation and the resulting probability to correctly answer the questions. Scenario 1: The teacher asks the students additional questions most of the time (e.g. 95 out of 100 times). A student who did not study (10% chance to pass the questions part) was not among the lucky ones and gets additional questions, which he fails to correctly answer. Why did the student fail the course? We would say it was the student’s fault to not study. Scenario 2: The teacher rarely asks additional questions (e.g. 3 out of 100 times). For a student who did not learn for possible questions, we would still predict a high probability to pass the course, since questions are unlikely. Of course, one of the students did not prepare for the questions (resulting in a 10% chance to pass the questions). He is unlucky and the teacher asks additional questions, which the student cannot answer and he fails the course. What is the reason for failing? I’d argue that now, the better explanation is that the teacher did test the student, because it was unlikely that the teacher would test. The teacher feature had an abnormal value. What it means for interpretable machine learning: If one of the input features for a prediction was abnormal in any sense (like a rare category of a categorical feature) and the feature influenced the prediction, it should be included in an explanation, even if other ‘normal’ features have the same influence on the prediction as the abnormal one. An abnormal feature in our house price predictor example might be that a rather expensive house has three balconies. Even if some attribution method finds out that the three balconies contribute the same price difference as the above average house size, the good neighbourhood and the recent renovation, the abnormal feature “three balconies” might be the best explanation why the house is so expensive.

Explanations are truthful. Good explanations prove to be true in reality (i.e. in other situations). But, disturbingly, this is not the most important factor for a ‘good’ explanation. For example selectiveness is more important than truthfulness. An explanation that selects only one or two possible causes can never cover the complete list of causes. Selectivity omits part of the truth. It’s not true that only one or two factors caused a stock market crash for example, but the truth is that there are millions of causes that influence millions of people to act in a way that caused a crash in the end. What it means for interpretable machine learning: The explanation should predict the event as truthfully as possible, which is sometimes called fidelity in the context of machine learning. So when we say that three balconies increase the price of a house, it should hold true for other houses as well (or at least for similar houses). To humans, fidelity is not as important for a good explanations as selectivity, contrast and the social aspect.

Good explanations are coherent with prior beliefs of the explainee. Humans tend to ignore information that is not coherent with their prior beliefs. This effect is known as confirmation bias (Nickerson 19989). Explanations are not spared from this type of bias: People will tend to devalue or ignore explanations that do not cohere with their beliefs. This of course differs individually, but there are also group-based prior beliefs like political opinions. What it means for interpretable machine learning: Good explanations are consistent with prior beliefs. This one is hard to infuse into machine learning and would probably drastically compromise predictive accuracy. An example would be negative effects of house size for the predicted price of the house for a few of the houses, which, let’s assume, improves accuracy (because of some complex interactions), but strongly contradicts prior beliefs. One thing you can do is to enforce monotonicity constraints (a feature can affect the outcome only into one direction) or use something like a linear model that has this property.

Good explanations are general and probable. A cause that can explain a lot of events is very general and could be considered as a good explanation. Note that this contradicts the fact that people explain things with abnormal causes. As I see it, abnormal causes beat general causes. Abnormal causes are, by definition, rare. So in the absence of some abnormal event, a general explanation is judged to be good by humans. Also keep in mind that people tend to judge probabilities of joint events incorrectly. (Joe is a librarian. Is it more likely that he is shy or that he is a shy person that loves reading books?). A good example is ‘The bigger a house the more expensive it is’, which is a very general, good explanation why houses are expensive or cheap. What it means for interpretable machine learning: Generality is easily measured by a feature’s support, which is the number of instances for which the explanation applies over the total number of instances.


  1. Kahneman, Daniel, and Amos Tversky. 1981. “The Simulation Heuristic.” STANFORD UNIV CA DEPT OF PSYCHOLOGY.

  2. Štrumbelj, Erik, and Igor Kononenko. 2011. “A General Method for Visualizing and Explaining Black-Box Regression Models.” In International Conference on Adaptive and Natural Computing Algorithms, 21–30. Springer.

  3. Nickerson, Raymond S. 1998. “Confirmation Bias: A Ubiquitous Phenomenon in Many Guises.” Review of General Psychology 2 (2). Educational Publishing Foundation: 175.