Demystifying Model Interpretability: Highlighting Counterfactual Explanations

by Veronica Scerra

I think that when I say I have a PhD in neuroscience, it conjures up in peoples’ minds images of white labcoats, pipettes, and cell cultures. That’s not unreasonable - many neuroscientists do that kind of bench work and our world is better for it. In reality, my days as a neuroscientist consisted of a few hours desperately watching an oscilloscope, and many hours cleaning, analyzing, and interpreting data, then building models to explain that data. And if all I had to do to be successful in neuroscience was that, I would probably still be there, because I loved those hours in front of my computer with my data. This brief jaunt down memory lane is actually relevant - what I’m writing about today are counterfactual explanations. In essence, my work in neuroscience can be distilled down to an exploration of a counterfactual. In basic terms, how this works is: we have a scenario wherein a certain phenomenon can be reliably produced, we create a scenario in which this known phenomenon does not hold true, and then we drill down on what specifically differs between our new scenario and the well-established one that can tell us something about the inner workings of a complex system, narrowing down the sources of unknown variance to as small a set as possible, so that we can draw meaningful conclusions from the outcomes. The complex system in my work was the primate frontal eye field, a small visomotor integrating region of the prefrontal cortex, but in a broader sense, this could be any black-box model.

TL; DR

Counterfactual Explanations
What:	Show minimal changes to input features that would alter the model’s prediction
Use When:	You want user recourse, fairness insights, or actionable explanations
Assumptions:	You must define plausible alternatives and realistic constraints (the alternatives must exist in the real world)
Alternatives:	SHAP, ICE, LIME, contrastive explanations

What Are Counterfactual Explanations?

Counterfactual explanations are a remarkably versatile and accessible explainability tool because they are built around outcomes, and do not require any deep understanding of the inner workings of the models producing those outcomes. It almost seems like a throwaway statement to say that you don’t need to understand the inner working of complex models to use counterfactuals, but it’s actually amazing - it means you don’t need to know anything about proprietary models and their design, it means that you don’t have to understand anything about machine learning to find them useful (anyone can understand them), and they can give interested parties actionable targets for changing/varying to obtain different outcomes. What counterfactual explanations do is search the input space for the nearest possible alternate inputs that generate a different output.

Let’s say you're using a classifier model to determine peoples’ eligibility for a new drug trial, and someone is classified as a 0, or ineligible. A counterfactual exploration would be able to tell you what that candidate might change, or do differently, to be classified differently. This can give interested parties a genuine understanding of the model’s decisions without having to know or understand anything going on under the hood. Similarly, these explanations can be used to assess fairness and bias in models. For example, if a counterfactual analysis of a model leads to the recommendation that candidates can get better outcomes by changing their race, or economic status, it might mean that the model is racially or economically biased, which, depending on the usage, could be a problem.

While other interpretability metrics can tell you how a decision was reached, counterfactuals let us explore what might have been…

How Do They Work?

Imagine you have a dial for each of your model input features, and you want to turn as few dials as possible, just enough to cross the model’s decision boundary. An ideal counterfactual would turn those dials (alter the input features) as little as possible to obtain a new decision - representing the closest world in which things are “different”. Often, you have the choice of different “alternate worlds”, wherein different features are changed. This is a good thing, as it can illuminate various paths for targeting changes. You have to be sensible to two things: 1) how you compute the “distance” between your base feature value, and 2) that the new counterfactual features values could actually exist (e.g., age can’t be negative), ensuring realistic alternatives. Several libraries exist for aiding in this process, for example, DiCE Python library will generate diverse feasible counterfactual options to test in your model, and the Alibi library can test out different algorithms and distance metrics to obtain optimal results

Simple 🙂

Strengths: Why Use Counterfactuals?

Actionability: Determining feasible counterfactuals can provide individuals with paths to better outcomes - which can be especially helpful in health, finance, hiring, etc.
Fairness: You can use a counterfactual analysis to determine if some groups must make greater changes than others to receive the same outcomes, revealing some bias in the model.
Debugging: Counterfactuals can help you determine if you have overly sharp decision boundaries or too rigid a model.
Transparency: While most agree that models used to make decisions should be “transparent”, it is hard to agree on what that means. Counterfactuals provide a path to transparency that can be understood the technically naive as well as the technically proficient.

Limitations

Feasibility: Not all changes are realistic, even if potentially possible. This is why it can be important to have several options for counterfactual scenarios.
Causal Blindness: Counterfactuals don’t imply causality unless explicitly modeled, they simply provide alternate scenarios in which the model’s output would be different.
Multiplicity: Many valid counterfactuals can exist - choosing among them requires care and situational understanding and context.
Over-optimization: You might just end up gaming the model, rather than revealing real-world paths. This is not so much a limitation, as something you must be aware of when using counterfactuals.

Final Thoughts

If PDP shows us the forest, ICE gives us the trees, and SHAP tells us the path we’ve taken to get where we are, counterfactuals let us ask what would have happened if we had turned left instead of right. Counterfactual explanations don’t just describe, they suggest. They empower. They let users and stakeholders understand the model’s decision as dynamic rather than static - an invitation to change rather than a closed door.

Stay tuned for a hands-on notebook where we generate and interpret counterfactuals using the DiCE library. You’ll see how simple tweaks can unlock powerful stories about your model - and your data. Stay tuned!

← References