Warning
This library is currently in beta and many functions may not work. If you use it anyway, we welcome your comments; please open an issue!
The API might change and the documentation is not up to date.
In particular, it is not yet possible to obtain interpretable concept-based explanations.
π Table of contents¶
- π Table of contents
- π Quick Start
- π¦ What's Included
- π Contributing
- π See Also
- π Acknowledgments
- π¨βπ Creators
- ποΈ Citation
- π License
π Quick Start¶
The library should be available on PyPI soon. Try pip install interpreto
to install it.
Otherwise, you can clone the repository and install it locally with pip install -e .
.
And any case, checkout the attribution walkthrough and the concept example to get started!
π¦ What's Included¶
Interpreto πͺ provides a modular framework encompassing Attribution Methods, Concept-Based Methods, and Evaluation Metrics.
Attribution Methods¶
Interpreto includes both inference-based and gradient-based attribution methods:
Inference-based Methods:
- Occlusion: Zeiler and Fergus, 2014. Visualizing and understanding convolutional networks.
- LIME: Ribeiro et al. 2013, "Why should i trust you?" explaining the predictions of any classifier.
- Kernel SHAP: Lundberg and Lee, 2017, A Unified Approach to Interpreting Model Predictions.
- Sobol Attribution: Fel et al. 2021, Look at the variance! efficient black-box explanations with sobol-based sensitivity analysis.
Gradient based methods:
- Saliency: Simonyan et al. 2013, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.
- Integrated Gradient: Sundararajan et al. 2017, Axiomatic Attribution for Deep Networks.
- SmoothGrad: Smilkov et al. 2017, SmoothGrad: removing noise by adding noise
Will be implemented soon.
- InputxGradient: Simonyan et al. 2013, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.
- DeepLift: Shrikumar et al. 2017, Learning Important Features Through Propagating Activation Differences.
- VarGrad: Richter et al. 2020, VarGrad: A Low-Variance Gradient Estimator for Variational Inference
Concept-Based Methods¶
Concept-based explanations aim to provide high-level interpretations of latent model representations.
Interpreto generalizes these methods through three core steps:
-
Concept Discovery (e.g., from latent embeddings)
-
Concept Interpretation (mapping discovered concepts to human-understandable elements)
-
Concept-to-Output Attribution (assessing concept relevance to model outputs) [Work in progress]
Concept Discovery Techniques (via Overcomplete):
-
NMF, Semi-NMF, ConvexNMF
-
ICA, SVD, PCA
-
SAE variants (Vanilla SAE, TopK SAE, JumpReLU SAE, BatchTopK SAE)
Available Concept Interpretation Techniques:
-
Top-k tokens from tokenizer vocabulary
-
Top-k tokens/words/sentences/samples from specific datasets
Concept Interpretation Techniques Added Soon:
- Input-to-concept attribution from dataset examples (Jourdan et al. 2023)
- Theme prediction via LLMs from top-k tokens/sentences
Concept Interpretation Techniques Added Later:
-
OpenAI Interpretation (Bills et al. 2023)
-
Aligning concepts with human labels (Sajjad et al. 2022)
-
Word cloud visualizations of concepts (Dalvi et al. 2022)
-
VocabProj & TokenChange (Gur-Arieh et al. 2025)
Concept-to-Output Attribution:
This part will be implemented later, but all the attribution methods presented above will be available here.
Note that only methods with a concept extraction that has an encoder (input to concept) AND a decoder (concept to output) can use this function.
Specific methods:
[Available later when all parts are implemented] Thanks to this generalization encompassing all concept-based methods and our highly flexible architecture, we can easily obtain a large number of concept-based methods:
-
CAV and TCAV: Kim et al. 2018, Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
-
ConceptSHAP: Yeh et al. 2020, On Completeness-aware Concept-Based Explanations in Deep Neural Networks
-
Yun et al. 2021, Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors
-
FFN values interpretation: Geva et al. 2022, Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
-
SparseCoding: Cunningham et al. 2023, Sparse Autoencoders Find Highly Interpretable Features in Language Models
-
Parameter Interpretation: Dar et al. 2023, Analyzing Transformers in Embedding Space
Evaluation Metrics¶
Evaluation Metrics for Attribution
We don't yet have metrics implemented for attribution methods, but that's coming soon!
Evaluation Metrics for Concepts
Several properties of the concept-space are desirable.
List of properties and corresponding metrics:
-
Concept-space faithfulness:
MSE
,FID
, or define a custom one throughReconstructionError
by specifying areconstruction_space
and adistance_function
. -
Concept-space complexity:
Sparsity
andSparsityRatio
metric are available. -
Concept-space stability: You can use
Stability
metric to compare concept-model dictionaries.
π Contributing¶
Feel free to propose your ideas or come and contribute with us on the Interpreto πͺ toolbox! We have a specific document where we describe in a simple way how to make your first pull request.
π See Also¶
More from the DEEL project:
- Xplique a Python library dedicated to explaining neural networks (Images, Time Series, Tabular data) on TensorFlow.
- Puncc a Python library for predictive uncertainty quantification using conformal prediction.
- oodeel a Python library that performs post-hoc deep Out-of-Distribution (OOD) detection on already trained neural network image classifiers.
- deel-lip a Python library for training k-Lipschitz neural networks on TensorFlow.
- deel-torchlip a Python library for training k-Lipschitz neural networks on PyTorch.
- Influenciae a Python library dedicated to computing influence values for the discovery of potentially problematic samples in a dataset.
- DEEL White paper a summary of the DEEL team on the challenges of certifiable AI and the role of data quality, representativity and explainability for this purpose.
π Acknowledgments¶
This project received funding from the French βInvesting for the Future β PIA3β program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledge the support of the DEEL and the FOR projects.
π¨βπ Creators¶
Interpreto πͺ is a project of the FOR and the DEEL teams at the IRT Saint-ExupΓ©ry in Toulouse, France.
ποΈ Citation¶
If you use Interpreto πͺ as part of your workflow in a scientific publication, please consider citing ποΈ our paper (coming soon):
BibTeX entry coming soon
π License¶
The package is released under MIT license.