Skip to content

Interpreto: Interpretability Toolkit for LLMs

Interpreto: Interpretability Toolkit for LLMs

Build status Docs status Version Python Version Downloads License

πŸš€ Quick Start

The library is available on PyPI, try pip install interpreto to install it.

Checkout the tutorials to get started:

πŸ“¦ What's Included

Interpreto πŸͺ„ provides a modular framework encompassing Attribution Methods, Concept-Based Methods, and Evaluation Metrics.

πŸ”₯ Attribution Methods

Interpreto includes both inference-based and gradient-based attribution methods.

They all work seamlessly for both classification (...ForSequenceClassification) and generation (...ForCausalLM)

Inference-based Methods:

Gradient-based methods:

πŸ’‘ Concept-Based Methods or Mechanistic Interpretability

Concept-based explanations aim to provide high-level interpretations of latent model representations.

Interpreto generalizes these methods through four core steps:

  1. Split a model in two and obtain a dataset of activations
  2. Concept Discovery (e.g., from latent embeddings)
  3. Concept Interpretation (mapping discovered concepts to human-understandable elements)
  4. Concept-to-Output Attribution (assessing concept relevance to model outputs)

1. Split a model in two and obtain a dataset of activations: (mainly via nnsight):

Choose any layer in any HuggingFace language model with our ModelWithSplitPoints based on nnsight. Then pass a dataset through it to obtain a dataset of activations.

2. Dictionary Learning for Concept Discovery (mainly via overcomplete):

3. Available Concept Interpretation Techniques:

Concept Interpretation Techniques Added in the future:

4. Concept-to-Output Attribution:

Estimate the contribution of each concept to the model output.

Can be obtained with any concept-based explainer via MethodConcepts.concept_output_gradient().

Specific methods:

Thanks to this generalization encompassing all concept-based methods and our highly flexible architecture, we can easily obtain a large number of concept-based methods.

The following list will soon be available:

πŸ“Š Evaluation Metrics

Evaluation Metrics for Attribution

To evaluate attribution methods faithfulness, there are the Insertion and Deletion metrics.

Evaluation Metrics for Concepts

Concept-based methods have several steps that can be evaluated together via ConSim.

Or independently:

πŸ‘ Contributing

Feel free to propose your ideas or come and contribute with us on the Interpreto πŸͺ„ toolbox! We have a specific document where we describe in a simple way how to make your first pull request.

πŸ‘€ See Also

More from the DEEL project:

  • Xplique a Python library dedicated to explaining neural networks (Images, Time Series, Tabular data) on TensorFlow.
  • Puncc a Python library for predictive uncertainty quantification using conformal prediction.
  • oodeel a Python library that performs post-hoc deep Out-of-Distribution (OOD) detection on already trained neural network image classifiers.
  • deel-lip a Python library for training k-Lipschitz neural networks on TensorFlow.
  • deel-torchlip a Python library for training k-Lipschitz neural networks on PyTorch.
  • Influenciae a Python library dedicated to computing influence values for the discovery of potentially problematic samples in a dataset.
  • DEEL White paper a summary of the DEEL team on the challenges of certifiable AI and the role of data quality, representativity and explainability for this purpose.

πŸ™ Acknowledgments

This project received funding from the French ”Investing for the Future – PIA3” program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledge the support of the DEEL and the FOR projects.

πŸ‘¨β€πŸŽ“ Creators

Interpreto πŸͺ„ is a project of the FOR and the DEEL teams at the IRT Saint-ExupΓ©ry in Toulouse, France.

πŸ—žοΈ Citation

If you use Interpreto πŸͺ„ as part of your workflow in a scientific publication, please consider citing πŸ—žοΈ our paper:

@article{poche2025interpreto,
    title       = {Interpreto: An Explainability Library for Transformers},
    author      = {Poch{\'e}, Antonin and Mullor, Thomas and Sarti, Gabriele and Boisnard, Fr{\'e}d{\'e}ric and Friedrich, Corentin and Claye, Charlotte and Hoofd, Fran{\c{c}}ois and Bernas, Raphael and Hudelot, C{\'e}line and Jourdan, Fanny},
    journal     = {arXiv preprint arXiv:2512.09730},
    year        = {2025}
}

πŸ“ License

The package is released under MIT license.