Skip to content

Interpreto: Interpretability Toolkit for LLMs

Interpreto: Interpretability Toolkit for LLMs

Build status Docs status Version Python Version Downloads License

πŸš€ Quick Start

The library is available on PyPI, try pip install interpreto to install it.

Checkout the tutorials to get started:

πŸ“¦ What's Included

Interpreto πŸͺ„ provides a modular framework encompassing Attribution Methods, Concept-Based Methods, and Evaluation Metrics.

Attribution Methods

Interpreto includes both inference-based and gradient-based attribution methods.

They all work seamlessly for both classification (...ForSequenceClassification) and generation (...ForCausalLM)

Inference-based Methods:

Gradient-based methods:

Concept-Based Methods or Mechanistic Interpretability

Concept-based explanations aim to provide high-level interpretations of latent model representations.

Interpreto generalizes these methods through three core steps:

  1. Concept Discovery (e.g., from latent embeddings)
  2. Concept Interpretation (mapping discovered concepts to human-understandable elements)
  3. Concept-to-Output Attribution (assessing concept relevance to model outputs)

Dictionary Learning for Concept Discovery (mainly via Overcomplete):

Available Concept Interpretation Techniques:

Concept Interpretation Techniques Added in the future:

Concept-to-Output Attribution:

Estimate the contribution of each concept to the model output.

Can be obtained with any concept-based explainer via MethodConcepts.concept_output_gradient().

Specific methods:

Thanks to this generalization encompassing all concept-based methods and our highly flexible architecture, we can easily obtain a large number of concept-based methods.

The following list will soon be available:

Evaluation Metrics

Evaluation Metrics for Attribution

To evaluate attribution methods faithfulness, there are the Insertion and Deletion metrics.

Evaluation Metrics for Concepts

Concept-based methods have several steps that can be evaluated together via ConSim.

Or independently:

πŸ‘ Contributing

Feel free to propose your ideas or come and contribute with us on the Interpreto πŸͺ„ toolbox! We have a specific document where we describe in a simple way how to make your first pull request.

πŸ‘€ See Also

More from the DEEL project:

  • Xplique a Python library dedicated to explaining neural networks (Images, Time Series, Tabular data) on TensorFlow.
  • Puncc a Python library for predictive uncertainty quantification using conformal prediction.
  • oodeel a Python library that performs post-hoc deep Out-of-Distribution (OOD) detection on already trained neural network image classifiers.
  • deel-lip a Python library for training k-Lipschitz neural networks on TensorFlow.
  • deel-torchlip a Python library for training k-Lipschitz neural networks on PyTorch.
  • Influenciae a Python library dedicated to computing influence values for the discovery of potentially problematic samples in a dataset.
  • DEEL White paper a summary of the DEEL team on the challenges of certifiable AI and the role of data quality, representativity and explainability for this purpose.

πŸ™ Acknowledgments

This project received funding from the French ”Investing for the Future – PIA3” program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledge the support of the DEEL and the FOR projects.

πŸ‘¨β€πŸŽ“ Creators

Interpreto πŸͺ„ is a project of the FOR and the DEEL teams at the IRT Saint-ExupΓ©ry in Toulouse, France.

πŸ—žοΈ Citation

If you use Interpreto πŸͺ„ as part of your workflow in a scientific publication, please consider citing πŸ—žοΈ our paper (coming soon):

BibTeX entry coming soon

πŸ“ License

The package is released under MIT license.