Skip to content

Interpreto: Interpretability Toolkit for LLMs

Build status Docs status Version Python Version Downloads License

πŸ“š Table of contents

πŸš€ Quick Start

The library should be available on PyPI soon. Try pip install interpreto to install it.

Otherwise, you can clone the repository and install it locally with pip install -e ..

And any case, checkout the attribution walkthrough and the concept example to get started!

πŸ“¦ What's Included

Interpreto πŸͺ„ provides a modular framework encompassing Attribution Methods, Concept-Based Methods, and Evaluation Metrics.

Attribution Methods

Interpreto includes both inference-based and gradient-based attribution methods:

Inference-based Methods:

Gradient based methods:

Concept-Based Methods

Concept-based explanations aim to provide high-level interpretations of latent model representations.

Interpreto generalizes these methods through three core steps:

  1. Concept Discovery (e.g., from latent embeddings)

  2. Concept Interpretation (mapping discovered concepts to human-understandable elements)

  3. Concept-to-Output Attribution (assessing concept relevance to model outputs) [Work in progress]

Concept Discovery Techniques (via Overcomplete):

  • NMF, Semi-NMF, ConvexNMF

  • ICA, SVD, PCA, KMeans

  • SAE variants (Vanilla SAE, TopK SAE, JumpReLU SAE, BatchTopK SAE)

Available Concept Interpretation Techniques:

  • Top-k tokens from tokenizer vocabulary

  • Top-k tokens/words/sentences/samples from specific datasets

  • LLM Labeling (Bills et al. 2023)

Concept Interpretation Techniques Added Soon:

  • Input-to-concept attribution from dataset examples (Jourdan et al. 2023)
  • Theme prediction via LLMs from top-k tokens/sentences

Concept Interpretation Techniques Added Later:

Concept-to-Output Attribution:

This part will be implemented later, but all the attribution methods presented above will be available here.

Note that only methods with a concept extraction that has an encoder (input to concept) AND a decoder (concept to output) can use this function.

Specific methods:

[Available later when all parts are implemented] Thanks to this generalization encompassing all concept-based methods and our highly flexible architecture, we can easily obtain a large number of concept-based methods:

Evaluation Metrics

Evaluation Metrics for Attribution

We don't yet have metrics implemented for attribution methods, but that's coming soon!

Evaluation Metrics for Concepts

Several properties of the concept-space are desirable.

List of properties and corresponding metrics:

  • Concept-space faithfulness: MSE, FID, or define a custom one through ReconstructionError by specifying a reconstruction_space and a distance_function.

  • Concept-space complexity: Sparsity and SparsityRatio metric are available.

  • Concept-space stability: You can use Stability metric to compare concept-model dictionaries.

πŸ‘ Contributing

Feel free to propose your ideas or come and contribute with us on the Interpreto πŸͺ„ toolbox! We have a specific document where we describe in a simple way how to make your first pull request.

πŸ‘€ See Also

More from the DEEL project:

  • Xplique a Python library dedicated to explaining neural networks (Images, Time Series, Tabular data) on TensorFlow.
  • Puncc a Python library for predictive uncertainty quantification using conformal prediction.
  • oodeel a Python library that performs post-hoc deep Out-of-Distribution (OOD) detection on already trained neural network image classifiers.
  • deel-lip a Python library for training k-Lipschitz neural networks on TensorFlow.
  • deel-torchlip a Python library for training k-Lipschitz neural networks on PyTorch.
  • Influenciae a Python library dedicated to computing influence values for the discovery of potentially problematic samples in a dataset.
  • DEEL White paper a summary of the DEEL team on the challenges of certifiable AI and the role of data quality, representativity and explainability for this purpose.

πŸ™ Acknowledgments

This project received funding from the French ”Investing for the Future – PIA3” program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledge the support of the DEEL and the FOR projects.

πŸ‘¨β€πŸŽ“ Creators

Interpreto πŸͺ„ is a project of the FOR and the DEEL teams at the IRT Saint-ExupΓ©ry in Toulouse, France.

πŸ—žοΈ Citation

If you use Interpreto πŸͺ„ as part of your workflow in a scientific publication, please consider citing πŸ—žοΈ our paper (coming soon):

BibTeX entry coming soon

πŸ“ License

The package is released under MIT license.