Optimization-based Dictionary Learning¶
Base abstract class¶
interpreto.concepts.methods.DictionaryLearningExplainer
¶
DictionaryLearningExplainer(model_with_split_points, *, nb_concepts, split_point=None, device='cpu', **kwargs)
Bases: ConceptAutoEncoderExplainer[BaseOptimDictionaryLearning], Generic[_BODL_co]
Code: concepts/methods/overcomplete.py
Implementation of a concept explainer using an
overcomplete.optimization.BaseOptimDictionaryLearning
(NMF and PCA variants) as concept_model.
Attributes:
| Name | Type | Description |
|---|---|---|
model_with_split_points |
ModelWithSplitPoints
|
The model to apply the explanation on.
It should have at least one split point on which |
split_point |
str | None
|
The split point used to train the |
concept_model |
BaseOptimDictionaryLearning
|
An Overcomplete BaseOptimDictionaryLearning variant for concept extraction. |
is_fitted |
bool
|
Whether the |
has_differentiable_concept_encoder |
bool
|
Whether the |
has_differentiable_concept_decoder |
bool
|
Whether the |
Examples:
>>> import datasets
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> from interpreto import ModelWithSplitPoints
>>> from interpreto.concepts import ICAConcepts
>>> from interpreto.concepts.interpretations import TopKInputs
>>> CLS_TOKEN = ModelWithSplitPoints.activation_granularities.CLS_TOKEN
>>> WORD = ModelWithSplitPoints.activation_granularities.WORD
...
>>> dataset = datasets.load_dataset("stanfordnlp/imdb")["train"]["text"][:1000]
>>> repo_id = "Qwen/Qwen3-0.6B"
>>> model = AutoModelForCausalLM.from_pretrained(repo_id, device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained(repo_id)
...
>>> # 1. Split your model in two parts
>>> splitted_model = ModelWithSplitPoints(
>>> model, tokenizer=tokenizer, split_points=[5],
>>> )
...
>>> # 2. Compute a dataset of activations
>>> activations = splitted_model.get_activations(
>>> dataset, activation_granularity=WORD
>>> )
...
>>> # 3. Fit a concept model on the dataset
>>> explainer = ICAConcepts(splitted_model, nb_concepts=20)
>>> explainer.fit(activations)
...
>>> # 4. Interpret the concepts
>>> interpreter = TopKInputs(
>>> concept_explainer=explainer,
>>> activation_granularity=WORD,
>>> )
>>> interpretations = interpreter.interpret(
>>> inputs=dataset, latent_activations=activations
>>> )
...
>>> # Print the interpretations
>>> for id, words in interpretations.items():
>>> print(f"Concept {id}: {list(words.keys())}")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
ModelWithSplitPoints
|
The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained. |
required |
|
int
|
Size of the SAE concept space. |
required |
|
str | None
|
The split point used to train the |
None
|
|
device | str
|
Device to use for the |
'cpu'
|
|
dict
|
Additional keyword arguments to pass to the |
{}
|
Source code in interpreto/concepts/methods/overcomplete.py
fit
¶
fit(activations, *, overwrite=False, **kwargs)
Fit an Overcomplete OptimDictionaryLearning model on the given activations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Tensor | dict[str, Tensor]
|
The activations used for fitting the |
required |
|
bool
|
Whether to overwrite the current model if it has already been fitted. Default: False. |
False
|
|
dict
|
Additional keyword arguments to pass to the |
{}
|
Source code in interpreto/concepts/methods/overcomplete.py
encode_activations
¶
encode_activations(activations)
Encode the given activations using the concept_model encoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
LatentActivations
|
The activations to encode. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
The encoded concept activations. |
Source code in interpreto/concepts/base.py
decode_concepts
¶
decode_concepts(concepts)
Decode the given concepts using the concept_model decoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
ConceptsActivations
|
The concepts to decode. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
The decoded model activations. |
Source code in interpreto/concepts/base.py
get_dictionary
¶
Get the dictionary learned by the fitted concept_model.
Returns:
| Type | Description |
|---|---|
Tensor
|
torch.Tensor: A |
Source code in interpreto/concepts/base.py
interpret
¶
Deprecated API for concept interpretation.
Interpretation methods should now be instantiated directly with the fitted concept explainer. For example:
TopKInputs(concept_explainer).interpret(inputs, latent_activations)
This method is kept only for backwards compatibility and will always
raise a :class:NotImplementedError.
Source code in interpreto/concepts/base.py
concept_output_gradient
¶
concept_output_gradient(inputs, targets=None, split_point=None, activation_granularity=TOKEN, aggregation_strategy=MEAN, concepts_x_gradients=True, normalization=True, tqdm_bar=False, batch_size=None)
Compute the gradients of the predictions with respect to the concepts.
To clarify what this function does, lets detail some notations. Suppose the initial model was splitted such that \(f = g \circ h\). Hence the concept model was fitted on \(A = h(X)\) with \(X\) a dataset of samples. The resulting concept model encoders and decoders are noted \(t\) and \(t^{-1}\). \(t\) can be seen as projections from the latent space to the concept space. Hence, the function going from the inputs to the concepts is \(f_{ic} = t \circ h\) and the function going from the concepts to the outputs is \(f_{co} = g \circ t^-1\).
Given a set of samples \(X\), and the functions \((h, t, t^{-1}, g)\) This function first compute \(C = t(A) = t \circ h(X)\), then returns \(\nabla{f_{co}}(C)\).
In practice all computations are done by ModelWithSplitPoints._get_concept_output_gradients,
which relies on NNsight. The current method only forwards the \(t\) and \(t^{-1}\),
respectively self.encode_activations and self.decode_concepts methods.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
list[str] | Tensor | BatchEncoding
|
The input data, either a list of samples, the tokenized input or a batch of samples. |
required |
|
list[int] | None
|
Specify which outputs of the model should be used to compute the gradients.
Note that \(f_{co}\) often has several outputs, by default gradients are computed for each output.
The |
None
|
|
str | None
|
The split point used to train the |
None
|
|
ActivationGranularity
|
The granularity of the activations to use for the attribution.
It is highly recommended to to use the same granularity as the one used in the
|
TOKEN
|
|
GranularityAggregationStrategy
|
Strategy to aggregate token activations into larger inputs granularities.
Applied for
|
MEAN
|
|
bool
|
If the resulting gradients should be multiplied by the concepts activations. True by default (similarly to attributions), because of mathematical properties. Therefore the out put is \(C * \nabla{f_{co}}(C)\). |
True
|
|
bool
|
Whether to normalize the gradients.
Gradients will be normalized on the concept (c) and sequence length (g) dimensions.
Such that for a given sample-target-granular pair,
the sum of the absolute values of the gradients is equal to 1.
(The granular elements depend on the :arg: |
True
|
|
bool
|
Whether to display a progress bar. |
False
|
|
int | None
|
Batch size for the model.
It might be different from the one used in |
None
|
Returns:
| Type | Description |
|---|---|
list[Float[Tensor, 't g c']]
|
list[Float[torch.Tensor, "t g c"]]: The gradients of the model output with respect to the concept activations. List length: correspond to the number of inputs. Tensor shape: (t, g, c) with t the target dimension, g the number of granularity elements in one input, and c the number of concepts. |
Source code in interpreto/concepts/base.py
385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 | |
List of available methods¶
interpreto.concepts.ConvexNMFConcepts
¶
ConvexNMFConcepts(model_with_split_points, *, nb_concepts, split_point=None, device='cpu', **kwargs)
Bases: DictionaryLearningExplainer[ConvexNMF]
Code: concepts/methods/overcomplete.py
ConceptAutoEncoderExplainer with the ConvexNMF from Ding et al. (2008)1 as concept model.
ConvexNMF implementation from overcomplete.optimization.ConvexNMF class.
-
C. H. Q. Ding, T. Li and M. I. Jordan, Convex and Semi-Nonnegative Matrix Factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 2010, pp. 45-55 ↩
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_with_split_points
|
ModelWithSplitPoints
|
The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained. |
required |
nb_concepts
|
int
|
Size of the SAE concept space. |
required |
split_point
|
str | None
|
The split point used to train the |
None
|
device
|
device | str
|
Device to use for the |
'cpu'
|
**kwargs
|
dict
|
Additional keyword arguments to pass to the |
{}
|
interpreto.concepts.DictionaryLearningConcepts
¶
DictionaryLearningConcepts(model_with_split_points, *, nb_concepts, split_point=None, device='cpu', **kwargs)
Bases: DictionaryLearningExplainer[SkDictionaryLearning]
Code: concepts/methods/overcomplete.py
ConceptAutoEncoderExplainer with the Dictionary Learning concepts from Mairal et al. (2009)1 as concept model.
Dictionary Learning implementation from overcomplete.optimization.SkDictionaryLearning class.
-
J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online dictionary learning for sparse coding Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 689-696. ↩
interpreto.concepts.ICAConcepts
¶
Bases: SkLearnWrapperExplainer[ICAWrapper]
Code: concepts/methods/overcomplete.py
ConceptAutoEncoderExplainer with the ICA from Hyvarinen and Oja (2000)1 as concept model.
-
A. Hyvarinen and E. Oja, Independent Component Analysis: Algorithms and Applications, Neural Networks, 13(4-5), 2000, pp. 411-430. ↩
interpreto.concepts.KMeansConcepts
¶
Bases: SkLearnWrapperExplainer[KMeansWrapper]
Code: concepts/methods/overcomplete.py
ConceptAutoEncoderExplainer with the K-Means as concept model.
interpreto.concepts.NMFConcepts
¶
NMFConcepts(model_with_split_points, *, nb_concepts, split_point=None, device='cpu', force_relu=False, **kwargs)
Bases: DictionaryLearningExplainer[NMF]
Code: concepts/methods/overcomplete.py
ConceptAutoEncoderExplainer with the NMF from Lee and Seung (1999)1 as concept model.
NMF implementation from overcomplete.optimization.NMF class.
-
Lee, D., Seung, H. Learning the parts of objects by non-negative matrix factorization. Nature, 401, 1999, pp. 788–791. ↩
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_with_split_points
|
ModelWithSplitPoints
|
The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained. |
required |
nb_concepts
|
int
|
Size of the SAE concept space. |
required |
split_point
|
str | None
|
The split point used to train the |
None
|
device
|
device | str
|
Device to use for the |
'cpu'
|
force_relu
|
bool
|
Whether to force the activations to be positive. |
False
|
**kwargs
|
dict
|
Additional keyword arguments to pass to the |
{}
|
encode_activations
¶
Encode the given activations using the concept_model encoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
activations
|
LatentActivations
|
The activations to encode. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
The encoded concept activations. |
interpreto.concepts.PCAConcepts
¶
Bases: SkLearnWrapperExplainer[PCAWrapper]
Code: concepts/methods/overcomplete.py
ConceptAutoEncoderExplainer with the PCA from Pearson (1901)1 as concept model.
-
K. Pearson, On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 1901, pp. 559-572. ↩
interpreto.concepts.SemiNMFConcepts
¶
Bases: DictionaryLearningExplainer[SemiNMF]
Code: concepts/methods/overcomplete.py
ConceptAutoEncoderExplainer with the SemiNMF from Ding et al. (2008)1 as concept model.
SemiNMF implementation from overcomplete.optimization.SemiNMF class.
-
C. H. Q. Ding, T. Li and M. I. Jordan, Convex and Semi-Nonnegative Matrix Factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 2010, pp. 45-55 ↩
interpreto.concepts.SparsePCAConcepts
¶
SparsePCAConcepts(model_with_split_points, *, nb_concepts, split_point=None, device='cpu', **kwargs)
Bases: DictionaryLearningExplainer[SkSparsePCA]
Code: concepts/methods/overcomplete.py
ConceptAutoEncoderExplainer with SparsePCA as concept model.
SparsePCA implementation from overcomplete.optimization.SkSparsePCA class.
interpreto.concepts.SVDConcepts
¶
Bases: SkLearnWrapperExplainer[SVDWrapper]
Code: concepts/methods/overcomplete.py
ConceptAutoEncoderExplainer with SVD as concept model.