Base Classes¶
interpreto.concepts.ConceptEncoderExplainer
¶
ConceptEncoderExplainer(model_with_split_points, concept_model, split_point=None)
Bases: ABC, Generic[ConceptModel]
Code: concepts/base.py
Abstract class defining an interface for concept explanation.
Child classes should implement the fit and encode_activations methods, and only assume the presence of an
encoding step using the concept_model to convert activations to latent concepts.
Attributes:
| Name | Type | Description |
|---|---|---|
model_with_split_points |
ModelWithSplitPoints
|
The model to apply the explanation on.
It should have at least one split point on which |
split_point |
str
|
The split point used to train the |
concept_model |
ConceptModelProtocol
|
The model used to extract concepts from the activations of
|
is_fitted |
bool
|
Whether the |
has_differentiable_concept_encoder |
bool
|
Whether the |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
ModelWithSplitPoints
|
The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained. |
required |
|
ConceptModelProtocol
|
The model used to extract concepts from
the activations of |
required |
|
str | None
|
The split point used to train the |
None
|
Source code in interpreto/concepts/base.py
fit
abstractmethod
¶
fit(activations, *args, **kwargs)
Fits concept_model on the given activations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Tensor | dict[str, Tensor]
|
A dictionary with model paths as keys and the corresponding tensors as values. |
required |
Returns:
| Type | Description |
|---|---|
Any
|
|
Source code in interpreto/concepts/base.py
interpret
¶
Deprecated API for concept interpretation.
Interpretation methods should now be instantiated directly with the fitted concept explainer. For example:
TopKInputs(concept_explainer).interpret(inputs, latent_activations)
This method is kept only for backwards compatibility and will always
raise a :class:NotImplementedError.
Source code in interpreto/concepts/base.py
input_concept_attribution
¶
input_concept_attribution(inputs, concept, attribution_method, **attribution_kwargs)
Attributes model inputs for a selected concept.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
ModelInputs
|
The input data, which can be a string, a list of tokens/words/clauses/sentences or a dataset. |
required |
|
int
|
Index identifying the position of the concept of interest (score in the
|
required |
|
type[AttributionExplainer]
|
The attribution method to obtain importance scores for input elements. |
required |
Returns:
| Type | Description |
|---|---|
list[float]
|
A list of attribution scores for each input. |
Source code in interpreto/concepts/base.py
interpreto.concepts.ConceptAutoEncoderExplainer
¶
ConceptAutoEncoderExplainer(model_with_split_points, concept_model, split_point=None)
Bases: ConceptEncoderExplainer[BaseDictionaryLearning], Generic[BDL]
Code: concepts/base.py
A concept bottleneck explainer wraps a concept_model that should be able to encode activations into concepts
and decode concepts into activations.
We use the term "concept bottleneck" loosely, as the latent space can be overcomplete compared to activation space, as in the case of sparse autoencoders.
We assume that the concept model follows the structure of an overcomplete.BaseDictionaryLearning
model, which defines the encode and decode methods for encoding and decoding activations into concepts.
Attributes:
| Name | Type | Description |
|---|---|---|
model_with_split_points |
ModelWithSplitPoints
|
The model to apply the explanation on.
It should have at least one split point on which |
split_point |
str
|
The split point used to train the |
concept_model |
[BaseDictionaryLearning](https
|
//github.com/KempnerInstitute/overcomplete/blob/24568ba5736cbefca4b78a12246d92a1be04a1f4/overcomplete/base.py#L10)): The model used to extract concepts from the
activations of |
is_fitted |
bool
|
Whether the |
has_differentiable_concept_encoder |
bool
|
Whether the |
has_differentiable_concept_decoder |
bool
|
Whether the |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
ModelWithSplitPoints
|
The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained. |
required |
|
//github.com/KempnerInstitute/overcomplete/blob/24568ba5736cbefca4b78a12246d92a1be04a1f4/overcomplete/base.py#L10)): The model used to extract concepts from
the activations of |
required | |
|
str | None
|
The split point used to train the |
None
|
Source code in interpreto/concepts/base.py
fit
abstractmethod
¶
fit(activations, *args, **kwargs)
Fits concept_model on the given activations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
Tensor | dict[str, Tensor]
|
A dictionary with model paths as keys and the corresponding tensors as values. |
required |
Returns:
| Type | Description |
|---|---|
Any
|
|
Source code in interpreto/concepts/base.py
encode_activations
¶
encode_activations(activations)
Encode the given activations using the concept_model encoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
LatentActivations
|
The activations to encode. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
The encoded concept activations. |
Source code in interpreto/concepts/base.py
decode_concepts
¶
decode_concepts(concepts)
Decode the given concepts using the concept_model decoder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
ConceptsActivations
|
The concepts to decode. |
required |
Returns:
| Type | Description |
|---|---|
Tensor
|
The decoded model activations. |
Source code in interpreto/concepts/base.py
get_dictionary
¶
Get the dictionary learned by the fitted concept_model.
Returns:
| Type | Description |
|---|---|
Tensor
|
torch.Tensor: A |
Source code in interpreto/concepts/base.py
interpret
¶
Deprecated API for concept interpretation.
Interpretation methods should now be instantiated directly with the fitted concept explainer. For example:
TopKInputs(concept_explainer).interpret(inputs, latent_activations)
This method is kept only for backwards compatibility and will always
raise a :class:NotImplementedError.
Source code in interpreto/concepts/base.py
concept_output_gradient
¶
concept_output_gradient(inputs, targets=None, split_point=None, activation_granularity=TOKEN, aggregation_strategy=MEAN, concepts_x_gradients=True, normalization=True, tqdm_bar=False, batch_size=None)
Compute the gradients of the predictions with respect to the concepts.
To clarify what this function does, lets detail some notations. Suppose the initial model was splitted such that \(f = g \circ h\). Hence the concept model was fitted on \(A = h(X)\) with \(X\) a dataset of samples. The resulting concept model encoders and decoders are noted \(t\) and \(t^{-1}\). \(t\) can be seen as projections from the latent space to the concept space. Hence, the function going from the inputs to the concepts is \(f_{ic} = t \circ h\) and the function going from the concepts to the outputs is \(f_{co} = g \circ t^-1\).
Given a set of samples \(X\), and the functions \((h, t, t^{-1}, g)\) This function first compute \(C = t(A) = t \circ h(X)\), then returns \(\nabla{f_{co}}(C)\).
In practice all computations are done by ModelWithSplitPoints._get_concept_output_gradients,
which relies on NNsight. The current method only forwards the \(t\) and \(t^{-1}\),
respectively self.encode_activations and self.decode_concepts methods.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
list[str] | Tensor | BatchEncoding
|
The input data, either a list of samples, the tokenized input or a batch of samples. |
required |
|
list[int] | None
|
Specify which outputs of the model should be used to compute the gradients.
Note that \(f_{co}\) often has several outputs, by default gradients are computed for each output.
The |
None
|
|
str | None
|
The split point used to train the |
None
|
|
ActivationGranularity
|
The granularity of the activations to use for the attribution.
It is highly recommended to to use the same granularity as the one used in the
|
TOKEN
|
|
GranularityAggregationStrategy
|
Strategy to aggregate token activations into larger inputs granularities.
Applied for
|
MEAN
|
|
bool
|
If the resulting gradients should be multiplied by the concepts activations. True by default (similarly to attributions), because of mathematical properties. Therefore the out put is \(C * \nabla{f_{co}}(C)\). |
True
|
|
bool
|
Whether to normalize the gradients.
Gradients will be normalized on the concept (c) and sequence length (g) dimensions.
Such that for a given sample-target-granular pair,
the sum of the absolute values of the gradients is equal to 1.
(The granular elements depend on the :arg: |
True
|
|
bool
|
Whether to display a progress bar. |
False
|
|
int | None
|
Batch size for the model.
It might be different from the one used in |
None
|
Returns:
| Type | Description |
|---|---|
list[Float[Tensor, 't g c']]
|
list[Float[torch.Tensor, "t g c"]]: The gradients of the model output with respect to the concept activations. List length: correspond to the number of inputs. Tensor shape: (t, g, c) with t the target dimension, g the number of granularity elements in one input, and c the number of concepts. |
Source code in interpreto/concepts/base.py
385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 | |