Base Classes¶
interpreto.concepts.ConceptEncoderExplainer
¶
ConceptEncoderExplainer(model_with_split_points, concept_model, split_point=None)
Bases: ABC
, Generic[ConceptModel]
Code: concepts/base.py
Abstract class defining an interface for concept explanation.
Child classes should implement the fit
and encode_activations
methods, and only assume the presence of an
encoding step using the concept_model
to convert activations to latent concepts.
Attributes:
Name | Type | Description |
---|---|---|
model_with_split_points |
ModelWithSplitPoints
|
The model to apply the explanation on.
It should have at least one split point on which |
split_point |
str
|
The split point used to train the |
concept_model |
ConceptModelProtocol
|
The model used to extract concepts from the activations of
|
is_fitted |
bool
|
Whether the |
has_differentiable_concept_encoder |
bool
|
Whether the |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
ModelWithSplitPoints
|
The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained. |
required |
|
ConceptModelProtocol
|
The model used to extract concepts from
the activations of |
required |
|
str | None
|
The split point used to train the |
None
|
Source code in interpreto/concepts/base.py
fit
abstractmethod
¶
fit(activations, *args, **kwargs)
Fits concept_model
on the given activations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Tensor | dict[str, Tensor]
|
A dictionary with model paths as keys and the corresponding tensors as values. |
required |
Returns:
Type | Description |
---|---|
Any
|
|
Source code in interpreto/concepts/base.py
interpret
¶
interpret(interpretation_method, concepts_indices, inputs=None, latent_activations=None, concepts_activations=None, **kwargs)
Interpret the concepts dimensions in the latent space into a human-readable format. The interpretation is a mapping between the concepts indices and an object allowing to interpret them. It can be a label, a description, examples, etc.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
type[BaseConceptInterpretationMethod]
|
The interpretation method to use to interpret the concepts. |
required |
|
int | list[int] | Literal['all']
|
The indices of the concepts to interpret. If "all", all concepts are interpreted. |
required |
|
list[str] | None
|
The inputs to use for the interpretation.
Necessary if the source is not |
None
|
|
LatentActivations | dict[str, LatentActivations] | None
|
The latent activations to use for the interpretation.
Necessary if the source is |
None
|
|
ConceptsActivations | None
|
The concepts activations to use for the interpretation.
Necessary if the source is not |
None
|
|
Additional keyword arguments to pass to the interpretation method. |
{}
|
Returns:
Type | Description |
---|---|
Mapping[int, Any]
|
Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts. |
Source code in interpreto/concepts/base.py
input_concept_attribution
¶
input_concept_attribution(inputs, concept, attribution_method, **attribution_kwargs)
Attributes model inputs for a selected concept.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
ModelInputs
|
The input data, which can be a string, a list of tokens/words/clauses/sentences or a dataset. |
required |
|
int
|
Index identifying the position of the concept of interest (score in the
|
required |
|
type[AttributionExplainer]
|
The attribution method to obtain importance scores for input elements. |
required |
Returns:
Type | Description |
---|---|
list[float]
|
A list of attribution scores for each input. |
Source code in interpreto/concepts/base.py
interpreto.concepts.ConceptAutoEncoderExplainer
¶
ConceptAutoEncoderExplainer(model_with_split_points, concept_model, split_point=None)
Bases: ConceptEncoderExplainer[BaseDictionaryLearning]
, Generic[BDL]
Code: concepts/base.py
A concept bottleneck explainer wraps a concept_model
that should be able to encode activations into concepts
and decode concepts into activations.
We use the term "concept bottleneck" loosely, as the latent space can be overcomplete compared to activation space, as in the case of sparse autoencoders.
We assume that the concept model follows the structure of an overcomplete.BaseDictionaryLearning
model, which defines the encode
and decode
methods for encoding and decoding activations into concepts.
Attributes:
Name | Type | Description |
---|---|---|
model_with_split_points |
ModelWithSplitPoints
|
The model to apply the explanation on.
It should have at least one split point on which |
split_point |
str
|
The split point used to train the |
concept_model |
[BaseDictionaryLearning](https
|
//github.com/KempnerInstitute/overcomplete/blob/24568ba5736cbefca4b78a12246d92a1be04a1f4/overcomplete/base.py#L10)): The model used to extract concepts from the
activations of |
is_fitted |
bool
|
Whether the |
has_differentiable_concept_encoder |
bool
|
Whether the |
has_differentiable_concept_decoder |
bool
|
Whether the |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
ModelWithSplitPoints
|
The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained. |
required |
|
[BaseDictionaryLearning](https
|
//github.com/KempnerInstitute/overcomplete/blob/24568ba5736cbefca4b78a12246d92a1be04a1f4/overcomplete/base.py#L10)): The model used to extract concepts from
the activations of |
required |
|
str | None
|
The split point used to train the |
None
|
Source code in interpreto/concepts/base.py
fit
abstractmethod
¶
fit(activations, *args, **kwargs)
Fits concept_model
on the given activations.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Tensor | dict[str, Tensor]
|
A dictionary with model paths as keys and the corresponding tensors as values. |
required |
Returns:
Type | Description |
---|---|
Any
|
|
Source code in interpreto/concepts/base.py
encode_activations
¶
encode_activations(activations)
Encode the given activations using the concept_model
encoder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
LatentActivations
|
The activations to encode. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
The encoded concept activations. |
Source code in interpreto/concepts/base.py
decode_concepts
¶
decode_concepts(concepts)
Decode the given concepts using the concept_model
decoder.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
ConceptsActivations
|
The concepts to decode. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
The decoded model activations. |
Source code in interpreto/concepts/base.py
get_dictionary
¶
Get the dictionary learned by the fitted concept_model
.
Returns:
Type | Description |
---|---|
Tensor
|
torch.Tensor: A |
Source code in interpreto/concepts/base.py
interpret
¶
interpret(interpretation_method, concepts_indices, inputs=None, latent_activations=None, concepts_activations=None, **kwargs)
Interpret the concepts dimensions in the latent space into a human-readable format. The interpretation is a mapping between the concepts indices and an object allowing to interpret them. It can be a label, a description, examples, etc.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
type[BaseConceptInterpretationMethod]
|
The interpretation method to use to interpret the concepts. |
required |
|
int | list[int] | Literal['all']
|
The indices of the concepts to interpret. If "all", all concepts are interpreted. |
required |
|
list[str] | None
|
The inputs to use for the interpretation.
Necessary if the source is not |
None
|
|
LatentActivations | dict[str, LatentActivations] | None
|
The latent activations to use for the interpretation.
Necessary if the source is |
None
|
|
ConceptsActivations | None
|
The concepts activations to use for the interpretation.
Necessary if the source is not |
None
|
|
Additional keyword arguments to pass to the interpretation method. |
{}
|
Returns:
Type | Description |
---|---|
Mapping[int, Any]
|
Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts. |
Source code in interpreto/concepts/base.py
input_concept_attribution
¶
input_concept_attribution(inputs, concept, attribution_method, **attribution_kwargs)
Attributes model inputs for a selected concept.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
ModelInputs
|
The input data, which can be a string, a list of tokens/words/clauses/sentences or a dataset. |
required |
|
int
|
Index identifying the position of the concept of interest (score in the
|
required |
|
type[AttributionExplainer]
|
The attribution method to obtain importance scores for input elements. |
required |
Returns:
Type | Description |
---|---|
list[float]
|
A list of attribution scores for each input. |
Source code in interpreto/concepts/base.py
concept_output_attribution
¶
concept_output_attribution(inputs, concepts, target, attribution_method, **attribution_kwargs)
Computes the attribution of each concept for the logit of a target output element.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
ModelInputs
|
An input data-point for the model. |
required |
|
Tensor
|
Concept activation tensor. |
required |
|
int
|
The target class for which the concept output attribution should be computed. |
required |
|
type[AttributionExplainer]
|
The attribution method to obtain importance scores for input elements. |
required |
Returns:
Type | Description |
---|---|
list[float]
|
A list of attribution scores for each concept. |