Base Classes¶

interpreto.concepts.ConceptEncoderExplainer ¶

ConceptEncoderExplainer(model_with_split_points, concept_model, split_point=None)

Bases: ABC, Generic[ConceptModel]

Code: concepts/base.py

Abstract class defining an interface for concept explanation. Child classes should implement the fit and encode_activations methods, and only assume the presence of an encoding step using the concept_model to convert activations to latent concepts.

Attributes:

Name	Type	Description
`model_with_split_points`	`ModelWithSplitPoints`	The model to apply the explanation on. It should have at least one split point on which `concept_model` can be fitted.
`split_point`	`str`	The split point used to train the `concept_model`.
`concept_model`	`ConceptModelProtocol`	The model used to extract concepts from the activations of `model_with_split_points`. The only assumption for classes inheriting from this class is that the `concept_model` can encode activations into concepts with `encode_activations`. The `ConceptModelProtocol` is defined in `interpreto.typing`. It is basically a `torch.nn.Module` with an `encode` method.
`is_fitted`	`bool`	Whether the `concept_model` was fit on model activations.
`has_differentiable_concept_encoder`	`bool`	Whether the `encode_activations` operation is differentiable.

Parameters:

Name	Type	Description	Default
`model_with_split_points` ¶	`ModelWithSplitPoints`	The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained.	required
`concept_model` ¶	`ConceptModelProtocol`	The model used to extract concepts from the activations of `model_with_split_points`. The `ConceptModelProtocol` is defined in `interpreto.typing`. It is basically a `torch.nn.Module` with an `encode` method.	required
`split_point` ¶	`str \| None`	The split point used to train the `concept_model`. If None, tries to use the split point of `model_with_split_points` if a single one is defined.	`None`

Source code in interpreto/concepts/base.py

def __init__(
    self,
    model_with_split_points: ModelWithSplitPoints,
    concept_model: ConceptModelProtocol,
    split_point: str | None = None,
):
    """Initializes the concept explainer with a given splitted model.

    Args:
        model_with_split_points (ModelWithSplitPoints): The model to apply the explanation on.
            It should have at least one split point on which a concept explainer can be trained.
        concept_model (ConceptModelProtocol): The model used to extract concepts from
            the activations of `model_with_split_points`.
            The `ConceptModelProtocol` is defined in `interpreto.typing`. It is basically a `torch.nn.Module` with an `encode` method.
        split_point (str | None): The split point used to train the `concept_model`. If None, tries to use the
            split point of `model_with_split_points` if a single one is defined.
    """
    if not isinstance(model_with_split_points, ModelWithSplitPoints):
        raise TypeError(
            f"The given model should be a ModelWithSplitPoints, but {type(model_with_split_points)} was given."
        )
    self.model_with_split_points: ModelWithSplitPoints = model_with_split_points
    self._concept_model = concept_model
    self.split_point = split_point  # Verified by `split_point.setter`
    self.__is_fitted: bool = False
    self.has_differentiable_concept_encoder = False

fit `abstractmethod` ¶

fit(activations, *args, **kwargs)

Fits concept_model on the given activations.

Parameters:

Name	Type	Description	Default
`activations` ¶	`Tensor \| dict[str, Tensor]`	A dictionary with model paths as keys and the corresponding tensors as values.	required

Returns:

Type	Description
`Any`	`None`, `concept_model` is fitted in-place, `is_fitted` is set to `True` and `split_point` is set.

Source code in interpreto/concepts/base.py

@abstractmethod
def fit(self, activations: LatentActivations | dict[str, LatentActivations], *args, **kwargs) -> Any:
    """Fits `concept_model` on the given activations.

    Args:
        activations (torch.Tensor | dict[str, torch.Tensor]): A dictionary with model paths as keys and the corresponding
            tensors as values.

    Returns:
        `None`, `concept_model` is fitted in-place, `is_fitted` is set to `True` and `split_point` is set.
    """
    pass

interpret ¶

interpret(interpretation_method, concepts_indices, inputs=None, latent_activations=None, concepts_activations=None, **kwargs)

Interpret the concepts dimensions in the latent space into a human-readable format. The interpretation is a mapping between the concepts indices and an object allowing to interpret them. It can be a label, a description, examples, etc.

Parameters:

Name	Type	Description	Default
`interpretation_method` ¶	`type[BaseConceptInterpretationMethod]`	The interpretation method to use to interpret the concepts.	required
`concepts_indices` ¶	`int \| list[int] \| Literal['all']`	The indices of the concepts to interpret. If "all", all concepts are interpreted.	required
`inputs` ¶	`list[str] \| None`	The inputs to use for the interpretation. Necessary if the source is not `VOCABULARY`, as examples are extracted from the inputs.	`None`
`latent_activations` ¶	`LatentActivations \| dict[str, LatentActivations] \| None`	The latent activations to use for the interpretation. Necessary if the source is `LATENT_ACTIVATIONS`. Otherwise, it is computed from the inputs or ignored if the source is `CONCEPT_ACTIVATIONS`.	`None`
`concepts_activations` ¶	`ConceptsActivations \| None`	The concepts activations to use for the interpretation. Necessary if the source is not `CONCEPT_ACTIVATIONS`. Otherwise, it is computed from the latent activations.	`None`
`**kwargs` ¶		Additional keyword arguments to pass to the interpretation method.	`{}`

Returns:

Type	Description
`Mapping[int, Any]`	Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.

Source code in interpreto/concepts/base.py

@check_fitted
def interpret(
    self,
    interpretation_method: type[BaseConceptInterpretationMethod],
    concepts_indices: int | list[int] | Literal["all"],
    inputs: list[str] | None = None,
    latent_activations: dict[str, LatentActivations] | LatentActivations | None = None,
    concepts_activations: ConceptsActivations | None = None,
    **kwargs,
) -> Mapping[int, Any]:
    """
    Interpret the concepts dimensions in the latent space into a human-readable format.
    The interpretation is a mapping between the concepts indices and an object allowing to interpret them.
    It can be a label, a description, examples, etc.

    Args:
        interpretation_method: The interpretation method to use to interpret the concepts.
        concepts_indices (int | list[int] | Literal["all"]): The indices of the concepts to interpret.
            If "all", all concepts are interpreted.
        inputs (list[str] | None): The inputs to use for the interpretation.
            Necessary if the source is not `VOCABULARY`, as examples are extracted from the inputs.
        latent_activations (LatentActivations | dict[str, LatentActivations] | None): The latent activations to use for the interpretation.
            Necessary if the source is `LATENT_ACTIVATIONS`.
            Otherwise, it is computed from the inputs or ignored if the source is `CONCEPT_ACTIVATIONS`.
        concepts_activations (ConceptsActivations | None): The concepts activations to use for the interpretation.
            Necessary if the source is not `CONCEPT_ACTIVATIONS`. Otherwise, it is computed from the latent activations.
        **kwargs: Additional keyword arguments to pass to the interpretation method.

    Returns:
        Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.
    """
    if concepts_indices == "all":
        concepts_indices = list(range(self.concept_model.nb_concepts))

    # verify
    if latent_activations is not None:
        split_latent_activations = self._sanitize_activations(latent_activations)
    else:
        split_latent_activations = None

    # initialize the interpretation method
    method = interpretation_method(
        model_with_split_points=self.model_with_split_points,
        split_point=self.split_point,
        concept_model=self.concept_model,
        **kwargs,
    )

    # compute the interpretation from inputs and activations
    return method.interpret(
        concepts_indices=concepts_indices,
        inputs=inputs,
        latent_activations=split_latent_activations,
        concepts_activations=concepts_activations,
    )

input_concept_attribution ¶

input_concept_attribution(inputs, concept, attribution_method, **attribution_kwargs)

Attributes model inputs for a selected concept.

Parameters:

Name	Type	Description	Default
`inputs` ¶	`ModelInputs`	The input data, which can be a string, a list of tokens/words/clauses/sentences or a dataset.	required
`concept` ¶	`int`	Index identifying the position of the concept of interest (score in the `ConceptsActivations` tensor) for which relevant input elements should be retrieved.	required
`attribution_method` ¶	`type[AttributionExplainer]`	The attribution method to obtain importance scores for input elements.	required

Returns:

Type	Description
`list[float]`	A list of attribution scores for each input.

Source code in interpreto/concepts/base.py

@check_fitted
def input_concept_attribution(
    self,
    inputs: ModelInputs,
    concept: int,
    attribution_method: type[AttributionExplainer],
    **attribution_kwargs,
) -> list[float]:
    """Attributes model inputs for a selected concept.

    Args:
        inputs (ModelInputs): The input data, which can be a string, a list of tokens/words/clauses/sentences
            or a dataset.
        concept (int): Index identifying the position of the concept of interest (score in the
            `ConceptsActivations` tensor) for which relevant input elements should be retrieved.
        attribution_method: The attribution method to obtain importance scores for input elements.

    Returns:
        A list of attribution scores for each input.
    """
    raise NotImplementedError("Input-to-concept attribution method is not implemented yet.")

interpreto.concepts.ConceptAutoEncoderExplainer ¶

ConceptAutoEncoderExplainer(model_with_split_points, concept_model, split_point=None)

Bases: ConceptEncoderExplainer[BaseDictionaryLearning], Generic[BDL]

Code: concepts/base.py

A concept bottleneck explainer wraps a concept_model that should be able to encode activations into concepts and decode concepts into activations.

We use the term "concept bottleneck" loosely, as the latent space can be overcomplete compared to activation space, as in the case of sparse autoencoders.

We assume that the concept model follows the structure of an overcomplete.BaseDictionaryLearning model, which defines the encode and decode methods for encoding and decoding activations into concepts.

Attributes:

Name	Type	Description
`model_with_split_points`	`ModelWithSplitPoints`	The model to apply the explanation on. It should have at least one split point on which `concept_model` can be fitted.
`split_point`	`str`	The split point used to train the `concept_model`.
`concept_model`	`[BaseDictionaryLearning](https`	//github.com/KempnerInstitute/overcomplete/blob/24568ba5736cbefca4b78a12246d92a1be04a1f4/overcomplete/base.py#L10)): The model used to extract concepts from the activations of `model_with_split_points`. The only assumption for classes inheriting from this class is that the `concept_model` can encode activations into concepts with `encode_activations`.
`is_fitted`	`bool`	Whether the `concept_model` was fit on model activations.
`has_differentiable_concept_encoder`	`bool`	Whether the `encode_activations` operation is differentiable.
`has_differentiable_concept_decoder`	`bool`	Whether the `decode_concepts` operation is differentiable.

Parameters:

Name	Type	Description	Default
`model_with_split_points` ¶	`ModelWithSplitPoints`	The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained.	required
`concept_model` ¶	`[BaseDictionaryLearning](https`	//github.com/KempnerInstitute/overcomplete/blob/24568ba5736cbefca4b78a12246d92a1be04a1f4/overcomplete/base.py#L10)): The model used to extract concepts from the activations of `model_with_split_points`.	required
`split_point` ¶	`str \| None`	The split point used to train the `concept_model`. If None, tries to use the split point of `model_with_split_points` if a single one is defined.	`None`

Source code in interpreto/concepts/base.py

def __init__(
    self,
    model_with_split_points: ModelWithSplitPoints,
    concept_model: BaseDictionaryLearning,
    split_point: str | None = None,
):
    """Initializes the concept explainer with a given splitted model.

    Args:
        model_with_split_points (ModelWithSplitPoints): The model to apply the explanation on.
            It should have at least one split point on which a concept explainer can be trained.
        concept_model ([BaseDictionaryLearning](https://github.com/KempnerInstitute/overcomplete/blob/24568ba5736cbefca4b78a12246d92a1be04a1f4/overcomplete/base.py#L10)): The model used to extract concepts from
            the activations of `model_with_split_points`.
        split_point (str | None): The split point used to train the `concept_model`. If None, tries to use the
            split point of `model_with_split_points` if a single one is defined.
    """
    self.concept_model: BaseDictionaryLearning
    super().__init__(model_with_split_points, concept_model, split_point)
    self.has_differentiable_concept_decoder = False

fit `abstractmethod` ¶

fit(activations, *args, **kwargs)

Fits concept_model on the given activations.

Parameters:

Name	Type	Description	Default
`activations` ¶	`Tensor \| dict[str, Tensor]`	A dictionary with model paths as keys and the corresponding tensors as values.	required

Returns:

Type	Description
`Any`	`None`, `concept_model` is fitted in-place, `is_fitted` is set to `True` and `split_point` is set.

Source code in interpreto/concepts/base.py

@abstractmethod
def fit(self, activations: LatentActivations | dict[str, LatentActivations], *args, **kwargs) -> Any:
    """Fits `concept_model` on the given activations.

    Args:
        activations (torch.Tensor | dict[str, torch.Tensor]): A dictionary with model paths as keys and the corresponding
            tensors as values.

    Returns:
        `None`, `concept_model` is fitted in-place, `is_fitted` is set to `True` and `split_point` is set.
    """
    pass

encode_activations ¶

encode_activations(activations)

Encode the given activations using the concept_model encoder.

Parameters:

Name	Type	Description	Default
`activations` ¶	`LatentActivations`	The activations to encode.	required

Returns:

Type	Description
`Tensor`	The encoded concept activations.

Source code in interpreto/concepts/base.py

@check_fitted
def encode_activations(self, activations: LatentActivations) -> torch.Tensor:  # ConceptsActivations
    """Encode the given activations using the `concept_model` encoder.

    Args:
        activations (LatentActivations): The activations to encode.

    Returns:
        The encoded concept activations.
    """
    self._sanitize_activations(activations)
    return self.concept_model.encode(activations)  # type: ignore

decode_concepts ¶

decode_concepts(concepts)

Decode the given concepts using the concept_model decoder.

Parameters:

Name	Type	Description	Default
`concepts` ¶	`ConceptsActivations`	The concepts to decode.	required

Returns:

Type	Description
`Tensor`	The decoded model activations.

Source code in interpreto/concepts/base.py

@check_fitted
def decode_concepts(self, concepts: ConceptsActivations) -> torch.Tensor:  # LatentActivations
    """Decode the given concepts using the `concept_model` decoder.

    Args:
        concepts (ConceptsActivations): The concepts to decode.

    Returns:
        The decoded model activations.
    """
    return self.concept_model.decode(concepts)  # type: ignore

get_dictionary ¶

get_dictionary()

Get the dictionary learned by the fitted concept_model.

Returns:

Type	Description
`Tensor`	torch.Tensor: A `torch.Tensor` containing the learned dictionary.

Source code in interpreto/concepts/base.py

@check_fitted
def get_dictionary(self) -> torch.Tensor:  # TODO: add this to tests
    """Get the dictionary learned by the fitted `concept_model`.

    Returns:
        torch.Tensor: A `torch.Tensor` containing the learned dictionary.
    """
    return self.concept_model.get_dictionary()  # type: ignore

interpret ¶

interpret(interpretation_method, concepts_indices, inputs=None, latent_activations=None, concepts_activations=None, **kwargs)

Interpret the concepts dimensions in the latent space into a human-readable format. The interpretation is a mapping between the concepts indices and an object allowing to interpret them. It can be a label, a description, examples, etc.

Parameters:

Name	Type	Description	Default
`interpretation_method` ¶	`type[BaseConceptInterpretationMethod]`	The interpretation method to use to interpret the concepts.	required
`concepts_indices` ¶	`int \| list[int] \| Literal['all']`	The indices of the concepts to interpret. If "all", all concepts are interpreted.	required
`inputs` ¶	`list[str] \| None`	The inputs to use for the interpretation. Necessary if the source is not `VOCABULARY`, as examples are extracted from the inputs.	`None`
`latent_activations` ¶	`LatentActivations \| dict[str, LatentActivations] \| None`	The latent activations to use for the interpretation. Necessary if the source is `LATENT_ACTIVATIONS`. Otherwise, it is computed from the inputs or ignored if the source is `CONCEPT_ACTIVATIONS`.	`None`
`concepts_activations` ¶	`ConceptsActivations \| None`	The concepts activations to use for the interpretation. Necessary if the source is not `CONCEPT_ACTIVATIONS`. Otherwise, it is computed from the latent activations.	`None`
`**kwargs` ¶		Additional keyword arguments to pass to the interpretation method.	`{}`

Returns:

Type	Description
`Mapping[int, Any]`	Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.

Source code in interpreto/concepts/base.py

@check_fitted
def interpret(
    self,
    interpretation_method: type[BaseConceptInterpretationMethod],
    concepts_indices: int | list[int] | Literal["all"],
    inputs: list[str] | None = None,
    latent_activations: dict[str, LatentActivations] | LatentActivations | None = None,
    concepts_activations: ConceptsActivations | None = None,
    **kwargs,
) -> Mapping[int, Any]:
    """
    Interpret the concepts dimensions in the latent space into a human-readable format.
    The interpretation is a mapping between the concepts indices and an object allowing to interpret them.
    It can be a label, a description, examples, etc.

    Args:
        interpretation_method: The interpretation method to use to interpret the concepts.
        concepts_indices (int | list[int] | Literal["all"]): The indices of the concepts to interpret.
            If "all", all concepts are interpreted.
        inputs (list[str] | None): The inputs to use for the interpretation.
            Necessary if the source is not `VOCABULARY`, as examples are extracted from the inputs.
        latent_activations (LatentActivations | dict[str, LatentActivations] | None): The latent activations to use for the interpretation.
            Necessary if the source is `LATENT_ACTIVATIONS`.
            Otherwise, it is computed from the inputs or ignored if the source is `CONCEPT_ACTIVATIONS`.
        concepts_activations (ConceptsActivations | None): The concepts activations to use for the interpretation.
            Necessary if the source is not `CONCEPT_ACTIVATIONS`. Otherwise, it is computed from the latent activations.
        **kwargs: Additional keyword arguments to pass to the interpretation method.

    Returns:
        Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.
    """
    if concepts_indices == "all":
        concepts_indices = list(range(self.concept_model.nb_concepts))

    # verify
    if latent_activations is not None:
        split_latent_activations = self._sanitize_activations(latent_activations)
    else:
        split_latent_activations = None

    # initialize the interpretation method
    method = interpretation_method(
        model_with_split_points=self.model_with_split_points,
        split_point=self.split_point,
        concept_model=self.concept_model,
        **kwargs,
    )

    # compute the interpretation from inputs and activations
    return method.interpret(
        concepts_indices=concepts_indices,
        inputs=inputs,
        latent_activations=split_latent_activations,
        concepts_activations=concepts_activations,
    )

input_concept_attribution ¶

input_concept_attribution(inputs, concept, attribution_method, **attribution_kwargs)

Attributes model inputs for a selected concept.

Parameters:

Name	Type	Description	Default
`inputs` ¶	`ModelInputs`	The input data, which can be a string, a list of tokens/words/clauses/sentences or a dataset.	required
`concept` ¶	`int`	Index identifying the position of the concept of interest (score in the `ConceptsActivations` tensor) for which relevant input elements should be retrieved.	required
`attribution_method` ¶	`type[AttributionExplainer]`	The attribution method to obtain importance scores for input elements.	required

Returns:

Type	Description
`list[float]`	A list of attribution scores for each input.

Source code in interpreto/concepts/base.py

@check_fitted
def input_concept_attribution(
    self,
    inputs: ModelInputs,
    concept: int,
    attribution_method: type[AttributionExplainer],
    **attribution_kwargs,
) -> list[float]:
    """Attributes model inputs for a selected concept.

    Args:
        inputs (ModelInputs): The input data, which can be a string, a list of tokens/words/clauses/sentences
            or a dataset.
        concept (int): Index identifying the position of the concept of interest (score in the
            `ConceptsActivations` tensor) for which relevant input elements should be retrieved.
        attribution_method: The attribution method to obtain importance scores for input elements.

    Returns:
        A list of attribution scores for each input.
    """
    raise NotImplementedError("Input-to-concept attribution method is not implemented yet.")

concept_output_attribution ¶

concept_output_attribution(inputs, concepts, target, attribution_method, **attribution_kwargs)

Computes the attribution of each concept for the logit of a target output element.

Parameters:

Name	Type	Description	Default
`inputs` ¶	`ModelInputs`	An input data-point for the model.	required
`concepts` ¶	`Tensor`	Concept activation tensor.	required
`target` ¶	`int`	The target class for which the concept output attribution should be computed.	required
`attribution_method` ¶	`type[AttributionExplainer]`	The attribution method to obtain importance scores for input elements.	required

Returns:

Type	Description
`list[float]`	A list of attribution scores for each concept.

Source code in interpreto/concepts/base.py

@check_fitted
def concept_output_attribution(
    self,
    inputs: ModelInputs,
    concepts: ConceptsActivations,
    target: int,
    attribution_method: type[AttributionExplainer],
    **attribution_kwargs,
) -> list[float]:
    """Computes the attribution of each concept for the logit of a target output element.

    Args:
        inputs (ModelInputs): An input data-point for the model.
        concepts (torch.Tensor): Concept activation tensor.
        target (int): The target class for which the concept output attribution should be computed.
        attribution_method: The attribution method to obtain importance scores for input elements.

    Returns:
        A list of attribution scores for each concept.
    """
    raise NotImplementedError("Concept-to-output attribution method is not implemented yet.")

Base Classes¶

interpreto.concepts.ConceptEncoderExplainer ¶

model_with_split_points ¶

concept_model ¶

split_point ¶

fit abstractmethod ¶

activations ¶

interpret ¶

interpretation_method ¶

concepts_indices ¶

inputs ¶

latent_activations ¶

concepts_activations ¶

**kwargs ¶

input_concept_attribution ¶

inputs ¶

concept ¶

attribution_method ¶

interpreto.concepts.ConceptAutoEncoderExplainer ¶

model_with_split_points ¶

concept_model ¶

split_point ¶

fit abstractmethod ¶

activations ¶

encode_activations ¶

activations ¶

decode_concepts ¶

concepts ¶

get_dictionary ¶

interpret ¶

interpretation_method ¶

concepts_indices ¶

inputs ¶

latent_activations ¶

concepts_activations ¶

**kwargs ¶

input_concept_attribution ¶

inputs ¶

concept ¶

attribution_method ¶

concept_output_attribution ¶

inputs ¶

concepts ¶

target ¶

attribution_method ¶

`model_with_split_points` ¶

`concept_model` ¶

`split_point` ¶

fit `abstractmethod` ¶

`activations` ¶

`interpretation_method` ¶

`concepts_indices` ¶

`inputs` ¶

`latent_activations` ¶

`concepts_activations` ¶

`**kwargs` ¶

`inputs` ¶

`concept` ¶

`attribution_method` ¶

`model_with_split_points` ¶

`concept_model` ¶

`split_point` ¶

fit `abstractmethod` ¶

`activations` ¶

`activations` ¶

`concepts` ¶

`interpretation_method` ¶

`concepts_indices` ¶

`inputs` ¶

`latent_activations` ¶

`concepts_activations` ¶

`**kwargs` ¶

`inputs` ¶

`concept` ¶

`attribution_method` ¶

`inputs` ¶

`concepts` ¶

`target` ¶

`attribution_method` ¶