Skip to content

Base Classes

interpreto.concepts.ConceptEncoderExplainer

ConceptEncoderExplainer(splitter, concept_model)

Bases: ABC, Generic[ConceptModel]

Code: concepts/base.py

Abstract class defining an interface for concept explanation. Child classes should implement the fit and activations_to_concepts methods, and only assume the presence of an encoding step using the concept_model to convert activations to latent concepts.

Attributes:

Name Type Description
splitter BaseSplitter

The model to apply the explanation on. The split point is determined by the model's split_point attribute.

concept_model ConceptModelProtocol

The model used to extract concepts from the activations of splitter. The only assumption for classes inheriting from this class is that the concept_model can encode activations into concepts with activations_to_concepts. The ConceptModelProtocol is defined in interpreto.typing. It is basically a torch.nn.Module with an encode method.

is_fitted bool

Whether the concept_model was fit on model activations.

has_differentiable_concept_encoder bool

Whether the activations_to_concepts operation is differentiable.

Parameters:

Name Type Description Default

splitter

BaseSplitter

The model to apply the explanation on. Its split_point attribute determines where activations are extracted.

required

concept_model

ConceptModelProtocol

The model used to extract concepts from the activations of splitter. The ConceptModelProtocol is defined in interpreto.typing. It is basically a torch.nn.Module with an encode method.

required
Source code in interpreto/concepts/base.py
def __init__(
    self,
    splitter: BaseSplitter,
    concept_model: ConceptModelProtocol,
):
    """Initializes the concept explainer with a given splitted model.

    Args:
        splitter (BaseSplitter): The model to apply the explanation on.
            Its `split_point` attribute determines where activations are extracted.
        concept_model (ConceptModelProtocol): The model used to extract concepts from
            the activations of `splitter`.
            The `ConceptModelProtocol` is defined in `interpreto.typing`. It is basically a `torch.nn.Module` with an `encode` method.
    """
    if not isinstance(splitter, BaseSplitter):
        raise TypeError(f"The given model should be a BaseSplitter (or subclass), but {type(splitter)} was given.")
    self.splitter: BaseSplitter = splitter
    self._concept_model = concept_model
    self.__is_fitted: bool = False

fit abstractmethod

fit(activations, *args, **kwargs)

Fits concept_model on the given activations.

Parameters:

Name Type Description Default

activations

Tensor

The latent activations used to fit the concept model.

required

Returns:

Type Description
Any

None, concept_model is fitted in-place, is_fitted is set to True and split_point is set.

Source code in interpreto/concepts/base.py
@abstractmethod
def fit(self, activations: LatentActivations, *args, **kwargs) -> Any:
    """Fits `concept_model` on the given activations.

    Args:
        activations (torch.Tensor): The latent activations used to fit the concept model.

    Returns:
        `None`, `concept_model` is fitted in-place, `is_fitted` is set to `True` and `split_point` is set.
    """
    pass

get_inputs_to_concepts_model

get_inputs_to_concepts_model()

Returns a model that maps raw inputs to concept activations.

The model can be passed to an attribution method, to obtain inputs to concepts attributions. Which are ways to interpret the concepts.

Returns:

Name Type Description
ModelForInputsToConcepts ModelForInputsToConcepts

A model that maps raw inputs to concept activations.

Source code in interpreto/concepts/base.py
def get_inputs_to_concepts_model(self) -> ModelForInputsToConcepts:
    """Returns a model that maps raw inputs to concept activations.

    The model can be passed to an attribution method,
    to obtain inputs to concepts attributions.
    Which are ways to interpret the concepts.

    Returns:
        ModelForInputsToConcepts: A model that maps raw inputs to concept activations.
    """
    return ModelForInputsToConcepts(self)

interpreto.concepts.ConceptAutoEncoderExplainer

ConceptAutoEncoderExplainer(splitter, concept_model)

Bases: ConceptEncoderExplainer[BaseDictionaryLearning], Generic[BDL]

Code: concepts/base.py

A concept bottleneck explainer wraps a concept_model that should be able to encode activations into concepts and decode concepts into activations.

We use the term "concept bottleneck" loosely, as the latent space can be overcomplete compared to activation space, as in the case of sparse autoencoders.

We assume that the concept model follows the structure of an overcomplete.BaseDictionaryLearning model, which defines the encode and decode methods for encoding and decoding activations into concepts.

Attributes:

Name Type Description
splitter ModelWithSplitPoints

The model to apply the explanation on. The split point is determined by the model's split_point attribute.

concept_model [BaseDictionaryLearning](https

//github.com/KempnerInstitute/overcomplete/blob/24568ba5736cbefca4b78a12246d92a1be04a1f4/overcomplete/base.py#L10)): The model used to extract concepts from the activations of splitter. The only assumption for classes inheriting from this class is that the concept_model can encode activations into concepts with activations_to_concepts.

is_fitted bool

Whether the concept_model was fit on model activations.

has_differentiable_concept_encoder bool

Whether the activations_to_concepts operation is differentiable.

has_differentiable_concept_decoder bool

Whether the concepts_to_activations operation is differentiable.

Parameters:

Name Type Description Default

splitter

BaseSplitter

The model to apply the explanation on. Its split_point attribute determines where activations are extracted.

required

concept_model ([BaseDictionaryLearning](https

//github.com/KempnerInstitute/overcomplete/blob/24568ba5736cbefca4b78a12246d92a1be04a1f4/overcomplete/base.py#L10)): The model used to extract concepts from the activations of splitter.

required
Source code in interpreto/concepts/base.py
def __init__(
    self,
    splitter: BaseSplitter,
    concept_model: BaseDictionaryLearning,
):
    """Initializes the concept explainer with a given splitted model.

    Args:
        splitter (BaseSplitter): The model to apply the explanation on.
            Its `split_point` attribute determines where activations are extracted.
        concept_model ([BaseDictionaryLearning](https://github.com/KempnerInstitute/overcomplete/blob/24568ba5736cbefca4b78a12246d92a1be04a1f4/overcomplete/base.py#L10)): The model used to extract concepts from
            the activations of `splitter`.
    """
    self.concept_model: BaseDictionaryLearning
    super().__init__(splitter, concept_model)  # type: ignore

fit abstractmethod

fit(activations, *args, **kwargs)

Fits concept_model on the given activations.

Parameters:

Name Type Description Default

activations

Tensor

The latent activations used to fit the concept model.

required

Returns:

Type Description
Any

None, concept_model is fitted in-place, is_fitted is set to True and split_point is set.

Source code in interpreto/concepts/base.py
@abstractmethod
def fit(self, activations: LatentActivations, *args, **kwargs) -> Any:
    """Fits `concept_model` on the given activations.

    Args:
        activations (torch.Tensor): The latent activations used to fit the concept model.

    Returns:
        `None`, `concept_model` is fitted in-place, `is_fitted` is set to `True` and `split_point` is set.
    """
    pass

activations_to_concepts

activations_to_concepts(activations)

Encode the given activations using the concept_model encoder.

Parameters:

Name Type Description Default

activations

LatentActivations

The activations to encode.

required

Returns:

Type Description
Tensor

The encoded concept activations.

Source code in interpreto/concepts/base.py
@check_fitted
def activations_to_concepts(self, activations: LatentActivations) -> torch.Tensor:  # ConceptsActivations
    """Encode the given activations using the `concept_model` encoder.

    Args:
        activations (LatentActivations): The activations to encode.

    Returns:
        The encoded concept activations.
    """
    if self.device != activations.device:
        activations = activations.to(self.device, non_blocking=True)
    return self.concept_model.encode(activations)  # type: ignore

concepts_to_activations

concepts_to_activations(concepts)

Decode the given concepts using the concept_model decoder.

Parameters:

Name Type Description Default

concepts

ConceptsActivations

The concepts to decode.

required

Returns:

Type Description
Tensor

The decoded model activations.

Source code in interpreto/concepts/base.py
@check_fitted
def concepts_to_activations(self, concepts: ConceptsActivations) -> torch.Tensor:  # LatentActivations
    """Decode the given concepts using the `concept_model` decoder.

    Args:
        concepts (ConceptsActivations): The concepts to decode.

    Returns:
        The decoded model activations.
    """
    if self.device != concepts.device:
        concepts = concepts.to(self.device, non_blocking=True)
    return self.concept_model.decode(concepts)  # type: ignore

get_dictionary

get_dictionary()

Get the dictionary learned by the fitted concept_model.

Returns:

Type Description
Tensor

torch.Tensor: A torch.Tensor containing the learned dictionary.

Source code in interpreto/concepts/base.py
@check_fitted
def get_dictionary(self) -> torch.Tensor:  # TODO: add this to tests
    """Get the dictionary learned by the fitted `concept_model`.

    Returns:
        torch.Tensor: A `torch.Tensor` containing the learned dictionary.
    """
    return self.concept_model.get_dictionary()  # type: ignore

concept_output_gradient

Compute the gradients of the predictions with respect to the concepts.

To clarify what this function does, lets detail some notations. Suppose the initial model was splitted such that \(f = g \circ h\). Hence the concept model was fitted on \(A = h(X)\) with \(X\) a dataset of samples. The resulting concept model encoders and decoders are noted \(t\) and \(t^{-1}\). \(t\) can be seen as projections from the latent space to the concept space. Hence, the function going from the inputs to the concepts is \(f_{ic} = t \circ h\) and the function going from the concepts to the outputs is \(f_{co} = g \circ t^-1\).

Given a set of samples \(X\), and the functions \((h, t, t^{-1}, g)\) This function first compute \(C = t(A) = t \circ h(X)\), then returns \(\nabla{f_{co}}(C)\).

In practice all computations are done by ModelWithSplitPoints._get_concept_output_gradients, which relies on NNsight. The current method only forwards the \(t\) and \(t^{-1}\), respectively self.activations_to_concepts and self.concepts_to_activations methods.

Parameters:

Name Type Description Default

inputs

list[str] | Tensor | BatchEncoding

The input data, either a list of samples, the tokenized input or a batch of samples.

required

targets

list[int] | None

Specify which outputs of the model should be used to compute the gradients. Note that \(f_{co}\) often has several outputs, by default gradients are computed for each output. The t dimension of the returned tensor is equal to the number of selected targets. (For classification, those are the classes logits and for generation, those are the most probable tokens probabilities).

None

activation_granularity

ActivationGranularity

The granularity of the activations to use for the attribution. It is highly recommended to to use the same granularity as the one used in the fit method. Possibles values are:

  • ModelWithSplitPoints.activation_granularities.CLS_TOKEN: only the first token (e.g. [CLS]) activation is returned (batch, d_model).

  • ModelWithSplitPoints.activation_granularities.ALL_TOKENS: every token activation is treated as a separate element (batch x seq_len, d_model).

  • ModelWithSplitPoints.activation_granularities.TOKEN: remove special tokens.

  • ModelWithSplitPoints.activation_granularities.WORD: aggregate by words following the split defined by :class:~interpreto.commons.granularity.Granularity.WORD.

  • ModelWithSplitPoints.activation_granularities.SENTENCE: aggregate by sentences following the split defined by :class:~interpreto.commons.granularity.Granularity.SENTENCE.

TOKEN

aggregation_strategy

GranularityAggregationStrategy

Strategy to aggregate token activations into larger inputs granularities. Applied for WORD and SENTENCE activation strategies. Token activations of shape n * (l, d) are aggregated on the sequence length dimension. The concatenated into (ng, d) tensors. Existing strategies are:

  • ModelWithSplitPoints.aggregation_strategies.SUM: Tokens activations are summed along the sequence length dimension.

  • ModelWithSplitPoints.aggregation_strategies.MEAN: Tokens activations are averaged along the sequence length dimension.

  • ModelWithSplitPoints.aggregation_strategies.MAX: The maximum of the token activations along the sequence length dimension is selected.

  • ModelWithSplitPoints.aggregation_strategies.SIGNED_MAX: The maximum of the absolute value of the activations multiplied by its initial sign. signed_max([[-1, 0, 1, 2], [-3, 1, -2, 0]]) = [-3, 1, -2, 2]

MEAN

concepts_x_gradients

bool

If the resulting gradients should be multiplied by the concepts activations. True by default (similarly to attributions), because of mathematical properties. Therefore the out put is \(C * \nabla{f_{co}}(C)\).

True

normalization

bool

Whether to normalize the gradients. Gradients will be normalized on the concept (c) and sequence length (g) dimensions. Such that for a given sample-target-granular pair, the sum of the absolute values of the gradients is equal to 1. (The granular elements depend on the :arg:activation_granularity).

True

tqdm_bar

bool

Whether to display a progress bar.

False

batch_size

int | None

Batch size for the model. It might be different from the one used in ModelWithSplitPoints.get_activations because gradients have a much larger impact on the memory.

None

Returns:

Type Description
list[Float[Tensor, 't g c']]

list[Float[torch.Tensor, "t g c"]]: The gradients of the model output with respect to the concept activations. List length: correspond to the number of inputs. Tensor shape: (t, g, c) with t the target dimension, g the number of granularity elements in one input, and c the number of concepts.

Source code in interpreto/concepts/base.py
@check_fitted
def concept_output_gradient(
    self,
    inputs: torch.Tensor | list[str] | BatchEncoding,
    targets: list[int] | None = None,
    activation_granularity: ActivationGranularity = ActivationGranularity.TOKEN,
    aggregation_strategy: GranularityAggregationStrategy = GranularityAggregationStrategy.MEAN,
    concepts_x_gradients: bool = True,
    normalization: bool = True,
    tqdm_bar: bool = False,
    batch_size: int | None = None,
) -> list[Float[torch.Tensor, "t g c"]]:
    """
    Compute the gradients of the predictions with respect to the concepts.

    To clarify what this function does, lets detail some notations.
    Suppose the initial model was splitted such that $f = g \\circ h$.
    Hence the concept model was fitted on $A = h(X)$ with $X$ a dataset of samples.
    The resulting concept model encoders and decoders are noted $t$ and $t^{-1}$.
    $t$ can be seen as projections from the latent space to the concept space.
    Hence, the function going from the inputs to the concepts is $f_{ic} = t \\circ h$
    and the function going from the concepts to the outputs is $f_{co} = g \\circ t^-1$.

    Given a set of samples $X$, and the functions $(h, t, t^{-1}, g)$
    This function first compute $C = t(A) = t \\circ h(X)$, then returns $\\nabla{f_{co}}(C)$.

    In practice all computations are done by `ModelWithSplitPoints._get_concept_output_gradients`,
    which relies on NNsight. The current method only forwards the $t$ and $t^{-1}$,
    respectively `self.activations_to_concepts` and `self.concepts_to_activations` methods.

    Args:
        inputs (list[str] | torch.Tensor | BatchEncoding):
            The input data, either a list of samples, the tokenized input or a batch of samples.

        targets (list[int] | None):
            Specify which outputs of the model should be used to compute the gradients.
            Note that $f_{co}$ often has several outputs, by default gradients are computed for each output.
            The `t` dimension of the returned tensor is equal to the number of selected targets.
            (For classification, those are the classes logits and for generation, those are the most probable tokens probabilities).

        activation_granularity (ActivationGranularity):
            The granularity of the activations to use for the attribution.
            It is highly recommended to to use the same granularity as the one used in the `fit` method.
            Possibles values are:

            - ``ModelWithSplitPoints.activation_granularities.CLS_TOKEN``:
                only the first token (e.g. ``[CLS]``) activation is returned ``(batch, d_model)``.

            - ``ModelWithSplitPoints.activation_granularities.ALL_TOKENS``:
                every token activation is treated as a separate element ``(batch x seq_len, d_model)``.

            - ``ModelWithSplitPoints.activation_granularities.TOKEN``: remove special tokens.

            - ``ModelWithSplitPoints.activation_granularities.WORD``:
                aggregate by words following the split defined by
                :class:`~interpreto.commons.granularity.Granularity.WORD`.

            - ``ModelWithSplitPoints.activation_granularities.SENTENCE``:
                aggregate by sentences following the split defined by
                :class:`~interpreto.commons.granularity.Granularity.SENTENCE`.

        aggregation_strategy:
            Strategy to aggregate token activations into larger inputs granularities.
            Applied for `WORD` and `SENTENCE` activation strategies.
            Token activations of shape  n * (l, d) are aggregated on the sequence length dimension.
            The concatenated into (ng, d) tensors.
            Existing strategies are:

            - ``ModelWithSplitPoints.aggregation_strategies.SUM``:
                Tokens activations are summed along the sequence length dimension.

            - ``ModelWithSplitPoints.aggregation_strategies.MEAN``:
                Tokens activations are averaged along the sequence length dimension.

            - ``ModelWithSplitPoints.aggregation_strategies.MAX``:
                The maximum of the token activations along the sequence length dimension is selected.

            - ``ModelWithSplitPoints.aggregation_strategies.SIGNED_MAX``:
                The maximum of the absolute value of the activations multiplied by its initial sign.
                signed_max([[-1, 0, 1, 2], [-3, 1, -2, 0]]) = [-3, 1, -2, 2]

        concepts_x_gradients (bool):
            If the resulting gradients should be multiplied by the concepts activations.
            True by default (similarly to attributions), because of mathematical properties.
            Therefore the out put is $C * \\nabla{f_{co}}(C)$.

        normalization (bool):
            Whether to normalize the gradients.
            Gradients will be normalized on the concept (c) and sequence length (g) dimensions.
            Such that for a given sample-target-granular pair,
            the sum of the absolute values of the gradients is equal to 1.
            (The granular elements depend on the :arg:`activation_granularity`).

        tqdm_bar (bool):
            Whether to display a progress bar.

        batch_size (int | None):
            Batch size for the model.
            It might be different from the one used in `ModelWithSplitPoints.get_activations`
            because gradients have a much larger impact on the memory.

    Returns:
        list[Float[torch.Tensor, "t g c"]]:
            The gradients of the model output with respect to the concept activations.
            List length: correspond to the number of inputs.
                Tensor shape: (t, g, c) with t the target dimension, g the number of granularity elements in one input, and c the number of
                concepts.
    """
    if not self.has_differentiable_concept_decoder:
        raise ValueError(
            "The concept decoder of this explainer is not differentiable. This is required to compute concept-to-output gradients. "
            f"Current explainer class: {self.__class__.__name__}."
        )

    # put everything on device
    self.to(self.splitter.device)  # type: ignore

    # forward all computations to
    gradients = self.splitter._get_concept_output_gradients(
        inputs=inputs,
        targets=targets,
        activations_to_concepts=self.activations_to_concepts,
        concepts_to_activations=self.concepts_to_activations,
        activation_granularity=activation_granularity,
        aggregation_strategy=aggregation_strategy,
        concepts_x_gradients=concepts_x_gradients,
        tqdm_bar=tqdm_bar,
        batch_size=batch_size,
    )

    # normalize the gradients if required
    if normalization:
        gradients = [self._normalize_gradients(g) for g in gradients]
    return gradients