Skip to content

Base Classes

interpreto.concepts.ConceptEncoderExplainer

ConceptEncoderExplainer(model_with_split_points, concept_model, split_point=None)

Bases: ABC, Generic[ConceptModel]

Code: concepts/base.py

Abstract class defining an interface for concept explanation. Child classes should implement the fit and encode_activations methods, and only assume the presence of an encoding step using the concept_model to convert activations to latent concepts.

Attributes:

Name Type Description
model_with_split_points ModelWithSplitPoints

The model to apply the explanation on. It should have at least one split point on which concept_model can be fitted.

split_point str

The split point used to train the concept_model.

concept_model ConceptModelProtocol

The model used to extract concepts from the activations of model_with_split_points. The only assumption for classes inheriting from this class is that the concept_model can encode activations into concepts with encode_activations. The ConceptModelProtocol is defined in interpreto.typing. It is basically a torch.nn.Module with an encode method.

is_fitted bool

Whether the concept_model was fit on model activations.

has_differentiable_concept_encoder bool

Whether the encode_activations operation is differentiable.

Parameters:

Name Type Description Default

model_with_split_points

ModelWithSplitPoints

The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained.

required

concept_model

ConceptModelProtocol

The model used to extract concepts from the activations of model_with_split_points. The ConceptModelProtocol is defined in interpreto.typing. It is basically a torch.nn.Module with an encode method.

required

split_point

str | None

The split point used to train the concept_model. If None, tries to use the split point of model_with_split_points if a single one is defined.

None
Source code in interpreto/concepts/base.py
def __init__(
    self,
    model_with_split_points: ModelWithSplitPoints,
    concept_model: ConceptModelProtocol,
    split_point: str | None = None,
):
    """Initializes the concept explainer with a given splitted model.

    Args:
        model_with_split_points (ModelWithSplitPoints): The model to apply the explanation on.
            It should have at least one split point on which a concept explainer can be trained.
        concept_model (ConceptModelProtocol): The model used to extract concepts from
            the activations of `model_with_split_points`.
            The `ConceptModelProtocol` is defined in `interpreto.typing`. It is basically a `torch.nn.Module` with an `encode` method.
        split_point (str | None): The split point used to train the `concept_model`. If None, tries to use the
            split point of `model_with_split_points` if a single one is defined.
    """
    if not isinstance(model_with_split_points, ModelWithSplitPoints):
        raise TypeError(
            f"The given model should be a ModelWithSplitPoints, but {type(model_with_split_points)} was given."
        )
    self.model_with_split_points: ModelWithSplitPoints = model_with_split_points
    self._concept_model = concept_model
    self.split_point = split_point  # Verified by `split_point.setter`
    self.__is_fitted: bool = False
    self.has_differentiable_concept_encoder = False

fit abstractmethod

fit(activations, *args, **kwargs)

Fits concept_model on the given activations.

Parameters:

Name Type Description Default

activations

Tensor | dict[str, Tensor]

A dictionary with model paths as keys and the corresponding tensors as values.

required

Returns:

Type Description
Any

None, concept_model is fitted in-place, is_fitted is set to True and split_point is set.

Source code in interpreto/concepts/base.py
@abstractmethod
def fit(self, activations: LatentActivations | dict[str, LatentActivations], *args, **kwargs) -> Any:
    """Fits `concept_model` on the given activations.

    Args:
        activations (torch.Tensor | dict[str, torch.Tensor]): A dictionary with model paths as keys and the corresponding
            tensors as values.

    Returns:
        `None`, `concept_model` is fitted in-place, `is_fitted` is set to `True` and `split_point` is set.
    """
    pass

interpret

Interpret the concepts dimensions in the latent space into a human-readable format. The interpretation is a mapping between the concepts indices and an object allowing to interpret them. It can be a label, a description, examples, etc.

Parameters:

Name Type Description Default

interpretation_method

type[BaseConceptInterpretationMethod]

The interpretation method to use to interpret the concepts.

required

concepts_indices

int | list[int] | Literal['all']

The indices of the concepts to interpret. If "all", all concepts are interpreted.

required

inputs

list[str] | None

The inputs to use for the interpretation. Necessary if the source is not VOCABULARY, as examples are extracted from the inputs.

None

latent_activations

LatentActivations | dict[str, LatentActivations] | None

The latent activations to use for the interpretation. Necessary if the source is LATENT_ACTIVATIONS. Otherwise, it is computed from the inputs or ignored if the source is CONCEPT_ACTIVATIONS.

None

concepts_activations

ConceptsActivations | None

The concepts activations to use for the interpretation. Necessary if the source is not CONCEPT_ACTIVATIONS. Otherwise, it is computed from the latent activations.

None

**kwargs

Additional keyword arguments to pass to the interpretation method.

{}

Returns:

Type Description
Mapping[int, Any]

Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.

Source code in interpreto/concepts/base.py
@check_fitted
def interpret(
    self,
    interpretation_method: type[BaseConceptInterpretationMethod],
    concepts_indices: int | list[int] | Literal["all"],
    inputs: list[str] | None = None,
    latent_activations: dict[str, LatentActivations] | LatentActivations | None = None,
    concepts_activations: ConceptsActivations | None = None,
    **kwargs,
) -> Mapping[int, Any]:
    """
    Interpret the concepts dimensions in the latent space into a human-readable format.
    The interpretation is a mapping between the concepts indices and an object allowing to interpret them.
    It can be a label, a description, examples, etc.

    Args:
        interpretation_method: The interpretation method to use to interpret the concepts.
        concepts_indices (int | list[int] | Literal["all"]): The indices of the concepts to interpret.
            If "all", all concepts are interpreted.
        inputs (list[str] | None): The inputs to use for the interpretation.
            Necessary if the source is not `VOCABULARY`, as examples are extracted from the inputs.
        latent_activations (LatentActivations | dict[str, LatentActivations] | None): The latent activations to use for the interpretation.
            Necessary if the source is `LATENT_ACTIVATIONS`.
            Otherwise, it is computed from the inputs or ignored if the source is `CONCEPT_ACTIVATIONS`.
        concepts_activations (ConceptsActivations | None): The concepts activations to use for the interpretation.
            Necessary if the source is not `CONCEPT_ACTIVATIONS`. Otherwise, it is computed from the latent activations.
        **kwargs: Additional keyword arguments to pass to the interpretation method.

    Returns:
        Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.
    """
    if concepts_indices == "all":
        concepts_indices = list(range(self.concept_model.nb_concepts))

    # verify
    if latent_activations is not None:
        split_latent_activations = self._sanitize_activations(latent_activations)
    else:
        split_latent_activations = None

    # initialize the interpretation method
    method = interpretation_method(
        model_with_split_points=self.model_with_split_points,
        split_point=self.split_point,
        concept_model=self.concept_model,
        **kwargs,
    )

    # compute the interpretation from inputs and activations
    return method.interpret(
        concepts_indices=concepts_indices,
        inputs=inputs,
        latent_activations=split_latent_activations,
        concepts_activations=concepts_activations,
    )

input_concept_attribution

input_concept_attribution(inputs, concept, attribution_method, **attribution_kwargs)

Attributes model inputs for a selected concept.

Parameters:

Name Type Description Default

inputs

ModelInputs

The input data, which can be a string, a list of tokens/words/clauses/sentences or a dataset.

required

concept

int

Index identifying the position of the concept of interest (score in the ConceptsActivations tensor) for which relevant input elements should be retrieved.

required

attribution_method

type[AttributionExplainer]

The attribution method to obtain importance scores for input elements.

required

Returns:

Type Description
list[float]

A list of attribution scores for each input.

Source code in interpreto/concepts/base.py
@check_fitted
def input_concept_attribution(
    self,
    inputs: ModelInputs,
    concept: int,
    attribution_method: type[AttributionExplainer],
    **attribution_kwargs,
) -> list[float]:
    """Attributes model inputs for a selected concept.

    Args:
        inputs (ModelInputs): The input data, which can be a string, a list of tokens/words/clauses/sentences
            or a dataset.
        concept (int): Index identifying the position of the concept of interest (score in the
            `ConceptsActivations` tensor) for which relevant input elements should be retrieved.
        attribution_method: The attribution method to obtain importance scores for input elements.

    Returns:
        A list of attribution scores for each input.
    """
    raise NotImplementedError("Input-to-concept attribution method is not implemented yet.")

interpreto.concepts.ConceptAutoEncoderExplainer

ConceptAutoEncoderExplainer(model_with_split_points, concept_model, split_point=None)

Bases: ConceptEncoderExplainer[BaseDictionaryLearning], Generic[BDL]

Code: concepts/base.py

A concept bottleneck explainer wraps a concept_model that should be able to encode activations into concepts and decode concepts into activations.

We use the term "concept bottleneck" loosely, as the latent space can be overcomplete compared to activation space, as in the case of sparse autoencoders.

We assume that the concept model follows the structure of an overcomplete.BaseDictionaryLearning model, which defines the encode and decode methods for encoding and decoding activations into concepts.

Attributes:

Name Type Description
model_with_split_points ModelWithSplitPoints

The model to apply the explanation on. It should have at least one split point on which concept_model can be fitted.

split_point str

The split point used to train the concept_model.

concept_model [BaseDictionaryLearning](https

//github.com/KempnerInstitute/overcomplete/blob/24568ba5736cbefca4b78a12246d92a1be04a1f4/overcomplete/base.py#L10)): The model used to extract concepts from the activations of model_with_split_points. The only assumption for classes inheriting from this class is that the concept_model can encode activations into concepts with encode_activations.

is_fitted bool

Whether the concept_model was fit on model activations.

has_differentiable_concept_encoder bool

Whether the encode_activations operation is differentiable.

has_differentiable_concept_decoder bool

Whether the decode_concepts operation is differentiable.

Parameters:

Name Type Description Default

model_with_split_points

ModelWithSplitPoints

The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained.

required

concept_model

[BaseDictionaryLearning](https

//github.com/KempnerInstitute/overcomplete/blob/24568ba5736cbefca4b78a12246d92a1be04a1f4/overcomplete/base.py#L10)): The model used to extract concepts from the activations of model_with_split_points.

required

split_point

str | None

The split point used to train the concept_model. If None, tries to use the split point of model_with_split_points if a single one is defined.

None
Source code in interpreto/concepts/base.py
def __init__(
    self,
    model_with_split_points: ModelWithSplitPoints,
    concept_model: BaseDictionaryLearning,
    split_point: str | None = None,
):
    """Initializes the concept explainer with a given splitted model.

    Args:
        model_with_split_points (ModelWithSplitPoints): The model to apply the explanation on.
            It should have at least one split point on which a concept explainer can be trained.
        concept_model ([BaseDictionaryLearning](https://github.com/KempnerInstitute/overcomplete/blob/24568ba5736cbefca4b78a12246d92a1be04a1f4/overcomplete/base.py#L10)): The model used to extract concepts from
            the activations of `model_with_split_points`.
        split_point (str | None): The split point used to train the `concept_model`. If None, tries to use the
            split point of `model_with_split_points` if a single one is defined.
    """
    self.concept_model: BaseDictionaryLearning
    super().__init__(model_with_split_points, concept_model, split_point)
    self.has_differentiable_concept_decoder = False

fit abstractmethod

fit(activations, *args, **kwargs)

Fits concept_model on the given activations.

Parameters:

Name Type Description Default

activations

Tensor | dict[str, Tensor]

A dictionary with model paths as keys and the corresponding tensors as values.

required

Returns:

Type Description
Any

None, concept_model is fitted in-place, is_fitted is set to True and split_point is set.

Source code in interpreto/concepts/base.py
@abstractmethod
def fit(self, activations: LatentActivations | dict[str, LatentActivations], *args, **kwargs) -> Any:
    """Fits `concept_model` on the given activations.

    Args:
        activations (torch.Tensor | dict[str, torch.Tensor]): A dictionary with model paths as keys and the corresponding
            tensors as values.

    Returns:
        `None`, `concept_model` is fitted in-place, `is_fitted` is set to `True` and `split_point` is set.
    """
    pass

encode_activations

encode_activations(activations)

Encode the given activations using the concept_model encoder.

Parameters:

Name Type Description Default

activations

LatentActivations

The activations to encode.

required

Returns:

Type Description
Tensor

The encoded concept activations.

Source code in interpreto/concepts/base.py
@check_fitted
def encode_activations(self, activations: LatentActivations) -> torch.Tensor:  # ConceptsActivations
    """Encode the given activations using the `concept_model` encoder.

    Args:
        activations (LatentActivations): The activations to encode.

    Returns:
        The encoded concept activations.
    """
    self._sanitize_activations(activations)
    return self.concept_model.encode(activations)  # type: ignore

decode_concepts

decode_concepts(concepts)

Decode the given concepts using the concept_model decoder.

Parameters:

Name Type Description Default

concepts

ConceptsActivations

The concepts to decode.

required

Returns:

Type Description
Tensor

The decoded model activations.

Source code in interpreto/concepts/base.py
@check_fitted
def decode_concepts(self, concepts: ConceptsActivations) -> torch.Tensor:  # LatentActivations
    """Decode the given concepts using the `concept_model` decoder.

    Args:
        concepts (ConceptsActivations): The concepts to decode.

    Returns:
        The decoded model activations.
    """
    return self.concept_model.decode(concepts)  # type: ignore

get_dictionary

get_dictionary()

Get the dictionary learned by the fitted concept_model.

Returns:

Type Description
Tensor

torch.Tensor: A torch.Tensor containing the learned dictionary.

Source code in interpreto/concepts/base.py
@check_fitted
def get_dictionary(self) -> torch.Tensor:  # TODO: add this to tests
    """Get the dictionary learned by the fitted `concept_model`.

    Returns:
        torch.Tensor: A `torch.Tensor` containing the learned dictionary.
    """
    return self.concept_model.get_dictionary()  # type: ignore

interpret

Interpret the concepts dimensions in the latent space into a human-readable format. The interpretation is a mapping between the concepts indices and an object allowing to interpret them. It can be a label, a description, examples, etc.

Parameters:

Name Type Description Default

interpretation_method

type[BaseConceptInterpretationMethod]

The interpretation method to use to interpret the concepts.

required

concepts_indices

int | list[int] | Literal['all']

The indices of the concepts to interpret. If "all", all concepts are interpreted.

required

inputs

list[str] | None

The inputs to use for the interpretation. Necessary if the source is not VOCABULARY, as examples are extracted from the inputs.

None

latent_activations

LatentActivations | dict[str, LatentActivations] | None

The latent activations to use for the interpretation. Necessary if the source is LATENT_ACTIVATIONS. Otherwise, it is computed from the inputs or ignored if the source is CONCEPT_ACTIVATIONS.

None

concepts_activations

ConceptsActivations | None

The concepts activations to use for the interpretation. Necessary if the source is not CONCEPT_ACTIVATIONS. Otherwise, it is computed from the latent activations.

None

**kwargs

Additional keyword arguments to pass to the interpretation method.

{}

Returns:

Type Description
Mapping[int, Any]

Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.

Source code in interpreto/concepts/base.py
@check_fitted
def interpret(
    self,
    interpretation_method: type[BaseConceptInterpretationMethod],
    concepts_indices: int | list[int] | Literal["all"],
    inputs: list[str] | None = None,
    latent_activations: dict[str, LatentActivations] | LatentActivations | None = None,
    concepts_activations: ConceptsActivations | None = None,
    **kwargs,
) -> Mapping[int, Any]:
    """
    Interpret the concepts dimensions in the latent space into a human-readable format.
    The interpretation is a mapping between the concepts indices and an object allowing to interpret them.
    It can be a label, a description, examples, etc.

    Args:
        interpretation_method: The interpretation method to use to interpret the concepts.
        concepts_indices (int | list[int] | Literal["all"]): The indices of the concepts to interpret.
            If "all", all concepts are interpreted.
        inputs (list[str] | None): The inputs to use for the interpretation.
            Necessary if the source is not `VOCABULARY`, as examples are extracted from the inputs.
        latent_activations (LatentActivations | dict[str, LatentActivations] | None): The latent activations to use for the interpretation.
            Necessary if the source is `LATENT_ACTIVATIONS`.
            Otherwise, it is computed from the inputs or ignored if the source is `CONCEPT_ACTIVATIONS`.
        concepts_activations (ConceptsActivations | None): The concepts activations to use for the interpretation.
            Necessary if the source is not `CONCEPT_ACTIVATIONS`. Otherwise, it is computed from the latent activations.
        **kwargs: Additional keyword arguments to pass to the interpretation method.

    Returns:
        Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.
    """
    if concepts_indices == "all":
        concepts_indices = list(range(self.concept_model.nb_concepts))

    # verify
    if latent_activations is not None:
        split_latent_activations = self._sanitize_activations(latent_activations)
    else:
        split_latent_activations = None

    # initialize the interpretation method
    method = interpretation_method(
        model_with_split_points=self.model_with_split_points,
        split_point=self.split_point,
        concept_model=self.concept_model,
        **kwargs,
    )

    # compute the interpretation from inputs and activations
    return method.interpret(
        concepts_indices=concepts_indices,
        inputs=inputs,
        latent_activations=split_latent_activations,
        concepts_activations=concepts_activations,
    )

input_concept_attribution

input_concept_attribution(inputs, concept, attribution_method, **attribution_kwargs)

Attributes model inputs for a selected concept.

Parameters:

Name Type Description Default

inputs

ModelInputs

The input data, which can be a string, a list of tokens/words/clauses/sentences or a dataset.

required

concept

int

Index identifying the position of the concept of interest (score in the ConceptsActivations tensor) for which relevant input elements should be retrieved.

required

attribution_method

type[AttributionExplainer]

The attribution method to obtain importance scores for input elements.

required

Returns:

Type Description
list[float]

A list of attribution scores for each input.

Source code in interpreto/concepts/base.py
@check_fitted
def input_concept_attribution(
    self,
    inputs: ModelInputs,
    concept: int,
    attribution_method: type[AttributionExplainer],
    **attribution_kwargs,
) -> list[float]:
    """Attributes model inputs for a selected concept.

    Args:
        inputs (ModelInputs): The input data, which can be a string, a list of tokens/words/clauses/sentences
            or a dataset.
        concept (int): Index identifying the position of the concept of interest (score in the
            `ConceptsActivations` tensor) for which relevant input elements should be retrieved.
        attribution_method: The attribution method to obtain importance scores for input elements.

    Returns:
        A list of attribution scores for each input.
    """
    raise NotImplementedError("Input-to-concept attribution method is not implemented yet.")

concept_output_attribution

concept_output_attribution(inputs, concepts, target, attribution_method, **attribution_kwargs)

Computes the attribution of each concept for the logit of a target output element.

Parameters:

Name Type Description Default

inputs

ModelInputs

An input data-point for the model.

required

concepts

Tensor

Concept activation tensor.

required

target

int

The target class for which the concept output attribution should be computed.

required

attribution_method

type[AttributionExplainer]

The attribution method to obtain importance scores for input elements.

required

Returns:

Type Description
list[float]

A list of attribution scores for each concept.

Source code in interpreto/concepts/base.py
@check_fitted
def concept_output_attribution(
    self,
    inputs: ModelInputs,
    concepts: ConceptsActivations,
    target: int,
    attribution_method: type[AttributionExplainer],
    **attribution_kwargs,
) -> list[float]:
    """Computes the attribution of each concept for the logit of a target output element.

    Args:
        inputs (ModelInputs): An input data-point for the model.
        concepts (torch.Tensor): Concept activation tensor.
        target (int): The target class for which the concept output attribution should be computed.
        attribution_method: The attribution method to obtain importance scores for input elements.

    Returns:
        A list of attribution scores for each concept.
    """
    raise NotImplementedError("Concept-to-output attribution method is not implemented yet.")