Skip to content

Neurons as Concepts

interpreto.concepts.NeuronsAsConcepts

NeuronsAsConcepts(model_with_split_points, split_point=None)

Bases: ConceptAutoEncoderExplainer

Code: concepts/methods/neurons_as_concepts.py Concept Bottleneck Explainer where the latent space is considered as the concept space.

TODO: Add doc with papers we can redo with it.

Attributes:

Name Type Description
model_with_split_points ModelWithSplitPoints

The model to apply the explanation on. It should have at least one split point on which concept_model can be fitted.

split_point str

The split point used to train the concept_model.

concept_model IdentityConceptModel

An identity concept model for harmonization.

is_fitted bool

Whether the concept_model was fit on model activations.

has_differentiable_concept_encoder bool

Whether the encode_activations operation is differentiable.

has_differentiable_concept_decoder bool

Whether the decode_concepts operation is differentiable.

Parameters:

Name Type Description Default

model_with_split_points

ModelWithSplitPoints

The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained.

required

split_point

str | None

The split point used to train the concept_model. If None, tries to use the split point of model_with_split_points if a single one is defined.

None
Source code in interpreto/concepts/methods/neurons_as_concepts.py
def __init__(
    self,
    model_with_split_points: ModelWithSplitPoints,
    split_point: str | None = None,
):
    """
    Initializes the concept explainer with a given splitted model.

    Args:
        model_with_split_points (ModelWithSplitPoints): The model to apply the explanation on.
            It should have at least one split point on which a concept explainer can be trained.
        split_point (str | None): The split point used to train the `concept_model`. If None, tries to use the
            split point of `model_with_split_points` if a single one is defined.
    """
    # extract the input size from the model activations
    self.model_with_split_points = model_with_split_points
    self.split_point: str = split_point  # type: ignore
    input_size = self.model_with_split_points.get_latent_shape()[self.split_point][-1]

    # initialize
    super().__init__(
        model_with_split_points=model_with_split_points,
        concept_model=IdentityConceptModel(input_size),
        split_point=self.split_point,
    )
    self.has_differentiable_concept_encoder = True
    self.has_differentiable_concept_decoder = True

interpret

Interpret the concepts dimensions in the latent space into a human-readable format. The interpretation is a mapping between the concepts indices and an object allowing to interpret them. It can be a label, a description, examples, etc.

Parameters:

Name Type Description Default

interpretation_method

type[BaseConceptInterpretationMethod]

The interpretation method to use to interpret the concepts.

required

concepts_indices

int | list[int] | Literal['all']

The indices of the concepts to interpret. If "all", all concepts are interpreted.

required

inputs

list[str] | None

The inputs to use for the interpretation. Necessary if the source is not VOCABULARY, as examples are extracted from the inputs.

None

latent_activations

LatentActivations | dict[str, LatentActivations] | None

The latent activations to use for the interpretation. Necessary if the source is LATENT_ACTIVATIONS. Otherwise, it is computed from the inputs or ignored if the source is CONCEPT_ACTIVATIONS.

None

concepts_activations

ConceptsActivations | None

The concepts activations to use for the interpretation. Necessary if the source is not CONCEPT_ACTIVATIONS. Otherwise, it is computed from the latent activations.

None

**kwargs

Additional keyword arguments to pass to the interpretation method.

{}

Returns:

Type Description
Mapping[int, Any]

Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.

Source code in interpreto/concepts/base.py
@check_fitted
def interpret(
    self,
    interpretation_method: type[BaseConceptInterpretationMethod],
    concepts_indices: int | list[int] | Literal["all"],
    inputs: list[str] | None = None,
    latent_activations: dict[str, LatentActivations] | LatentActivations | None = None,
    concepts_activations: ConceptsActivations | None = None,
    **kwargs,
) -> Mapping[int, Any]:
    """
    Interpret the concepts dimensions in the latent space into a human-readable format.
    The interpretation is a mapping between the concepts indices and an object allowing to interpret them.
    It can be a label, a description, examples, etc.

    Args:
        interpretation_method: The interpretation method to use to interpret the concepts.
        concepts_indices (int | list[int] | Literal["all"]): The indices of the concepts to interpret.
            If "all", all concepts are interpreted.
        inputs (list[str] | None): The inputs to use for the interpretation.
            Necessary if the source is not `VOCABULARY`, as examples are extracted from the inputs.
        latent_activations (LatentActivations | dict[str, LatentActivations] | None): The latent activations to use for the interpretation.
            Necessary if the source is `LATENT_ACTIVATIONS`.
            Otherwise, it is computed from the inputs or ignored if the source is `CONCEPT_ACTIVATIONS`.
        concepts_activations (ConceptsActivations | None): The concepts activations to use for the interpretation.
            Necessary if the source is not `CONCEPT_ACTIVATIONS`. Otherwise, it is computed from the latent activations.
        **kwargs: Additional keyword arguments to pass to the interpretation method.

    Returns:
        Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.
    """
    if concepts_indices == "all":
        concepts_indices = list(range(self.concept_model.nb_concepts))

    # verify
    if latent_activations is not None:
        split_latent_activations = self._sanitize_activations(latent_activations)
    else:
        split_latent_activations = None

    # initialize the interpretation method
    method = interpretation_method(
        model_with_split_points=self.model_with_split_points,
        split_point=self.split_point,
        concept_model=self.concept_model,
        **kwargs,
    )

    # compute the interpretation from inputs and activations
    return method.interpret(
        concepts_indices=concepts_indices,
        inputs=inputs,
        latent_activations=split_latent_activations,
        concepts_activations=concepts_activations,
    )

input_concept_attribution

input_concept_attribution(inputs, concept, attribution_method, **attribution_kwargs)

Attributes model inputs for a selected concept.

Parameters:

Name Type Description Default

inputs

ModelInputs

The input data, which can be a string, a list of tokens/words/clauses/sentences or a dataset.

required

concept

int

Index identifying the position of the concept of interest (score in the ConceptsActivations tensor) for which relevant input elements should be retrieved.

required

attribution_method

type[AttributionExplainer]

The attribution method to obtain importance scores for input elements.

required

Returns:

Type Description
list[float]

A list of attribution scores for each input.

Source code in interpreto/concepts/base.py
@check_fitted
def input_concept_attribution(
    self,
    inputs: ModelInputs,
    concept: int,
    attribution_method: type[AttributionExplainer],
    **attribution_kwargs,
) -> list[float]:
    """Attributes model inputs for a selected concept.

    Args:
        inputs (ModelInputs): The input data, which can be a string, a list of tokens/words/clauses/sentences
            or a dataset.
        concept (int): Index identifying the position of the concept of interest (score in the
            `ConceptsActivations` tensor) for which relevant input elements should be retrieved.
        attribution_method: The attribution method to obtain importance scores for input elements.

    Returns:
        A list of attribution scores for each input.
    """
    raise NotImplementedError("Input-to-concept attribution method is not implemented yet.")

concept_output_attribution

concept_output_attribution(inputs, concepts, target, attribution_method, **attribution_kwargs)

Computes the attribution of each concept for the logit of a target output element.

Parameters:

Name Type Description Default

inputs

ModelInputs

An input data-point for the model.

required

concepts

Tensor

Concept activation tensor.

required

target

int

The target class for which the concept output attribution should be computed.

required

attribution_method

type[AttributionExplainer]

The attribution method to obtain importance scores for input elements.

required

Returns:

Type Description
list[float]

A list of attribution scores for each concept.

Source code in interpreto/concepts/base.py
@check_fitted
def concept_output_attribution(
    self,
    inputs: ModelInputs,
    concepts: ConceptsActivations,
    target: int,
    attribution_method: type[AttributionExplainer],
    **attribution_kwargs,
) -> list[float]:
    """Computes the attribution of each concept for the logit of a target output element.

    Args:
        inputs (ModelInputs): An input data-point for the model.
        concepts (torch.Tensor): Concept activation tensor.
        target (int): The target class for which the concept output attribution should be computed.
        attribution_method: The attribution method to obtain importance scores for input elements.

    Returns:
        A list of attribution scores for each concept.
    """
    raise NotImplementedError("Concept-to-output attribution method is not implemented yet.")