Neurons as Concepts¶

interpreto.concepts.NeuronsAsConcepts ¶

NeuronsAsConcepts(model_with_split_points, split_point=None)

Bases: ConceptAutoEncoderExplainer

Code: concepts/methods/neurons_as_concepts.py Concept Bottleneck Explainer where the latent space is considered as the concept space.

TODO: Add doc with papers we can redo with it.¶

Attributes:

Name	Type	Description
`model_with_split_points`	`ModelWithSplitPoints`	The model to apply the explanation on. It should have at least one split point on which `concept_model` can be fitted.
`split_point`	`str`	The split point used to train the `concept_model`.
`concept_model`	`IdentityConceptModel`	An identity concept model for harmonization.
`is_fitted`	`bool`	Whether the `concept_model` was fit on model activations.
`has_differentiable_concept_encoder`	`bool`	Whether the `encode_activations` operation is differentiable.
`has_differentiable_concept_decoder`	`bool`	Whether the `decode_concepts` operation is differentiable.

Parameters:

Name	Type	Description	Default
`model_with_split_points` ¶	`ModelWithSplitPoints`	The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained.	required
`split_point` ¶	`str \| None`	The split point used to train the `concept_model`. If None, tries to use the split point of `model_with_split_points` if a single one is defined.	`None`

Source code in interpreto/concepts/methods/neurons_as_concepts.py

def __init__(
    self,
    model_with_split_points: ModelWithSplitPoints,
    split_point: str | None = None,
):
    """
    Initializes the concept explainer with a given splitted model.

    Args:
        model_with_split_points (ModelWithSplitPoints): The model to apply the explanation on.
            It should have at least one split point on which a concept explainer can be trained.
        split_point (str | None): The split point used to train the `concept_model`. If None, tries to use the
            split point of `model_with_split_points` if a single one is defined.
    """
    # extract the input size from the model activations
    self.model_with_split_points = model_with_split_points
    self.split_point: str = split_point  # type: ignore
    input_size = self.model_with_split_points.get_latent_shape()[self.split_point][-1]

    # initialize
    super().__init__(
        model_with_split_points=model_with_split_points,
        concept_model=IdentityConceptModel(input_size),
        split_point=self.split_point,
    )
    self.has_differentiable_concept_encoder = True
    self.has_differentiable_concept_decoder = True

interpret ¶

interpret(interpretation_method, concepts_indices, inputs=None, latent_activations=None, concepts_activations=None, **kwargs)

Interpret the concepts dimensions in the latent space into a human-readable format. The interpretation is a mapping between the concepts indices and an object allowing to interpret them. It can be a label, a description, examples, etc.

Parameters:

Name	Type	Description	Default
`interpretation_method` ¶	`type[BaseConceptInterpretationMethod]`	The interpretation method to use to interpret the concepts.	required
`concepts_indices` ¶	`int \| list[int] \| Literal['all']`	The indices of the concepts to interpret. If "all", all concepts are interpreted.	required
`inputs` ¶	`list[str] \| None`	The inputs to use for the interpretation. Necessary if the source is not `VOCABULARY`, as examples are extracted from the inputs.	`None`
`latent_activations` ¶	`LatentActivations \| dict[str, LatentActivations] \| None`	The latent activations to use for the interpretation. Necessary if the source is `LATENT_ACTIVATIONS`. Otherwise, it is computed from the inputs or ignored if the source is `CONCEPT_ACTIVATIONS`.	`None`
`concepts_activations` ¶	`ConceptsActivations \| None`	The concepts activations to use for the interpretation. Necessary if the source is not `CONCEPT_ACTIVATIONS`. Otherwise, it is computed from the latent activations.	`None`
`**kwargs` ¶		Additional keyword arguments to pass to the interpretation method.	`{}`

Returns:

Type	Description
`Mapping[int, Any]`	Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.

Source code in interpreto/concepts/base.py

@check_fitted
def interpret(
    self,
    interpretation_method: type[BaseConceptInterpretationMethod],
    concepts_indices: int | list[int] | Literal["all"],
    inputs: list[str] | None = None,
    latent_activations: dict[str, LatentActivations] | LatentActivations | None = None,
    concepts_activations: ConceptsActivations | None = None,
    **kwargs,
) -> Mapping[int, Any]:
    """
    Interpret the concepts dimensions in the latent space into a human-readable format.
    The interpretation is a mapping between the concepts indices and an object allowing to interpret them.
    It can be a label, a description, examples, etc.

    Args:
        interpretation_method: The interpretation method to use to interpret the concepts.
        concepts_indices (int | list[int] | Literal["all"]): The indices of the concepts to interpret.
            If "all", all concepts are interpreted.
        inputs (list[str] | None): The inputs to use for the interpretation.
            Necessary if the source is not `VOCABULARY`, as examples are extracted from the inputs.
        latent_activations (LatentActivations | dict[str, LatentActivations] | None): The latent activations to use for the interpretation.
            Necessary if the source is `LATENT_ACTIVATIONS`.
            Otherwise, it is computed from the inputs or ignored if the source is `CONCEPT_ACTIVATIONS`.
        concepts_activations (ConceptsActivations | None): The concepts activations to use for the interpretation.
            Necessary if the source is not `CONCEPT_ACTIVATIONS`. Otherwise, it is computed from the latent activations.
        **kwargs: Additional keyword arguments to pass to the interpretation method.

    Returns:
        Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.
    """
    if concepts_indices == "all":
        concepts_indices = list(range(self.concept_model.nb_concepts))

    # verify
    if latent_activations is not None:
        split_latent_activations = self._sanitize_activations(latent_activations)
    else:
        split_latent_activations = None

    # initialize the interpretation method
    method = interpretation_method(
        model_with_split_points=self.model_with_split_points,
        split_point=self.split_point,
        concept_model=self.concept_model,
        **kwargs,
    )

    # compute the interpretation from inputs and activations
    return method.interpret(
        concepts_indices=concepts_indices,
        inputs=inputs,
        latent_activations=split_latent_activations,
        concepts_activations=concepts_activations,
    )

input_concept_attribution ¶

input_concept_attribution(inputs, concept, attribution_method, **attribution_kwargs)

Attributes model inputs for a selected concept.

Parameters:

Name	Type	Description	Default
`inputs` ¶	`ModelInputs`	The input data, which can be a string, a list of tokens/words/clauses/sentences or a dataset.	required
`concept` ¶	`int`	Index identifying the position of the concept of interest (score in the `ConceptsActivations` tensor) for which relevant input elements should be retrieved.	required
`attribution_method` ¶	`type[AttributionExplainer]`	The attribution method to obtain importance scores for input elements.	required

Returns:

Type	Description
`list[float]`	A list of attribution scores for each input.

Source code in interpreto/concepts/base.py

@check_fitted
def input_concept_attribution(
    self,
    inputs: ModelInputs,
    concept: int,
    attribution_method: type[AttributionExplainer],
    **attribution_kwargs,
) -> list[float]:
    """Attributes model inputs for a selected concept.

    Args:
        inputs (ModelInputs): The input data, which can be a string, a list of tokens/words/clauses/sentences
            or a dataset.
        concept (int): Index identifying the position of the concept of interest (score in the
            `ConceptsActivations` tensor) for which relevant input elements should be retrieved.
        attribution_method: The attribution method to obtain importance scores for input elements.

    Returns:
        A list of attribution scores for each input.
    """
    raise NotImplementedError("Input-to-concept attribution method is not implemented yet.")

concept_output_attribution ¶

concept_output_attribution(inputs, concepts, target, attribution_method, **attribution_kwargs)

Computes the attribution of each concept for the logit of a target output element.

Parameters:

Name	Type	Description	Default
`inputs` ¶	`ModelInputs`	An input data-point for the model.	required
`concepts` ¶	`Tensor`	Concept activation tensor.	required
`target` ¶	`int`	The target class for which the concept output attribution should be computed.	required
`attribution_method` ¶	`type[AttributionExplainer]`	The attribution method to obtain importance scores for input elements.	required

Returns:

Type	Description
`list[float]`	A list of attribution scores for each concept.

Source code in interpreto/concepts/base.py

@check_fitted
def concept_output_attribution(
    self,
    inputs: ModelInputs,
    concepts: ConceptsActivations,
    target: int,
    attribution_method: type[AttributionExplainer],
    **attribution_kwargs,
) -> list[float]:
    """Computes the attribution of each concept for the logit of a target output element.

    Args:
        inputs (ModelInputs): An input data-point for the model.
        concepts (torch.Tensor): Concept activation tensor.
        target (int): The target class for which the concept output attribution should be computed.
        attribution_method: The attribution method to obtain importance scores for input elements.

    Returns:
        A list of attribution scores for each concept.
    """
    raise NotImplementedError("Concept-to-output attribution method is not implemented yet.")

Neurons as Concepts¶

interpreto.concepts.NeuronsAsConcepts ¶

TODO: Add doc with papers we can redo with it.¶

`model_with_split_points` ¶

`split_point` ¶

interpret ¶

`interpretation_method` ¶

`concepts_indices` ¶

`inputs` ¶

`latent_activations` ¶

`concepts_activations` ¶

`**kwargs` ¶

input_concept_attribution ¶

`inputs` ¶

`concept` ¶

`attribution_method` ¶

concept_output_attribution ¶

`inputs` ¶

`concepts` ¶

`target` ¶

`attribution_method` ¶

Neurons as Concepts¶

interpreto.concepts.NeuronsAsConcepts ¶

TODO: Add doc with papers we can redo with it.¶

model_with_split_points ¶

split_point ¶

interpret ¶

interpretation_method ¶

concepts_indices ¶

inputs ¶

latent_activations ¶

concepts_activations ¶

**kwargs ¶

input_concept_attribution ¶

inputs ¶

concept ¶

attribution_method ¶

concept_output_attribution ¶

inputs ¶

concepts ¶

target ¶

attribution_method ¶

`model_with_split_points` ¶

`split_point` ¶

`interpretation_method` ¶

`concepts_indices` ¶

`inputs` ¶

`latent_activations` ¶

`concepts_activations` ¶

`**kwargs` ¶

`inputs` ¶

`concept` ¶

`attribution_method` ¶

`inputs` ¶

`concepts` ¶

`target` ¶

`attribution_method` ¶