Cockatiel¶

Implementation of the COCKATIEL framework from COCKATIEL: COntinuous Concept ranKed ATtribution with Interpretable ELements for explaining neural net classifiers on NLP by Jourdan et al. (2023).

interpreto.concepts.Cockatiel ¶

Cockatiel(model_with_split_points, *, nb_concepts, split_point=None, device='cpu', force_relu=False, **kwargs)

Bases: NMFConcepts

Code: concepts/methods/cockatiel.py

Implementation of the Cockatiel concept explainer by Jourdan et al. (2023)¹.

Jourdan F., Picard A., Fel T., Risser A., Loubes JM., and Asher N. COCKATIEL: COntinuous Concept ranKed ATtribution with Interpretable ELements for explaining neural net classifiers on NLP. Findings of the Association for Computational Linguistics (ACL 2023), pp. 5120–5136, 2023. ↩

Attributes:

Name	Type	Description
`model_with_split_points`	`ModelWithSplitPoints`	The model to apply the explanation on. It should have at least one split point on which `concept_model` can be fitted.
`split_point`	`str \| None`	The split point used to train the `concept_model`. Default: `None`, set only when the concept explainer is fitted.
`concept_model`	`SemiNMF`	An Overcomplete NMF encoder-decoder.
`force_relu`	`bool`	Whether to force the activations to be positive.
`is_fitted`	`bool`	Whether the `concept_model` was fit on model activations.
`has_differentiable_concept_encoder`	`bool`	Whether the `encode_activations` operation is differentiable.
`has_differentiable_concept_decoder`	`bool`	Whether the `decode_concepts` operation is differentiable.

Parameters:

Name	Type	Description	Default
`model_with_split_points` ¶	`ModelWithSplitPoints`	The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained.	required
`nb_concepts` ¶	`int`	Size of the SAE concept space.	required
`split_point` ¶	`str \| None`	The split point used to train the `concept_model`. If None, tries to use the split point of `model_with_split_points` if a single one is defined.	`None`
`device` ¶	`device \| str`	Device to use for the `concept_module`.	`'cpu'`
`force_relu` ¶	`bool`	Whether to force the activations to be positive.	`False`
`**kwargs` ¶	`dict`	Additional keyword arguments to pass to the `concept_module`. See the Overcomplete documentation of the provided `concept_model_class` for more details.	`{}`

Source code in interpreto/concepts/methods/overcomplete.py

def __init__(
    self,
    model_with_split_points: ModelWithSplitPoints,
    *,
    nb_concepts: int,
    split_point: str | None = None,
    device: torch.device | str = "cpu",
    force_relu: bool = False,
    **kwargs,
):
    """
    Initialize the concept bottleneck explainer based on the Overcomplete BaseOptimDictionaryLearning framework.

    Args:
        model_with_split_points (ModelWithSplitPoints): The model to apply the explanation on.
            It should have at least one split point on which a concept explainer can be trained.
        nb_concepts (int): Size of the SAE concept space.
        split_point (str | None): The split point used to train the `concept_model`. If None, tries to use the
            split point of `model_with_split_points` if a single one is defined.
        device (torch.device | str): Device to use for the `concept_module`.
        force_relu (bool): Whether to force the activations to be positive.
        **kwargs (dict): Additional keyword arguments to pass to the `concept_module`.
            See the Overcomplete documentation of the provided `concept_model_class` for more details.
    """
    super().__init__(
        model_with_split_points,
        nb_concepts=nb_concepts,
        split_point=split_point,
        device=device,
        **kwargs,
    )
    self.force_relu = force_relu
    self.has_differentiable_concept_encoder = False
    self.has_differentiable_concept_decoder = True

fit ¶

fit(activations, *, overwrite=False, **kwargs)

Fit an Overcomplete OptimDictionaryLearning model on the given activations.

Parameters:

Name	Type	Description	Default
`activations` ¶	`Tensor \| dict[str, Tensor]`	The activations used for fitting the `concept_model`. If a dictionary is provided, the activation corresponding to `split_point` will be used.	required
`overwrite` ¶	`bool`	Whether to overwrite the current model if it has already been fitted. Default: False.	`False`
`**kwargs` ¶	`dict`	Additional keyword arguments to pass to the `concept_model`. See the Overcomplete documentation of the provided `concept_model` for more details.	`{}`

Source code in interpreto/concepts/methods/overcomplete.py

def fit(self, activations: LatentActivations | dict[str, LatentActivations], *, overwrite: bool = False, **kwargs):
    """Fit an Overcomplete OptimDictionaryLearning model on the given activations.

    Args:
        activations (torch.Tensor | dict[str, torch.Tensor]): The activations used for fitting the `concept_model`.
            If a dictionary is provided, the activation corresponding to `split_point` will be used.
        overwrite (bool): Whether to overwrite the current model if it has already been fitted.
            Default: False.
        **kwargs (dict): Additional keyword arguments to pass to the `concept_model`.
            See the Overcomplete documentation of the provided `concept_model` for more details.
    """
    split_activations = self._prepare_fit(activations, overwrite=overwrite)
    if (split_activations < 0).any():
        if self.force_relu:
            split_activations = torch.nn.functional.relu(split_activations)
        else:
            raise ValueError(
                "The activations should be positive. If you want to force the activations to be positive, "
                "use the `NMFConcepts(..., force_relu=True)`."
            )
    self.concept_model.fit(split_activations, **kwargs)

encode_activations ¶

encode_activations(activations)

Encode the given activations using the concept_model encoder.

Parameters:

Name	Type	Description	Default
`activations` ¶	`LatentActivations`	The activations to encode.	required

Returns:

Type	Description
`Tensor`	The encoded concept activations.

Source code in interpreto/concepts/methods/overcomplete.py

@check_fitted
def encode_activations(self, activations: LatentActivations) -> torch.Tensor:  # ConceptsActivations
    """Encode the given activations using the `concept_model` encoder.

    Args:
        activations (LatentActivations): The activations to encode.

    Returns:
        The encoded concept activations.
    """
    self._sanitize_activations(activations)
    if (activations < 0).any():
        if self.force_relu:
            activations = torch.nn.functional.relu(activations)
        else:
            raise ValueError(
                "The activations should be positive. If you want to force the activations to be positive, "
                "use the `NMFConcepts(..., force_relu=True)`."
            )
    return self.concept_model.encode(activations)  # type: ignore

decode_concepts ¶

decode_concepts(concepts)

Decode the given concepts using the concept_model decoder.

Parameters:

Name	Type	Description	Default
`concepts` ¶	`ConceptsActivations`	The concepts to decode.	required

Returns:

Type	Description
`Tensor`	The decoded model activations.

Source code in interpreto/concepts/base.py

@check_fitted
def decode_concepts(self, concepts: ConceptsActivations) -> torch.Tensor:  # LatentActivations
    """Decode the given concepts using the `concept_model` decoder.

    Args:
        concepts (ConceptsActivations): The concepts to decode.

    Returns:
        The decoded model activations.
    """
    return self.concept_model.decode(concepts)  # type: ignore

get_dictionary ¶

get_dictionary()

Get the dictionary learned by the fitted concept_model.

Returns:

Type	Description
`Tensor`	torch.Tensor: A `torch.Tensor` containing the learned dictionary.

Source code in interpreto/concepts/base.py

@check_fitted
def get_dictionary(self) -> torch.Tensor:  # TODO: add this to tests
    """Get the dictionary learned by the fitted `concept_model`.

    Returns:
        torch.Tensor: A `torch.Tensor` containing the learned dictionary.
    """
    return self.concept_model.get_dictionary()  # type: ignore

interpret ¶

interpret(interpretation_method, concepts_indices, inputs=None, latent_activations=None, concepts_activations=None, **kwargs)

Interpret the concepts dimensions in the latent space into a human-readable format. The interpretation is a mapping between the concepts indices and an object allowing to interpret them. It can be a label, a description, examples, etc.

Parameters:

Name	Type	Description	Default
`interpretation_method` ¶	`type[BaseConceptInterpretationMethod]`	The interpretation method to use to interpret the concepts.	required
`concepts_indices` ¶	`int \| list[int] \| Literal['all']`	The indices of the concepts to interpret. If "all", all concepts are interpreted.	required
`inputs` ¶	`list[str] \| None`	The inputs to use for the interpretation. Necessary if the source is not `VOCABULARY`, as examples are extracted from the inputs.	`None`
`latent_activations` ¶	`LatentActivations \| dict[str, LatentActivations] \| None`	The latent activations to use for the interpretation. Necessary if the source is `LATENT_ACTIVATIONS`. Otherwise, it is computed from the inputs or ignored if the source is `CONCEPT_ACTIVATIONS`.	`None`
`concepts_activations` ¶	`ConceptsActivations \| None`	The concepts activations to use for the interpretation. Necessary if the source is not `CONCEPT_ACTIVATIONS`. Otherwise, it is computed from the latent activations.	`None`
`**kwargs` ¶		Additional keyword arguments to pass to the interpretation method.	`{}`

Returns:

Type	Description
`Mapping[int, Any]`	Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.

Source code in interpreto/concepts/base.py

@check_fitted
def interpret(
    self,
    interpretation_method: type[BaseConceptInterpretationMethod],
    concepts_indices: int | list[int] | Literal["all"],
    inputs: list[str] | None = None,
    latent_activations: dict[str, LatentActivations] | LatentActivations | None = None,
    concepts_activations: ConceptsActivations | None = None,
    **kwargs,
) -> Mapping[int, Any]:
    """
    Interpret the concepts dimensions in the latent space into a human-readable format.
    The interpretation is a mapping between the concepts indices and an object allowing to interpret them.
    It can be a label, a description, examples, etc.

    Args:
        interpretation_method: The interpretation method to use to interpret the concepts.
        concepts_indices (int | list[int] | Literal["all"]): The indices of the concepts to interpret.
            If "all", all concepts are interpreted.
        inputs (list[str] | None): The inputs to use for the interpretation.
            Necessary if the source is not `VOCABULARY`, as examples are extracted from the inputs.
        latent_activations (LatentActivations | dict[str, LatentActivations] | None): The latent activations to use for the interpretation.
            Necessary if the source is `LATENT_ACTIVATIONS`.
            Otherwise, it is computed from the inputs or ignored if the source is `CONCEPT_ACTIVATIONS`.
        concepts_activations (ConceptsActivations | None): The concepts activations to use for the interpretation.
            Necessary if the source is not `CONCEPT_ACTIVATIONS`. Otherwise, it is computed from the latent activations.
        **kwargs: Additional keyword arguments to pass to the interpretation method.

    Returns:
        Mapping[int, Any]: A mapping between the concepts indices and the interpretation of the concepts.
    """
    if concepts_indices == "all":
        concepts_indices = list(range(self.concept_model.nb_concepts))

    # verify
    if latent_activations is not None:
        split_latent_activations = self._sanitize_activations(latent_activations)
    else:
        split_latent_activations = None

    # initialize the interpretation method
    method = interpretation_method(
        model_with_split_points=self.model_with_split_points,
        split_point=self.split_point,
        concept_model=self.concept_model,
        **kwargs,
    )

    # compute the interpretation from inputs and activations
    return method.interpret(
        concepts_indices=concepts_indices,
        inputs=inputs,
        latent_activations=split_latent_activations,
        concepts_activations=concepts_activations,
    )

input_concept_attribution ¶

input_concept_attribution(inputs, concept, **attribution_kwargs)

Computes the attribution of each input to a given concept.

Parameters:

Name	Type	Description	Default
`inputs` ¶	`ModelInputs`	The input data, which can be a string, a list of tokens/words/clauses/sentences, or a dataset.	required
`concept` ¶	`int \| list[int]`	The concept index (or list of concepts indices) to analyze.	required

Returns:

Type	Description
`list[float]`	A list of attribution scores for each input.

Source code in interpreto/concepts/methods/cockatiel.py

def input_concept_attribution(
    self,
    inputs: ModelInput,
    concept: int,
    **attribution_kwargs,
) -> list[float]:
    """
    Computes the attribution of each input to a given concept.

    Args:
        inputs (ModelInputs): The input data, which can be a string, a list of tokens/words/clauses/sentences, or a dataset.
        concept (int | list[int]): The concept index (or list of concepts indices) to analyze.

    Returns:
        A list of attribution scores for each input.
    """
    return super().input_concept_attribution(
        inputs, concept, "Occlusion", **attribution_kwargs
    )  # TODO: add occlusion class when it exists

concept_output_attribution ¶

concept_output_attribution(inputs, concepts, target, **attribution_kwargs)

Computes the attribution of each concept for the logit of a target output element.

Parameters:

Name	Type	Description	Default
`inputs` ¶	`ModelInputs`	An input datapoint for the model.	required
`concepts` ¶	`Tensor`	Concept activation tensor.	required
`target` ¶	`int`	The target class for which the concept output attribution should be computed.	required

Returns:

Type	Description
`list[float]`	A list of attribution scores for each concept.

Source code in interpreto/concepts/methods/cockatiel.py

def concept_output_attribution(
    self, inputs: ModelInputs, concepts: ConceptsActivations, target: int, **attribution_kwargs
) -> list[float]:
    """Computes the attribution of each concept for the logit of a target output element.

    Args:
        inputs (ModelInputs): An input datapoint for the model.
        concepts (torch.Tensor): Concept activation tensor.
        target (int): The target class for which the concept output attribution should be computed.

    Returns:
        A list of attribution scores for each concept.
    """
    return super().concept_output_attribution(
        inputs, concepts, target, attribution_method="Sobol", **attribution_kwargs
    )  # TODO: add sobol class when it exists

Cockatiel¶

interpreto.concepts.Cockatiel ¶

`model_with_split_points` ¶

`nb_concepts` ¶

`split_point` ¶

`device` ¶

`force_relu` ¶

`**kwargs` ¶

fit ¶

`activations` ¶

`overwrite` ¶

`**kwargs` ¶

encode_activations ¶

`activations` ¶

decode_concepts ¶

`concepts` ¶

get_dictionary ¶

interpret ¶

`interpretation_method` ¶

`concepts_indices` ¶

`inputs` ¶

`latent_activations` ¶

`concepts_activations` ¶

`**kwargs` ¶

input_concept_attribution ¶

`inputs` ¶

`concept` ¶

concept_output_attribution ¶

`inputs` ¶

`concepts` ¶

`target` ¶

Cockatiel¶

interpreto.concepts.Cockatiel ¶

model_with_split_points ¶

nb_concepts ¶

split_point ¶

device ¶

force_relu ¶

**kwargs ¶

fit ¶

activations ¶

overwrite ¶

**kwargs ¶

encode_activations ¶

activations ¶

decode_concepts ¶

concepts ¶

get_dictionary ¶

interpret ¶

interpretation_method ¶

concepts_indices ¶

inputs ¶

latent_activations ¶

concepts_activations ¶

**kwargs ¶

input_concept_attribution ¶

inputs ¶

concept ¶

concept_output_attribution ¶

inputs ¶

concepts ¶

target ¶

`model_with_split_points` ¶

`nb_concepts` ¶

`split_point` ¶

`device` ¶

`force_relu` ¶

`**kwargs` ¶

`activations` ¶

`overwrite` ¶

`**kwargs` ¶

`activations` ¶

`concepts` ¶

`interpretation_method` ¶

`concepts_indices` ¶

`inputs` ¶

`latent_activations` ¶

`concepts_activations` ¶

`**kwargs` ¶

`inputs` ¶

`concept` ¶

`inputs` ¶

`concepts` ¶

`target` ¶