Concepts Dictionary Metrics¶

Concept dictionary metrics evaluate the concept space via its dictionary. They can either be compute on a single dictionary or compare two or more dictionaries.

from interpreto.concepts.metrics import MetricClass

metric = MetricClass(concept_explainer1, concept_explainer2, ...)
score = metric.compute()

Stability¶

interpreto.concepts.metrics.Stability ¶

Stability(*concept_explainers, matching_algorithm=COSINE_HUNGARIAN)

Code concepts/metrics/dictionary_metrics.py

Stability metric between sets of dictionaries, introduced by Fel et al. (2023)¹. Also called Consistency by Paulo et Belrose (2025)².

If only one dictionary is provided, the metric is a self comparison of the dictionary.
If two dictionaries are provided, the metric is a comparison between the two dictionaries.
If more than two dictionaries are provided, the metric is the mean of the pairwise comparisons.

Fel, T., Boutin, V., Béthune, L., Cadène, R., Moayeri, M., Andéol, L., Chavidal, M., & Serre, T. A holistic approach to unifying automatic concept extraction and concept importance estimation. Advances in Neural Information Processing Systems. 2023. ↩
Paulo, G et Belrose, N. Sparse Autoencoders Trained on the Same Data Learn Different Features 2025. ↩

Parameters:

Name	Type	Description	Default
`concept_explainers` ¶	`ConceptAutoEncoderExplainer \| Float[Tensor, 'cpt d']`	The `ConceptAutoEncoderExplainer`s or dictionaries to compare. Both types are supported and can be mixed.	`()`
`matching_algorithm` ¶	`DistanceFunctionProtocol`	The algorithm used to match concepts between dictionaries. Defaults to ConceptMatchingAlgorithm.COSINE_HUNGARIAN.	`COSINE_HUNGARIAN`

Examples:

>>> import torch
>>> from interpreto.concepts import NMFConcepts
>>> from interpreto.concepts.metrics import Stability

>>> # Iterate on random seeds
>>> concept_explainers = []
>>> for seed in range(10):
...     # set seed
...     torch.manual_seed(seed)
...     # Create a concept model
...     nmf_explainer = NMFConcepts(model_with_split_points, nb_concepts=20, device="cuda", force_relu=True)
...     # Fit the concept model
...     nmf_explainer.fit(activations)
...     concept_explainers.append(nmf_explainer)

>>> # Compute the stability metric
>>> stability = Stability(*concept_explainers)
>>> score = stability.compute()

Raises:

Type	Description
`ValueError`	If no `ConceptAutoEncoderExplainer`s or dictionary are provided.
`ValueError`	If the matching algorithm is not supported.
`ValueError`	If the dictionaries are not torch.Tensor.
`ValueError`	If the dictionaries have different shapes.

Source code in interpreto/concepts/metrics/dictionary_metrics.py

def __init__(
    self,
    *concept_explainers: ConceptAutoEncoderExplainer | Float[torch.Tensor, "cpt d"],
    matching_algorithm: DistanceFunctionProtocol = ConceptMatchingAlgorithm.COSINE_HUNGARIAN,
):
    if len(concept_explainers) < 1:
        raise ValueError("At least one `ConceptAutoEncoderExplainer`s or `torch.Tensor`s must be provided.")

    # if only one explainer is provided, duplicate it for self comparison
    if len(concept_explainers) == 1:
        concept_explainers = concept_explainers * 2

    # extract dictionaries from concept explainers
    self.dictionaries: list[Float[torch.Tensor, "cpt d"]] = [
        ce.get_dictionary() if isinstance(ce, ConceptAutoEncoderExplainer) else ce for ce in concept_explainers
    ]

    expected_shape = None
    for i, dictionary in enumerate(self.dictionaries):
        if not isinstance(dictionary, torch.Tensor):
            raise ValueError(
                f"Dictionary {i} or dictionary extracted from concept explainer {i} is not a torch.Tensor."
            )

        if len(dictionary.shape) != 2:
            raise ValueError(
                f"Dictionary {i} or dictionary extracted from concept explainer {i} is not a 2D tensor."
            )

        expected_shape = dictionary.shape if expected_shape is None else expected_shape
        if dictionary.shape != expected_shape:
            raise ValueError(
                f"Dictionary {i} or dictionary extracted from concept explainer {i} has a different shape from the first dictionary."
                f"Expected shape: {expected_shape}, got shape: {dictionary.shape}."
            )

    self.distance_function = matching_algorithm

compute ¶

compute()

Compute the mean score over pairwise comparison scores between dictionaries.

Returns:

Name	Type	Description
`float`	`float`	The stability score.

Source code in interpreto/concepts/metrics/dictionary_metrics.py

def compute(self) -> float:
    """Compute the mean score over pairwise comparison scores between dictionaries.

    Returns:
        float: The stability score.
    """
    comparisons = []
    for dict_1, dict_2 in combinations(self.dictionaries, 2):
        # compute pairwise comparison
        comparison = 1 - self.distance_function(dict_1, dict_2)
        comparisons.append(comparison)

    return torch.stack(comparisons).mean().item()

Concepts Dictionary Metrics¶

Stability¶

interpreto.concepts.metrics.Stability ¶

`concept_explainers` ¶

`matching_algorithm` ¶

compute ¶

Concepts Dictionary Metrics¶

Stability¶

interpreto.concepts.metrics.Stability ¶

concept_explainers ¶

matching_algorithm ¶

compute ¶

`concept_explainers` ¶

`matching_algorithm` ¶