Skip to content

Concepts Dictionary Metrics

Concept dictionary metrics evaluate the concept space via its dictionary. They can either be compute on a single dictionary or compare two or more dictionaries.

from interpreto.concepts.metrics import MetricClass

metric = MetricClass(concept_explainer1, concept_explainer2, ...)
score = metric.compute()

Stability

interpreto.concepts.metrics.Stability

Stability(*concept_explainers, matching_algorithm=COSINE_HUNGARIAN)

Code concepts/metrics/dictionary_metrics.py

Stability metric between sets of dictionaries, introduced by Fel et al. (2023)1. Also called Consistency by Paulo et Belrose (2025)2.

  • If only one dictionary is provided, the metric is a self comparison of the dictionary.
  • If two dictionaries are provided, the metric is a comparison between the two dictionaries.
  • If more than two dictionaries are provided, the metric is the mean of the pairwise comparisons.

  1. Fel, T., Boutin, V., Béthune, L., Cadène, R., Moayeri, M., Andéol, L., Chavidal, M., & Serre, T. A holistic approach to unifying automatic concept extraction and concept importance estimation. Advances in Neural Information Processing Systems. 2023. 

  2. Paulo, G et Belrose, N. Sparse Autoencoders Trained on the Same Data Learn Different Features 2025. 

Parameters:

Name Type Description Default

concept_explainers

ConceptAutoEncoderExplainer | Float[Tensor, 'cpt d']

The ConceptAutoEncoderExplainers or dictionaries to compare. Both types are supported and can be mixed.

()

matching_algorithm

DistanceFunctionProtocol

The algorithm used to match concepts between dictionaries. Defaults to ConceptMatchingAlgorithm.COSINE_HUNGARIAN.

COSINE_HUNGARIAN

Examples:

>>> import torch
>>> from interpreto.concepts import NMFConcepts
>>> from interpreto.concepts.metrics import Stability
>>> # Iterate on random seeds
>>> concept_explainers = []
>>> for seed in range(10):
...     # set seed
...     torch.manual_seed(seed)
...     # Create a concept model
...     nmf_explainer = NMFConcepts(model_with_split_points, nb_concepts=20, device="cuda", force_relu=True)
...     # Fit the concept model
...     nmf_explainer.fit(activations)
...     concept_explainers.append(nmf_explainer)
>>> # Compute the stability metric
>>> stability = Stability(*concept_explainers)
>>> score = stability.compute()

Raises:

Type Description
ValueError

If no ConceptAutoEncoderExplainers or dictionary are provided.

ValueError

If the matching algorithm is not supported.

ValueError

If the dictionaries are not torch.Tensor.

ValueError

If the dictionaries have different shapes.

Source code in interpreto/concepts/metrics/dictionary_metrics.py
def __init__(
    self,
    *concept_explainers: ConceptAutoEncoderExplainer | Float[torch.Tensor, "cpt d"],
    matching_algorithm: DistanceFunctionProtocol = ConceptMatchingAlgorithm.COSINE_HUNGARIAN,
):
    if len(concept_explainers) < 1:
        raise ValueError("At least one `ConceptAutoEncoderExplainer`s or `torch.Tensor`s must be provided.")

    # if only one explainer is provided, duplicate it for self comparison
    if len(concept_explainers) == 1:
        concept_explainers = concept_explainers * 2

    # extract dictionaries from concept explainers
    self.dictionaries: list[Float[torch.Tensor, "cpt d"]] = [
        ce.get_dictionary() if isinstance(ce, ConceptAutoEncoderExplainer) else ce for ce in concept_explainers
    ]

    expected_shape = None
    for i, dictionary in enumerate(self.dictionaries):
        if not isinstance(dictionary, torch.Tensor):
            raise ValueError(
                f"Dictionary {i} or dictionary extracted from concept explainer {i} is not a torch.Tensor."
            )

        if len(dictionary.shape) != 2:
            raise ValueError(
                f"Dictionary {i} or dictionary extracted from concept explainer {i} is not a 2D tensor."
            )

        expected_shape = dictionary.shape if expected_shape is None else expected_shape
        if dictionary.shape != expected_shape:
            raise ValueError(
                f"Dictionary {i} or dictionary extracted from concept explainer {i} has a different shape from the first dictionary."
                f"Expected shape: {expected_shape}, got shape: {dictionary.shape}."
            )

    self.distance_function = matching_algorithm

compute

compute()

Compute the mean score over pairwise comparison scores between dictionaries.

Returns:

Name Type Description
float float

The stability score.

Source code in interpreto/concepts/metrics/dictionary_metrics.py
def compute(self) -> float:
    """Compute the mean score over pairwise comparison scores between dictionaries.

    Returns:
        float: The stability score.
    """
    comparisons = []
    for dict_1, dict_2 in combinations(self.dictionaries, 2):
        # compute pairwise comparison
        comparison = 1 - self.distance_function(dict_1, dict_2)
        comparisons.append(comparison)

    return torch.stack(comparisons).mean().item()