ConSim

interpreto.concepts.metrics.ConSim ¶

ConSim(model_with_split_points, user_llm, activation_granularity, classes=None, split_point=None)

Code: concepts/metrics/consim.py

ConSim for Concept-based Simulatability. Was introduced by Poché et al. in 2025[^1].

It evaluates all three components of the concept-based explanation:

the concepts space
the concepts interpretation
the concepts importance

To evaluate explanations on a given model \(f\), ConSim evaluates to which extent explanations help a meta-predictor \(\Psi\) to simulate the predictions of the model \(f\).

In our case, the role of the meta-predictor will be played by user_llm, and interface calling a model either from local, or from a remote API, such as OpenAI or HuggingFace. Therefore, most of the code correspond to building the prompts for the LLM.

There are three steps to ConSim:

Step 0: Instantiate the ConSim metric with the model_with_split_points (\(f\)) and the user_llm (\(\Psi\)).
Step 1: Select interesting examples for ConSim with the select_examples method. Samples are selected to see how well \(\Psi\) can simulate \(f\). Thus there are samples for every classes and many initial errors from \(f\).
Step 2: Evaluate the ConSim score with the evaluate method. It is an accuracy score between \(\Psi\) and \(f\) predictions. But we selected interesting examples, so it cannot be compared to a natural accuracy on the dataset. Therefore, we need to compare it to a baseline ().

Tip

We highly recommend to do the steps 1 and 2 several times with different seeds to get more statistically significant results. The initial papers[^1] used five different seeds..

A. Poché, A. Jacovi, A.M. Picard, V. Boutin, and F. Jourdan. ConSim: Measuring Concept-Based Explanations' Effectiveness with Automated Simulatability. In the Proceedings of the 2025 Association for Computational Linguistics (ACL). ↩

Parameters:

Name	Type	Description	Default
`model_with_split_points` ¶	`ModelWithSplitPoints`	ModelWithSplitPoints The model to explain. Is is a wrapper around a model and a tokenizer to easily get activations.	required
`user_llm` ¶	`LLMInterface \| None`	LLMInterface \| None The LLM interface that will serve as the meta-predictor. If not provided the user will have to call the ConSim prompts manually. If your preferred LLM API is not supported, you can implement your own LLM interface. You just have to implement the `generate` method. The format of the prompt is: `[(Role.SYSTEM, "system prompt"), (Role.USER, "user prompt"), (Role.ASSISTANT, "assistant prompt")]`	required
`activation_granularity` ¶	`ActivationGranularity`	ActivationGranularity The granularity of the activations to use for the explanations.	required
`classes` ¶	`list[str] \| None`	list[str] \| None The names of classes of the dataset.	`None`
`split_point` ¶	`str \| None`	str Where to split the model to explain.	`None`

Attributes:

Name	Type	Description
`classes`	`list[str] \| None`	list[str] \| None The names of classes of the dataset.
`prompt_types`	`type[PromptTypes]`	PromptTypes Enum of the possible prompts types to use.
`model_with_split_points`		ModelWithSplitPoints The model to explain. Is is a wrapper around a model and a tokenizer to easily get activations.
`split_point`	`str`	str Where to split the model to explain.
`user_llm`	`LLMInterface \| None`	LLMInterface \| None The LLM interface that will serve as the meta-predictor. If your preferred LLM API is not supported, you can implement your own LLM interface. You just have to implement the `generate` method. The format of the prompt is: `[(Role.SYSTEM, "system prompt"), (Role.USER, "user prompt"), (Role.ASSISTANT, "assistant prompt")]`

TODO

validate example in practice

Examples:

Preamble to a metric, fit a concept explainer:

>>> import datasets
>>> from interpreto import ConSim, ModelWithSplitPoints, ICAConcepts, OpenAILLM
>>>
>>> # ------------------------
>>> # Load a model and wrap it
>>> model_with_split_points = ModelWithSplitPoints(
...     "textattack/bert-base-uncased-ag-news",
...     split_points=["bert.encoder.layer.10.output"],
...     model_autoclass=AutoModelForSequenceClassification,  # type: ignore
...     batch_size=4,
... )
>>>
>>> # --------------------------------------
>>> # Load a dataset and compute activations
>>> dataset = datasets.load_dataset("fancyzhx/ag_news")
>>> classes = ["World", "Sports", "Business", "Sci/Tech"]
>>> activations = model_with_split_points.get_activations(dataset["train"]["text"])
>>>
>>> # -------------------------
>>> # Fit the concept explainer
>>> concept_explainer_1 = ICAConcepts(model_with_split_points, nb_concepts=50)
>>> concept_explainer.fit(activations)

The two steps of ConSim:

>>> # ------------------------------------------------------------------
>>> # Step 0: Define the User-LLM and instantiate the ConSim metric
>>> user_llm = OpenAILLM(api_key="YOUR_OPENAI_API_KEY", model="gpt-4.1-nano")
>>> consim = ConSim(
...     model_with_split_points,
...     user_llm,
...     activation_granularities=ModelWithSplitPoints.activation_granularities.TOKEN,
...     classes=classes,
... )
>>>
>>> # ----------------------------------------------
>>> # Step 1: Select interesting examples for ConSim
>>> samples, labels, predictions = consim.select_examples(
...     dataset["train"]["text"], dataset["train"]["label"],
... )
>>>
>>> # -------------------------------------------------------------
>>> # Step 2: Evaluate the ConSim score, do not forget the baseline
>>> baseline = consim.evaluate(samples, labels, predictions, prompt_type=PromptTypes.L2_baseline_with_lp)
>>> consim_score = consim.evaluate(samples, labels, predictions, concept_explainer_1, prompt_type=PromptTypes.E3_global_and_local_concepts_with_lp)

Source code in interpreto/concepts/metrics/consim.py

def __init__(
    self,
    model_with_split_points: ModelWithSplitPoints,
    user_llm: LLMInterface | None,
    activation_granularity: ActivationGranularity,
    classes: list[str] | None = None,
    split_point: str | None = None,
):
    """
    Initialize the ConSim metric.
    """
    self.model_with_split_points = model_with_split_points
    if split_point is None:
        if len(self.model_with_split_points.split_points) > 1:
            raise ValueError(
                "If the model has more than one split point, a split point for fitting the concept model should "
                f"be specified. Got split point: '{split_point}' with model split points: "
                f"{', '.join(self.model_with_split_points.split_points)}."
            )
        split_point = self.model_with_split_points.split_points[0]

    if split_point not in self.model_with_split_points.split_points:
        raise ValueError(
            f"Split point '{split_point}' not found in model split points: {', '.join(self.model_with_split_points.split_points)}."
        )

    self.split_point: str = split_point
    self.activation_granularity: ActivationGranularity = activation_granularity
    self.user_llm: LLMInterface | None = user_llm
    self.classes: list[str] | None = classes

select_examples ¶

select_examples(inputs, labels, nb_lp_samples=20, nb_ep_samples=20, seed=0, batch_size=64, device=None)

Select examples for the ConSim metric. It first computes the models' predictions on the inputs. Then, it selects nb_lp_samples + nb_ep_samples samples from the inputs. The goal is to select uniformly between each class (with respect to the labels). There should be as many samples where the initial model prediction are correct as incorrect. The samples are then randomly shuffled.

The first nb_lp_samples samples are selected for the learning phase. The last nb_ep_samples samples are selected for the evaluation phase.

Therefore, there is no guarantee on the repartition inside learning and evaluation phase.

Parameters:

Name	Type	Description	Default
`inputs` ¶	`list[str]`	list[str] The inputs to predict.	required
`labels` ¶	`Tensor`	torch.Tensor The labels of the inputs.	required
`nb_lp_samples` ¶	`int`	int The number of samples to select for the learning phase.	`20`
`nb_ep_samples` ¶	`int`	int The number of samples to select for the evaluation phase.	`20`
`seed` ¶	`int`	int The seed to use for the random selection.	`0`
`batch_size` ¶	`int`	int The batch size to use for the predictions.	`64`
`device` ¶	`device \| str \| None`	torch.device \| str \| None The device to use for the predictions.	`None`

Returns:

Name	Type	Description
`interesting_samples`	`list[str]`	list[str] The interesting samples.
`labels`	`Tensor`	torch.Tensor The labels of the interesting samples.
`predictions`	`Tensor`	torch.Tensor The predictions of the model on the interesting samples.

Source code in interpreto/concepts/metrics/consim.py

def select_examples(
    self,
    inputs: list[str],
    labels: torch.Tensor,
    nb_lp_samples: int = 20,
    nb_ep_samples: int = 20,
    seed: int = 0,
    batch_size: int = 64,
    device: torch.device | str | None = None,
) -> tuple[list[str], torch.Tensor, torch.Tensor]:
    """
    Select examples for the ConSim metric. It first computes the models' predictions on the inputs.
    Then, it selects `nb_lp_samples` + `nb_ep_samples` samples from the inputs.
    The goal is to select uniformly between each class (with respect to the labels).
    There should be as many samples where the initial model prediction are correct as incorrect.
    The samples are then randomly shuffled.

    The first `nb_lp_samples` samples are selected for the learning phase.
    The last `nb_ep_samples` samples are selected for the evaluation phase.

    Therefore, there is no guarantee on the repartition inside learning and evaluation phase.

    Arguments:
        inputs: list[str]
            The inputs to predict.
        labels: torch.Tensor
            The labels of the inputs.
        nb_lp_samples: int
            The number of samples to select for the learning phase.
        nb_ep_samples: int
            The number of samples to select for the evaluation phase.
        seed: int
            The seed to use for the random selection.
        batch_size: int
            The batch size to use for the predictions.
        device: torch.device | str | None
            The device to use for the predictions.

    Returns:
        interesting_samples: list[str]
            The interesting samples.
        labels: torch.Tensor
            The labels of the interesting samples.
        predictions: torch.Tensor
            The predictions of the model on the interesting samples.
    """
    predictions = self._get_predictions(inputs, batch_size=batch_size, device=device)
    return self._extract_interesting_elements(
        inputs=inputs,
        labels=labels,
        predictions=predictions,
        nb_lp_samples=nb_lp_samples,
        nb_ep_samples=nb_ep_samples,
        seed=seed,
    )

evaluate ¶

evaluate(interesting_samples, predictions, concept_explainer=None, concepts_interpretation=None, global_importances=None, prompt_type=E3_global_and_local_concepts_with_lp, anonymize_classes=False, importance_threshold=0.05)

Evaluate the ConSim metric, thus the accuracy of the user_llm predictions with respect to the model predictions.

First local concepts importances are computed via the concept_explainer. Then a prompt is constructed by integrating all the different elements and following the prompt_type. The prompt is sent to the user_llm and the model predictions are extracted from the response. Finally, the score is computed by comparing the model predictions with the user_llm predictions.

The prompts have five parts:

Initial Phase (IP.1) the first part is the task description, which is a list of questions to ask the LLM.
Initial Phase (IP.2) the second is a global concepts explanation on \(f\). Listing the important concepts for each class.
Learning Phase (LP.1) the third gives examples of samples and predictions from the model \(f\).
Learning Phase (LP.2) the fourth is a local concepts explanation on \(f\). Listing the important concepts in each example.
Evaluation Phase (EP.1) the fifth is a list of samples on which the meta-predictor \(\Psi\) will be asked to predict the model \(f\) predictions.

The answer of the LLM will be a list of predictions for each sample. ConSim compares these predictions to the model \(f\) predictions and computes the accuracy of the explanations.

Parameters:

Name	Type	Description	Default
`interesting_samples` ¶	`list[str]`	list[str] The interesting samples.	required
`predictions` ¶	`Tensor`	torch.Tensor The predictions of the model on the interesting samples.	required
`concept_explainer` ¶	`ConceptAutoEncoderExplainer \| None`	ConceptAutoEncoderExplainer \| None The concept explainer. Can be None for the baseline.	`None`
`concepts_interpretation` ¶	`dict[int, str] \| None`	dict[int, str] \| None The words that activate the concepts the most and the least. A dictionary with the concepts as keys and another dictionary as values. The inner dictionary has the words as keys and the activations as values. Can be None for the baseline.	`None`
`global_importances` ¶	`dict[str, dict[int, float]] \| None`	dict[str, dict[int, float]] \| None The importance of the concepts for each class. A dictionary with the classes as keys and another dictionary as values. The inner dictionary has the concepts as keys and the importance as values. Can be None for the baseline.	`None`
`prompt_type` ¶	`PromptTypes`	PromptTypes The type of prompt to use. Possible values are: `PromptTypes.L1_baseline_without_lp`: baseline without learning phase. `PromptTypes.E1_global_concepts_without_lp`: global concepts without learning phase. `PromptTypes.L2_baseline_with_lp`: baseline with learning phase. `PromptTypes.E2_global_concepts_with_lp`: global concepts with learning phase. `PromptTypes.E3_global_and_local_concepts_with_lp`: global and local concepts with learning phase. `PromptTypes.U1_upper_bound_concepts_at_ep`: upper bound - concepts at evaluation phase.	`E3_global_and_local_concepts_with_lp`
`anonymize_classes` ¶	`bool`	bool Whether to anonymize the classes. Class names will be replaced by "Class_i" where i is the index of the class. It prevents the user-llm to solve the task by simply guessing the class.	`False`
`importance_threshold` ¶	`float`	float The threshold to select the most important concepts for each class. The threshold correspond to the cumulative importance of the concepts to keep.	`0.05`

Returns:

Type Description

float | None | tuple[list[tuple[Role, str]], list[str]]

score or prompts and model predictions: float | None | tuple[list[tuple[Role, str]], list[str]] Possible outputs:

score (float): The score of the ConSim metric. (The nominal behavior)
None: If the model predictions are empty or the user-llm predictions are empty. It was chosen to return None, because ConSim should be called a lot of times for statistically significant results. Therefore, having a None score once in a while is better than the script crashing.
prompts and model predictions (tuple[list[tuple[Role, str]], list[str]]): If no user_llm is provided, returns the prompts and the model predictions. The prompt is the first element of the tuple (list[tuple[Role, str]]). The predictions are the second element of the tuple (list[str]). The user will have to call the ConSim prompts manually. The response of the LLM on the prompts should be compared to the model predictions.

Source code in interpreto/concepts/metrics/consim.py

def evaluate(
    self,
    interesting_samples: list[str],
    predictions: torch.Tensor,
    concept_explainer: ConceptAutoEncoderExplainer | None = None,
    concepts_interpretation: dict[int, str] | None = None,
    global_importances: dict[str, dict[int, float]] | None = None,
    prompt_type: PromptTypes = PromptTypes.E3_global_and_local_concepts_with_lp,
    anonymize_classes: bool = False,
    importance_threshold: float = 0.05,
) -> float | None | tuple[list[tuple[Role, str]], list[str]]:
    """
    Evaluate the ConSim metric, thus the accuracy of the `user_llm` predictions with respect to the model predictions.

    First local concepts importances are computed via the `concept_explainer`.
    Then a prompt is constructed by integrating all the different elements and following the `prompt_type`.
    The prompt is sent to the `user_llm` and the model predictions are extracted from the response.
    Finally, the score is computed by comparing the model predictions with the `user_llm` predictions.

    The prompts have five parts:

    - Initial Phase (IP.1) the first part is the task description, which is a list of questions to ask the LLM.

    - Initial Phase (IP.2) the second is a global concepts explanation on $f$. Listing the important concepts for each class.

    - Learning Phase (LP.1) the third gives examples of samples and predictions from the model $f$.

    - Learning Phase (LP.2) the fourth is a local concepts explanation on $f$. Listing the important concepts in each example.

    - Evaluation Phase (EP.1) the fifth is a list of samples on which the meta-predictor $\\Psi$ will be asked to predict the model $f$ predictions.

    The answer of the LLM will be a list of predictions for each sample. ConSim compares these predictions to the
    model $f$ predictions and computes the accuracy of the explanations.

    Arguments:
        interesting_samples: list[str]
            The interesting samples.

        predictions: torch.Tensor
            The predictions of the model on the interesting samples.

        concept_explainer: ConceptAutoEncoderExplainer | None
            The concept explainer. Can be None for the baseline.

        concepts_interpretation: dict[int, str] | None
            The words that activate the concepts the most and the least.
            A dictionary with the concepts as keys and another dictionary as values.
            The inner dictionary has the words as keys and the activations as values.
            Can be None for the baseline.

        global_importances: dict[str, dict[int, float]] | None
            The importance of the concepts for each class.
            A dictionary with the classes as keys and another dictionary as values.
            The inner dictionary has the concepts as keys and the importance as values.
            Can be None for the baseline.

        prompt_type: PromptTypes
            The type of prompt to use. Possible values are:

            - `PromptTypes.L1_baseline_without_lp`: baseline without learning phase.

            - `PromptTypes.E1_global_concepts_without_lp`: global concepts without learning phase.

            - `PromptTypes.L2_baseline_with_lp`: baseline with learning phase.

            - `PromptTypes.E2_global_concepts_with_lp`: global concepts with learning phase.

            - `PromptTypes.E3_global_and_local_concepts_with_lp`: global and local concepts with learning phase.

            - `PromptTypes.U1_upper_bound_concepts_at_ep`: upper bound - concepts at evaluation phase.

        anonymize_classes: bool
            Whether to anonymize the classes. Class names will be replaced by "Class_i" where i is the index of the class.
            It prevents the user-llm to solve the task by simply guessing the class.

        importance_threshold: float
            The threshold to select the most important concepts for each class.
            The threshold correspond to the cumulative importance of the concepts to keep.

    Returns:
        score or prompts and model predictions: float | None | tuple[list[tuple[Role, str]], list[str]]
            Possible outputs:

            - score (float): The score of the ConSim metric. (The nominal behavior)
            - None: If the model predictions are empty or the user-llm predictions are empty.
                It was chosen to return None,
                because ConSim should be called a lot of times for statistically significant results.
                Therefore, having a None score once in a while is better than the script crashing.
            - prompts and model predictions (tuple[list[tuple[Role, str]], list[str]]):
                If no user_llm is provided, returns the prompts and the model predictions.
                The prompt is the first element of the tuple (list[tuple[Role, str]]).
                The predictions are the second element of the tuple (list[str]).
                The user will have to call the ConSim prompts manually.
                The response of the LLM on the prompts should be compared to the model predictions.

    Raises:
        ValueError
            If the model predictions and the user-llm predictions have different lengths.
        Warnings
            If the user-llm response is empty or the format is not respected.
    """
    local_importances: torch.Tensor | None = None
    if concept_explainer is not None:
        # Ensure the mwsp of the explainer is the same as the one used in the provided concept_explainer
        if concept_explainer.split_point not in self.model_with_split_points.split_points:
            raise ValueError(
                "The split point used in the provided `concept_explainer` should be one of the `model_with_split_points` ones."
                f"Got split point: '{concept_explainer.split_point}' with model split points: "
                f"{', '.join(self.model_with_split_points.split_points)}."
            )
        if (
            concept_explainer.model_with_split_points._model.config.name_or_path
            != self.model_with_split_points._model.config.name_or_path
        ):
            raise ValueError(
                "The model used in the provided `concept_explainer` should be the same as the one used in the `model_with_split_points`."
                f"Got (concept_explainer) model name or path: '{concept_explainer.model_with_split_points._model.config.name_or_path}'"
                f"and (model_with_split_points) model name or path: '{self.model_with_split_points._model.config.name_or_path}'."
            )

        # compute concepts importance  # TODO: when first layers can be skipped pass the concept activations
        # For now we force gradient-input
        # TODO: precise shapes with jaxtyping
        if prompt_type in [
            PromptTypes.E3_global_and_local_concepts_with_lp,
            PromptTypes.U1_upper_bound_concepts_at_ep,
        ]:
            if prompt_type is PromptTypes.E3_global_and_local_concepts_with_lp:
                samples_to_explain = interesting_samples[: len(interesting_samples) // 2]
            else:
                samples_to_explain = interesting_samples
            local_importances_list = concept_explainer.concept_output_gradient(
                inputs=samples_to_explain,
                split_point=self.split_point,
                activation_granularity=self.activation_granularity,
                concepts_x_gradients=True,
                tqdm_bar=False,
            )
            local_importances = torch.stack(local_importances_list)

    # generate the prompt
    prompts, literal_model_predictions = ConSim._generate_prompt(
        sentences=interesting_samples,
        predictions=predictions,
        classes=self.classes,
        concepts_interpretation=concepts_interpretation,
        global_importances=global_importances,
        local_importances=local_importances,
        prompt_type=prompt_type,
        anonymize_classes=anonymize_classes,
        importance_threshold=importance_threshold,
    )

    # if no user_llm is provided, we return the prompts and the model predictions
    if self.user_llm is None:
        return prompts, literal_model_predictions

    user_llm_response = self.user_llm.generate(prompts)

    # raise warnings if the response is empty or the format is not respected
    return self._compute_score(
        user_llm_response=user_llm_response,
        literal_model_predictions=literal_model_predictions,
    )

interpreto.concepts.metrics.consim.PromptTypes ¶

Bases: Enum

There are six types of prompts, including two baselines and an upper bond:

Attributes:

Name	Type	Description
`L1_baseline_without_lp`		IP.1 and EP.1 are included in the prompt. Only the task description, but explanations or learning phase.
`E1_global_concepts_without_lp`		IP.1, IP.2, and EP.1 are included in the prompt. Only task description and global concepts explanation, but no learning phase.
`L2_baseline_with_lp`		IP.1, LP.1, and EP.1 are included in the prompt. Task description and learning phase, but no explanations.
`E2_global_concepts_with_lp`		IP.1, IP.2, LP.1, and EP.1 are included in the prompt. Task description, global concepts explanation, and learning phase. But no local concepts explanation.
`E3_global_and_local_concepts_with_lp`		IP.1, IP.2, LP.1, LP.2, and EP.1 are included in the prompt. Task description, learning phase, and both global and local concepts explanation.
`U1_upper_bound_concepts_at_ep`		IP.1, IP.2, LP.1, LP.2, EP.1, and EP.2 are included in the prompt. Same as `E3_global_and_local_concepts_with_lp`, but local explanations are also given at evaluation phase. This has a very high probability to leak the initial model predictions via EP local explanations. Warning, this should not be considered as a ConSim score. But it gives an upper bound to the ConSim scores.

ConSim

interpreto.concepts.metrics.ConSim ¶

`model_with_split_points` ¶

`user_llm` ¶

`activation_granularity` ¶

`classes` ¶

`split_point` ¶

select_examples ¶

`inputs` ¶

`labels` ¶

`nb_lp_samples` ¶

`nb_ep_samples` ¶

`seed` ¶

`batch_size` ¶

`device` ¶

evaluate ¶

`interesting_samples` ¶

`predictions` ¶

`concept_explainer` ¶

`concepts_interpretation` ¶

`global_importances` ¶

`prompt_type` ¶

`anonymize_classes` ¶

`importance_threshold` ¶

interpreto.concepts.metrics.consim.PromptTypes ¶

ConSim

interpreto.concepts.metrics.ConSim ¶

model_with_split_points ¶

user_llm ¶

activation_granularity ¶

classes ¶

split_point ¶

select_examples ¶

inputs ¶

labels ¶

nb_lp_samples ¶

nb_ep_samples ¶

seed ¶

batch_size ¶

device ¶

evaluate ¶

interesting_samples ¶

predictions ¶

concept_explainer ¶

concepts_interpretation ¶

global_importances ¶

prompt_type ¶

anonymize_classes ¶

importance_threshold ¶

interpreto.concepts.metrics.consim.PromptTypes ¶

`model_with_split_points` ¶

`user_llm` ¶

`activation_granularity` ¶

`classes` ¶

`split_point` ¶

`inputs` ¶

`labels` ¶

`nb_lp_samples` ¶

`nb_ep_samples` ¶

`seed` ¶

`batch_size` ¶

`device` ¶

`interesting_samples` ¶

`predictions` ¶

`concept_explainer` ¶

`concepts_interpretation` ¶

`global_importances` ¶

`prompt_type` ¶

`anonymize_classes` ¶

`importance_threshold` ¶