LLM Labels¶

LLMLabels uses a large language model to generate natural-language labels for each concept, based on the top-k activating inputs. This provides a human-readable summary of what each concept represents.

API Reference¶

interpreto.concepts.interpretations.LLMLabels ¶

LLMLabels(*, concept_explainer, activation_granularity=None, aggregation_strategy=MEAN, llm_interface, concept_encoding_batch_size=1024, sampling_method=TOP, k_examples=30, k_context=0, use_vocab=False, use_unique_words=0, unique_words_kwargs={}, k_quantile=5, system_prompt=None)

Bases: BaseConceptInterpretationMethod

Code concepts/interpretations/llm_labels.py

Implement the automatic labeling method using a language model (LLM) to provide a short textual description given some examples of what activate the concept. This method was first introduced in ¹, we implement here the step 1 of the method.

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, William Saunders* Language models can explain neurons in language models 2023. ↩

Parameters:

Name	Type	Description	Default
`concept_explainer` ¶	`ConceptEncoderExplainer`	The fitted concept explainer used for encoding activations.	required
`activation_granularity` ¶	`ActivationGranularity`	The granularity of the activations to use for the interpretation. See :method:`interpreto.concepts.splitters.model_with_split_points.ModelWithSplitPoints.get_activations` for more details.	`None`
`aggregation_strategy` ¶	`GranularityAggregationStrategy`	The aggregation strategy to use for the activations. See :method:`interpreto.concepts.splitters.model_with_split_points.ModelWithSplitPoints.get_activations` for more details.	`MEAN`
`llm_interface` ¶	`LLMInterface`	The LLM interface to use for the interpretation.	required
`concept_encoding_batch_size` ¶	`int`	The batch size to use for the concept encoding.	`1024`
`sampling_method` ¶	`SAMPLING_METHOD`	The method to use for sampling the inputs provided to the LLM.	`TOP`
`k_examples` ¶	`int`	The number of inputs to use for the interpretation.	`30`
`k_context` ¶	`int`	The number of context tokens to use around the concept tokens. In the prompt, in the examples, the k context tokens before and after the concept token are selected. It is recommended to set it to between 5 and 10 for TOKEN and WORD granularities. However, if the granularity is CLS_TOKEN or SAMPLE, or `use_unique_words=True` or `use_vocab=True`, it will be forced to 0. Indeed, in these cases the context do not make sense.	`0`
`use_vocab` ¶	`bool`	If True, the interpretation will be computed from the vocabulary of the model.	`False`
`use_unique_words` ¶	`bool`	If True, the interpretation will be computed from the unique words of the inputs. Incompatible with `use_vocab=True`. Default unique words selects all different word from the input. It can be tuned through the `unique_words_kwargs` argument.	`0`
`unique_words_kwargs` ¶	`dict`	The kwargs to pass to the `extract_ngrams` function. See `extract_ngrams` for more details. Possible arguments are `count_min_threshold`, `lemmatize`, `words_to_ignore`.	`{}`
`k_quantile` ¶	`int`	The number of quantiles to use for sampling the inputs, if `sampling_method` is `QUANTILE`.	`5`
`system_prompt` ¶	`str \| None`	The system prompt to use for the LLM. If None, a default prompt is used.	`None`

Source code in interpreto/concepts/interpretations/llm_labels.py

def __init__(
    self,
    *,
    concept_explainer: ConceptEncoderExplainer,
    activation_granularity: ActivationGranularity | None = None,
    aggregation_strategy: GranularityAggregationStrategy = GranularityAggregationStrategy.MEAN,
    llm_interface: LLMInterface,
    concept_encoding_batch_size: int = 1024,
    sampling_method: SamplingMethod = SamplingMethod.TOP,
    k_examples: int = 30,
    k_context: int = 0,
    use_vocab: bool = False,
    use_unique_words: bool | int = 0,
    unique_words_kwargs: dict = {},
    k_quantile: int = 5,
    system_prompt: str | None = None,
):
    super().__init__(
        concept_explainer=concept_explainer,
        activation_granularity=activation_granularity,
        aggregation_strategy=aggregation_strategy,
        concept_encoding_batch_size=concept_encoding_batch_size,
        use_vocab=use_vocab,
        use_unique_words=use_unique_words,
        unique_words_kwargs=unique_words_kwargs,
    )

    if k_context > 0 and (
        use_vocab
        or use_unique_words
        or self.activation_granularity
        in [
            ActivationGranularity.SAMPLE,
            ActivationGranularity.CLS_TOKEN,
        ]
    ):
        k_context = 0
        warnings.warn(
            "k_context is set to 0 because use_vocab or use_unique_words or activation_granularity is SAMPLE or CLS_TOKEN."
            "With these granularities, it is not possible to provide context around the granular inputs.",
            stacklevel=2,
        )

    self.llm_interface = llm_interface
    self.sampling_method = sampling_method
    self.k_examples = k_examples
    self.k_context = k_context
    self.k_quantile = k_quantile

    if system_prompt is None:
        if self.k_context > 0:
            self.system_prompt = SYSTEM_PROMPT_WITH_CONTEXT
        else:
            self.system_prompt = SYSTEM_PROMPT_WITHOUT_CONTEXT
    else:
        self.system_prompt = system_prompt

interpret ¶

interpret(concepts_indices, inputs=None, latent_activations=None, concepts_activations=None)

Give the interpretation of the concepts dimensions in the latent space into a human-readable format. The interpretation is a mapping between the concepts indices and a short textual description. The granularity of input examples is determined by the activation_granularity class attribute.

Parameters:

Name	Type	Description	Default
`concepts_indices` ¶	`int \| list[int] \| Literal['all']`	The indices of the concepts to interpret. If "all", all concepts are interpreted.	required
`inputs` ¶	`list[str] \| None`	The inputs to use for the interpretation. Necessary if not `use_vocab`,as examples are extracted from the inputs.	`None`
`latent_activations` ¶	`Float[Tensor, 'nl d'] \| None`	The latent activations matching the inputs. If not provided, it is computed from the inputs.	`None`
`concepts_activations` ¶	`Float[Tensor, 'nl cpt'] \| None`	The concepts activations matching the inputs. If not provided, it is computed from the inputs or latent activations.	`None`

Returns:

Type	Description
`Mapping[int, str \| None]`	Mapping[int, str \| None]: The textual labels of the concepts indices.

Source code in interpreto/concepts/interpretations/llm_labels.py

def interpret(
    self,
    concepts_indices: int | list[int] | Literal["all"],
    inputs: list[str] | None = None,
    latent_activations: LatentActivations | None = None,
    concepts_activations: ConceptsActivations | None = None,
) -> Mapping[int, str | None]:
    """
    Give the interpretation of the concepts dimensions in the latent space into a human-readable format.
    The interpretation is a mapping between the concepts indices and a short textual description.
    The granularity of input examples is determined by the `activation_granularity` class attribute.


    Args:
        concepts_indices (int | list[int] | Literal["all"]):
            The indices of the concepts to interpret. If "all", all concepts are interpreted.

        inputs (list[str] | None):
            The inputs to use for the interpretation.
            Necessary if not `use_vocab`,as examples are extracted from the inputs.

        latent_activations (Float[torch.Tensor, "nl d"] | None):
            The latent activations matching the inputs. If not provided,
            it is computed from the inputs.

        concepts_activations (Float[torch.Tensor, "nl cpt"] | None):
            The concepts activations matching the inputs. If not provided,
            it is computed from the inputs or latent activations.

    Returns:
        Mapping[int, str | None]: The textual labels of the concepts indices.
    """
    sure_concepts_indices: list[int]
    granular_inputs: list[str]
    sure_concepts_activations: Float[torch.Tensor, "nl cpt"]
    granular_sample_ids: list[int]
    (
        sure_concepts_indices,
        granular_inputs,
        sure_concepts_activations,
        granular_sample_ids,
    ) = self.get_granular_inputs_and_concept_activations(
        concepts_indices=concepts_indices,
        inputs=inputs,
        latent_activations=latent_activations,
        concepts_activations=concepts_activations,
    )

    labels: Mapping[int, str | None] = {}
    for concept_idx in sure_concepts_indices:
        example_idx = self.sampling_method.sample_examples(
            concept_activations=sure_concepts_activations[:, concept_idx],
            k_examples=self.k_examples,
            k_quantile=self.k_quantile,
        )
        examples = _format_examples(
            example_ids=example_idx,
            inputs=granular_inputs,
            concept_activations=sure_concepts_activations[:, concept_idx],
            sample_ids=granular_sample_ids,
            k_context=self.k_context,
        )
        example_prompt = _build_example_prompt(examples)
        prompt: list[tuple[Role, str]] = [
            (Role.SYSTEM, self.system_prompt),
            (Role.USER, example_prompt),
            (Role.ASSISTANT, ""),
        ]
        label = self.llm_interface.generate(prompt)
        labels[concept_idx] = label
    return labels

interpreto.commons.llm_interface.LLMInterface ¶

Bases: ABC

generate `abstractmethod` ¶

generate(prompt)

Source code in interpreto/commons/llm_interface.py

@abstractmethod
def generate(self, prompt: list[tuple[Role, str]]) -> str | None:
    pass

interpreto.concepts.interpretations.extract_ngrams ¶

extract_ngrams(inputs, n=1, count_min_threshold=1, return_counts=False, lemmatize=False, words_to_ignore=None)

Extract n-grams (from 1-gram up to n-gram of words) from a list of texts.

If n=3, it extracts 1-grams, 2-grams, and 3-grams.

Parameters:

Name	Type	Description	Default
`inputs` ¶	`Iterable[str]`	The texts to extract n-grams from.	required
`n` ¶	`int`	The maximum n-gram size. All sizes from 1 to n are extracted.	`1`
`count_min_threshold` ¶	`int`	The minimum total number of occurrences of an n-gram in the whole `inputs`.	`1`
`return_counts` ¶	`bool`	Whether to return the counts of each n-gram. Defaults to False.	`False`
`lemmatize` ¶	`bool`	Whether to lemmatize words before counting.	`False`
`words_to_ignore` ¶	`list[str] \| None`	A list of words to ignore (applied to individual tokens before forming n-grams).	`None`

Returns:

Type	Description
`list[str] \| Counter[str]`	list[str] \| Counter[str]: The list of unique n-grams or the counts of each n-gram.

Source code in interpreto/concepts/interpretations/base.py

@jaxtyped(typechecker=beartype)
def extract_ngrams(
    inputs: Iterable[str],
    n: int = 1,
    count_min_threshold: int = 1,
    return_counts: bool = False,
    lemmatize: bool = False,
    words_to_ignore: list[str] | None = None,
) -> list[str] | Counter[str]:
    """
    Extract n-grams (from 1-gram up to n-gram of words) from a list of texts.

    If n=3, it extracts 1-grams, 2-grams, and 3-grams.

    Args:
        inputs (Iterable[str]):
            The texts to extract n-grams from.

        n (int):
            The maximum n-gram size. All sizes from 1 to n are extracted.

        count_min_threshold (int, optional):
            The minimum total number of occurrences of an n-gram in the whole `inputs`.

        return_counts (bool, optional):
            Whether to return the counts of each n-gram.
            Defaults to False.

        lemmatize (bool, optional):
            Whether to lemmatize words before counting.

        words_to_ignore (list[str] | None, optional):
            A list of words to ignore (applied to individual tokens before forming n-grams).

    Returns:
        list[str] | Counter[str]:
            The list of unique n-grams or the counts of each n-gram.
    """
    _ensure_nltk_resources(lemmatize=lemmatize)

    if lemmatize:
        lemmatizer = WordNetLemmatizer()

    tuple_ngram_counts: Counter[tuple[str]] = Counter()

    for text in inputs:
        tokens = word_tokenize(text)

        # preprocess tokens
        processed = []
        for word in tokens:
            if lemmatize:
                word = lemmatizer.lemmatize(word.lower())  # noqa: PLW2901  # type: ignore  (ignore possibly unbound)
            if words_to_ignore is not None and word in words_to_ignore:
                continue
            processed.append(word)
            tuple_ngram_counts[(word,)] += 1  # unigram tuple

        for size in range(2, n + 1):  # skips size 1 as covered over
            for i in range(len(processed) - size + 1):
                tuple_ngram_counts[tuple(processed[i : i + size])] += 1  # >1-gram tuples

    str_ngram_counts: Counter[str] = Counter(
        {
            " ".join(key): count  # convert ngram tuples to strings
            for key, count in tuple_ngram_counts.items()
            if count >= count_min_threshold  # filter too rare n-grams
        }
    )

    if return_counts:
        return str_ngram_counts

    return list(str_ngram_counts.keys())

LLM Labels¶

API Reference¶

interpreto.concepts.interpretations.LLMLabels ¶

`concept_explainer` ¶

`activation_granularity` ¶

`aggregation_strategy` ¶

`llm_interface` ¶

`concept_encoding_batch_size` ¶

`sampling_method` ¶

`k_examples` ¶

`k_context` ¶

`use_vocab` ¶

`use_unique_words` ¶

`unique_words_kwargs` ¶

`k_quantile` ¶

`system_prompt` ¶

interpret ¶

`concepts_indices` ¶

`inputs` ¶

`latent_activations` ¶

`concepts_activations` ¶

interpreto.commons.llm_interface.LLMInterface ¶

generate `abstractmethod` ¶

interpreto.concepts.interpretations.extract_ngrams ¶

`inputs` ¶

`n` ¶

`count_min_threshold` ¶

`return_counts` ¶

`lemmatize` ¶

`words_to_ignore` ¶

LLM Labels¶

API Reference¶

interpreto.concepts.interpretations.LLMLabels ¶

concept_explainer ¶

activation_granularity ¶

aggregation_strategy ¶

llm_interface ¶

concept_encoding_batch_size ¶

sampling_method ¶

k_examples ¶

k_context ¶

use_vocab ¶

use_unique_words ¶

unique_words_kwargs ¶

k_quantile ¶

system_prompt ¶

interpret ¶

concepts_indices ¶

inputs ¶

latent_activations ¶

concepts_activations ¶

interpreto.commons.llm_interface.LLMInterface ¶

generate abstractmethod ¶

interpreto.concepts.interpretations.extract_ngrams ¶

inputs ¶

n ¶

count_min_threshold ¶

return_counts ¶

lemmatize ¶

words_to_ignore ¶

`concept_explainer` ¶

`activation_granularity` ¶

`aggregation_strategy` ¶

`llm_interface` ¶

`concept_encoding_batch_size` ¶

`sampling_method` ¶

`k_examples` ¶

`k_context` ¶

`use_vocab` ¶

`use_unique_words` ¶

`unique_words_kwargs` ¶

`k_quantile` ¶

`system_prompt` ¶

`concepts_indices` ¶

`inputs` ¶

`latent_activations` ¶

`concepts_activations` ¶

generate `abstractmethod` ¶

`inputs` ¶

`n` ¶

`count_min_threshold` ¶

`return_counts` ¶

`lemmatize` ¶

`words_to_ignore` ¶