Concept Interpretations
interpreto.concepts.interpretations.TopKInputs
¶
TopKInputs(*, concept_explainer, activation_granularity=WORD, aggregation_strategy=MEAN, concept_encoding_batch_size=1024, k=5, use_vocab=False, use_unique_words=0, unique_words_kwargs={}, concept_model_device=None)
Bases: BaseConceptInterpretationMethod
Code concepts/interpretations/topk_inputs.py
Implementation of the Top-K Inputs concept interpretation method also called MaxAct, or CMAW. It associate to each concept the inputs that activates it the most. It is the most natural way to interpret a concept, as it is the most natural way to explain a concept. Hence several papers used it without describing it. Nonetheless, we can reference Bricken et al. (2023) 1 from Anthropic for their post on transformer-circuits.
-
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, Chris Olah Towards Monosemanticity: Decomposing Language Models With Dictionary Learning Transformer Circuits, 2023. ↩
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
ConceptEncoderExplainer
|
The concept explainer built on top of a |
required |
|
ActivationGranularity
|
The granularity of the activations to use for the interpretation.
See :method: |
WORD
|
|
GranularityAggregationStrategy
|
The aggregation strategy to use for the activations.
See :method: |
MEAN
|
|
int
|
The batch size to use for the concept encoding. |
1024
|
|
int
|
The number of inputs to use for the interpretation. |
5
|
|
bool
|
If True, the interpretation will be computed from the vocabulary of the model. |
False
|
|
bool
|
If True, the interpretation will be computed from the unique words of the inputs.
Incompatible with |
0
|
|
dict
|
The kwargs to pass to the |
{}
|
|
device | str | None
|
The device to use for the concept model forward pass. If None, does not change the device. |
None
|
Examples:
Minimal example, finding the topk tokens activating a neuron:
>>> from transformers import AutoModelForCausalLM
>>>
>>> from interpreto import ModelWithSplitPoints
>>> from interpreto.concepts import NeuronsAsConcepts, TopKInputs
>>>
>>> # load and split the the GPT2 model
>>> mwsp = ModelWithSplitPoints(
... "gpt2",
... split_points=[11], # split at the 12th layer
... automodel=AutoModelForCausalLM,
... device_map="auto",
... batch_size=2048,
... )
>>>
>>> # Use `NeuronsAsConcepts` to use the concept-based pipeline with neurons
>>> concept_explainer = NeuronsAsConcepts(mwsp)
>>>
>>> method = TopKInputs(
... concept_explainer=concept_explainer,
... use_vocab=True, # use the vocabulary of the model and test all tokens (50257 with GPT2)
... k=10, # get the top 10 tokens for each neuron
... )
>>>
>>> topk_tokens = method.interpret(
... concepts_indices="all", # interpret the three first neurons of the 7th layer
... )
>>>
>>> print(list(topk_tokens[1].keys()))
['hostages', 'choke', 'infring', 'herpes', 'nuns', 'phylogen', 'watched', 'alitarian', 'tattoos', 'fisher']
>>> # Results are not interpretable, due to superposition and such.
>>> # This is why we use dictionary to find concept direction!
Classification example, we should fit concepts on the [CLS] token activations,
then use TopKInputs with use_unique_words=True and activation_granularity=CSL_TOKEN:
>>> from datasets import load_dataset
>>> from transformers import AutoModelForSequenceClassification
>>>
>>> from interpreto import ModelWithSplitPoints
>>> from interpreto.concepts import ICAConcepts, TopKInputs
>>>
>>> CLS_TOKEN = ModelWithSplitPoints.activation_granularities.CLS_TOKEN
>>>
>>> # load and split an IMDB classification model
>>> mwsp = ModelWithSplitPoints(
... "textattack/bert-base-uncased-imdb",
... split_points=[11], # split at the last layer
... automodel=AutoModelForSequenceClassification,
... device_map="cuda",
... batch_size=64,
... )
>>>
>>> # load the IMDB dataset and compute a dataset of [CLS] token activations
>>> imdb = load_dataset("stanfordnlp/imdb", split="train")["text"][:1000]
>>> activations = mwsp.get_activations(imdb, activation_granularity=CLS_TOKEN)
>>>
>>> # Load an fit a concept-based explainer
>>> concept_explainer = ICAConcepts(mwsp, nb_concepts=20)
>>> concept_explainer.fit(activations)
>>>
>>> method = TopKInputs(
... concept_explainer=concept_explainer,
... activation_granularity=CLS_TOKEN,
... k=5, # get the top 10 tokens for each concept
... use_unique_words=True, # necessary to get topk words on the [CLS] token
... unique_words_kwargs={
... "count_min_threshold": 5, # only consider words that appear at least 5 times in the dataset
... "lemmatize": True, # group words by their lemma (e.g., "bad" and "badly" are grouped together)
... }
... )
>>>
>>> topk_words = method.interpret(
... inputs=imdb,
... concepts_indices="all", # interpret the three first neurons of the 7th layer
... )
>>>
>>> print(list(topk_words[1].keys()))
['bad', 'bad.', 'hackneyed', 'clichéd', 'cannibal']
Generation example, use either TOKEN or WORD granularity for activations.
WORD allows to select the topk words for each concept without recomputing the activations.
>>> from datasets import load_dataset
>>> from transformers import AutoModelForCausalLM
>>>
>>> from interpreto import ModelWithSplitPoints
>>> from interpreto.concepts import ICAConcepts, TopKInputs
>>>
>>> WORD = ModelWithSplitPoints.activation_granularities.WORD
>>>
>>> # load and split the the GPT2 model
>>> mwsp = ModelWithSplitPoints(
... "Qwen/Qwen3-0.6B",
... split_points=[9], # split at the 10th layer
... automodel=AutoModelForCausalLM,
... device_map="auto",
... batch_size=16,
... )
>>>
>>> # load the IMDB dataset and compute a dataset of words activations
>>> imdb = load_dataset("stanfordnlp/imdb", split="train")["text"][:1000]
>>> activations = mwsp.get_activations(imdb, activation_granularity=WORD)
>>>
>>> # Load an fit a concept-based explainer
>>> concept_explainer = ICAConcepts(mwsp, nb_concepts=10)
>>> concept_explainer.fit(activations)
>>>
>>> method = TopKInputs(
... concept_explainer=concept_explainer,
... activation_granularity=WORD, # we want the topk words for each concept
... k=10, # get the top 10 words for each concept
... device="cuda",
... )
>>>
>>> topk_tokens = method.interpret(
... concepts_indices="all", # interpret the three first neurons of the 7th layer
... inputs=imdb,
... latent_activations=activations, # use previously computed activations (same granularity)
... )
Source code in interpreto/concepts/interpretations/topk_inputs.py
interpret
¶
interpret(concepts_indices='all', inputs=None, latent_activations=None, concepts_activations=None)
Give the interpretation of the concepts dimensions in the latent space into a human-readable format.
The interpretation is a mapping between the concepts indices and a list of inputs allowing to interpret them.
The granularity of input examples is determined by the activation_granularity class attribute.
The returned inputs are the most activating inputs for the concepts.
If all activations are zero, the corresponding concept interpretation is set to None.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int | list[int] | Literal['all']
|
The indices of the concepts to interpret. If "all", all concepts are interpreted. |
'all'
|
|
list[str] | None
|
The inputs to use for the interpretation.
Necessary if not |
None
|
|
dict[str, Tensor] | Float[Tensor, 'nl d'] | None
|
The latent activations matching the inputs. If not provided, it is computed from the inputs. |
None
|
|
Float[Tensor, 'nl cpt'] | None
|
The concepts activations matching the inputs. If not provided, it is computed from the inputs or latent activations. |
None
|
Returns:
| Type | Description |
|---|---|
Mapping[int, Any]
|
Mapping[int, Any]: The interpretation of the concepts indices. |
Source code in interpreto/concepts/interpretations/topk_inputs.py
interpreto.concepts.interpretations.LLMLabels
¶
LLMLabels(*, concept_explainer, activation_granularity=TOKEN, aggregation_strategy=MEAN, llm_interface, concept_encoding_batch_size=1024, sampling_method=TOP, k_examples=30, k_context=0, use_vocab=False, use_unique_words=0, unique_words_kwargs={}, k_quantile=5, system_prompt=None, concept_model_device=None)
Bases: BaseConceptInterpretationMethod
Code concepts/interpretations/llm_labels.py
Implement the automatic labeling method using a language model (LLM) to provide a short textual description given some examples of what activate the concept. This method was first introduced in 1, we implement here the step 1 of the method.
-
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, William Saunders* Language models can explain neurons in language models 2023. ↩
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
ConceptEncoderExplainer
|
The fitted concept explainer used for encoding activations. |
required |
|
ActivationGranularity
|
The granularity of the activations to use for the interpretation.
See :method: |
TOKEN
|
|
GranularityAggregationStrategy
|
The aggregation strategy to use for the activations.
See :method: |
MEAN
|
|
LLMInterface
|
The LLM interface to use for the interpretation. |
required |
|
int
|
The batch size to use for the concept encoding. |
1024
|
|
SAMPLING_METHOD
|
The method to use for sampling the inputs provided to the LLM. |
TOP
|
|
int
|
The number of inputs to use for the interpretation. |
30
|
|
int
|
The number of context tokens to use around the concept tokens.
In the prompt, in the examples, the k context tokens before and after the concept token are selected.
It is recommended to set it to between 5 and 10 for TOKEN and WORD granularities.
However, if the granularity is CLS_TOKEN or SAMPLE,
or |
0
|
|
bool
|
If True, the interpretation will be computed from the vocabulary of the model. |
False
|
|
bool
|
If True, the interpretation will be computed from the unique words of the inputs.
Incompatible with |
0
|
|
dict
|
The kwargs to pass to the |
{}
|
|
int
|
The number of quantiles to use for sampling the inputs, if |
5
|
|
str | None
|
The system prompt to use for the LLM. If None, a default prompt is used. |
None
|
|
device | str | None
|
The device to use for the concept model forward pass. If None, does not change the device. |
None
|
Source code in interpreto/concepts/interpretations/llm_labels.py
interpret
¶
interpret(concepts_indices, inputs=None, latent_activations=None, concepts_activations=None)
Give the interpretation of the concepts dimensions in the latent space into a human-readable format.
The interpretation is a mapping between the concepts indices and a short textual description.
The granularity of input examples is determined by the activation_granularity class attribute.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int | list[int] | Literal['all']
|
The indices of the concepts to interpret. If "all", all concepts are interpreted. |
required |
|
list[str] | None
|
The inputs to use for the interpretation.
Necessary if not |
None
|
|
dict[str, Tensor] | Float[Tensor, 'nl d'] | None
|
The latent activations matching the inputs. If not provided, it is computed from the inputs. |
None
|
|
Float[Tensor, 'nl cpt'] | None
|
The concepts activations matching the inputs. If not provided, it is computed from the inputs or latent activations. |
None
|
Returns:
| Type | Description |
|---|---|
Mapping[int, str | None]
|
Mapping[int, str | None]: The textual labels of the concepts indices. |
Source code in interpreto/concepts/interpretations/llm_labels.py
interpreto.model_wrapping.llm_interface.LLMInterface
¶
interpreto.concepts.interpretations.extract_ngrams
¶
extract_ngrams(inputs, n=1, count_min_threshold=1, return_counts=False, lemmatize=False, words_to_ignore=None)
Extract n-grams (from 1-gram up to n-gram of words) from a list of texts.
If n=3, it extracts 1-grams, 2-grams, and 3-grams.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
list[str]
|
The texts to extract n-grams from. |
required |
|
int
|
The maximum n-gram size. All sizes from 1 to n are extracted. |
1
|
|
int
|
The minimum total number of occurrences of an n-gram in the whole |
1
|
|
bool
|
Whether to return the counts of each n-gram. Defaults to False. |
False
|
|
bool
|
Whether to lemmatize words before counting. |
False
|
|
list[str] | None
|
A list of words to ignore (applied to individual tokens before forming n-grams). |
None
|
Returns:
| Type | Description |
|---|---|
list[str] | Counter[str]
|
list[str] | Counter[str]: The list of unique n-grams or the counts of each n-gram. |