Concept Interpretations
interpreto.concepts.interpretations.TopKInputs
¶
TopKInputs(*, concept_explainer, activation_granularity=WORD, concept_encoding_batch_size=1024, k=5, use_vocab=False, use_unique_words=False, unique_words_kwargs={}, concept_model_device=None)
Bases: BaseConceptInterpretationMethod
Code concepts/interpretations/topk_inputs.py
Implementation of the Top-K Inputs concept interpretation method also called MaxAct, or CMAW. It associate to each concept the inputs that activates it the most. It is the most natural way to interpret a concept, as it is the most natural way to explain a concept. Hence several papers used it without describing it. Nonetheless, we can reference Bricken et al. (2023) 1 from Anthropic for their post on transformer-circuits.
-
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, Chris Olah Towards Monosemanticity: Decomposing Language Models With Dictionary Learning Transformer Circuits, 2023. ↩
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
ConceptEncoderExplainer
|
The concept explainer built on top of a |
required |
|
ActivationGranularity
|
The granularity at which the interpretation is computed.
Allowed values are |
WORD
|
|
int
|
The batch size to use for the concept encoding. |
1024
|
|
int
|
The number of inputs to use for the interpretation. |
5
|
|
bool
|
If True, the interpretation will be computed from the vocabulary of the model. |
False
|
|
bool
|
If True, the interpretation will be computed from the unique words of the inputs.
Incompatible with |
False
|
|
dict
|
The kwargs to pass to the |
{}
|
|
device | str | None
|
The device to use for the concept model forward pass. If None, does not change the device. |
None
|
Examples:
Minimal example, finding the topk tokens activating a neuron:
>>> from transformers import AutoModelForCausalLM
>>>
>>> from interpreto import ModelWithSplitPoints
>>> from interpreto.concepts import NeuronsAsConcepts, TopKInputs
>>>
>>> # load and split the the GPT2 model
>>> mwsp = ModelWithSplitPoints(
... "gpt2",
... split_points=[11], # split at the 12th layer
... automodel=AutoModelForCausalLM,
... device_map="auto",
... batch_size=2048,
... )
>>>
>>> # Use `NeuronsAsConcepts` to use the concept-based pipeline with neurons
>>> concept_explainer = NeuronsAsConcepts(mwsp)
>>>
>>> method = TopKInputs(
... concept_explainer=concept_explainer,
... use_vocab=True, # use the vocabulary of the model and test all tokens (50257 with GPT2)
... k=10, # get the top 10 tokens for each neuron
... )
>>>
>>> topk_tokens = method.interpret(
... concepts_indices="all", # interpret the three first neurons of the 7th layer
... )
>>>
>>> print(list(topk_tokens[1].keys()))
['hostages', 'choke', 'infring', 'herpes', 'nuns', 'phylogen', 'watched', 'alitarian', 'tattoos', 'fisher']
>>> # Results are not interpretable, due to superposition and such.
>>> # This is why we use dictionary to find concept direction!
Classification example, we should fit concepts on the [CLS] token activations,
then use TopKInputs with use_unique_words=True and activation_granularity=CSL_TOKEN:
>>> from datasets import load_dataset
>>> from transformers import AutoModelForSequenceClassification
>>>
>>> from interpreto import ModelWithSplitPoints
>>> from interpreto.concepts import ICAConcepts, TopKInputs
>>>
>>> CLS_TOKEN = ModelWithSplitPoints.activation_granularities.CLS_TOKEN
>>>
>>> # load and split an IMDB classification model
>>> mwsp = ModelWithSplitPoints(
... "textattack/bert-base-uncased-imdb",
... split_points=[11], # split at the last layer
... automodel=AutoModelForSequenceClassification,
... device_map="cuda",
... batch_size=64,
... )
>>>
>>> # load the IMDB dataset and compute a dataset of [CLS] token activations
>>> imdb = load_dataset("stanfordnlp/imdb", split="train")["text"][:1000]
>>> activations = mwsp.get_activations(imdb, activation_granularity=CLS_TOKEN)
>>>
>>> # Load an fit a concept-based explainer
>>> concept_explainer = ICAConcepts(mwsp, nb_concepts=20)
>>> concept_explainer.fit(activations)
>>>
>>> method = TopKInputs(
... concept_explainer=concept_explainer,
... activation_granularity=CLS_TOKEN,
... k=5, # get the top 10 tokens for each concept
... use_unique_words=True, # necessary to get topk words on the [CLS] token
... unique_words_kwargs={
... "count_min_threshold": 5, # only consider words that appear at least 5 times in the dataset
... "lemmatize": True, # group words by their lemma (e.g., "bad" and "badly" are grouped together)
... }
... )
>>>
>>> topk_words = method.interpret(
... inputs=imdb,
... concepts_indices="all", # interpret the three first neurons of the 7th layer
... )
>>>
>>> print(list(topk_words[1].keys()))
['bad', 'bad.', 'hackneyed', 'clichéd', 'cannibal']
Generation example, use either TOKEN or WORD granularity for activations.
WORD allows to select the topk words for each concept without recomputing the activations.
>>> from datasets import load_dataset
>>> from transformers import AutoModelForCausalLM
>>>
>>> from interpreto import ModelWithSplitPoints
>>> from interpreto.concepts import ICAConcepts, TopKInputs
>>>
>>> WORD = ModelWithSplitPoints.activation_granularities.WORD
>>>
>>> # load and split the the GPT2 model
>>> mwsp = ModelWithSplitPoints(
... "Qwen/Qwen3-0.6B",
... split_points=[9], # split at the 10th layer
... automodel=AutoModelForCausalLM,
... device_map="auto",
... batch_size=16,
... )
>>>
>>> # load the IMDB dataset and compute a dataset of words activations
>>> imdb = load_dataset("stanfordnlp/imdb", split="train")["text"][:1000]
>>> activations = mwsp.get_activations(imdb, activation_granularity=WORD)
>>>
>>> # Load an fit a concept-based explainer
>>> concept_explainer = ICAConcepts(mwsp, nb_concepts=10)
>>> concept_explainer.fit(activations)
>>>
>>> method = TopKInputs(
... concept_explainer=concept_explainer,
... activation_granularity=WORD, # we want the topk words for each concept
... k=10, # get the top 10 words for each concept
... device="cuda",
... )
>>>
>>> topk_tokens = method.interpret(
... concepts_indices="all", # interpret the three first neurons of the 7th layer
... inputs=imdb,
... latent_activations=activations, # use previously computed activations (same granularity)
... )
Source code in interpreto/concepts/interpretations/topk_inputs.py
interpret
¶
interpret(concepts_indices, inputs=None, latent_activations=None, concepts_activations=None)
Give the interpretation of the concepts dimensions in the latent space into a human-readable format.
The interpretation is a mapping between the concepts indices and a list of inputs allowing to interpret them.
The granularity of input examples is determined by the activation_granularity class attribute.
The returned inputs are the most activating inputs for the concepts.
If all activations are zero, the corresponding concept interpretation is set to None.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int | list[int] | Literal['all']
|
The indices of the concepts to interpret. If "all", all concepts are interpreted. |
required |
|
list[str] | None
|
The inputs to use for the interpretation.
Necessary if not |
None
|
|
dict[str, Tensor] | Float[Tensor, 'nl d'] | None
|
The latent activations matching the inputs. If not provided, it is computed from the inputs. |
None
|
|
Float[Tensor, 'nl cpt'] | None
|
The concepts activations matching the inputs. If not provided, it is computed from the inputs or latent activations. |
None
|
Returns:
| Type | Description |
|---|---|
Mapping[int, Any]
|
Mapping[int, Any]: The interpretation of the concepts indices. |
Source code in interpreto/concepts/interpretations/topk_inputs.py
interpreto.concepts.interpretations.LLMLabels
¶
LLMLabels(*, concept_explainer, activation_granularity=TOKEN, llm_interface, concept_encoding_batch_size=1024, sampling_method=TOP, k_examples=30, k_context=0, use_vocab=False, use_unique_words=False, unique_words_kwargs={}, k_quantile=5, system_prompt=None, concept_model_device=None)
Bases: BaseConceptInterpretationMethod
Code concepts/interpretations/llm_labels.py
Implement the automatic labeling method using a language model (LLM) to provide a short textual description given some examples of what activate the concept. This method was first introduced in 1, we implement here the step 1 of the method.
-
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, William Saunders* Language models can explain neurons in language models 2023. ↩
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
ConceptEncoderExplainer
|
The fitted concept explainer used for encoding activations. |
required |
|
ActivationGranularity
|
The granularity at which the interpretation is computed.
Allowed values are |
TOKEN
|
|
LLMInterface
|
The LLM interface to use for the interpretation. |
required |
|
int
|
The batch size to use for the concept encoding. |
1024
|
|
SAMPLING_METHOD
|
The method to use for sampling the inputs provided to the LLM. |
TOP
|
|
int
|
The number of inputs to use for the interpretation. |
30
|
|
int
|
The number of context tokens to use around the concept tokens. |
0
|
|
bool
|
If True, the interpretation will be computed from the vocabulary of the model. |
False
|
|
bool
|
If True, the interpretation will be computed from the unique words of the inputs.
Incompatible with |
False
|
|
dict
|
The kwargs to pass to the |
{}
|
|
int
|
The number of quantiles to use for sampling the inputs, if |
5
|
|
str | None
|
The system prompt to use for the LLM. If None, a default prompt is used. |
None
|
|
device | str | None
|
The device to use for the concept model forward pass. If None, does not change the device. |
None
|
Source code in interpreto/concepts/interpretations/llm_labels.py
interpret
¶
interpret(concepts_indices, inputs=None, latent_activations=None, concepts_activations=None)
Give the interpretation of the concepts dimensions in the latent space into a human-readable format.
The interpretation is a mapping between the concepts indices and a short textual description.
The granularity of input examples is determined by the activation_granularity class attribute.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
int | list[int] | Literal['all']
|
The indices of the concepts to interpret. If "all", all concepts are interpreted. |
required |
|
list[str] | None
|
The inputs to use for the interpretation.
Necessary if not |
None
|
|
dict[str, Tensor] | Float[Tensor, 'nl d'] | None
|
The latent activations matching the inputs. If not provided, it is computed from the inputs. |
None
|
|
Float[Tensor, 'nl cpt'] | None
|
The concepts activations matching the inputs. If not provided, it is computed from the inputs or latent activations. |
None
|
Returns:
| Type | Description |
|---|---|
Mapping[int, str | None]
|
Mapping[int, str | None]: The textual labels of the concepts indices. |
Source code in interpreto/concepts/interpretations/llm_labels.py
interpreto.model_wrapping.llm_interface.LLMInterface
¶
interpreto.concepts.interpretations.extract_unique_words
¶
extract_unique_words(inputs, count_min_threshold=1, return_counts=False, lemmatize=False, words_to_ignore=None)
Extract words from a text.
Depending on parameters, it may select a subset of words or return the counts of each word.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
The text to extract words from. |
required |
|
float
|
The minimum total number of a occurrence of a word in the whole |
1
|
|
bool
|
Whether to return the counts of each word. Defaults to False. |
False
|
|
list[str] | None
|
(list[str], optional): A list of words to ignore. |
None
|
Examples:
Fastest version as used in TopKInputs.
>>> extract_unique_words(["Interpreto is the latin for 'to interpret'.", "interpreto is magic"])
["interpreto", "is", "the", "latin", "for", "to", "'", "interpret", ".", "magic"]
More complex use:
>>> import nltk
>>> from datasets import load_dataset
>>> from nltk.corpus import stopwords
>>>
>>> from interpreto.concepts.interpretations import extract_unique_words
>>>
>>> nltk.download("stopwords")
>>>
>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes")["train"]["text"]
>>> extract_unique_words(
... inputs=dataset,
... count_min_threshold=20,
... return_counts=True,
... lemmatize=True,
... words_to_ignore=stopwords.words("english") + [".", ",", "'s", "n't", "--", "``", "'"],
... )
Counter({'film': 1402,
'movie': 1243,
'one': 594,
'like': 574,
'ha': 563,
'make': 437,
'story': 417,
...
'pop': 20,
'college': 20,
'bear': 20,
'plain': 20,
'generic': 20})
Returns:
| Type | Description |
|---|---|
list[str] | Counter[str]
|
list[str] | Counter[str]: The list of unique words or the counts of each word. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input is not a list of strings. |
Source code in interpreto/concepts/interpretations/base.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | |