TopKInputs or MaxAct¶
Generalization of maximally activating inputs used by Towards Monosemanticity: Decomposing Language Models With Dictionary Learning by Bricken et al. (2023)
interpreto.concepts.interpretations.TopKInputs
¶
TopKInputs(*, model_with_split_points, concept_model, activation_granularity=WORD, split_point=None, k=5, use_vocab=False)
Bases: BaseConceptInterpretationMethod
Code concepts/interpretations/topk_inputs.py
Implementation of the Top-K Inputs concept interpretation method also called MaxAct. It associate to each concept the inputs that activates it the most. It is the most natural way to interpret a concept, as it is the most natural way to explain a concept. Hence several papers used it without describing it. Nonetheless, we can reference Bricken et al. (2023) 1 from Anthropic for their post on transformer-circuits.
-
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, Chris Olah Towards Monosemanticity: Decomposing Language Models With Dictionary Learning Transformer Circuits, 2023. ↩
Attributes:
Name | Type | Description |
---|---|---|
model_with_split_points |
ModelWithSplitPoints
|
The model with split points to use for the interpretation. |
split_point |
str
|
The split point to use for the interpretation. |
concept_model |
ConceptModelProtocol
|
The concept model to use for the interpretation. |
activation_granularity |
ActivationGranularity
|
The granularity at which the interpretation is computed.
Allowed values are |
k |
int
|
The number of inputs to use for the interpretation. |
use_vocab |
bool
|
If True, the interpretation will be computed from the vocabulary of the model. |
Examples:
>>> from datasets import load_dataset
>>> from interpreto import ModelWithSplitPoints
>>> from interpreto.concepts import NeuronsAsConcepts
>>> from interpreto.concepts.interpretations import TopKInputs
>>> # load and split the model
>>> split = "bert.encoder.layer.1.output"
>>> model_with_split_points = ModelWithSplitPoints(
... "hf-internal-testing/tiny-random-bert",
... split_points=[split],
... model_autoclass=AutoModelForMaskedLM,
... batch_size=4,
... )
>>> # NeuronsAsConcepts do not need to be fitted
>>> concept_model = NeuronsAsConcepts(model_with_split_points=model_with_split_points, split_point=split)
>>> # extracting concept interpretations
>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes")["train"]["text"]
>>> all_top_k_words = concept_model.interpret(
... interpretation_method=TopKInputs,
... activation_granularity=TopKInputs.activation_granularities.WORD,
... k=2,
... concepts_indices="all",
... inputs=dataset,
... latent_activations=activations,
... )
Source code in interpreto/concepts/interpretations/topk_inputs.py
interpret
¶
interpret(concepts_indices, inputs=None, latent_activations=None, concepts_activations=None)
Give the interpretation of the concepts dimensions in the latent space into a human-readable format.
The interpretation is a mapping between the concepts indices and a list of inputs allowing to interpret them.
The granularity of input examples is determined by the activation_granularity
class attribute.
The returned inputs are the most activating inputs for the concepts.
If all activations are zero, the corresponding concept interpretation is set to None
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
int | list[int]
|
The indices of the concepts to interpret. |
required |
|
list[str] | None
|
The inputs to use for the interpretation.
Necessary if not |
None
|
|
Float[Tensor, 'nl d'] | None
|
The latent activations matching the inputs. If not provided, it is computed from the inputs. |
None
|
|
Float[Tensor, 'nl cpt'] | None
|
The concepts activations matching the inputs. If not provided, it is computed from the inputs or latent activations. |
None
|
Returns:
Type | Description |
---|---|
Mapping[int, Any]
|
Mapping[int, Any]: The interpretation of the concepts indices. |