TopKInputs or MaxAct¶
Generalization of maximally activating inputs used by Towards Monosemanticity: Decomposing Language Models With Dictionary Learning by Bricken et al. (2023)
interpreto.concepts.interpretations.TopKInputs
¶
TopKInputs(*, model_with_split_points, concept_model, activation_granularity=WORD, source, split_point=None, k=5)
Bases: BaseConceptInterpretationMethod
Code concepts/interpretations/topk_inputs.py
Implementation of the Top-K Inputs concept interpretation method also called MaxAct. It associate to each concept the inputs that activates it the most. It is the most natural way to interpret a concept, as it is the most natural way to explain a concept. Hence several papers used it without describing it. Nonetheless, we can reference Bricken et al. (2023) 1 from Anthropic for their post on transformer-circuits.
-
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn*, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, Chris Olah Towards Monosemanticity: Decomposing Language Models With Dictionary Learning Transformer Circuits, 2023. ↩
Attributes:
Name | Type | Description |
---|---|---|
model_with_split_points |
ModelWithSplitPoints
|
The model with split points to use for the interpretation. |
split_point |
str
|
The split point to use for the interpretation. |
concept_model |
ConceptModelProtocol
|
The concept model to use for the interpretation. |
activation_granularity |
ActivationGranularity
|
The granularity at which the interpretation is computed.
Allowed values are |
source |
InterpretationSources
|
In any case, TopKInputs requires concept-activations and inputs. But depending on the available variable, you will or will not have to recompute all of this activations. The source correspond to starting from which activations should be computed. Supported sources are
|
k |
int
|
The number of inputs to use for the interpretation. |
Examples:
>>> from datasets import load_dataset
>>> from interpreto import ModelWithSplitPoints
>>> from interpreto.concepts import NeuronsAsConcepts
>>> from interpreto.concepts.interpretations import TopKInputs
>>> # load and split the model
>>> split = "bert.encoder.layer.1.output"
>>> model_with_split_points = ModelWithSplitPoints(
... "hf-internal-testing/tiny-random-bert",
... split_points=[split],
... model_autoclass=AutoModelForMaskedLM,
... batch_size=4,
... )
>>> # NeuronsAsConcepts do not need to be fitted
>>> concept_model = NeuronsAsConcepts(model_with_split_points=model_with_split_points, split_point=split)
>>> # extracting concept interpretations
>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes")["train"]["text"]
>>> all_top_k_words = concept_model.interpret(
... interpretation_method=TopKInputs,
... activation_granularity=TopKInputs.activation_granularities.WORD,
... source=TopKInputs.sources.INPUTS,
... k=2,
... concepts_indices="all",
... inputs=dataset,
... latent_activations=activations,
... )
Source code in interpreto/concepts/interpretations/topk_inputs.py
interpret
¶
interpret(concepts_indices, inputs=None, latent_activations=None, concepts_activations=None)
Give the interpretation of the concepts dimensions in the latent space into a human-readable format.
The interpretation is a mapping between the concepts indices and a list of inputs allowing to interpret them.
The granularity of input examples is determined by the activation_granularity
class attribute.
The returned inputs are the most activating inputs for the concepts.
The required arguments depend on the source
class attribute.
If all activations are zero, the corresponding concept interpretation is set to None
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
int | list[int]
|
The indices of the concepts to interpret. |
required |
|
list[str] | None
|
The inputs to use for the interpretation.
Necessary if the source is not |
None
|
|
Float[Tensor, 'nl d'] | None
|
The latent activations to use for the interpretation.
Necessary if the source is |
None
|
|
Float[Tensor, 'nl cpt'] | None
|
The concepts activations to use for the interpretation.
Necessary if the source is not |
None
|
Returns:
Type | Description |
---|---|
Mapping[int, Any]
|
Mapping[int, Any]: The interpretation of the concepts indices. |
Raises:
Type | Description |
---|---|
ValueError
|
If the arguments do not correspond to the specified source. |