Sparse Autoencoders (SAEs)¶

Abstract base class¶

interpreto.concepts.methods.SAEExplainer ¶

SAEExplainer(model_with_split_points, *, nb_concepts, split_point=None, encoder_module=None, dictionary_params=None, device='cpu', **kwargs)

Bases: ConceptAutoEncoderExplainer[SAE], Generic[_SAE_co]

Code: concepts/methods/overcomplete.py

Implementation of a concept explainer using a overcomplete.sae.SAE variant as concept_model.

Attributes:

Name	Type	Description
`model_with_split_points`	`ModelWithSplitPoints`	The model to apply the explanation on. It should have at least one split point on which `concept_model` can be fitted.
`split_point`	`str \| None`	The split point used to train the `concept_model`. Default: `None`, set only when the concept explainer is fitted.
`concept_model`	`SAE`	An Overcomplete SAE variant for concept extraction.
`is_fitted`	`bool`	Whether the `concept_model` was fit on model activations.
`has_differentiable_concept_encoder`	`bool`	Whether the `encode_activations` operation is differentiable.
`has_differentiable_concept_decoder`	`bool`	Whether the `decode_concepts` operation is differentiable.

Examples:

>>> import datasets
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> from interpreto import ModelWithSplitPoints
>>> from interpreto.concepts import VanillaSAE
>>> from interpreto.concepts.interpretations import TopKInputs
>>> CLS_TOKEN = ModelWithSplitPoints.activation_granularities.CLS_TOKEN
>>> WORD = ModelWithSplitPoints.activation_granularities.WORD
...
>>> dataset = datasets.load_dataset("stanfordnlp/imdb")["train"]["text"][:1000]
>>> repo_id = "Qwen/Qwen3-0.6B"
>>> model = AutoModelForCausalLM.from_pretrained(repo_id, device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained(repo_id)
...
>>> # 1. Split your model in two parts
>>> splitted_model = ModelWithSplitPoints(
>>>     model, tokenizer=tokenizer, split_points=[5],
>>> )
...
>>> # 2. Compute a dataset of activations
>>> activations = splitted_model.get_activations(
>>>     dataset, activation_granularity=WORD
>>> )
...
>>> # 3. Fit a concept model on the dataset
>>> explainer = VanillaSAE(splitted_model, nb_concepts=100, device="cuda")
>>> explainer.fit(activations, lr=1e-3, nb_epochs=20, batch_size=1024)
...
>>> # 4. Interpret the concepts
>>> interpreter = TopKInputs(
>>>     concept_explainer=explainer,
>>>     activation_granularity=WORD,
>>> )
>>> interpretations = interpreter.interpret(
>>>     inputs=dataset, latent_activations=activations
>>> )
...
>>> # Print the interpretations
>>> for id, words in interpretations.items():
>>>     print(f"Concept {id}: {list(words.keys()) if words else None}")

Parameters:

Name	Type	Description	Default
`model_with_split_points` ¶	`ModelWithSplitPoints`	The model to apply the explanation on. It should have at least one split point on which a concept explainer can be trained.	required
`nb_concepts` ¶	`int`	Size of the SAE concept space.	required
`split_point` ¶	`str \| None`	The split point used to train the `concept_model`. If None, tries to use the split point of `model_with_split_points` if a single one is defined.	`None`
`encoder_module` ¶	`Module \| str \| None`	Encoder module to use to construct the SAE, see Overcomplete SAE documentation.	`None`
`dictionary_params` ¶	`dict \| None`	Dictionary parameters to use to construct the SAE, see Overcomplete SAE documentation.	`None`
`device` ¶	`device \| str`	Device to use for the `concept_module`.	`'cpu'`
`**kwargs` ¶	`dict`	Additional keyword arguments to pass to the `concept_module`. See the Overcomplete documentation of the provided `concept_model_class` for more details.	`{}`

Source code in interpreto/concepts/methods/overcomplete.py

def __init__(
    self,
    model_with_split_points: ModelWithSplitPoints,
    *,
    nb_concepts: int,
    split_point: str | None = None,
    encoder_module: nn.Module | str | None = None,
    dictionary_params: dict | None = None,
    device: str = "cpu",
    **kwargs,
):
    """
    Initialize the concept bottleneck explainer based on the Overcomplete SAE framework.

    Args:
        model_with_split_points (ModelWithSplitPoints): The model to apply the explanation on.
            It should have at least one split point on which a concept explainer can be trained.
        nb_concepts (int): Size of the SAE concept space.
        split_point (str | None): The split point used to train the `concept_model`. If None, tries to use the
            split point of `model_with_split_points` if a single one is defined.
        encoder_module (nn.Module | str | None): Encoder module to use to construct the SAE, see [Overcomplete SAE documentation](https://kempnerinstitute.github.io/overcomplete/saes/vanilla/).
        dictionary_params (dict | None): Dictionary parameters to use to construct the SAE, see [Overcomplete SAE documentation](https://kempnerinstitute.github.io/overcomplete/saes/vanilla/).
        device (torch.device | str): Device to use for the `concept_module`.
        **kwargs (dict): Additional keyword arguments to pass to the `concept_module`.
            See the Overcomplete documentation of the provided `concept_model_class` for more details.
    """
    if not issubclass(self.concept_model_class, oc_sae.SAE):
        raise ValueError(
            "ConceptEncoderDecoder must be a subclass of `overcomplete.sae.SAE`.\n"
            "Use `interpreto.concepts.methods.SAEExplainerClasses` to get the list of available SAE methods."
        )
    self.model_with_split_points = model_with_split_points
    self.split_point: str = split_point  # type: ignore

    # TODO: this will be replaced with a scan and a better way to select how to pick activations based on model class
    shapes = self.model_with_split_points.get_latent_shape()
    concept_model = self.concept_model_class(
        input_shape=shapes[self.split_point][-1],
        nb_concepts=nb_concepts,
        encoder_module=encoder_module,
        dictionary_params=dictionary_params,
        device=device,
        **kwargs,
    )
    super().__init__(model_with_split_points, concept_model, self.split_point)

fit ¶

fit(activations, *, use_amp=False, batch_size=1024, criterion=MSELoss, optimizer_class=Adam, optimizer_kwargs={}, scheduler_class=None, scheduler_kwargs={}, lr=0.001, nb_epochs=20, clip_grad=None, monitoring=None, device=None, max_nan_fallbacks=5, overwrite=False)

Fit an Overcomplete SAE model on the given activations.

Parameters:

Name	Type	Description	Default
`activations` ¶	`Tensor \| dict[str, Tensor]`	The activations used for fitting the `concept_model`. If a dictionary is provided, the activation corresponding to `split_point` will be used.	required
`use_amp` ¶	`bool`	Whether to use automatic mixed precision for fitting.	`False`
`criterion` ¶	`SAELoss`	Loss criterion for the training of the `concept_model`.	`MSELoss`
`optimizer_class` ¶	`type[Optimizer]`	Optimizer for the training of the `concept_model`.	`Adam`
`optimizer_kwargs` ¶	`dict`	Keyword arguments to pass to the optimizer.	`{}`
`scheduler_class` ¶	`type[LRScheduler] \| None`	Learning rate scheduler for the training of the `concept_model`.	`None`
`scheduler_kwargs` ¶	`dict`	Keyword arguments to pass to the scheduler.	`{}`
`lr` ¶	`float`	Learning rate for the training of the `concept_model`.	`0.001`
`nb_epochs` ¶	`int`	Number of epochs for the training of the `concept_model`.	`20`
`clip_grad` ¶	`float \| None`	Gradient clipping value for the training of the `concept_model`.	`None`
`monitoring` ¶	`int \| None`	Monitoring frequency for the training of the `concept_model`.	`None`
`device` ¶	`device \| str`	Device to use for the training of the `concept_model`.	`None`
`max_nan_fallbacks` ¶	`int \| None`	Maximum number of fallbacks to use when NaNs are encountered during training. Ignored if use_amp is False.	`5`
`overwrite` ¶	`bool`	Whether to overwrite the current model if it has already been fitted. Default: False.	`False`

Returns:

Type	Description
`dict`	A dictionary with training history logs.

Source code in interpreto/concepts/methods/overcomplete.py

def fit(
    self,
    activations: LatentActivations | dict[str, LatentActivations],
    *,
    use_amp: bool = False,
    batch_size: int = 1024,
    criterion: type[SAELoss] = MSELoss,
    optimizer_class: type[torch.optim.Optimizer] = torch.optim.Adam,
    optimizer_kwargs: dict = {},
    scheduler_class: type[torch.optim.lr_scheduler.LRScheduler] | None = None,
    scheduler_kwargs: dict = {},
    lr: float = 1e-3,
    nb_epochs: int = 20,
    clip_grad: float | None = None,
    monitoring: int | None = None,
    device: torch.device | str | None = None,
    max_nan_fallbacks: int | None = 5,
    overwrite: bool = False,
) -> dict:
    """Fit an Overcomplete SAE model on the given activations.

    Args:
        activations (torch.Tensor | dict[str, torch.Tensor]): The activations used for fitting the `concept_model`.
            If a dictionary is provided, the activation corresponding to `split_point` will be used.
        use_amp (bool): Whether to use automatic mixed precision for fitting.
        criterion (interpreto.concepts.SAELoss): Loss criterion for the training of the `concept_model`.
        optimizer_class (type[torch.optim.Optimizer]): Optimizer for the training of the `concept_model`.
        optimizer_kwargs (dict): Keyword arguments to pass to the optimizer.
        scheduler_class (type[torch.optim.lr_scheduler.LRScheduler] | None): Learning rate scheduler for the
            training of the `concept_model`.
        scheduler_kwargs (dict): Keyword arguments to pass to the scheduler.
        lr (float): Learning rate for the training of the `concept_model`.
        nb_epochs (int): Number of epochs for the training of the `concept_model`.
        clip_grad (float | None): Gradient clipping value for the training of the `concept_model`.
        monitoring (int | None): Monitoring frequency for the training of the `concept_model`.
        device (torch.device | str): Device to use for the training of the `concept_model`.
        max_nan_fallbacks (int | None): Maximum number of fallbacks to use when NaNs are encountered during
            training. Ignored if use_amp is False.
        overwrite (bool): Whether to overwrite the current model if it has already been fitted.
            Default: False.

    Returns:
        A dictionary with training history logs.
    """
    if device is None:
        device = self.device
    split_activations = self._prepare_fit(activations, overwrite=overwrite)
    dataloader = DataLoader(TensorDataset(split_activations.detach()), batch_size=batch_size, shuffle=True)
    optimizer_kwargs.update({"lr": lr})
    optimizer = optimizer_class(self.concept_model.parameters(), **optimizer_kwargs)  # type: ignore
    train_params = {
        "model": self.concept_model,
        "dataloader": dataloader,
        "criterion": criterion(),
        "optimizer": optimizer,
        "nb_epochs": nb_epochs,
        "clip_grad": clip_grad,
        "monitoring": monitoring,
        "device": device,
    }
    if scheduler_class is not None:
        scheduler = scheduler_class(optimizer, **scheduler_kwargs)
        train_params["scheduler"] = scheduler

    if use_amp:
        train_method = oc_sae.train.train_sae_amp
        train_params["max_nan_fallbacks"] = max_nan_fallbacks
    else:
        train_method = oc_sae.train_sae
    log = train_method(**train_params)
    self.concept_model.fitted = True

    # Manually set `BatchTopKSAEConcepts` `.training` argument to `False`
    # Because overcomplete does not do it and it changes the `.encode()` method drastically.
    if hasattr(self.concept_model, "training"):
        self.concept_model.training = False
    return log

encode_activations ¶

encode_activations(activations)

Encode the given activations using the concept_model encoder.

Parameters:

Name	Type	Description	Default
`activations` ¶	`Tensor`	The activations to encode.	required

Returns:

Type	Description
`Tensor`	The encoded concept activations.

Source code in interpreto/concepts/methods/overcomplete.py

@check_fitted
def encode_activations(self, activations: LatentActivations) -> torch.Tensor:  # ConceptsActivations
    """Encode the given activations using the `concept_model` encoder.

    Args:
        activations (torch.Tensor): The activations to encode.

    Returns:
        The encoded concept activations.
    """
    # SAEs.encode returns both codes (concepts activations) and pre_codes (before relu)
    _, codes = super().encode_activations(activations.to(self.device))
    return codes

decode_concepts ¶

decode_concepts(concepts)

Decode the given concepts using the concept_model decoder.

Parameters:

Name	Type	Description	Default
`concepts` ¶	`Tensor`	The concepts to decode.	required

Returns:

Type	Description
`Tensor`	The decoded concept activations.

Source code in interpreto/concepts/methods/overcomplete.py

@check_fitted
def decode_concepts(self, concepts: torch.Tensor) -> torch.Tensor:
    """Decode the given concepts using the `concept_model` decoder.

    Args:
        concepts (torch.Tensor): The concepts to decode.

    Returns:
        The decoded concept activations.
    """
    return self.concept_model.decode(concepts.to(self.device))  # type: ignore

get_dictionary ¶

get_dictionary()

Get the dictionary learned by the fitted concept_model.

Returns:

Type	Description
`Tensor`	torch.Tensor: A `torch.Tensor` containing the learned dictionary.

Source code in interpreto/concepts/base.py

@check_fitted
def get_dictionary(self) -> torch.Tensor:  # TODO: add this to tests
    """Get the dictionary learned by the fitted `concept_model`.

    Returns:
        torch.Tensor: A `torch.Tensor` containing the learned dictionary.
    """
    return self.concept_model.get_dictionary()  # type: ignore

interpret ¶

interpret(*args, **kwargs)

Deprecated API for concept interpretation.

Interpretation methods should now be instantiated directly with the fitted concept explainer. For example:

TopKInputs(concept_explainer).interpret(inputs, latent_activations)

This method is kept only for backwards compatibility and will always raise a :class:NotImplementedError.

Source code in interpreto/concepts/base.py

@check_fitted
def interpret(self, *args, **kwargs) -> Mapping[int, Any]:  # TODO: 0.5.0 remove
    """Deprecated API for concept interpretation.

    Interpretation methods should now be instantiated directly with the
    fitted concept explainer. For example:

    ``TopKInputs(concept_explainer).interpret(inputs, latent_activations)``

    This method is kept only for backwards compatibility and will always
    raise a :class:`NotImplementedError`.
    """
    raise NotImplementedError("Use the new API: TopKInputs(concept_explainer).interpret(...).")

concept_output_gradient ¶

concept_output_gradient(inputs, targets=None, split_point=None, activation_granularity=TOKEN, aggregation_strategy=MEAN, concepts_x_gradients=True, normalization=True, tqdm_bar=False, batch_size=None)

Compute the gradients of the predictions with respect to the concepts.

To clarify what this function does, lets detail some notations. Suppose the initial model was splitted such that \(f = g \circ h\). Hence the concept model was fitted on \(A = h(X)\) with \(X\) a dataset of samples. The resulting concept model encoders and decoders are noted \(t\) and \(t^{-1}\). \(t\) can be seen as projections from the latent space to the concept space. Hence, the function going from the inputs to the concepts is \(f_{ic} = t \circ h\) and the function going from the concepts to the outputs is \(f_{co} = g \circ t^-1\).

Given a set of samples \(X\), and the functions \((h, t, t^{-1}, g)\) This function first compute \(C = t(A) = t \circ h(X)\), then returns \(\nabla{f_{co}}(C)\).

In practice all computations are done by ModelWithSplitPoints._get_concept_output_gradients, which relies on NNsight. The current method only forwards the \(t\) and \(t^{-1}\), respectively self.encode_activations and self.decode_concepts methods.

Parameters:

Name	Type	Description	Default
`inputs` ¶	`list[str] \| Tensor \| BatchEncoding`	The input data, either a list of samples, the tokenized input or a batch of samples.	required
`targets` ¶	`list[int] \| None`	Specify which outputs of the model should be used to compute the gradients. Note that \(f_{co}\) often has several outputs, by default gradients are computed for each output. The `t` dimension of the returned tensor is equal to the number of selected targets. (For classification, those are the classes logits and for generation, those are the most probable tokens probabilities).	`None`
`split_point` ¶	`str \| None`	The split point used to train the `concept_model`. If None, tries to use the split point of `model_with_split_points` if a single one is defined.	`None`
`activation_granularity` ¶	`ActivationGranularity`	The granularity of the activations to use for the attribution. It is highly recommended to to use the same granularity as the one used in the `fit` method. Possibles values are: `ModelWithSplitPoints.activation_granularities.CLS_TOKEN`: only the first token (e.g. `[CLS]`) activation is returned `(batch, d_model)`. `ModelWithSplitPoints.activation_granularities.ALL_TOKENS`: every token activation is treated as a separate element `(batch x seq_len, d_model)`. `ModelWithSplitPoints.activation_granularities.TOKEN`: remove special tokens. `ModelWithSplitPoints.activation_granularities.WORD`: aggregate by words following the split defined by :class:`~interpreto.commons.granularity.Granularity.WORD`. `ModelWithSplitPoints.activation_granularities.SENTENCE`: aggregate by sentences following the split defined by :class:`~interpreto.commons.granularity.Granularity.SENTENCE`. Requires `spacy` to be installed.	`TOKEN`
`aggregation_strategy` ¶	`GranularityAggregationStrategy`	Strategy to aggregate token activations into larger inputs granularities. Applied for `WORD` and `SENTENCE` activation strategies. Token activations of shape n * (l, d) are aggregated on the sequence length dimension. The concatenated into (ng, d) tensors. Existing strategies are: `ModelWithSplitPoints.aggregation_strategies.SUM`: Tokens activations are summed along the sequence length dimension. `ModelWithSplitPoints.aggregation_strategies.MEAN`: Tokens activations are averaged along the sequence length dimension. `ModelWithSplitPoints.aggregation_strategies.MAX`: The maximum of the token activations along the sequence length dimension is selected. `ModelWithSplitPoints.aggregation_strategies.SIGNED_MAX`: The maximum of the absolute value of the activations multiplied by its initial sign. signed_max([[-1, 0, 1, 2], [-3, 1, -2, 0]]) = [-3, 1, -2, 2]	`MEAN`
`concepts_x_gradients` ¶	`bool`	If the resulting gradients should be multiplied by the concepts activations. True by default (similarly to attributions), because of mathematical properties. Therefore the out put is \(C * \nabla{f_{co}}(C)\).	`True`
`normalization` ¶	`bool`	Whether to normalize the gradients. Gradients will be normalized on the concept (c) and sequence length (g) dimensions. Such that for a given sample-target-granular pair, the sum of the absolute values of the gradients is equal to 1. (The granular elements depend on the :arg:`activation_granularity`).	`True`
`tqdm_bar` ¶	`bool`	Whether to display a progress bar.	`False`
`batch_size` ¶	`int \| None`	Batch size for the model. It might be different from the one used in `ModelWithSplitPoints.get_activations` because gradients have a much larger impact on the memory.	`None`

Returns:

Type	Description
`list[Float[Tensor, 't g c']]`	list[Float[torch.Tensor, "t g c"]]: The gradients of the model output with respect to the concept activations. List length: correspond to the number of inputs. Tensor shape: (t, g, c) with t the target dimension, g the number of granularity elements in one input, and c the number of concepts.

Source code in interpreto/concepts/base.py

@check_fitted
def concept_output_gradient(
    self,
    inputs: torch.Tensor | list[str] | BatchEncoding,
    targets: list[int] | None = None,
    split_point: str | None = None,
    activation_granularity: ActivationGranularity = ActivationGranularity.TOKEN,
    aggregation_strategy: GranularityAggregationStrategy = GranularityAggregationStrategy.MEAN,
    concepts_x_gradients: bool = True,
    normalization: bool = True,
    tqdm_bar: bool = False,
    batch_size: int | None = None,
) -> list[Float[torch.Tensor, "t g c"]]:
    """
    Compute the gradients of the predictions with respect to the concepts.

    To clarify what this function does, lets detail some notations.
    Suppose the initial model was splitted such that $f = g \\circ h$.
    Hence the concept model was fitted on $A = h(X)$ with $X$ a dataset of samples.
    The resulting concept model encoders and decoders are noted $t$ and $t^{-1}$.
    $t$ can be seen as projections from the latent space to the concept space.
    Hence, the function going from the inputs to the concepts is $f_{ic} = t \\circ h$
    and the function going from the concepts to the outputs is $f_{co} = g \\circ t^-1$.

    Given a set of samples $X$, and the functions $(h, t, t^{-1}, g)$
    This function first compute $C = t(A) = t \\circ h(X)$, then returns $\\nabla{f_{co}}(C)$.

    In practice all computations are done by `ModelWithSplitPoints._get_concept_output_gradients`,
    which relies on NNsight. The current method only forwards the $t$ and $t^{-1}$,
    respectively `self.encode_activations` and `self.decode_concepts` methods.

    Args:
        inputs (list[str] | torch.Tensor | BatchEncoding):
            The input data, either a list of samples, the tokenized input or a batch of samples.

        targets (list[int] | None):
            Specify which outputs of the model should be used to compute the gradients.
            Note that $f_{co}$ often has several outputs, by default gradients are computed for each output.
            The `t` dimension of the returned tensor is equal to the number of selected targets.
            (For classification, those are the classes logits and for generation, those are the most probable tokens probabilities).

        split_point (str | None):
            The split point used to train the `concept_model`.
            If None, tries to use the split point of `model_with_split_points` if a single one is defined.

        activation_granularity (ActivationGranularity):
            The granularity of the activations to use for the attribution.
            It is highly recommended to to use the same granularity as the one used in the `fit` method.
            Possibles values are:

            - ``ModelWithSplitPoints.activation_granularities.CLS_TOKEN``:
                only the first token (e.g. ``[CLS]``) activation is returned ``(batch, d_model)``.

            - ``ModelWithSplitPoints.activation_granularities.ALL_TOKENS``:
                every token activation is treated as a separate element ``(batch x seq_len, d_model)``.

            - ``ModelWithSplitPoints.activation_granularities.TOKEN``: remove special tokens.

            - ``ModelWithSplitPoints.activation_granularities.WORD``:
                aggregate by words following the split defined by
                :class:`~interpreto.commons.granularity.Granularity.WORD`.

            - ``ModelWithSplitPoints.activation_granularities.SENTENCE``:
                aggregate by sentences following the split defined by
                :class:`~interpreto.commons.granularity.Granularity.SENTENCE`.
                Requires `spacy` to be installed.

        aggregation_strategy:
            Strategy to aggregate token activations into larger inputs granularities.
            Applied for `WORD` and `SENTENCE` activation strategies.
            Token activations of shape  n * (l, d) are aggregated on the sequence length dimension.
            The concatenated into (ng, d) tensors.
            Existing strategies are:

            - ``ModelWithSplitPoints.aggregation_strategies.SUM``:
                Tokens activations are summed along the sequence length dimension.

            - ``ModelWithSplitPoints.aggregation_strategies.MEAN``:
                Tokens activations are averaged along the sequence length dimension.

            - ``ModelWithSplitPoints.aggregation_strategies.MAX``:
                The maximum of the token activations along the sequence length dimension is selected.

            - ``ModelWithSplitPoints.aggregation_strategies.SIGNED_MAX``:
                The maximum of the absolute value of the activations multiplied by its initial sign.
                signed_max([[-1, 0, 1, 2], [-3, 1, -2, 0]]) = [-3, 1, -2, 2]

        concepts_x_gradients (bool):
            If the resulting gradients should be multiplied by the concepts activations.
            True by default (similarly to attributions), because of mathematical properties.
            Therefore the out put is $C * \\nabla{f_{co}}(C)$.

        normalization (bool):
            Whether to normalize the gradients.
            Gradients will be normalized on the concept (c) and sequence length (g) dimensions.
            Such that for a given sample-target-granular pair,
            the sum of the absolute values of the gradients is equal to 1.
            (The granular elements depend on the :arg:`activation_granularity`).

        tqdm_bar (bool):
            Whether to display a progress bar.

        batch_size (int | None):
            Batch size for the model.
            It might be different from the one used in `ModelWithSplitPoints.get_activations`
            because gradients have a much larger impact on the memory.

    Returns:
        list[Float[torch.Tensor, "t g c"]]:
            The gradients of the model output with respect to the concept activations.
            List length: correspond to the number of inputs.
                Tensor shape: (t, g, c) with t the target dimension, g the number of granularity elements in one input, and c the number of
                concepts.
    """
    if not self.has_differentiable_concept_decoder:
        raise ValueError(
            "The concept decoder of this explainer is not differentiable. This is required to compute concept-to-output gradients. "
            f"Current explainer class: {self.__class__.__name__}."
        )

    # put everything on device
    self.concept_model.to(self.model_with_split_points.device)

    # forward all computations to
    gradients = self.model_with_split_points._get_concept_output_gradients(
        inputs=inputs,
        targets=targets,
        encode_activations=self.encode_activations,
        decode_concepts=self.decode_concepts,
        split_point=split_point,
        activation_granularity=activation_granularity,
        aggregation_strategy=aggregation_strategy,
        concepts_x_gradients=concepts_x_gradients,
        tqdm_bar=tqdm_bar,
        batch_size=batch_size,
    )

    # normalize the gradients if required
    if normalization:
        gradients = [self.__normalize_gradients(g) for g in gradients]
    return gradients

List of available SAEs¶

interpreto.concepts.methods.BatchTopKSAEConcepts ¶

BatchTopKSAEConcepts(model_with_split_points, *, nb_concepts, split_point=None, encoder_module=None, dictionary_params=None, device='cpu', **kwargs)

Bases: SAEExplainer[BatchTopKSAE]

Code: concepts/methods/overcomplete.py

ConceptAutoEncoderExplainer with the BatchTopK SAE from Bussmann et al. (2024)¹ as concept model.

BatchTopK SAE implementation from overcomplete.sae.BatchTopKSAE class.

Bussmann, B., Leask, P., Nanda, N. BatchTopK Sparse Autoencoders. Arxiv Preprint, 2024. ↩

interpreto.concepts.methods.JumpReLUSAEConcepts ¶

JumpReLUSAEConcepts(model_with_split_points, *, nb_concepts, split_point=None, encoder_module=None, dictionary_params=None, device='cpu', **kwargs)

Bases: SAEExplainer[JumpSAE]

Code: concepts/methods/overcomplete.py

ConceptAutoEncoderExplainer with the JumpReLU SAE from Rajamanoharan et al. (2024)¹ as concept model.

JumpReLU SAE implementation from overcomplete.sae.JumpReLUSAE class.

Rajamanoharan, S. et al., Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders. Arxiv Preprint, 2024. ↩

interpreto.concepts.methods.MpSAEConcepts ¶

MpSAEConcepts(model_with_split_points, *, nb_concepts, split_point=None, encoder_module=None, dictionary_params=None, device='cpu', **kwargs)

Bases: SAEExplainer[MpSAE]

Code: concepts/methods/overcomplete.py

ConceptAutoEncoderExplainer with the MpSAE from Costa et al. (2025)¹ as concept model.

Matching Pursuit SAE implementation from overcomplete.sae.MpSAE class.

Valérie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, Demba Ba (2025). From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit. arXiv preprint arXiv:2506.03093. ↩

interpreto.concepts.methods.TopKSAEConcepts ¶

TopKSAEConcepts(model_with_split_points, *, nb_concepts, split_point=None, encoder_module=None, dictionary_params=None, device='cpu', **kwargs)

Bases: SAEExplainer[TopKSAE]

Code: concepts/methods/overcomplete.py

ConceptAutoEncoderExplainer with the TopK SAE from Gao et al. (2024)¹ as concept model.

TopK SAE implementation from overcomplete.sae.TopKSAE class.

Gao, L. et al., Scaling and evaluating sparse autoencoders. The Thirteenth International Conference on Learning Representations, 2025. ↩

interpreto.concepts.methods.VanillaSAEConcepts ¶

VanillaSAEConcepts(model_with_split_points, *, nb_concepts, split_point=None, encoder_module=None, dictionary_params=None, device='cpu', **kwargs)

Bases: SAEExplainer[SAE]

Code: concepts/methods/overcomplete.py

ConceptAutoEncoderExplainer with the Vanilla SAE from Cunningham et al. (2023)¹ and Bricken et al. (2023)² as concept model.

Vanilla SAE implementation from overcomplete.sae.SAE class.

Huben, R., Cunningham, H., Smith, L. R., Ewart, A., Sharkey, L. Sparse Autoencoders Find Highly Interpretable Features in Language Models. The Twelfth International Conference on Learning Representations, 2024. ↩
Bricken, T. et al., Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Transformer Circuits Thread, 2023. ↩

Loss Functions¶

These functions can be passed as the criterion argument in the fit method of the SAEExplainer class. MSELoss is the default loss function.

interpreto.concepts.methods.SAELossClasses ¶

Bases: Enum

Enumeration of possible loss functions for SAEs.

To pass as the criterion parameter of SAEExplainer.fit().

Attributes:

Name	Type	Description
`MSE`	`type[SAELoss]`	Mean Squared Error loss.
`DeadNeuronsReanimation`	`type[SAELoss]`	Loss function promoting reanimation of dead neurons.

Sparse Autoencoders (SAEs)¶

Abstract base class¶

interpreto.concepts.methods.SAEExplainer ¶

model_with_split_points ¶

nb_concepts ¶

split_point ¶

encoder_module ¶

dictionary_params ¶

device ¶

**kwargs ¶

fit ¶

activations ¶

use_amp ¶

criterion ¶

optimizer_class ¶

optimizer_kwargs ¶

scheduler_class ¶

scheduler_kwargs ¶

lr ¶

nb_epochs ¶

clip_grad ¶

monitoring ¶

device ¶

max_nan_fallbacks ¶

overwrite ¶

encode_activations ¶

activations ¶

decode_concepts ¶

concepts ¶

get_dictionary ¶

interpret ¶

concept_output_gradient ¶

inputs ¶

targets ¶

split_point ¶

activation_granularity ¶

aggregation_strategy ¶

concepts_x_gradients ¶

normalization ¶

tqdm_bar ¶

batch_size ¶

List of available SAEs¶

interpreto.concepts.methods.BatchTopKSAEConcepts ¶

interpreto.concepts.methods.JumpReLUSAEConcepts ¶

interpreto.concepts.methods.MpSAEConcepts ¶

interpreto.concepts.methods.TopKSAEConcepts ¶

interpreto.concepts.methods.VanillaSAEConcepts ¶

Loss Functions¶

interpreto.concepts.methods.SAELossClasses ¶

`model_with_split_points` ¶

`nb_concepts` ¶

`split_point` ¶

`encoder_module` ¶

`dictionary_params` ¶

`device` ¶

`**kwargs` ¶

`activations` ¶

`use_amp` ¶

`criterion` ¶

`optimizer_class` ¶

`optimizer_kwargs` ¶

`scheduler_class` ¶

`scheduler_kwargs` ¶

`lr` ¶

`nb_epochs` ¶

`clip_grad` ¶

`monitoring` ¶

`device` ¶

`max_nan_fallbacks` ¶

`overwrite` ¶

`activations` ¶

`concepts` ¶

`inputs` ¶

`targets` ¶

`split_point` ¶

`activation_granularity` ¶

`aggregation_strategy` ¶

`concepts_x_gradients` ¶

`normalization` ¶

`tqdm_bar` ¶

`batch_size` ¶