Skip to content

Integrated Gradients

Bases: MultitaskExplainerMixin, AttributionExplainer

Integrated Gradients (IG) is a gradient-based interpretability method that attributes importance scores to input features (e.g., tokens) by integrating the model’s gradients along a path from a baseline input to the actual input.

The method is designed to address some of the limitations of standard gradients, such as saturation and noise, by averaging gradients over interpolated inputs rather than relying on a single local gradient.

Reference: Sundararajan et al. (2017). Axiomatic Attribution for Deep Networks. Paper

Examples:

>>> from interpreto import IntegratedGradients
>>> method = IntegratedGradients(model=model, tokenizer=tokenizer,
>>>                              batch_size=4, n_interpolations=50)
>>> explanations = method.explain(model_inputs=text)

Parameters:

Name Type Description Default

model

PreTrainedModel

model to explain

required

tokenizer

PreTrainedTokenizer

Hugging Face tokenizer associated with the model

required

batch_size

int

batch size for the attribution method

4

device

device

device on which the attribution method will be run

None

inference_mode

Callable[[Tensor], Tensor]

The mode used for inference. It can be either one of LOGITS, SOFTMAX, or LOG_SOFTMAX. Use InferenceModes to choose the appropriate mode.

LOGITS

n_interpolations

int

the number of interpolations to generate

10

baseline

Tensor | float | None

the baseline to use for the interpolations

None