Integrated Gradients¶
Bases: MultitaskExplainerMixin, AttributionExplainer
Integrated Gradients (IG) is a gradient-based interpretability method that attributes importance scores to input features (e.g., tokens) by integrating the model’s gradients along a path from a baseline input to the actual input.
The method is designed to address some of the limitations of standard gradients, such as saturation and noise, by averaging gradients over interpolated inputs rather than relying on a single local gradient.
Reference: Sundararajan et al. (2017). Axiomatic Attribution for Deep Networks. Paper
Examples:
>>> from interpreto import IntegratedGradients
>>> method = IntegratedGradients(model=model, tokenizer=tokenizer,
>>> batch_size=4, n_perturbations=50)
>>> explanations = method.explain(model_inputs=text)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
PreTrainedModel
|
model to explain |
required |
|
PreTrainedTokenizer
|
Hugging Face tokenizer associated with the model |
required |
|
int
|
batch size for the attribution method |
4
|
|
Granularity
|
The level of granularity for the explanation.
Options are: |
WORD
|
|
GranularityAggregationStrategy
|
how to aggregate token-level attributions into granularity scores.
Options are: MEAN, MAX, MIN, SUM, and SIGNED_MAX.
Ignored for |
MEAN
|
|
device
|
device on which the attribution method will be run |
None
|
|
Callable[[Tensor], Tensor]
|
The mode used for inference. It can be either one of LOGITS, SOFTMAX, or LOG_SOFTMAX. Use InferenceModes to choose the appropriate mode. |
LOGITS
|
|
bool
|
If True, multiplies the input embeddings with
their gradients before aggregation. Defaults to |
True
|
|
int
|
the number of interpolations to generate |
10
|
|
Tensor | float | None
|
the baseline to use for the interpolations |
None
|