
Probing a Sentiment Classifier for Linguistic Properties¶
In this tutorial we demonstrate supervised probing with interpreto.concepts.probes.
The idea is simple: a model trained for sentiment analysis might also encode other linguistic properties in its hidden representations β such as sentence length, presence of negation, or dominant tense. Probes let us test this hypothesis by training lightweight classifiers on the model's internal activations.
This approach is inspired by the probing literature (Conneau et al. 2018, "What you can cram into a single $&!#* vector").
Steps:
- ποΈ Setup: Load the model and split it
- π Data: Load IMDB and compute linguistic labels
- π¦ Activations: Extract CLS-token representations
- ποΈ Fit probes: Train several probe types
- π Evaluate: Compare probe performance
- π¬ Discussion
Author: Antonin PochΓ©
1. ποΈ Setup: Load the model and split it ¶
We use a DistilBERT model fine-tuned for binary sentiment classification on IMDB.
We wrap it with SplitSequenceClassification, which splits the model at the classification head and extracts [CLS] token activations.
> β‘οΈ Note > > The model was trained to predict sentiment. We will probe whether its internal representations also encode other linguistic properties β properties that were never part of its training objective.
2. π Data: Load IMDB and compute linguistic labels ¶
We load a subset of the IMDB test set (1000 samples) β the same domain the model was trained on.
We then compute three binary linguistic labels from the text itself:
| Concept | Definition |
|---|---|
| Length | Is the review longer than the median length? |
| Negation | Does the text contain negation words (not, never, no, don't, ...)? |
| Past tense | Is the text predominantly in the past tense? |
> π₯ Tip > > These labels are computed from the raw text, not from the model's predictions. The point of probing is to test what other information the model's representations encode beyond its training objective.
import numpy as np
# --- Concept 1: Sentence length (long vs short) ---
lengths = [len(text.split()) for text in texts]
median_length = np.median(lengths)
is_long = [1.0 if l >= median_length else 0.0 for l in lengths]
# --- Concept 2: Contains negation ---
NEGATION_WORDS = {
"not",
"no",
"never",
"neither",
"nobody",
"nothing",
"nowhere",
"nor",
"don't",
"doesn't",
"didn't",
"won't",
"wouldn't",
"shouldn't",
"couldn't",
"isn't",
"aren't",
"wasn't",
"weren't",
"hasn't",
"haven't",
"hadn't",
}
def has_negation(text: str) -> float:
words = set(re.findall(r"\b\w[\w']*\b", text.lower()))
return 1.0 if words & NEGATION_WORDS else 0.0
contains_negation = [has_negation(text) for text in texts]
# --- Concept 3: Past tense dominant ---
PAST_MARKERS = {"was", "were", "had", "did", "been", "went", "said", "told", "made", "got"}
PRESENT_MARKERS = {"is", "are", "has", "does", "do", "goes", "says", "tells", "makes", "gets"}
def is_past_tense(text: str) -> float:
words = re.findall(r"\b\w+\b", text.lower())
past_count = sum(1 for w in words if w in PAST_MARKERS)
present_count = sum(1 for w in words if w in PRESENT_MARKERS)
return 1.0 if past_count > present_count else 0.0
past_tense = [is_past_tense(text) for text in texts]
# --- Combine into a multi-label tensor (n, 3) ---
labels = torch.tensor(
list(zip(is_long, contains_negation, past_tense, strict=True)),
dtype=torch.float32,
)
CONCEPT_NAMES = ["long_text", "contains_negation", "past_tense"]
print(f"Labels shape: {labels.shape}")
print(f"Label prevalence: {labels.mean(dim=0).tolist()}")
print(f" - Long text: {labels[:, 0].mean():.2%}")
print(f" - Contains negation: {labels[:, 1].mean():.2%}")
print(f" - Past tense: {labels[:, 2].mean():.2%}")
3. π¦ Activations: Extract CLS-token representations ¶
We extract the [CLS] token activation for each sample. This is the vector that feeds the classification head β a single 768-dimensional vector summarizing the entire input.
4. ποΈ Fit probes: Train several probe types ¶
We split the data into train/test, then train three different probes:
| Probe | Description |
|---|---|
LinearRegressionProbe |
Ridge regression (closed-form) |
CosineCentroidProbe |
Cosine similarity to class centroids |
LogisticRegressionProbe |
Gradient-descent logistic regression |
All probes learn a mapping from activations to concept scores: (n, d) β (n, c) where c=3 (our three linguistic concepts).
> β‘οΈ Note
>
> We wrap each probe in a ProbeExplainer which connects it to the split model. The explainer handles activation format validation and device management.
from sklearn.model_selection import train_test_split
# Train/test split (80/20)
indices = list(range(len(texts)))
train_idx, test_idx = train_test_split(indices, test_size=0.2, random_state=42)
train_activations = activations[train_idx]
test_activations = activations[test_idx]
train_labels = labels[train_idx]
test_labels = labels[test_idx]
print(f"Train: {train_activations.shape[0]} samples")
print(f"Test: {test_activations.shape[0]} samples")
from interpreto.concepts.probes import (
CosineCentroidProbe,
LinearRegressionProbe,
LogisticRegressionProbe,
ProbeExplainer,
Standardization,
)
# Define probe configurations
probe_configs = {
"LinearRegression (ridge)": LinearRegressionProbe(l2=1e-3),
"CosineCentroid": CosineCentroidProbe(normalization=Standardization()),
"LogisticRegression": LogisticRegressionProbe(lr=1e-2, max_iter=200),
}
# Train each probe
explainers = {}
for name, probe in probe_configs.items():
explainer = ProbeExplainer(splitter, probe)
explainer.fit(train_activations, train_labels)
explainers[name] = explainer
print(f"β {name} fitted (is_fitted={explainer.is_fitted})")
5. π Evaluate: Compare probe performance ¶
We evaluate each probe on the held-out test set using AUROC (Area Under the ROC Curve), which measures how well the probe's scores separate positive from negative samples for each concept.
An AUROC significantly above 0.5 (chance level) means the model's representations do encode that linguistic property.
from sklearn.metrics import roc_auc_score
results = {}
for name, explainer in explainers.items():
# Get concept scores on test set
scores = explainer.activations_to_concepts(test_activations) # (n_test, 3)
# Compute AUROC per concept
aurocs = []
for c in range(labels.shape[1]):
y_true = test_labels[:, c].numpy()
y_score = scores[:, c].detach().cpu().numpy()
# Only compute if both classes are present
if len(set(y_true)) > 1:
aurocs.append(roc_auc_score(y_true, y_score))
else:
aurocs.append(float("nan"))
results[name] = aurocs
# Display results table
print(f"{'Probe':<30} {'Long text':<12} {'Negation':<12} {'Past tense':<12} {'Mean':<8}")
print("-" * 74)
for name, aurocs in results.items():
mean_auroc = np.nanmean(aurocs)
print(f"{name:<30} {aurocs[0]:<12.3f} {aurocs[1]:<12.3f} {aurocs[2]:<12.3f} {mean_auroc:<8.3f}")
import matplotlib.pyplot as plt
# Visualize results
fig, ax = plt.subplots(figsize=(8, 4))
x = np.arange(len(CONCEPT_NAMES))
width = 0.25
for i, (name, aurocs) in enumerate(results.items()):
ax.bar(x + i * width, aurocs, width, label=name)
ax.axhline(y=0.5, color="gray", linestyle="--", label="Chance level")
ax.set_xlabel("Concept")
ax.set_ylabel("AUROC")
ax.set_title("Probe Performance: Linguistic Properties in Sentiment Representations")
ax.set_xticks(x + width)
ax.set_xticklabels(CONCEPT_NAMES)
ax.legend(loc="lower right")
ax.set_ylim(0.0, 1.0)
plt.tight_layout()
plt.show()
6. π¬ Discussion ¶
What do the results tell us?¶
- High AUROC for negation: The sentiment model strongly encodes negation β this makes sense, as negation flips sentiment polarity ("good" vs "not good").
- Moderate AUROC for length: Sentence length is partially encoded, possibly because longer reviews tend to be more nuanced.
- Variable AUROC for tense: Past tense encoding depends on the model and dataset β movie reviews are typically in past tense ("the movie was..."), so this may be less discriminative.
Probing caveats¶
A high probe accuracy does not prove the model uses that information for its task β only that the information is recoverable from the representations. See Hewitt & Liang (2019) for a discussion of probe expressivity and selectivity.
Next steps¶
- Try probing at different layers to see where each property is encoded (earlier layers for syntax, later for semantics).
- Use
MeansDiffProbeas a minimal baseline (Fisher's discriminant direction). - Combine probes with concept attribution to see which input activates the concept.
References¶
- Conneau et al. (2018). What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. ACL.
- Hewitt & Liang (2019). Designing and interpreting probes with control tasks. EMNLP.