Probing Classifiers: Promises, Shortcomings, and Advances¶

🎯 Pitch¶

This paper provides a clear, unified formalization and critical survey of the probing-classifier paradigm, identifying key failure modes (missing controls, probe-expressivity confounds, and correlation-vs-causation gaps) and summarizing methodological fixes and advances. By showing when probe accuracy is and isn’t evidence that a model encodes or uses a linguistic property, it equips researchers to draw more reliable interpretations from model analyses and to design probes that yield meaningful, actionable insights.

1. Executive Summary (2-3 sentences)¶

This paper clarifies what “probing classifiers” actually measure when we train a classifier to predict a linguistic property from a neural model’s internal representations, and why naive interpretations of probe accuracy are often invalid. It contributes a unified formalization of the probing setup (Figure 1; Section 2) and surveys key methodological failure modes—especially missing baselines/controls, probe expressivity confounds, and correlation-vs-causation gaps—along with advances that partially address them (Section 4).

2. Context and Motivation¶

What specific problem or gap does this paper address?
Probing is widely used to “interpret” NLP models by testing whether internal representations encode linguistic properties (Section 1–3).
The gap is that probe results are easy to misinterpret, because “high probe accuracy” can arise from factors unrelated to the original model truly representing/using the probed property (Section 4).
Why is this problem important (real-world impact, theoretical significance, or both)?
Practically: researchers use probes to justify claims like “model layer $l$ learns syntax” or to guide architecture decisions; misleading probe conclusions can misdirect model design and scientific understanding (Section 5).
Scientifically: probing is used as evidence about what neural representations contain and how models behave; flawed inference undermines interpretability claims (Section 4.3).
What prior approaches existed, and where do they fall short?
Prior work used probe accuracy (sometimes with simple baselines) as a proxy for representation “quality,” “readability,” or “extractability” (Section 3), but these terms are often left informal and conflate multiple notions.
Critiques highlighted shortcomings in:
- Comparative baselines and controls (Section 4.1),
- Probe choice and expressivity (Section 4.2),
- Correlation vs. causal use by the model (Section 4.3),
- Dataset-vs-task confounds (Section 4.4),
- Dependence on pre-defined properties (Section 4.5).
How does this paper position itself relative to existing work?
It is a critical review / roadmap (“Squib”) rather than a new probing method or large empirical study.
It formalizes components in a single framework (Figure 1a–b; Section 2 & 4) and organizes existing critiques and proposed fixes into a coherent set of design questions for probing experiments.

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The “system” is a measurement protocol: train an auxiliary classifier (“probe”) on a model’s internal vectors to predict a linguistic label, and interpret the probe’s performance as evidence about the representation.
It aims to diagnose what information is present and/or easily extractable from internal representations, and (with additional machinery) whether that information is used for the model’s original task.

3.2 Big-picture architecture (diagram in words)¶

Original pipeline: input text $x$ → original model f → intermediate representation $f_l(x)$ → original prediction $\hat{y}$.
Probing pipeline: same $x$ → representation $f_l(x)$ → probe g → predicted property $\hat{z}$.
Interpretation layer: compare probing performance against baselines/controls (e.g., randomization, selectivity), optionally add information-theoretic measures (mutual information, MDL) and causal interventions (Section 4; Figure 1b).

3.3 Roadmap for the deep dive¶

Define the core objects (model, datasets, representations, probe, metrics) because conclusions depend on each (Section 2; Figure 1a).
Explain why raw probe accuracy is underdetermined without comparisons (Section 4.1).
Detail how probe expressivity changes what can be inferred (Section 4.2).
Separate correlation (decodability) from causation (usage), and how interventions try to bridge this (Section 4.3).
Cover dataset/task and property-definition constraints that limit generalization and discovery (Sections 4.4–4.5).

3.4 Detailed, sentence-based technical breakdown¶

This is a methodological framework + critical survey paper whose core idea is: probing is not a single accuracy number; it is an experiment defined by multiple interacting components, and valid interpretation requires explicit controls/metrics and (for causal claims) interventions (Sections 2–4; Figure 1).

3.4.1 Core formalization (what gets defined)¶

The paper defines an original model as a function $$ f : x \mapsto \hat{y}, $$ trained on an original dataset $$ D_O = {(x^{(i)}, y^{(i)})}, $$ and evaluated by some task metric written as PERF(f, D_O) (Section 2; Figure 1a).
It defines an intermediate representation at layer $l$ as $f_l(x)$, meaning “some internal output of $f$ when processing $x$” (Section 2). The paper notes $f_l(x)$ can generalize beyond standard hidden states to other internals such as attention weights (footnote in Section 3; tied to Clark et al. 2019).
It defines a probing classifier (probe) as a function $$ g : f_l(x) \mapsto \hat{z}, $$ trained to predict a probing property $z$ (often a linguistic annotation such as POS tags) on a probing dataset $$ D_P = {(x^{(i)}, z^{(i)})}, $$ with a reported performance PERF(g, f, D_O, D_P) (Section 2; Figure 1a).
The key technical point is that probing performance is explicitly written as depending on all of:
the probe g,
the original model f,
the original dataset D_O,
the probing dataset D_P (Section 2). This is meant to prevent casual readings like “the architecture encodes $z$,” when the result may be specific to training data, probe capacity, or dataset artifacts.

3.4.2 Information-theoretic reinterpretation (what probing estimates)¶

The paper frames training a probe as estimating how much information about $z$ is in representations $h$ (where $h$ ranges over $f_l(x)$) using mutual information: $$ I(z; h). $$ Here, $z$ is a random variable over probing labels and $h$ is a random variable over representations (Section 2).
Interpretation: if $I(z;h)$ is high, then $z$ is statistically related to $h$ (decodable in principle), but this still does not automatically imply that the original model’s decision process uses $z$ (this becomes central in Section 4.3).

3.4.3 “System / data pipeline diagram in words” (explicit sequence)¶

A probing experiment, as formalized here, proceeds as follows (Sections 2 and 4; Figure 1a–b):

Train or select an original model f on an original dataset $D_O$ for an original task $x\mapsto y$, and record PERF(f, D_O).
Freeze f (typical probing assumption in this framework) and run it on inputs $x$ to extract representations $f_l(x)$ at some chosen layer/component.
Assemble a probing dataset $D_P$ containing pairs $(x^{(i)}, z^{(i)})$ for the property of interest $z$.
Train a probe g that maps $f_l(x)$ to $\hat{z}$ using $D_P$.
Evaluate probe performance as PERF(g, f, D_O, D_P) (e.g., accuracy).
Add interpretation scaffolding (Section 4; Figure 1b), which may include:
baseline/skyline comparisons,
control tasks/datasets,
selectivity,
information-gain under control functions,
minimum description length,
interventions that create modified representations $\tilde{f}_l(x)$ and test downstream effects.

This explicit sequencing matters because many shortcomings arise from steps 3–6 (dataset artifacts, probe memorization, inappropriate baselines, and invalid causal inferences).

3.4.4 Comparisons and controls (how the paper strengthens interpretation)¶

Problem: A single accuracy number (e.g., 87.8) is uninterpretable without a reference (Section 4.1).

The paper organizes comparison strategies (Section 4.1):

Lower-bound baselines:
Majority baselines, or probes trained on simpler representations like static embeddings.
Random baselines: train probes on representations from randomized versions of f, showing even random features can be surprisingly decodable (Section 4.1).
Upper bounds (“skylines”) using $\bar{f}$ (Section 4.1; Figure 1b):
Human performance estimates, literature-reported performance, or a dedicated model trained directly on $x\mapsto z$ (without being constrained to frozen $f_l(x)$).
Control tasks to diagnose probe memorization (Hewitt & Liang 2019; Section 4.1; Figure 1b):
Construct $D_{P,\text{Rand}}$ by randomizing labels in $D_P$.
Define selectivity as: $$ \text{SEL}(g, f, D_O, D_P, D_{P,\text{Rand}})= \text{PERF}(g, f, D_O, D_P)-\text{PERF}(g, f, D_O, D_{P,\text{Rand}}). $$
Intended meaning: a probe that truly exploits structure in representations should perform well on $D_P$ but poorly on randomized labels; high accuracy with low selectivity suggests the probe can memorize dataset quirks (Section 4.1).
Control functions (Pimentel et al. 2020b; Section 4.1; Figure 1b):
Apply a transformation $c$ to representations: $$ c: f_l(x)\mapsto c(f_l(x)). $$
Define information gain relative to that control: $$ G(z,h,c)=I(z;h)-I(z;c(h)). $$
Subsequent work (Zhu & Rudzicz 2020, as summarized here) finds control tasks and control functions are “almost equivalent” theoretically and empirically (Section 4.1).
Control datasets targeting task relevance (Ravichander, Belinkov, & Hovy 2021; Section 4.1; Figure 1b):
Modify the original dataset to create $D_{O,z}$ where all examples share the same value of property $z$, so $z$ is not discriminative for predicting $y$.
Intuition: if $z$ cannot help solve the original task, f should not need it; yet probes may still decode $z$ “incidentally,” undermining naive claims that decodability implies relevance (Section 4.1).

3.4.5 Which probe classifier to use? (expressivity and complexity)¶

The paper frames probe choice as a central confound (Section 4.2):

Simple probes (e.g., linear) are often motivated as measuring “readability” or features likely used by the model, because they cannot invent complex decision boundaries (Section 4.2).
Complex probes (e.g., non-linear MLPs) can yield higher accuracy but may learn to exploit dataset artifacts or memorize, inflating decodability without reflecting representation structure relevant to the model’s computations (Section 4.2; ties back to selectivity in Section 4.1).

It then summarizes approaches to explicitly incorporate complexity:

“Use the most complex probe” viewpoint: to best estimate how much information is available about $z$, allow a very expressive decoder (Pimentel et al. 2020b, as described in Section 4.2).
Minimum Description Length (MDL) as an accuracy–complexity measure (Voita & Titov 2020; Section 4.2; Figure 1b):
Denoted MDL(g, f, D_O, D_P).
Intuition in plain language: rather than only asking “can the probe predict $z$?”, MDL asks “how compactly can $z$ be encoded given the representation?”, which reflects both performance and complexity.
Accuracy–complexity trade-off reporting (“Pareto probing”; Pimentel et al. 2020a; Section 4.2):
The recommendation is to report trade-offs across a family of probes rather than a single probe choice.
Alternative probe classes (e.g., pruning-based probes that learn masks over f’s weights) are mentioned as targeting better accuracy–complexity trade-offs than standard non-linear probes (Cao, Sanh, & Rush 2021; Section 4.2).
Parameter-less probing (Section 4.2):
Instead of learning a classifier g, infer structure directly from internal signals such as attention weights or distances between word representations, possibly feeding them into an external algorithm (e.g., Chu–Liu/Edmonds parsing).
The paper’s framing: this can be viewed as a probe with no learned parameters, potentially avoiding “what did the probe learn?” confounds, though algorithmic complexity still exists (Section 4.2).

3.4.6 Correlation vs. causation (when probing can claim “used by the model”)¶

A central critique is that standard probing is two-stage and decoupled: f is trained for $y$, g is trained for $z$, and nothing forces f to use $z$ even if $z$ is decodable from $f_l(x)$ (Section 4.3).

The paper summarizes intervention-based strategies that attempt causal evidence by creating modified representations $\tilde{f}_l(x)$ (Figure 1b; Section 4.3):

Gradient-based interventions (Giulianelli et al. 2018; Section 4.3):
Use gradients from the probe to modify representations inside f.
Measure how changes affect both probing performance and original-task performance.
Reported outcome (as summarized): probing can identify features that are used, at least in specific settings (Section 4.3).
Counterfactual representations (Tucker, Qian, & Levy 2021; Section 4.3):
Update $f_l(x)$ with respect to $z$ to create “counterfactual” embeddings and observe effects on other properties.
Property removal (“amnesic probing”) (Elazar et al. 2021; Section 4.3):
Iteratively train linear probes and project out the probed directions to reduce information about $z$ in the representation, yielding $\tilde{f}_l(x)$.
Compare probe performance vs. original-task performance after removal.
Reported outcome (as summarized): high decodability does not necessarily imply the model’s original performance depends strongly on that information (Section 4.3), in tension with the earlier gradient-intervention conclusions.
Adversarial removal with controls and continued updating of f (Feder et al. 2021; Section 4.3):
Train adversarially to remove $z$ while controlling for other properties $z_C$ that should remain.
Key difference from standard probing: f keeps being updated, and the method aims to estimate effects on downstream tasks when f is fine-tuned (Section 4.3).

3.4.7 Datasets vs. tasks and property pre-definition (structural constraints)¶

Datasets are imperfect proxies for tasks (Section 4.4):
Claims about “what a model learns” are confounded because different models f are trained on different $D_O$, and probe results also depend on $D_P$ (size/composition), which is “not well studied” here (Section 4.4).
Probed properties must be pre-defined (Section 4.5):
Probing requires choosing a property $z$ and typically requires annotated datasets.
This constrains probing to available annotations (often English-centric) and to what researchers already think might matter.
The paper mentions one attempt to discover latent clusters that correspond to known and potentially novel categories (Michael, Botha, & Tenney 2020; Section 4.5), but emphasizes probing remains mostly hypothesis-driven.

3.4.8 Worked micro-example (single input → output walk-through)¶

Consider the sentiment model example the paper uses (Section 2):

Let $x$ be a sentence like “The movie was surprisingly good.”
The original model computes $\hat{y}=f(x)$, e.g., positive or negative.
At some layer $l$, we extract $h=f_l(x)$, a vector representation of the sentence.
We define a probing property $z$, e.g., POS tags for each word or a sentence-level syntactic label, and create $D_P$ with pairs $(x,z)$.
We train a probe $g(h)\to \hat{z}$ and evaluate PERF(g, f, D_O, D_P).
Interpretation depends on additional steps:
Compare to a random baseline or skyline (Section 4.1),
Compute selectivity using randomized labels (Section 4.1),
Optionally intervene on $h$ to see if changing/removing $z$ affects PERF(f, D_O) (Section 4.3).

This example illustrates the paper’s main message: decoding $z$ from $h$ is easy to do, but hard to interpret without controls and (for usage claims) interventions.

3.4.9 “Core configurations and hyperparameters” (what is and isn’t specified)¶

This paper does not introduce a new trained model with optimizer settings, learning rates, batch sizes, hardware, or token counts.
Instead, the “configuration space” it emphasizes is methodological:
choice of f, layer $l$, g (linear vs non-linear vs parameter-less),
datasets $D_O$, $D_P$ and their control variants ($D_{P,\text{Rand}}$, $D_{O,z}$),
metrics (PERF, SEL, $I(z;h)$, $G(z,h,c)$, MDL),
and whether one performs interventions to produce $\tilde{f}_l(x)$ (Figure 1b; Section 4).

4. Key Insights and Innovations¶

(1) Probing is a multi-component experiment, not a single metric (Section 2; Figure 1a–b).
Novelty/significance: The explicit dependence PERF(g, f, D_O, D_P) forces correct attribution and discourages overgeneral claims like “the architecture encodes syntax,” when results may hinge on datasets or probe choice.
(2) Controls are essential because probes can succeed for the “wrong reasons” (Section 4.1).
What’s different: The review emphasizes control tasks (D_{P,\text{Rand}} and selectivity) and control datasets (D_{O,z}) as principled ways to detect memorization and to challenge “task relevance” interpretations.
Significance: This directly targets a common failure mode: overinterpreting probe accuracy as evidence of meaningful internal linguistic structure.
(3) Probe expressivity creates an accuracy–interpretability trade-off (Section 4.2).
What’s different: Rather than arguing “linear is best” or “complex is best,” the paper organizes multiple positions and highlights metrics like MDL and Pareto frontiers as ways to quantify the trade-off.
Significance: It reframes probe selection as an experimental variable that must be justified, not a default.
(4) Decodability is correlational; causal claims require interventions (Section 4.3).
What’s different: The paper treats the disconnect between g and f as fundamental, and summarizes intervention-based approaches that test whether modifying/removing $z$ in representations changes original-task behavior.
Significance: This is the bridge from “information is present” to “information is used,” which is what many interpretability questions actually ask.
(5) Probing is limited by dataset proxies and pre-defined properties (Sections 4.4–4.5).
Significance: It explains why probing can systematically miss properties that matter but are not annotated or hypothesized in advance, and why cross-paper comparisons are often confounded.

5. Experimental Analysis¶

Evaluation methodology in this paper
This article is primarily a conceptual review and framework formalization; it does not present a new benchmark suite or a new set of controlled experiments with reported tables of results.
Instead, it synthesizes empirical findings from prior work to support methodological claims (Sections 4.1–4.3).
Main quantitative results
The provided text includes an illustrative probe performance value (“87.8” in Section 4.1) as an example of an uninterpretable standalone number, not as a reported experimental result from this paper.
Other performance-related statements are qualitative summaries of prior work (e.g., probes can achieve high accuracy but low selectivity; linear probes tend to have higher selectivity than non-linear probes in Hewitt & Liang 2019, as described in Section 4.1) without specific numeric results in the excerpt.
Baselines and controls discussed
Lower bounds: majority, simple representations, random baselines (Section 4.1).
Upper bounds: skylines $\bar{f}$ via human estimates, literature SOTA, or dedicated $x\mapsto z$ models (Section 4.1).
Controls: randomized-label tasks ($D_{P,\text{Rand}}$) and selectivity; control functions and information gain; control datasets ($D_{O,z}$) (Section 4.1; Figure 1b).
Do the experiments (as surveyed) convincingly support the claims?
The survey evidence supports the paper’s central caution: probe accuracy alone is ambiguous because random features and expressive probes can yield strong decodability, and because properties can be incidentally decodable even when not discriminative for the original task (Section 4.1).
For the causality question, the surveyed intervention papers reach mixed conclusions (Giulianelli et al. 2018 vs. Elazar et al. 2021 as summarized in Section 4.3), which the paper uses to argue that “used by the model” is not settled by standard probing and requires careful intervention design.
Ablations, failure cases, robustness checks
Not presented as new experiments here, but the paper highlights methodological “robustness checks” conceptually:
- control tasks/datasets (Section 4.1),
- reporting accuracy–complexity trade-offs and MDL (Section 4.2),
- interventions that test downstream impact (Section 4.3).

6. Limitations and Trade-offs¶

Scope limitation (review nature)
Because the paper is a short critical review, it does not experimentally validate a single “best practice” pipeline end-to-end; it instead catalogs options and pitfalls (Sections 4–5).
Control-task applicability
The selectivity control tasks based on randomized labels are described as especially suitable for word-level properties and less clearly applicable to sentence-level properties (Section 4.1).
Causality remains difficult
Interventions can suggest causality, but different intervention methods yield different conclusions (Section 4.3). This implies:
- causal claims depend strongly on the intervention type (gradient nudging vs. projection-out vs. adversarial removal),
- and “property removal” may not map cleanly to “model no longer uses it,” especially if information is redundantly encoded.
Dataset confounding is not resolved
The paper emphasizes that both $D_O$ and $D_P$ can drive results (Section 4.4), but also notes that systematic studies varying these datasets are largely missing.
Property pre-definition biases what you can find
Probing mostly answers questions of the form “is this property $z$ present?” and struggles to discover unexpected properties without additional machinery (Section 4.5).

7. Implications and Future Directions¶

How this work changes the landscape
It pushes probing from an informal practice (“train a probe, report accuracy”) toward a controlled experimental methodology with explicit components, controls, and interpretive boundaries (Figure 1; Sections 2 and 4).
Follow-up research it enables or suggests
More systematic studies disentangling:
- architecture vs. training data effects (vary f while holding $D_O$ fixed, or vice versa),
- probing dataset effects (size/composition of $D_P$) (Section 4.4).
Better general-purpose controls for sentence-level properties (building on the limitations noted for control tasks; Section 4.1).
Stronger causal frameworks connecting representation properties to original-task decisions, likely via interventions and counterfactuals (Section 4.3).
Practical applications / downstream use cases
Using probes to characterize what is extractable from a model’s representations (e.g., whether representations contain POS, syntax, semantics) remains valuable for diagnostics and for anticipating transfer behavior (Sections 3 and 5).
Probing (plus other analyses) can sometimes motivate architectural changes or efficiency methods, and the paper lists a few examples where analyses informed later modeling choices (Section 5).
Repro/Integration Guidance: when to prefer what (based on this paper)
If your goal is relative comparisons (e.g., “does model A encode $z$ more than model B?”), prefer:
- explicit baselines and skylines (Section 4.1),
- controls like selectivity or information-gain-style comparisons (Section 4.1),
- and consider reporting accuracy–complexity trade-offs or MDL if “ease of extraction” matters (Section 4.2).
If your goal is causal/usage claims (“does the model use $z$ to predict $y$?”), standard probing is insufficient; prefer interventions that alter $f_l(x)$ (or remove $z$) and measure impact on PERF(f, D_O) or downstream behavior (Section 4.3).
If you want to minimize “probe learned the trick” concerns, consider parameter-less approaches that infer properties directly from internal signals, while being mindful that external algorithms still embody assumptions and complexity (Section 4.2).