Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process¶

🎯 Pitch¶

This paper introduces RISE, an unsupervised method that trains sparse auto-encoders on sentence-level chain-of-thought activations to discover linear "reasoning vectors"—decoder directions that disentangle behaviors like reflection and backtracking. These vectors both reveal the geometric organization of reasoning in LLMs and can be causally injected at inference to controllably amplify or suppress behaviors, enabling interpretable, test-time steering of reasoning trajectories without retraining.

1. Executive Summary (2-3 sentences)¶

This paper introduces RISE (Reasoning behavior Interpretability via Sparse auto-Encoder), an unsupervised framework that discovers reasoning vectors: linear directions in an LLM’s internal activation space that correspond to distinct reasoning behaviors (e.g., reflection, backtracking). By training sparse autoencoders (SAEs) on sentence-level (“step-level”) chain-of-thought activations, it finds disentangled decoder directions that both organize reasoning behaviors geometrically and can be causally used at inference time to amplify/suppress those behaviors without retraining the base model.

2. Context and Motivation¶

Problem / gap addressed
Reasoning-capable LLMs often produce long chain-of-thought (CoT) traces, but the internal mechanisms shaping these trajectories are not well characterized.
Prior mechanistic/activation-steering methods typically require human-defined concepts and supervised contrastive data, which is a poor fit for reasoning because reasoning behaviors are:
- fluid and overlapping,
- hard to define precisely “in token space,” and
- difficult to annotate exhaustively at scale.
Why this matters
Understanding which internal features correspond to reasoning behaviors is useful for:
- interpretability (mapping internal “cognitive” structure),
- controllability (steering inference trajectories),
- efficiency (reducing unnecessary verbosity / “overthinking”),
- and possibly performance (if certain behaviors are helpful/harmful depending on the instance).
Prior approaches and shortfalls (as framed here)
Activation steering via supervised concept vectors, especially DiffMean (Difference-of-Means):
- Build two labeled groups (e.g., “happy” vs “sad”), average activations per group, subtract averages to get a steering direction.
- Limitation for reasoning: requires clean oppositional labels and predefined concepts; reasoning behaviors do not naturally form such clean binary partitions and are not easily labeled.
Focusing on a small set of predefined behaviors (e.g., reflection/backtracking) provides only a narrow view of the possible behavior space.
How this work positions itself
It aims to uncover the geometry of reasoning behaviors unsupervisedly, using SAEs trained on step-level activations, then validating interpretability/causality via visualization, clustering metrics, and intervention experiments.

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is an unsupervised interpretability + steering pipeline that learns a dictionary of sparse latent features from an LLM’s internal activations during reasoning.
It solves the problem by (i) extracting sentence-level reasoning-step activations, (ii) training a sparse autoencoder on those activations, and (iii) using SAE decoder directions to analyze geometry and intervene at inference time.

3.2 Big-picture architecture (diagram in words)¶

(A) Data/activation collection
Input questions → LLM generates CoT responses → responses are split into steps → a specific per-step hidden activation is extracted.
(B) SAE training
Step activations from one chosen LLM layer → SAE encoder produces sparse codes → SAE decoder reconstructs activations using a sparse combination of decoder directions.
(C) Analysis + steering
Decoder directions are visualized/clusted (e.g., UMAP + silhouette) and associated with behaviors.
Selected directions (or their centroid) are injected/removed from LLM activations at inference time to modulate behaviors (reflection/backtracking/confidence/length structure).

3.3 Roadmap for the deep dive¶

I will explain, in order:
How the paper defines “reasoning vectors” and why SAEs are a natural tool here.
How “thought representations” (step-level activations) are constructed from CoT traces.
The SAE objective and what the decoder columns/rows represent mechanistically.
The theoretical identifiability claim (Theorem 1) and what assumptions it needs.
How behavior geometry is validated (UMAP + silhouette) and how human labels are used only for evaluation.
How inference-time interventions are implemented (projection / scaling) and what causal effects are measured.
How “new behaviors” are discovered via an entropy-based optimization to find confidence-related vectors.

3.4 Detailed, sentence-based technical breakdown¶

Framing: This is primarily an empirical mechanistic interpretability + intervention paper with an algorithmic component (SAE-based latent discovery) and a theoretical justification (dictionary recovery conditions) for why SAE decoder directions can align with “true” underlying behavior directions.

3.4.1 Core concept: “reasoning vectors” as linear directions¶

The paper adopts the linear representation hypothesis, i.e., the idea that “atomic” concepts/behaviors correspond to directions in activation space.
A reasoning vector is defined as a direction in activation space that encodes a distinct reasoning behavior (e.g., reflection/backtracking), such that modifying activations along that direction predictably changes the behavior.

3.4.2 Thought representation construction (step-level activations)¶

The pipeline to create SAE training inputs is:

Collect model responses
For each question in a selected dataset, run the target model to generate a chain-of-thought response.
Split responses into sentence-level “steps”
Each response is divided using the delimiter symbol <\n\n>.
If a response splits into k steps, k varies by example.
Embed each step by extracting a delimiter-token activation
The paper re-runs inference by feeding both the question and the full response back into the model.
For each step boundary, it extracts the hidden representation at the token <\n\n>.
That hidden vector is treated as the activation representing the corresponding reasoning step.
Choose which internal representation to use
The extracted vectors are specifically taken from the residual stream representations after each transformer layer.
The SAE is trained on activations from a single chosen layer l at a time (later analysis compares across layers).

Important design choice: They operate at the sentence/step level rather than token level because the same token can serve different roles across contexts; a step-level representation is intended to better capture the structure of reasoning.

3.4.3 Sparse Autoencoder (SAE) model and objective¶

What an SAE is (in this paper’s usage):
An SAE learns a dictionary of latent feature directions that can reconstruct the original activation vectors while using only a small number of active features per input.
The goal is that sparsity promotes disentanglement and interpretability of individual features.
Forward equations (Eq. (1))
Let the original activation be \(h \in \mathbb{R}^d\).
The SAE computes a latent code \(z\) and a reconstruction \(\hat{h}\):
- \(z = \sigma(W^\top_{\text{encoder}} h + b_{\text{encoder}})\)
- \(\hat{h} = W^\top_{\text{decoder}} \sigma(z) + b_{\text{decoder}}\)
The nonlinearity \(\sigma\) is ReLU for the standard SAE used here.
Interpretation of decoder parameters
The decoder matrix \(W_{\text{decoder}} \in \mathbb{R}^{D \times d}\) contains D learned vectors.
The paper treats each learned decoder direction as an “atomic” vector; these are the candidate reasoning vectors.
Mechanistically, \(\hat{h}\) becomes a linear combination of a small subset of decoder vectors, gated by sparse activations in \(z\).
Training objective (Eq. (2))
\(L = \|\hat{h} - h\|_2^2 + \lambda \|z\|_0\)
The reconstruction term keeps \(\hat{h}\) close to \(h\), while the \(\ell_0\)-style sparsity penalty encourages few active features.
The intended effect is: each step activation is explained by a small number of feature directions, making individual directions more interpretable.

3.4.4 Theoretical justification: when SAEs recover a “true dictionary” (Theorem 1)¶

The paper posits a generative model for delimiter-token activations (Eq. (3)):
\(h = W a + \varepsilon\)
Here:
- \(W = [w_1,\dots,w_m] \in \mathbb{R}^{d \times m}\) is a (hypothetical) ground-truth dictionary of latent behavior directions,
- \(a\) is a k-sparse coefficient vector (only k behaviors active per step),
- \(\varepsilon\) is bounded noise.
Theorem 1 provides conditions under which SAE training recovers the dictionary (up to permutation and scaling):
Incoherence: different dictionary vectors are not too aligned (bounded cosine similarity by \(\mu < 1\)).
Sparsity: \(k\) is small relative to \(1/\mu\) (stated as \(k < c/\mu\) for a universal constant \(c\)).
Activation function: ReLU is used.
Separation: nonzero coefficients satisfy \(|a_i| \ge \alpha > 0\).
Conclusion of Theorem 1 (Eq. (5))
As sample size \(N \to \infty\), local optima of the SAE objective recover a decoder matrix aligned with the true dictionary up to permutation \(\Pi\) and diagonal scaling \(D\):
- \(W_{\text{dec}} \approx W \Pi D\)
Plain-language meaning
Under idealized assumptions, the learned decoder directions can correspond to underlying “atomic” behavioral directions, making it plausible that decoder columns/rows can map to interpretable reasoning behaviors.

(The paper also includes empirical checks of sparsity/incoherence assumptions in Appendix C / Figure 9, but the excerpt does not include exact numeric bounds; it qualitatively reports they are “clearly bounded” and reconstruction error is consistently low.)

3.4.5 Experimental setup details (what is specified vs not specified)¶

Target model(s)
Main: DeepSeek-R1-Distill-Qwen-1.5B (R1-1.5B) (Section 4.2).
Extension: Qwen3-8B (reported in Figure 5).
SAE training data
MATH dataset, with 500 randomly sampled training examples (Section 4.2).
The SAE is trained on step-level delimiter activations constructed as in Section 3.2.
SAE hyperparameters (explicitly given)
Hidden dimension: \(D = 2048\).
Batch size: 1024.
Learning rate: 1 × 10^-4, with warm-up over the first 10% of training.
Optimizer: Adam.
LR schedule: cosine annealing decay.
Sparsity strength: \(\lambda = 2 × 10^{-3}\).
LLM architecture/training config fields that are not provided in the excerpt
The paper does not specify (in the provided content) the base LLM’s:
- context window length,
- tokenizer details,
- number of transformer layers / hidden size / heads (beyond showing layer indices in plots),
- or any pretraining compute/hardware.
Since RISE is an analysis/steering method, the focus is on SAE training and inference-time intervention rather than training the base model.

3.4.6 Geometry analysis: visualizing and quantifying separability of behaviors¶

Human-labeled behaviors are used for evaluation, not for SAE training
They label each step representation into one of three classes: reflection, backtracking, others (Section 4.3).
Labeling method: LLM-as-a-judge using GPT-5 with definitions:
- reflection = re-examining earlier steps / uncertainty,
- backtracking = retract/pivot to a new approach,
- others = neither.
Appendix D also reports consistency checks across judges and keyword matching:
- Pairwise agreement ratios exceed 85%.
- GPT-5 vs GPT-4o agreement is 0.94 (Figure 10).
Mapping labels to SAE features
For each labeled step, they map the step’s activation into SAE latent feature space and examine which channels are most active.
“Channel activity” is measured by the largest magnitude of its latent feature for that step.
UMAP visualization (Figure 2)
They embed SAE decoder vectors into 2D using UMAP.
They choose UMAP specifically because it uses cosine similarity internally, emphasizing direction over magnitude (important since the method interprets behaviors as directions).
Observed qualitative structure:
- Reflection- and backtracking-associated vectors cluster more tightly in localized regions.
- “Other” behaviors appear more dispersed, consistent with being a heterogeneous category.
Layerwise separability quantified by silhouette score (Figure 3)
They compute (normalized) silhouette scores across layers for:
- overall clustering,
- reflection vs backtracking,
- others vs reflection,
- others vs backtracking.
Reported pattern:
- Later layers generally have higher separability than early layers.
- There is a slight decline near the final layers (except the very last), suggesting mid-to-late layers encode the clearest behavioral structure.
- Reflection vs backtracking separation is more modest than separating either from “others,” suggesting partial overlap between reflection/backtracking subspaces.

3.4.7 Causal intervention: using decoder directions during inference (Eq. (6))¶

Goal of the intervention
Show that SAE decoder directions are not only correlated with behaviors but are causally involved: editing activations along these directions changes reasoning behavior during generation.
How they select a behavior vector
They filter out decoder columns that are strongly activated across multiple behaviors (to improve specificity).
For reflection:
- Take reflection-specific columns and compute their average to get a single reflection vector (a centroid in decoder space).
Decoder columns are normalized to unit length: \(\|w_i\|_2 = 1\).
Where the intervention is applied
They intervene at the hidden representation of the last token at each reasoning step (Figure 4).
They focus on decoder columns from the final layer for the main causal study (Section 4.4.1).
Intervention equation (Eq. (6))
\(h' = h - w_i (w_i^\top h)\)
This subtracts the projection of \(h\) onto direction \(w_i\), i.e., “project out” that component.
Controlling strength
They also use a scalar \(\alpha\) to control strength:
- \(h' = h - \alpha \cdot w_i (w_i^\top h)\), with \(\alpha \in \{-1.5, -1, 0, 1, 1.5\}\).
Interpretation:
- \(\alpha > 0\): remove/suppress that direction.
- \(\alpha < 0\): add/amplify that direction (since subtracting a negative adds).
Worked micro-example (toy illustration of Eq. (6))
Suppose an activation \(h = (2, 1)\) and a unit behavior direction \(w = (1, 0)\).
Then \(w^\top h = 2\), and the projection is \(w(w^\top h) = (2, 0)\).
The edited activation is \(h' = (2, 1) - (2, 0) = (0, 1)\).
In words: the component along the behavior direction is removed, leaving only the orthogonal component.

3.4.8 Discovering new behaviors: confidence via entropy-minimization over decoder columns (Eq. (7))¶

Motivation
Confidence is described as hard to specify at the word level, so it serves as a test case for unsupervised discovery beyond human-defined behaviors.
They use entropy of the output distribution as a proxy for confidence: lower entropy ≈ more confident.
Model split around the SAE layer
Choose the layer \(l\) where the SAE is trained.
Split the model into:
- \(h = f_{1 \to l}(x)\) (activation up to layer \(l\)),
- \(y = f_{l \to L}(h)\) (rest of the model to produce outputs).
Optimization objective (Eq. (7))
They learn a score vector \(S \in \mathbb{R}^D\) over decoder columns by minimizing expected entropy when adding a decoder-space perturbation \(S W_{\text{decoder}}\) to activations:
- \(p_k = \text{softmax}( f_{l \to L}(h + S W_{\text{decoder}}) )_k\)
- Minimize: \(-\sum_{k=1}^{|V|} p_k \log p_k\) (entropy), averaged over activation samples.
Training details for this S optimization:
- Optimizer: Adam
- Learning rate: 0.01
- Iterations: 1000
- Batch size: 256
- Cosine annealing LR decay schedule
From scores to a “confidence vector”
Select columns with the highest entries in \(S\).
Combine/inject them (the excerpt refers to injecting the corresponding vector, termed the confidence reasoning vector) using the same inference intervention mechanism as Section 4.4.

4. Key Insights and Innovations¶

(1) Step-level SAE training as an unsupervised route to reasoning-behavior features
Novelty: Instead of token-level or supervised concept labeling, the method trains SAEs on sentence-level reasoning-step activations (delimiter token <\n\n>).
Significance: This targets reasoning structure at a more semantically meaningful granularity and avoids requiring human-defined behavior taxonomies during training.
(2) Geometry of behaviors in decoder space + layerwise separability
Novelty: The work treats the SAE decoder directions as points in a “behavior space” and shows (via UMAP + silhouette) that reflection/backtracking align with separable regions (Figure 2) and become more separable in mid-to-late layers (Figure 3).
Significance: This provides a concrete, testable claim: reasoning behaviors correspond to clusters of directions in a learned dictionary, not just vague correlations.
(3) Causal controllability without retraining by projection-based intervention
Novelty: The intervention method uses a clean geometric edit—subtracting (or amplifying) the projection along a decoder direction (Eq. (6))—applied per reasoning step.
Significance: It demonstrates mechanistic leverage: you can change reflection/backtracking frequency and response length characteristics while preserving the final answer in at least the showcased example (Figure 6).
(4) Unsupervised discovery of a hard-to-define behavior: confidence
Novelty: The paper proposes a specific optimization (Eq. (7)) that identifies decoder directions associated with lower output entropy.
Significance: This moves beyond “discover features then label them,” showing a way to search the learned feature dictionary for directions that causally control an objective proxy (entropy/confidence).

5. Experimental Analysis¶

Evaluation methodology
Base models
- Main: DeepSeek-R1-Distill-Qwen-1.5B (R1-1.5B).
- Additional: Qwen3-8B (Figure 5).
SAE training data
- MATH dataset, 500 training examples (Section 4.2).
Behaviors evaluated
- Reflection, backtracking, and “others” (for geometry/visualization evaluation).
- Response length structure (Appendix E; Figures 11–12).
- Confidence via entropy proxy (Section 5).
Metrics
- Geometry/separability: UMAP visualization (Figure 2) and normalized silhouette score across layers (Figure 3; and length-based Figure 12).
- Causal behavior change: number of reflection/backtracking “steps” under interventions (Figure 5; Table 1; Section 5.1).
- For confidence: entropy-related token changes (Figure 7) and behavior-frequency changes; limited performance measurement on AIME25 (Section 5.1).
- For steering comparisons: accuracy and token usage (Figure 8) versus TIP and SEAL.
Main quantitative results (with numbers explicitly shown)
Intervention strength affects reflection steps (R1-1.5B, AIME25)
- With \(\alpha \in \{-1.5,-1,0,1,1.5\}\), reflection steps change to:
- {58.6, 73.6, 90.5, 131.0, 166.9} (Section 4.4.1).
- This is consistent with a monotonic behavior-control effect.
Cross-domain generalization of reflection/backtracking steering (Table 1)
- SAE trained on MATH500, applied to new domains with R1-1.5B:
- GPQA-Diamond:
  - Reflection: Vanilla 53.23, Positive 62.77, Negative 45.42
  - Backtracking: Vanilla 11.83, Positive 20.31, Negative 6.47
- KnowLogic:
  - Reflection: Vanilla 35.56, Positive 51.00, Negative 25.99
  - Backtracking: Vanilla 5.42, Positive 9.38, Negative 2.33
Confidence vector reduces reflection/backtracking (AIME25) with small reported accuracy change
- Reflection steps reduced 90.53 → 33.77.
- Backtracking steps reduced 35.50 → 5.93.
- Accuracy changes 23.33% → 20.00% (difference of 1 question out of 30), and the paper argues this is not statistically significant (Section 5.1).
Confidence vector generalization across domains (Table 2)
- GPQA:
- Reflection: Vanilla 53.23, Positive 61.37, Negative 37.25
- Backtracking: Vanilla 11.83, Positive 16.18, Negative 6.91
- KnowLogic:
- Reflection: Vanilla 35.56, Positive 48.16, Negative 20.04
- Backtracking: Vanilla 5.42, Positive 5.98, Negative 3.51
Steering method comparison (Figure 8)
- A “careful steering strategy” using the top-3 confidence vectors and test-time learned coefficients improves reasoning accuracy by up to 4.66 points and reduces token usage by 13.69% (relative to the compared baselines shown: TIP, SEAL, and Vanilla).
- The excerpt does not provide all absolute accuracy/token numbers in text, so only the stated deltas can be safely repeated.
Do the experiments support the claims?
Interpretability/geometry claim: Supported by Figure 2’s clustering visualization and Figure 3’s silhouette trends, with stronger separability in later layers.
Causality/controllability claim: Supported by intervention results:
- monotonic change with \(\alpha\) (Section 4.4.1),
- consistent behavior shifts across tasks and models (Figure 5),
- cross-domain behavior modulation (Table 1).
Novel behavior discovery claim (confidence): Supported by:
- concentrated high-scoring columns in a region of decoder space (Figure 7a),
- changes in frequent step-initial tokens away from “Wait/Alternatively” toward calculation tokens (Figure 7b–c),
- large reductions in reflection/backtracking step counts (Section 5.1) and domain generalization (Table 2).
Performance impact: The confidence intervention shows a small drop on AIME25 (23.33%→20.00%), while the more careful “top-3 with learned coefficients” method claims accuracy improvements and token reduction (Figure 8). The excerpt does not include statistical testing beyond the “1/30” remark.
Ablations / robustness checks present in the excerpt
Annotation robustness: multiple judges + keyword method agreement (Figure 10).
Domain shift: GPQA-Diamond and KnowLogic for both behavior vectors (Table 1) and confidence vectors (Table 2).
Layerwise analysis: silhouette across layers (Figure 3) and length structure (Figures 11–12).
Model generalization: results also shown for Qwen3-8B in Figure 5.

6. Limitations and Trade-offs¶

Dependence on a specific step segmentation scheme
Steps are defined by the delimiter <\n\n>. If a model formats reasoning differently (or if outputs are not neatly segmented), the “step activation” construction may be brittle or inconsistent.
SAE is trained on a single layer at a time
While the paper analyzes layerwise separability, the core SAE representation is layer-specific, which may limit a unified multi-layer interpretation unless multiple SAEs are trained.
Behavior labeling is still needed for evaluation and for selecting “behavior-specific” columns
The method is unsupervised in training, but identifying “reflection-specific” columns relies on human/LLM-labeled steps and filtering columns that activate across multiple behaviors (Section 4.4.1).
This introduces an implicit dependence on the chosen taxonomy (reflection, backtracking, others) for certain steering experiments.
Objective proxy risk for “confidence”
Confidence is operationalized via entropy minimization (Eq. (7)). Lower entropy is a proxy and may not always correspond to calibrated correctness; indeed, the excerpt reports a slight accuracy drop on AIME25 under the confidence vector intervention.
Scope of behaviors explored
The paper highlights reflection/backtracking and confidence and notes “others” is diffuse; it does not provide a comprehensive catalog of newly discovered behavior types beyond the confidence case study in the provided content.
Potential entanglement and oversmoothing effects
The silhouette analysis notes a decline near final layers (Figure 3 discussion), attributed to an oversmoothing phenomenon where representations become too similar. This could make late-layer features less cleanly separable.
Missing practical details for full reproduction (from the provided excerpt)
The excerpt does not specify several base-model inference/training parameters (e.g., context length/tokenizer, full architecture details), so exact reproduction depends on external model documentation not included here.

7. Implications and Future Directions¶

How this changes the landscape
It suggests a concrete mechanism for reasoning interpretability: reasoning behaviors can be treated as a sparse, linearly addressable dictionary in activation space, discoverable without supervised concept datasets.
It reframes reasoning “styles” (reflection/backtracking/length/confidence) as geometric structure that can be visualized, clustered, and intervened upon.
Follow-up research enabled
Richer unsupervised behavior discovery: Use alternative objectives (beyond entropy) to search decoder space for directions controlling other hard-to-define properties (e.g., persistence, exploration, error correction).
Multi-layer or temporal feature tracking: Since separability varies by layer (Figure 3), future work could connect features across layers or identify where in the network behaviors “emerge.”
Improved selection and disentanglement: Develop principled ways to choose behavior-specific columns without relying on labeled steps, or to quantify and mitigate cross-behavior entanglement.
Practical applications / downstream use cases
Test-time steering of reasoning style: Without retraining, practitioners can suppress or amplify specific behaviors by projecting out or adding components along SAE directions (Eq. (6)), potentially trading off verbosity vs self-checking.
Efficiency control: The paper’s length-structure finding (Appendix E; Figures 11–12) and the token-usage reduction reported for the confidence-based steering (Figure 8) point to using latent directions to reduce unnecessary token generation.
Domain-robust behavior modulation: Tables 1–2 indicate learned directions generalize from MATH-trained SAE vectors to GPQA/KnowLogic, suggesting portability of behavior controls.
Repro/Integration Guidance (based on provided paper content)
Prefer this method when:
- you want unsupervised discovery of internal behavior directions (no contrastive labeled dataset like DiffMean),
- you can obtain chain-of-thought traces and have a stable step delimiter (here <\n\n>),
- and you want inference-time control without fine-tuning or retraining the base model.
A minimal integration loop implied by the paper:
1. Generate CoT outputs on an unlabeled set (e.g., 500 MATH samples).
2. Extract per-step delimiter activations at a chosen layer (residual stream after the layer).
3. Train an SAE (e.g., \(D=2048\), Adam, lr \(10^{-4}\), \(\lambda=2\times10^{-3}\)).
4. Identify decoder directions of interest:
5. via evaluation labels for known behaviors (reflection/backtracking), and/or
6. via objective-based search such as entropy minimization to find confidence-related directions (Eq. (7)).
7. Apply steering by projection editing at each reasoning step during inference (Eq. (6)) and tune strength \(\alpha\).