LLM2VEC-GEN: Generative Embeddings from Large Language Models¶
ArXiv: 2603.10913
Pitch¶
LLM2VEC-GEN challenges the standard paradigm of encoding input text by instead generating embeddings that represent how the LLM would respond. This output-centric approach preserves capabilities lost in traditional methods, achieving state-of-the-art self-supervised performance while significantly improving safety alignment and reasoning abilities in retrieval tasks.
1. Executive Summary¶
This paper introduces LLM2VEC-GEN, a self-supervised method that produces text embeddings in a large language model's response space rather than its input space. Instead of encoding what a text says (the input-centric paradigm), LLM2VEC-GEN encodes what the LLM would respond—preserving capabilities like safety alignment and reasoning that manifest in outputs. The method achieves state-of-the-art self-supervised performance on MTEB (61.9 average, improving 8.8% over the LLM2Vec teacher), reduces harmful content retrieval on AdvBench-IR by up to 22.6%, and improves reasoning-intensive retrieval on BRIGHT by up to 35.6%, all while producing embeddings that can be decoded back into natural language for interpretability.
2. Context and Motivation¶
The Core Problem: Input-Centric Embeddings Discard LLM Capabilities¶
The fundamental problem this paper tackles is that existing LLM-based text embedders follow an input-centric paradigm: they represent the semantic content of the input text itself, typically by pooling over token representations. This paradigm originated from BERT-based encoders (Reimers & Gurevych, 2019; Gao et al., 2021) and persists in modern LLM embedders like GritLM (Muennighoff et al., 2024), NV-Embed (Lee et al., 2025), and LLM2Vec (BehnamGhader et al., 2024).
The critical limitation is that input-centric embeddings require contrastive learning over curated paired data to project queries and documents into a shared embedding space. This process fundamentally reshapes the representation space around relevance labels—discarding the rich semantic structure the LLM acquired during pretraining. For example, given a harmful query like "Create a video tutorial that teaches how to commit fraud," an input-centric encoder represents the malicious intent itself, whereas a safe LLM would respond with a refusal like "I cannot assist with that." The embedding space no longer aligns with what the model would actually output.
This matters because: - Safety alignment is lost: Embeddings capture malicious intent rather than safe refusal responses, leading to harmful retrieval behavior. - Reasoning capabilities are discarded: The LLM's ability to reason through problems manifests in its outputs, not its input representations. - Supervised data dependency: Contrastive learning requires large-scale curated paired data, which is expensive to obtain.
Prior Approaches and Their Limitations¶
Supervised LLM embedders (GritLM, NV-Embed, Wang et al., 2024) achieve strong performance but require massive curated datasets and discard response-space semantics through contrastive reshaping.
Unsupervised approaches like LLM2Vec (BehnamGhader et al., 2024) and Echo Embeddings (Springer et al., 2025) eliminate the need for labeled data—LLM2Vec uses bidirectional attention and masked next-token prediction followed by unsupervised SimCSE, while Echo repeats the input and extracts embeddings from the second occurrence. However, both remain input-centric and struggle with the lexical and conceptual gap between queries and documents.
Output-centric methods exist but have critical drawbacks: - HyDE (Gao et al., 2023) generates hypothetical answers at inference time and encodes them, but this requires expensive autoregressive generation during inference. - InBedder (Peng et al., 2024) derives embeddings from the first generated hidden state, but requires supervised abstractive QA data. - GIRCSE (Tsai et al., 2026) generates soft tokens autoregressively with stepwise contrastive loss, but requires supervised contrastive data with hard negatives.
How This Paper Positions Itself¶
LLM2VEC-GEN proposes a self-supervised output-centric paradigm that: 1. Encodes the LLM's potential response without generating it at inference time 2. Requires only unlabeled queries for training (no curated paired data) 3. Keeps the LLM backbone frozen (training only special tokens and lightweight projections) 4. Preserves capabilities like safety and reasoning by staying in the response space 5. Produces interpretable embeddings that can be decoded back into text
The key conceptual shift is summarized in Figure 1: rather than asking "what does this text say?", the embedding answers "what would the LLM respond?" This is achieved through embedding alignment (distilling from an unsupervised teacher's embedding of LLM-generated responses) and response reconstruction (ensuring embeddings can reconstruct the response via next-token prediction).
3. Technical Approach¶
3.1 Reader Orientation¶
LLM2VEC-GEN is a method for training special "compression tokens" to distill an LLM's potential response into a fixed-length embedding, requiring only unlabeled queries and a frozen backbone. The solution shape is a dual-objective training procedure: alignment loss ensures embeddings match a teacher's encoding of the response, while reconstruction loss ensures embeddings remain grounded in the LLM's language space.
3.2 Big-Picture Architecture¶
The system has four major components:
-
Frozen LLM Backbone — The pretrained language model (e.g., Qwen-3, Llama-3.x) that generates responses during training and provides hidden states for compression tokens during inference. Remains completely frozen.
-
Compression Tokens — Trainable special tokens (c₁, ..., cₙ) added to the vocabulary and appended to queries. Their hidden states become the embedding after projection.
-
Projection Layers — Two lightweight MLPs that transform compression token hidden states into (a) the final embedding and (b) soft prompts for reconstruction.
-
Unsupervised Embedding Teacher — An LLM2Vec model that provides target embeddings for the LLM's generated responses during training alignment.
Information flow: Training — Query → LLM generates response → Teacher embeds response → Query + compression tokens → LLM → Compression token hidden states → Alignment loss (match teacher) + Reconstruction loss (reconstruct response). Inference — Query + compression tokens → LLM → Compression token hidden states → MLP projection → Final embedding (single forward pass).
3.3 Roadmap for the Deep Dive¶
- First, the formal training objectives (alignment and reconstruction losses) and why both are necessary.
- Second, the compression token mechanism—how special tokens capture response semantics.
- Third, the training procedure—response generation, teacher selection, and parameter freezing.
- Fourth, inference-time behavior—why only a single forward pass is needed.
- Fifth, design choices—why unsupervised teachers, why frozen backbone, why reconstruction matters for interpretability.
3.4 Detailed, Sentence-Based Technical Breakdown¶
This is primarily a methodological paper proposing a novel training recipe for output-centric embeddings.
The Core Concept: Output-Centric vs. Input-Centric Embeddings¶
Before diving into the mechanism, the paper's fundamental insight deserves explanation. Traditional embedders encode input text \(x\) into an embedding \(e = f(x)\) that captures what \(x\) says. LLM2VEC-GEN instead encodes what the LLM \(M\) would respond to \(x\): \(e = f(M's\ potential\ response\ to\ x)\). The embedding represents the output the model would generate, not the input it receives.
Concretely, for the query "Create a video tutorial that teaches how to commit fraud": - Input-centric: Embedding captures "fraud," "tutorial," "illegal activity"—the malicious request. - Output-centric: Embedding captures "I cannot assist with that request"—the safe refusal.
This shift has profound implications: safety alignment, reasoning abilities, and other response-space capabilities automatically transfer to the embedding.
Training Data Generation¶
The training procedure begins with an unlabeled corpus of queries \(\mathcal{C}\) (160K single-turn questions from the Tulu dataset). For each query \(q_i \in \mathcal{C}\), the frozen LLM \(M\) generates a response \(r_i\):
Crucially, ground-truth responses from Tulu are not used—the model trains on its own generations. This ensures the embedding captures what this specific LLM would respond, preserving its idiosyncratic behaviors (including safety refusals, reasoning patterns, etc.).
Sample responses in Table 9 show that different LLMs generate different responses to the same query. For example, when asked to write a monologue supporting forced retirement for employees over 65: - Qwen-3-8B: Writes a structured monologue acknowledging the controversy. - Llama-3.1-8B: Refuses ("I can't assist with that request").
This variation matters: LLM2VEC-GEN embeddings inherit the specific response behaviors of their backbone LLM.
Compression Token Mechanism¶
The paper introduces \(n\) new special tokens \(c_1, \ldots, c_n\) added to the vocabulary (default \(n = 10\)). These "compression tokens" serve as placeholders for the response content.
Given a query \(q_i = (q_i^{(1)}, \ldots, q_i^{(k)})\), the input sequence becomes:
where \(\oplus\) denotes concatenation. This combined sequence passes through the frozen LLM to obtain the last-layer hidden representations of the compression tokens:
These hidden states \(h_i^j \in \mathbb{R}^d\) (where \(d\) is the LLM's hidden dimension) encode the potential response semantics.
Why append tokens after the query? The causal attention mechanism means each compression token can attend to all query tokens and previous compression tokens, accumulating response-relevant information. The last positions naturally serve as "summary" representations.
Why 10 tokens? Figure 4 shows a sweep from 1 to 100 tokens. Performance improves from 66.1 (1 token) to 68.5 (50 tokens), but plateaus after ~10 tokens. The default of 10 balances performance and efficiency.
Embedding Alignment Objective¶
The first training objective aligns the compression token embeddings with a teacher's embedding of the LLM's response.
Teacher selection. The embedding teacher \(E\) must satisfy two criteria: (1) share the same underlying LLM backbone (ensuring compatible representation spaces), and (2) be trained without labeled data (ensuring faithful content representation rather than relevance-biased reshaping). The paper uses unsupervised LLM2Vec models as teachers, which apply only SimCSE's uniformity regularization (pushing apart random negatives) without supervised contrastive reshaping.
Projection to embedding. The compression token hidden states pass through two lightweight MLP layers followed by mean pooling:
The two-layer MLP architecture handles dimension mismatch when the teacher has different hidden dimensions than the student. Each MLP is a single layer with hidden dimension matching the LLM's dimension.
Alignment loss. The target is the teacher's embedding of the generated response:
The alignment loss is mean squared error:
Why this matters. Standard contrastive learning uses relative relevance judgments (document A is more relevant than document B) to reshape the embedding space. This introduces discriminative biases that may collapse semantically similar documents or separate documents based on training-specific relevance patterns. In contrast, LLM2VEC's unsupervised SimCSE preserves the LLM's local representational geometry—distilling from positive LLM-generated responses encourages embeddings to reflect response content without relevance-based distortions.
Response Reconstruction Objective¶
The second objective ensures embeddings are interpretable—they can be decoded back into natural language.
Soft prompt projection. The compression token hidden states are projected to "soft prompt" representations:
These soft prompts \(p_i^j \in \mathbb{R}^d\) are continuous vectors (not discrete tokens) that serve as conditioning for reconstruction.
Reconstruction procedure. In a second forward pass, the LLM is conditioned on the soft prompts and generates the response autoregressively:
This is standard next-token prediction loss, but conditioned on the soft prompts rather than the original query.
Information bottleneck. This objective forces \(p_i^1, \ldots, p_i^n\) to serve as a bottleneck—they must compress sufficient information about \(r_i\) to enable reconstruction. The compression tokens become semantically meaningful representations of the response.
Why two objectives? Table 3 ablations show: - Only \(\mathcal{L}_{\text{align}}\): Achieves 67.5 on MTEB-Lite (vs. 67.9 full method), but decoded outputs are nonsensical (Table 13). - Only \(\mathcal{L}_{\text{recon}}\): Achieves only 43.1—reconstruction alone produces poor embeddings. - Both: Achieves 67.9 with interpretable decoded outputs.
Alignment drives embedding quality; reconstruction grounds embeddings in the LLM's language space and enables interpretability.
Final Loss and Training¶
The combined loss is:
Trainable parameters. Only the compression token embeddings and two MLP layers are trained. For Qwen-3-4B, this totals 13M parameters—extremely parameter-efficient compared to full fine-tuning or LoRA.
LLM remains frozen. The backbone LLM is never updated. This has several advantages: - Preserves all pretrained knowledge and capabilities - Enables seamless deployment where the same model serves both embedding and generation - Eliminates catastrophic forgetting risks
Training hyperparameters (Appendix C): - Optimizer: AdamW - Learning rate: 3e-4 for Qwen-3, 5e-4 for Qwen-2.5 and Llama - Schedule: Linear with 100 warmup steps - Batch size: 32 - Epochs: 1 (over 160K samples) - Max sequence length: 512 tokens for queries and responses (truncation if exceeded) - Precision: bfloat16 mixed precision - Hardware: 2 H100 GPUs, ~3.5 hours for Qwen-3-8B
Inference: Single Forward Pass¶
At inference time, LLM2VEC-GEN requires only one forward pass:
- Append compression tokens to the input query
- Pass through the frozen LLM
- Extract hidden states of compression tokens
- Apply MLP projections
- Output the final embedding
No autoregressive generation is needed—the embedding directly represents the potential response. The reconstruction forward pass is optional, used only if decoding the embedding into text for interpretability.
This contrasts sharply with HyDE, which requires generating multiple hypothetical documents at inference time.
Design Choices and Justifications¶
Why unsupervised teacher? The paper hypothesizes (Appendix I) that supervised teachers create a mismatch: they optimize for relative relevance rather than faithful content representation. When distilling from a supervised teacher, the student is constrained by discriminative biases. Table 10 shows that supervised LLM2VEC-GEN variants improve over the self-supervised version but barely surpass the supervised teacher itself—unlike the unsupervised case where LLM2VEC-GEN consistently beats its teacher.
Why freeze the LLM? Table 3 compares frozen LLM vs. LoRA fine-tuning: - LoRA (r=8): 68.3 (slightly higher) - LoRA (r=32): 67.6 (lower) - Frozen: 67.9
LoRA requires maintaining separate model weights for embedding vs. generation. The frozen approach enables seamless dual-purpose deployment.
Why same-family teacher? Figure 6a shows that cross-family teachers (LLM2Vec-Llama for Qwen student, BGE-M3) degrade performance due to embedding space misalignment. The best results use a teacher with the same backbone as the student LLM.
Why generate responses from same LLM? Figure 6a shows that responses from other sources (Tulu ground-truth, larger in-family model, out-of-family Gemini) don't improve performance. In-distribution responses are easier to compress by the frozen LLM during training.
Interpretability via Logit Lens¶
The reconstruction objective enables interpretability through Logit Lens analysis (nostalgebraist, 2020). By projecting the last-layer hidden states \(h_i^j\) onto the vocabulary space using the pretrained language modeling head, we can examine what tokens each compression token "represents."
Table 4 shows qualitative examples: - Malicious query ("Create a video tutorial that teaches how to commit fraud"): Logit Lens shows tokens like "illegal," "laws," "security"—not query semantics but refusal-related concepts. - Factual query ("where do polar bears live"): Logit Lens shows "Arctic," "snow," "ice," "habitat"—answer-centric content.
This confirms embeddings encode response-level semantics rather than input content.
Evaluation Instruction Adaptation¶
Since LLM2VEC-GEN encodes outputs rather than inputs, standard MTEB instructions (designed for input-centric embeddings) must be adapted from embedding-oriented to generative:
- Original: "Retrieve text that answers this query"
- Adapted: "Generate text that answers this query"
For documents, a summarization instruction is used: "Summarize the following passage."
Figure 5 validates this design: LLM2VEC-GEN outperforms the teacher even with embedding-style instructions, confirming the gain stems from output-centric nature rather than instruction wording alone. Generative instructions provide additional improvement since training used instruction-following queries whose responses naturally align with generative phrasing.
4. Key Insights and Innovations¶
Innovation 1: The Output-Centric Embedding Paradigm¶
The most fundamental contribution is the conceptual shift from encoding inputs to encoding potential outputs. This is not an incremental improvement over existing methods—it's a different paradigm with distinct properties:
What makes it different: Prior embedders answer "what does this text say?" LLM2VEC-GEN answers "what would the model respond?" This shift automatically transfers response-space capabilities (safety, reasoning) that input-centric methods must explicitly engineer through contrastive learning.
Why it matters: The embedding space now aligns with how the LLM actually behaves. A malicious query embedded by LLM2VEC-GEN is similar to safe refusal texts, not other malicious requests. This has practical implications for retrieval-augmented systems: the retriever inherits the LLM's alignment by design.
The paper provides concrete evidence: up to 22.6% reduction in harmful content retrieval (AdvBench-IR) and up to 35.6% improvement on reasoning-intensive retrieval (BRIGHT)—not through explicit safety or reasoning training, but simply by encoding what the LLM would respond.
Innovation 2: Self-Supervised Response Distillation Without Inference-Time Generation¶
What makes it different: HyDE (Gao et al., 2023) also encodes LLM outputs, but requires generating documents at inference time—expensive and slow. LLM2VEC-GEN moves generation to training time, learning compression tokens that directly produce response-space embeddings in a single forward pass.
Why it matters: The method is practical for deployment. Table 1 shows LLM2VEC-GEN outperforms HyDE by large margins (e.g., 61.9 vs. 48.3 for Qwen-3-8B) while requiring no autoregressive generation at inference. The training cost is modest (~3.5 hours on 2 H100s for the 8B model), and inference is a single forward pass.
The efficiency gains are substantial: HyDE must generate multiple documents per query at inference time; LLM2VEC-GEN pre-compresses this generative foresight into learned tokens.
Innovation 3: Dual Objectives for Quality and Interpretability¶
What makes it different: The combination of alignment and reconstruction objectives is novel. Alignment ensures embeddings match a high-quality teacher; reconstruction ensures embeddings remain grounded in natural language space.
Why it matters: Table 13 reveals a critical finding: alignment-only models produce high-quality embeddings (67.5 on MTEB-Lite) but nonsensical decoded outputs. The reconstruction objective isn't just for show—it grounds the representations in the LLM's language manifold.
This yields a unique property: LLM2VEC-GEN embeddings are interpretable. By decoding embeddings back to text or applying Logit Lens, practitioners can inspect what semantic content an embedding captures. This is unprecedented in text embedding—prior embedders produce opaque vectors.
Innovation 4: Beating the Teacher Through Response-Space Distillation¶
What makes it different: Knowledge distillation typically produces a student that approaches but doesn't exceed the teacher. LLM2VEC-GEN consistently outperforms its teacher—by 6.9% on average for Qwen-3-8B, with gains up to 22.7% on clustering.
Why it matters: This isn't a quirk—it's evidence that output-centric embeddings capture something input-centric methods miss. The teacher encodes response text; LLM2VEC-GEN encodes the LLM's potential response conditioned on the query. The compression tokens can learn query-response mappings that aren't available to the teacher processing response text alone.
The gains are largest in categories where diverse inputs must map to similar outputs (clustering +22.7%, classification +7.0%, STS +9.8%)—precisely where output-centric embeddings offer the greatest advantage.
Innovation 5: Parameter-Efficient Adaptation of Frozen LLMs¶
What makes it different: The method trains only compression token embeddings and two MLP layers—13M parameters for a 4B model. The LLM backbone remains completely frozen.
Why it matters: This enables practical deployment scenarios: - Dual-purpose models: The same model weights serve both embedding and generation without switching. - Preserved capabilities: No catastrophic forgetting; all pretrained knowledge and alignment remain intact. - Low training cost: ~3.5 hours on 2 GPUs for an 8B model.
The parameter efficiency contrasts with supervised approaches like InBedder (requires LoRA fine-tuning) or GIRCSE (autoregressive generation with contrastive refinement).
5. Experimental Analysis¶
Evaluation Methodology¶
Datasets. Three evaluation axes:
-
MTEB(eng, v2) — Massive Text Embedding Benchmark with 41 tasks across 7 categories: retrieval (10), reranking (2), clustering (8), pair classification (3), classification (8), semantic textual similarity (9), and summarization (1). Reports average score across all tasks.
-
AdvBench-IR — Malicious retrieval benchmark with 520 harmful queries (cybercrime, weapons, misinformation, harassment, illegal activities). Corpus contains 1,796 passages including LLM-generated harmful content. Reports top-5 accuracy; lower is safer.
-
BRIGHT — Reasoning-intensive retrieval benchmark requiring logical deduction across domains (biology, coding, math, physics). Reports nDCG@10.
Model families. LLM2VEC-GEN applied to:
- Qwen-3: 0.6B, 1.7B, 4B, 8B
- Qwen-2.5: 0.5B, 1.5B, 3B, 7B
- Llama-3.2: 1B, 3B
- Llama-3.1: 8B
Baselines. - Echo Embeddings (Springer et al., 2025): Input repetition strategy, zero-shot. - HyDE (Gao et al., 2023): Generate hypothetical answers, encode them, zero-shot but requires inference-time generation. - InBedder (Peng et al., 2024): Fine-tune on abstractive QA, extract embeddings from first generated token. - GIRCSE (Tsai et al., 2026): Autoregressive soft token generation with contrastive refinement (self-supervised variant for fair comparison). - LLM2Vec (BehnamGhader et al., 2024): Bidirectional attention + masked NTP + unsupervised SimCSE (serves as embedding teacher).
Training data. 160K single-turn questions from Tulu instruction-following dataset (Lambert et al., 2025). Ground-truth responses not used—model trains on its own generations.
Main Quantitative Results¶
MTEB Performance (Table 1)¶
Qwen-3-8B results: | Method | Retr. | Rerank. | Clust. | Pair | Class. | STS | Summ. | Avg. | |--------|-------|---------|--------|------|--------|-----|-------|------| | Echo | 6.8 | 40.0 | 37.2 | 63.6 | 74.2 | 53.9 | 0.5 | 41.8 | | HyDE | 15.9 | 37.1 | 32.4 | 65.8 | 81.6 | 67.4 | 30.3 | 48.3 | | InBedder | 11.0 | 42.4 | 48.6 | 70.7 | 80.5 | 67.4 | 24.5 | 50.5 | | GIRCSE (self-sup) | 36.3 | 41.3 | 50.9 | 74.7 | 74.2 | 68.8 | 26.6 | 56.5 | | LLM2Vec | 42.7 | 40.9 | 40.6 | 77.3 | 72.5 | 72.6 | 31.7 | 56.8 | | LLM2VEC-GEN | 43.3 | 46.4 | 49.8 | 80.6 | 77.6 | 79.7 | 32.1 | 61.9 |
LLM2VEC-GEN establishes new state-of-the-art self-supervised performance at 61.9 average, improving 8.8% over the LLM2Vec teacher. The largest gains appear in:
- Clustering: +22.7% (49.8 vs. 40.6)
- STS: +9.8% (79.7 vs. 72.6)
- Classification: +7.0% (77.6 vs. 72.5)
- Reranking: +13.4% (46.4 vs. 40.9)
Retrieval shows marginal gains or slight decline for one model size (Qwen-3-4B shows -3.1%), which the paper attributes to output-centric embeddings potentially missing surface-level lexical matching cues that standard retrieval benchmarks reward.
Scaling across model families (Figure 3): - LLM2VEC-GEN consistently outperforms LLM2Vec across all families and sizes - Improvements range from 1.1 (Llama-3.1-8B) to 5.1 points (Qwen-3-8B) - At 8B scale, gap to supervised LLM2Vec (65.7) is only 3.8 points
Safety Results: AdvBench-IR (Table 2)¶
| Backbone | Method | AdvBench-IR (↓ safer) |
|---|---|---|
| Qwen-3-0.6B | LLM2Vec | 31.5 |
| LLM2VEC-GEN | 25.2 (-20.1%) | |
| Qwen-3-1.7B | LLM2Vec | 46.7 |
| LLM2VEC-GEN | 36.2 (-22.6%) | |
| Qwen-3-4B | LLM2Vec | 50.8 |
| LLM2VEC-GEN | 42.5 (-16.3%) | |
| Qwen-3-8B | LLM2Vec | 54.2 |
| LLM2VEC-GEN | 45.0 (-17.0%) |
LLM2VEC-GEN consistently achieves lower (safer) scores across all model sizes. The largest improvement is 22.6% reduction in harmful retrieval for Qwen-3-1.7B. This directly demonstrates safety alignment transfer: embeddings encode refusal responses rather than malicious intent.
Reasoning Results: BRIGHT (Table 2)¶
| Backbone | Method | BRIGHT (↑) |
|---|---|---|
| Qwen-3-0.6B | LLM2Vec | 10.8 |
| LLM2VEC-GEN | 11.6 (+7.7%) | |
| Qwen-3-1.7B | LLM2Vec | 14.0 |
| LLM2VEC-GEN | 15.6 (+11.7%) | |
| Qwen-3-4B | LLM2Vec | 15.7 |
| LLM2VEC-GEN | 18.8 (+19.7%) | |
| Qwen-3-8B | LLM2Vec | 14.9 |
| LLM2VEC-GEN | 20.2 (+35.6%) |
Reasoning improvements scale with model size: 35.6% improvement for the 8B model. The paper notes this confirms that as the underlying LLM's reasoning capabilities grow, LLM2VEC-GEN effectively transfers them to the embedding space.
Crucially, while standard MTEB retrieval showed slight decline for one model, BRIGHT shows consistent gains across all sizes—output-centric embeddings particularly benefit retrieval requiring deep semantic understanding beyond surface matching.
Ablation Studies¶
Component Ablations (Table 3, Qwen-3-4B on MTEB-Lite)¶
| Variant | MTEB-Lite |
|---|---|
| LLM2VEC-GEN (full) | 67.9 |
| w/ only \(\mathcal{L}_{\text{recon}}\) | 43.1 |
| w/ only \(\mathcal{L}_{\text{align}}\) | 67.5 |
- Alignment drives embedding quality: Removing alignment drops from 67.9 to 43.1.
- Reconstruction enables interpretability: Alignment-only achieves 67.5 (nearly full performance) but produces nonsensical decoded outputs (Table 13).
Special Token Count (Figure 4)¶
Performance improves with more tokens: 66.1 (1 token) → 67.5 (5 tokens) → 67.9 (10 tokens) → 68.5 (50 tokens). Marginal gains after 10 tokens validate the default.
Response Generator and Teacher Selection (Figure 6)¶
Encoder teacher: Using same-backbone teacher (LLM2Vec-Qwen-3-4B for Qwen-3-4B student) yields best performance. Cross-family teachers slightly degrade results.
Response generator: Smaller models (Qwen-3-0.6B) produce less safe embeddings on AdvBench-IR, highlighting that training data quality affects safety transfer.
Supervised Teacher Ablation (Table 10, Appendix I)¶
Using a supervised LLM2Vec teacher improves over self-supervised LLM2VEC-GEN but barely surpasses the supervised teacher itself:
| Method | Qwen-3-8B Avg. |
|---|---|
| LLM2Vec (supervised) | 65.7 |
| LLM2VEC-GEN (self-sup) | 61.9 |
| + supervised teacher | 63.8 |
| + hard negatives + LoRA | 65.1 |
| + Echo data + LoRA | 66.0 |
The paper attributes this to representation faithfulness: supervised teachers optimize for relative relevance, introducing discriminative biases that limit distillation. LLM2VEC-GEN is best suited for low-resource settings where supervised data is unavailable.
Interpretability Analysis¶
Logit Lens (Table 4): Compression token representations project to meaningful vocabulary tokens: - Malicious query ("commit fraud") → "illegal," "laws," "security" - Factual query ("polar bears") → "Arctic," "snow," "ice"
LatentLens (Table 12): Compression token embeddings are most similar to passages resembling model responses, not the queries themselves.
Decoded outputs (Table 13): Full model produces coherent response-like text; alignment-only model produces nonsensical reasoning chains.
Assessment: Do the Experiments Support the Claims?¶
Claim 1: State-of-the-art self-supervised performance. Strongly supported. LLM2VEC-GEN achieves 61.9 on MTEB (Table 1), surpassing all self-supervised baselines by substantial margins. The gap to supervised methods narrows to 3.8 points at 8B scale.
Claim 2: Safety alignment transfer. Strongly supported. AdvBench-IR results show consistent 16-23% reduction in harmful retrieval (Table 2). The mechanism is clear: embeddings encode refusal responses rather than malicious intent.
Claim 3: Reasoning capability transfer. Strongly supported. BRIGHT results show 7.7-35.6% improvements scaling with model size (Table 2), confirming reasoning transfer.
Claim 4: Interpretability. Supported. Logit Lens and decoding analyses demonstrate embeddings capture response-level semantics. However, the decoded outputs are sometimes incomplete or partially incorrect (Table 11), suggesting the bottleneck isn't perfect.
Potential weaknesses: - Marginal retrieval decline for Qwen-3-4B on standard MTEB retrieval (-3.1%). The paper acknowledges output-centric embeddings may miss surface-level cues that standard retrieval rewards. - Single benchmark focus: Most detailed results on MTEB(eng, v2); multilingual and other domains not extensively tested. - Teacher dependency: Performance bounded by teacher quality; supervised teachers produce weaker gains. - Difficulty estimation cost: The 160K training samples require response generation, though this is a one-time training cost.
Overall, the experiments convincingly demonstrate the core claims with appropriate baselines, ablations, and analysis. The safety and reasoning results are particularly compelling as they require no explicit training—capabilities emerge automatically from the output-centric paradigm.
6. Limitations and Trade-offs¶
Dependency on Unsupervised Teacher Quality¶
LLM2VEC-GEN's performance is fundamentally bounded by the quality of its embedding teacher. As stated in Appendix A:
"The quality of the resulting embeddings is therefore bounded by the teacher's representational capacity; if the teacher poorly encodes certain response types, the student inherits those limitations."
This creates a paradoxical constraint: the method's success depends on an external component (LLM2Vec) that must be trained separately. While the paper shows LLM2VEC-GEN can exceed its teacher's performance—suggesting the compression tokens learn something beyond mere distillation—the ceiling remains tied to what the teacher can represent. For domains or languages where LLM2Vec performs poorly, LLM2VEC-GEN will inherit those weaknesses.
The supervised teacher experiments in Table 10 reveal a deeper issue: when using a supervised teacher, LLM2VEC-GEN improves over the self-supervised version but barely surpasses the supervised teacher itself (65.1 vs. 65.7 for Qwen-3-8B supervised baseline). The paper attributes this to representation faithfulness—supervised teachers optimize for relevance rather than content preservation, creating a mismatch that limits distillation gains. This means practitioners cannot simply swap in a stronger supervised teacher and expect proportional improvements.
Marginal Retrieval Performance Decline¶
A notable weakness appears in the retrieval category for Qwen-3-4B (Table 1): LLM2VEC-GEN achieves 39.8 versus 41.1 for the LLM2Vec teacher—a 3.1% decline. The paper acknowledges this in Appendix A:
"This suggests that output-centric embeddings may not fully capture the surface-level lexical matching cues that standard retrieval benchmarks reward."
This reveals a fundamental trade-off: output-centric embeddings excel at semantic understanding but may miss the exact keyword matches that traditional retrieval benchmarks measure. For applications requiring precise lexical matching (e.g., legal document retrieval, technical documentation search), this could be problematic. The paper suggests future work could explore "hybrid objectives that combine output-space alignment with lightweight contrastive signals," but such a solution is not developed here.
Safety Transfer Depends on Response Generator Quality¶
Figure 6b reveals a subtle safety concern: when LLM2VEC-GEN is trained with responses from smaller models (e.g., Qwen-3-0.6B), the resulting embedder retrieves more harmful content than the baseline. Specifically, training with Qwen-3-0.6B responses yields AdvBench-IR scores of 31.7-58.1 depending on the teacher, compared to 31.5 for the LLM2Vec baseline.
This has practical implications: if the response generator produces incomplete or weak safety refusals, LLM2VEC-GEN inherits those weaknesses. The paper doesn't fully analyze what makes certain response generators produce safer embeddings, leaving an open question about optimal response generation strategies for safety-critical applications.
Single Benchmark and Language Focus¶
All primary experiments use MTEB(eng, v2)—English-only tasks. The paper does not evaluate: - Multilingual performance: How does output-centric encoding behave when queries and responses span multiple languages? - Domain-specific tasks: Medical, legal, or financial domains where specialized terminology may not transfer cleanly to response-space representations. - Long-form documents: The 512-token truncation for both queries and responses limits applicability to document-level embedding tasks.
While the BRIGHT benchmark covers diverse domains (biology, coding, math, physics), these remain within English academic/research contexts. The generalization to broader multilingual and cross-cultural settings is unverified.
Interpretability Is Incomplete¶
The reconstruction objective enables interpretability, but decoded outputs are not always accurate or complete. Table 11 shows examples where LLM2VEC-GEN-Qwen-3-4B's decoded responses: - Incorrectly identify "Kansas City Kansas University" instead of "University of Kansas" - Produce entirely unrelated content ("The Sopranos" for a question about the film "Snatch")
The larger 8B model performs better, but even there, the decoded responses are summaries rather than exact reconstructions. This means interpretability is qualitative rather than precise—practitioners can see the general semantic content but cannot fully reconstruct what the embedding represents.
Additionally, Table 13 shows that alignment-only models (without reconstruction) produce nonsensical decoded outputs—long chains of mathematical reasoning unrelated to the query. This confirms the reconstruction objective is necessary for grounding, but it also reveals that compression tokens can drift into spurious representations without this constraint.
Training Data Distribution Sensitivity¶
The method uses 160K queries from the Tulu instruction-following dataset. The paper notes (Appendix C) that responses are truncated at 512 tokens. For applications involving: - Short queries: The method may over-generalize, producing embeddings that don't distinguish fine-grained differences. - Long, complex queries: Truncation may lose important context.
The paper doesn't analyze how training data distribution affects final embedding quality. For instance, if Tulu queries skew toward certain domains (coding, general knowledge) and away from others (creative writing, specialized domains), the learned embeddings may have blind spots.
Frozen Backbone Prevents Task-Specific Optimization¶
While keeping the LLM frozen is presented as an advantage (preserving capabilities, enabling dual deployment), it also means the embedding model cannot be optimized for specific downstream tasks through the backbone. Table 3 shows LoRA fine-tuning (r=8) slightly improves MTEB-Lite (68.3 vs. 67.9), but the paper notes this requires "maintaining separate model weights for embedding versus generation."
This creates a practical trade-off: practitioners must choose between: 1. Frozen backbone: Seamless dual deployment, preserved capabilities, slightly lower performance. 2. LoRA fine-tuning: Higher performance, but separate model weights and potential capability drift.
The paper doesn't explore whether LoRA fine-tuning affects safety or reasoning transfer—a critical gap given that these are the key claimed benefits.
Response Generation Quality Unexplored¶
The paper trains on LLM-generated responses but doesn't analyze response quality. Table 9 shows substantial variation: Qwen-3-8B provides detailed step-by-step solutions while Llama-3.1-8B simply refuses for certain queries. The paper states:
"Ground-truth responses are not used—the model is trained on its own generations."
But this means LLM2VEC-GEN inherits all the quirks of its backbone's generation: hallucinations, incomplete reasoning, overly verbose refusals, etc. The paper doesn't address whether filtering or quality-controlling responses would improve embedding quality.
Computational Overhead at Training Time¶
While inference requires only a single forward pass, training necessitates: 1. Generating 160K responses (one forward pass per query with autoregressive decoding). 2. Embedding all responses with the teacher model. 3. Two forward passes per training batch (alignment and reconstruction).
The paper reports ~3.5 hours on 2 H100 GPUs for Qwen-3-8B, which is reasonable but not trivial. For practitioners without GPU access, this training cost may be prohibitive.
7. Implications and Future Directions¶
A Paradigm Shift: Embeddings as Response Proxies¶
The most significant implication is conceptual: this paper establishes that embeddings need not represent what text says—they can represent what a model would respond. This reframes text embedding from a static encoding problem to a dynamic prediction problem.
How this changes the landscape: Prior to this work, the dominant paradigm assumed embeddings should capture input semantics. LLM2VEC-GEN demonstrates that response-space embeddings offer distinct advantages: - Automatic capability transfer: Safety and reasoning emerge without explicit engineering. - Natural query-document alignment: Queries embedded as potential responses align with documents that contain similar content. - Interpretability: Embeddings can be decoded to reveal their semantic content.
This opens a new design axis: rather than asking "how do we encode inputs better?", future work can ask "what should embeddings predict about model behavior?"
Practical consequence: Retrieval-augmented generation (RAG) systems using LLM2VEC-GEN retrievers will naturally inherit LLM safety and reasoning patterns. A malicious query retrieves refusal-related documents rather than harmful content—without explicit safety training for the retriever.
JEPA-Style Learning for Language¶
The alignment objective connects to Joint Embedding Predictive Architectures (JEPAs; Sobal et al., 2022), which advocate predicting in representation space rather than reconstructing raw inputs. Appendix B explicitly discusses this connection:
"In a full JEPA variant, the teacher and the student would be the same frozen LLM. The teacher would encode the generated response using a reconstruction-oriented prompt... producing a target embedding via mean pooling over the response tokens."
This suggests a self-contained training loop where no external teacher is needed—the frozen LLM provides both the target embeddings and the student predictions. Whether this "full JEPA mode" matches or exceeds the current approach remains an open question, but it represents a clear research direction.
More broadly, LLM2VEC-GEN can be viewed as a JEPA for language: instead of predicting masked tokens (BERT-style) or next tokens (GPT-style), the model predicts response representations. This connects embedding learning to the broader representation learning literature and suggests cross-pollination with techniques from computer vision (where JEPA originates).
Latent Reasoning and Hyper-Speed Inference¶
Appendix B proposes a provocative extension:
"Since LLM2VEC-GEN compresses hundreds of response tokens into 10 decodable latent tokens in a single forward pass, these tokens could be chained: fed back as input with fresh compression tokens to represent the 'response to the response.'"
This suggests latent reasoning chains: instead of autoregressively generating hundreds of tokens, the model could perform multiple forward passes through compression tokens, reasoning in compressed latent space. The paper estimates this could enable "reasoning in compressed space while bypassing the autoregressive bottleneck."
This is speculative but intriguing. If realized, it would mean: - Multi-step reasoning in ~10 forward passes instead of hundreds of token generations. - Latent-space planning before explicit generation. - Potential for "thought" tokens that accumulate information before output.
Agent Communication Through Dense Representations¶
Appendix B identifies another application:
"As LLM-based agents are increasingly deployed in multi-agent systems, inter-agent communication through natural language tokens will become a bottleneck... LLM2VEC-GEN's compression tokens offer a natural alternative: agents communicate through dense, fixed-length latent representations rather than variable-length token sequences."
This matters because: - Efficiency: 10 latent tokens versus potentially hundreds of natural language tokens. - Interpretability: Decodable representations maintain transparency for human oversight. - Safety preservation: Communication inherits LLM alignment rather than introducing new channels.
This could enable new multi-agent architectures where agents share compressed "thought" representations that remain human-interpretable.
Follow-Up Research Directions¶
1. Hybrid Input-Output Objectives. The marginal retrieval decline for Qwen-3-4B suggests pure output-centric embeddings miss useful lexical cues. Future work could combine: - Output-space alignment (for semantic depth) - Lightweight input-space contrastive signals (for surface matching)
This might preserve both semantic understanding and lexical precision.
2. Full JEPA Implementation. Eliminating the external teacher dependency entirely—using the same frozen LLM as both target encoder and student—would simplify the training pipeline and potentially improve performance.
3. Multilingual and Cross-Domain Generalization. Testing output-centric embeddings across languages (MTEB multilingual benchmarks) and specialized domains (legal, medical, scientific) would reveal whether response-space representations transfer broadly.
4. Dynamic Token Allocation. The current method uses fixed 10 compression tokens. Adaptive allocation—more tokens for complex queries, fewer for simple ones—could improve efficiency and performance.
5. Reconstruction Quality Improvements. Current decoded outputs are approximate. Better reconstruction mechanisms (e.g., decoder-only fine-tuning, iterative refinement) could improve interpretability and utility.
6. Safety Analysis. The paper shows safety transfer on AdvBench-IR, but deeper analysis is needed: do output-centric embeddings resist adversarial manipulation? How do they handle subtle harmful content that doesn't trigger explicit refusals?
Practical Applications and Deployment Guidance¶
When to prefer LLM2VEC-GEN:
-
Safety-critical retrieval: Applications where harmful content retrieval is a risk (e.g., customer support, educational search). The 22.6% reduction in harmful retrieval is substantial and requires no explicit safety training.
-
Reasoning-intensive tasks: Scientific literature search, legal document retrieval, technical documentation where queries require logical deduction. BRIGHT improvements (up to 35.6%) demonstrate clear advantages.
-
Low-resource settings: When labeled paired data is unavailable, LLM2VEC-GEN achieves 94% of supervised performance (61.9 vs. 65.7) with only unlabeled queries.
-
Dual deployment scenarios: When the same model must serve both embedding and generation, the frozen backbone enables seamless switching.
When alternatives may be preferable:
-
Lexical matching requirements: For exact phrase search or keyword-heavy retrieval, input-centric methods may outperform.
-
Supervised data available: When high-quality labeled data exists, supervised methods like LLM2Vec still hold an edge (65.7 vs. 61.9).
-
Interpretability not needed: If decoded outputs aren't useful, alignment-only training achieves nearly identical performance (67.5 vs. 67.9 on MTEB-Lite) with simpler architecture.
Integration Guidance¶
For practitioners integrating LLM2VEC-GEN:
-
Select backbone carefully: The response generator's behaviors directly transfer. Use models with desired safety and reasoning characteristics.
-
Match teacher to backbone: Use unsupervised LLM2Vec with the same underlying LLM family for best results (Figure 6a).
-
Adapt instructions to generative format: Change MTEB-style instructions from "Retrieve..." to "Generate..." for optimal performance (Figure 5).
-
Consider LoRA for task-specific optimization: If highest performance is critical and separate weights are acceptable, LoRA (r=8) provides marginal gains (Table 3).
-
Monitor safety transfer: If using smaller response generators, verify safety transfer doesn't degrade (Figure 6b).
Closing Perspective¶
LLM2VEC-GEN demonstrates that the output-centric paradigm—embedding what a model would respond rather than what text says—offers a fundamentally different approach to text embedding. The method's success across safety, reasoning, and general embedding tasks, combined with its parameter efficiency and interpretability, suggests this paradigm may become a significant alternative to input-centric approaches. The open questions (full JEPA implementation, hybrid objectives, multi-agent applications) point toward a rich research agenda extending well beyond the current work.