TranslateGemma Technical Report¶

🎯 Pitch¶

TranslateGemma presents a suite of open machine translation models obtained by post-training Gemma 3 (4B/12B/27B) with a two-stage pipeline—supervised fine-tuning on high-quality human and synthetic parallel data followed by reinforcement learning that directly optimizes translation-quality reward models. This yields consistent, substantial quality gains across 55 language pairs, often allowing smaller models to match larger baselines while retaining Gemma 3’s multimodal and instruction-following capabilities, making high-quality, efficient, and transparent MT tools broadly available to the research community.

1. Executive Summary (2-3 sentences)¶

TranslateGemma is a suite of open machine translation models built by post-training the multilingual Gemma 3 (4B/12B/27B parameters) specifically for translation, using supervised fine-tuning followed by reinforcement learning that directly optimizes translation-quality rewards. The work matters because it aims to provide strong open MT models that improve translation quality and efficiency (smaller models approaching larger baselines), while preserving Gemma 3’s multimodal (image+text) capability (Introduction; Tables 1–3; Table 2).

2. Context and Motivation¶

Problem / gap addressed
Large multilingual foundation models can translate, but the report targets the gap between “general multilingual capability” and “consistently high translation quality across many language pairs,” especially with an open model release (Introduction).
The work also emphasizes maintaining general instruction-following and multimodal abilities while specializing for MT, rather than creating a narrowly specialized system (Introduction; Section 2.4; Section 5.3).
Why this problem is important
Machine translation is positioned as critical infrastructure for cross-language communication and access to information (Introduction).
The report highlights the practical research value of open models: transparency, reproducibility, and enabling community iteration (Introduction).
Prior approaches and where they fall short (as framed here)
The report treats baseline Gemma 3 as already “potent multilingual,” but not optimized specifically enough for translation quality, motivating targeted post-training (Introduction).
It also implicitly critiques relying on a single training stage by adopting a two-stage process: supervised fine-tuning (parallel data) then reinforcement learning (quality-optimized rewards) (Introduction; Sections 3–4).
Positioning relative to existing work
TranslateGemma is positioned as an open translation-enhanced variant of Gemma 3, improving translation via:
- High-quality parallel data (human + synthetic generated with strong models).
- RL with an ensemble of reward models including quality estimation (MetricX-QE) and fine-grained error-based judging (AutoMQM) (Abstract; Sections 2–4).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a set of Gemma 3 checkpoints further trained to produce better translations while keeping general and multimodal capabilities.
It solves translation quality and efficiency by using a two-stage post-training pipeline: supervised fine-tuning on curated parallel data, then reinforcement learning that optimizes translation-quality reward signals (Introduction; Sections 3–4).

3.2 Big-picture architecture (diagram in words)¶

(A) Data creation & curation
Monolingual text → synthetic translations (via Gemini) → filtering/selection.
Human parallel datasets → added for diversity and low-resource/script coverage.
A slice of generic instruction-following data → mixed in to preserve generality (Section 2).
(B) Supervised Fine-Tuning (SFT)
Start from released Gemma 3 checkpoints → fine-tune on the mixed dataset with specified optimizer settings; freeze embeddings (Section 3).
(C) Reinforcement Learning (RL)
Start from the SFT checkpoint → optimize using an ensemble of reward models (MetricX-QE, AutoMQM-QE, ChrF, naturalness judge, generalist reward), combining sequence-level and token-/span-level signals (Section 4; Figure 2).
(D) Evaluation
Text MT on WMT24++ (55 pairs) with automatic metrics; human MQM on WMT25 (10 pairs); multimodal image translation on Vistra (Sections 5–6; Tables 1–3; Table 2).

3.3 Roadmap for the deep dive¶

Explain training data first, because both SFT and RL depend on how parallel data is produced and mixed (Section 2).
Then explain SFT, because RL starts from the SFT checkpoint and inherits its capabilities (Section 3).
Then explain RL rewards and credit assignment, because the key methodological novelty is how multiple reward signals (including span-level) are combined (Section 4; Figure 2).
Finally, explain evaluation protocols and prompts, because the reported improvements depend on consistent prompting and specific benchmarks (Section 5.2; Tables 1–3).

3.4 Detailed, sentence-based technical breakdown¶

This is an empirical post-training / systems+training recipe paper: it builds translation-specialized checkpoints from a general foundation model using a two-stage SFT→RL pipeline, and validates improvements with automatic and human evaluations (Introduction; Sections 3–6).

3.4.1 System/data pipeline diagram in words (explicit sequence)¶

Start with base checkpoints.
Training begins from released Gemma 3 checkpoints at 27B, 12B, and 4B parameters (Section 3).
Assemble training data (parallel + general instruction).
The training mixture draws from:
- Synthetic parallel translation data generated using Gemini 2.5 Flash from monolingual text (Section 2.1).
- Human-generated parallel data from SMOL and GATITOS to improve diversity and script coverage, especially for lower-resource languages (Section 2.2).
- Generic instruction-following data making up 30% of the SFT mixture, taken from the original Gemma 3 mixture, intended to reduce overfitting to translation and retain general instruction-following (Section 2.4).
The language distribution differs between SFT and RL, and is measured in model tokens (Figure 1). RL uses “the same translation data as for SFT, except for GATITOS and SMOL,” which are SFT-only (Section 2.3).
Generate and filter synthetic parallel data (high-level to specific).
Monolingual source: MADLAD-400 is used as the monolingual corpus from which source sentences are drawn for synthetic generation (Section 2.1).
Target quantity: up to 10K synthetic examples per language pair are produced (Section 2.1).
Source selection strategy (to focus on “most benefit” cases):
- Source segments are bucketed by length.
- From these buckets, the pipeline samples 1,000,000 source segments per language pair (Section 2.1).
- A cheap pre-filter then compares two translations from Gemini 2.5 Flash:
- One generated with greedy decoding.
- One generated by sampling with temperature 1.0.
- Both are scored using MetricX 24-QE, and the system selects sources where the sampled output shows the largest improvement over greedy (Section 2.1).
- The rationale is to identify inputs that will benefit most from expensive multi-sample decoding, using a 2-sample proxy (Section 2.1).
High-quality generation (multi-sample) and selection:
- For each selected source, generate 128 samples from Gemini 2.5 Flash.
- Apply a MetricX 24-QE filter to select the best-performing example(s) (Section 2.1).
Length regimes:
- Synthetic translations are created both as individual sentences and as text blobs up to 512 tokens, aiming to support segment-level and longer-text translation (Section 2.1).
Formatting cleanup:
- An additional formatting filter (again based on Gemini 2.5 Flash) is applied to avoid formatting issues or erroneous translations (Section 2.1).
Coverage:
- This synthetic methodology is applied to all WMT24++ language pairs plus an additional set of 30 language pairs listed in Appendix B (Section 2.1; Appendix B).
Supervised Fine-Tuning (SFT) on the mixture.
Tooling: SFT uses Kauldron SFT tooling (Section 3).
Optimizer and core hyperparameters (explicitly provided):
- Optimizer: AdaFactor
- Learning rate: 0.0001
- Batch size: 64
- Steps: 200k
- Parameters updated: all parameters are updated, but embedding parameters are frozen (Section 3).
Reason for freezing embeddings:
- Preliminary experiments indicate freezing embeddings helps translation performance for languages/scripts not covered in the SFT data mix (Section 3).

Notably missing details (cannot be inferred from the provided text): - The report excerpt does not specify the base model architecture details requested in many ML summaries (e.g., number of layers, hidden size, attention heads, context length, tokenizer specifics), nor training hardware, total training tokens, or compute budget for SFT.

Reinforcement Learning (RL) from translation-quality rewards.
RL is run on top of the SFT checkpoint (Section 4).
The key idea is to optimize translation quality using an ensemble of reward models/metrics, rather than a single metric (Section 4).
Reward models used (with how they are applied):
- MetricX-24-XXL-QE:
- A regression-based metric producing a score from 0 (best) to 25 (worst) aligned to MQM score ranges (Section 4).
- Used as QE (quality estimation) by passing an empty reference (i.e., no reference translation) (Section 4).
- For rewards, scores are linearly rescaled using \(5.0 - \text{score}\) so that larger reward means better translation (Section 4).
- Gemma-AutoMQM-QE:
- A fine-tuned AutoMQM model initialized from Gemma 3-27B-IT and trained on MQM ratings from WMT 2020–WMT 2023 (Section 4).
- Uses default MQM weights to compute token-level rewards from AutoMQM outputs (Section 4).
- Also ignores references (used in QE mode) (Section 4).
- ChrF:
- A character n-gram overlap metric (Section 4).
- This is the only reward model in the RL ensemble that uses (synthetic) references (Section 4).
- Scaled by a factor of two to roughly match the scale of other rewards (Section 4).
- Naturalness Autorater (in-house):
- Implemented as a prompted LLM-as-a-judge using the base RL policy model (Section 4).
- Produces span-level annotations, penalizing output spans that do not sound native when the unnaturalness is not due to an unnatural source (Section 4).
- A Generalist reward model:
- Covers many tasks (reasoning, instruction following, multilingual abilities) and is adapted from the general Gemma 3 post-training setup (Section 4).
- This component is a mechanism for capability preservation, counteracting over-specialization on translation.
Credit assignment: combining sequence-level and token-/span-level signals
- The RL setup uses algorithms “extended to support token-level advantages,” which are added to advantages computed from sequence-level rewards (Section 4).
- This allows direct use of fine-grained span-level reward signals (from AutoMQM and the Naturalness Autorater) for improved training efficiency and credit assignment (Section 4).
- Figure 2 illustrates the additive combination and notes that sequence-level reward is treated as reward-to-go, broadcast uniformly to tokens, and then combined with token-level signals; combined advantages are then batch-normalized (Figure 2).

Again, key missing details (cannot be inferred from the provided text): - The exact RL algorithm name (e.g., PPO variant), KL regularization choices, sampling temperature during RL rollouts, reward weights across the ensemble, number of RL steps, batch sizes, and compute/hardware are not specified in the provided excerpt.

Inference prompting format (operational interface).
The model is trained and evaluated using a specific translation prompt, and the report recommends using the same prompt for new translations (Section 5.2; Figure 3).
The prompt frames the model as a professional translator and instructs it to output only the target-language translation, with explicit source/target language names and locale-like language codes (Figure 3).

4. Key Insights and Innovations¶

(1) Two-stage post-training specialized for translation (SFT → RL)
What’s different: Instead of only supervised fine-tuning, the pipeline adds RL that optimizes translation quality signals directly (Introduction; Sections 3–4).
Why it matters: The reported gains appear across model sizes and across many language pairs, suggesting the two-stage approach improves quality beyond baseline multilingual capability (Table 1; Appendix A mentions consistency across 55 pairs).
(2) Synthetic parallel data generation with QE-based source selection + 128-sample filtering
What’s different: Synthetic data is not generated uniformly; it uses a two-step selection: 1) find source segments where sampling beats greedy under QE, then 2) do expensive 128-sample generation and pick best via MetricX 24-QE (Section 2.1).
Why it matters: This is an explicit attempt to spend generation budget where it yields the most improvement and to push synthetic data quality upward (Section 2.1).
(3) RL reward ensemble combining QE, error-based judging, lexical overlap, naturalness, and general capabilities
What’s different: Reward is not a single scalar from one metric; it is an ensemble including MetricX-QE, AutoMQM-QE (fine-grained error spans), ChrF with references, plus naturalness and a generalist reward (Section 4).
Why it matters: The ensemble targets multiple properties—adequacy/error severity, fluency/naturalness, and capability retention—rather than optimizing only overlap or only a learned QE metric.
(4) Token-/span-level credit assignment in RL for translation
What’s different: The RL process incorporates token-level advantages derived from span-level annotations (AutoMQM and Naturalness Autorater), combined additively with sequence-level reward-to-go (Section 4; Figure 2).
Why it matters: Fine-grained feedback can, in principle, reduce wasted learning signal by telling the model where the translation is wrong, not only that it is wrong (Section 4).
(5) Retaining multimodal capability without multimodal post-training
What’s different: The SFT/RL steps use no multimodal training data, yet the report evaluates image translation and finds capabilities largely retained, with some improvements (Section 5.3; Table 2).
Why it matters: It suggests translation-focused post-training does not necessarily destroy image-conditioned translation behavior, at least under the tested setting (Section 5.3).

5. Experimental Analysis¶

5.1 Evaluation methodology (datasets, metrics, baselines, setup)¶

Automatic text MT evaluation
Benchmark: WMT24++ across 55 language pairs (Abstract; Section 5.1; Table 1; Appendix A/Table 4 provides per-language MetricX values).
Metrics:
- MetricX 24 (lower is better; reported with ↓ in Table 1).
- Comet22 (higher is better; reported with ↑ in Table 1).
Baselines: corresponding-size Gemma 3 models (4B/12B/27B) vs TranslateGemma models of the same sizes (Table 1).
Automatic image translation evaluation
Benchmark: Vistra image translation benchmark (Section 5.3).
Data handling:
- The evaluation selects only images with a single text instance per reference, resulting in 264 images (Section 5.3).
- The model input is the image plus a translation prompt; no text box location info and no OCR preprocessing is provided (Section 5.3).
Reported metric aggregation:
- Scores average translation from English into German, Spanish, Russian, and Chinese (Table 2).
Human evaluation
Framework: MQM (Multidimensional Quality Metrics), where professional translators mark error spans with severity/category, and a weighted score is computed (Section 6).
Tooling: annotations collected with Anthea (Section 6).
Language pairs: 10 directions from 3 sources (English, Czech, Japanese) (Section 6 list).
Systems compared:
- TranslateGemma 12B and 27B, and baseline Gemma 3 27B (Section 6).
Rater assignment:
- Documents truncated to ≤12 sentences (with specific handling for literary chapters via chunking), and “pseudo-SxS” assignment where the same rater evaluates all system outputs for a given source document (Section 6).

5.2 Main quantitative results (with specific numbers)¶

Text MT on WMT24++ (55 pairs) improves substantially across sizes (Table 1):
27B: MetricX 4.04 → 3.09; Comet22 83.1 → 84.4.
12B: MetricX 4.86 → 3.60; Comet22 81.6 → 83.5.
4B: MetricX 6.97 → 5.32; Comet22 77.2 → 80.1.
Efficiency observation (quality vs size)
The report highlights that TranslateGemma 12B exceeds the baseline Gemma 3 27B on the WMT24++ aggregate numbers (Table 1: 12B TranslateGemma MetricX 3.60 vs 27B Gemma 3 MetricX 4.04; Comet22 83.5 vs 83.1).
Similarly, TranslateGemma 4B approaches baseline Gemma 3 12B (Table 1: MetricX 5.32 vs 4.86; Comet22 80.1 vs 81.6), indicating partial “size compression” of translation quality.
Per-language consistency (examples given; full table in Appendix A/Table 4)
Examples of MetricX improvements cited in Section 5.1 include:
- English→German: 1.63 → 1.19
- English→Spanish: 2.54 → 1.88
- English→Hebrew: 3.90 → 2.72
- English→Swahili: 5.92 → 4.45
- English→Lithuanian: 6.01 → 4.39
- English→Estonian: 6.40 → 4.61
- English→Icelandic: 8.31 → 5.69
Appendix A/Table 4 lists all 55 language pairs’ MetricX values for 27B/12B/4B TranslateGemma vs 27B/12B/4B Gemma 3, and the narrative claims improvements are consistent across all 55 pairs.
Image translation on Vistra (264 images subset)
Table 2 shows mixed outcomes depending on model size and metric:
- 27B: MetricX 2.03 → 1.58; Comet22 76.1 → 77.7.
- 12B: MetricX 2.33 → 2.08; Comet22 74.9 → 72.8 (Comet22 decreases).
- 4B: MetricX 2.60 → 2.58; Comet22 69.1 → 70.7 (small gain).
The report attributes small 4B gains to limited capacity (Section 5.3).
Human MQM evaluation on WMT25 (10 language pairs)
Table 3 reports MQM scores (lower is better). Selected examples:
- English→Italian: TranslateGemma 27B 1.8 vs Gemma 3 27B 2.5.
- English→Marathi: TranslateGemma 27B 3.1 vs Gemma 3 27B 4.7.
- English→Swahili: TranslateGemma 27B 4.2 vs Gemma 3 27B 5.2.
Notable non-uniform outcomes:
- English→German: TranslateGemma 27B 2.3 vs Gemma 3 27B 2.2 (roughly on par / slightly worse).
- Japanese→English: TranslateGemma 27B 13.4 vs Gemma 3 27B 11.6 (regression), with error analysis pointing to named entity mistranslation as the cause while other categories improve (Section 6).

5.3 Do the experiments support the claims?¶

Supportive evidence
The aggregate WMT24++ results show consistent improvements in both MetricX and Comet22 across sizes (Table 1), strengthening the claim that post-training improves translation quality beyond the base model.
Human MQM evaluation generally matches the automatic trend for most tested directions (Section 6; Table 3), especially showing larger gains on some low-resource directions (e.g., English→Marathi, English→Swahili).
Image translation evaluation demonstrates multimodal retention without multimodal post-training data, at least for the selected Vistra subset and prompting setup (Section 5.3; Table 2).
Caveats visible in the results
The human evaluation includes clear exceptions (German targets and Japanese→English regression) (Section 6; Table 3).
The image translation metrics are not uniformly improved (notably 12B Comet22 drops) (Table 2).
The report excerpt does not provide ablation studies separating the effects of:
- synthetic vs human parallel data,
- SFT vs RL,
- reward ensemble components,
- freezing embeddings vs not, so causal attribution within the pipeline remains underdetermined from the provided text.

6. Limitations and Trade-offs¶

Dependence on strong proprietary generators and filters for training data
Synthetic parallel data generation relies on Gemini 2.5 Flash plus MetricX 24-QE scoring and additional formatting filtering (Section 2.1). This raises a reproducibility trade-off for fully open replication of the data pipeline, even if the final models are open.
Incomplete training specification (from the provided excerpt)
Many implementation-critical details are not present here: base-model architecture hyperparameters, tokenizer and context window, total tokens, hardware, compute budget, RL step count, RL algorithm specifics, and reward weighting. This limits exact reproducibility based solely on the excerpt.
Potential over-optimization or uneven generalization across directions
Human evaluation shows at least one meaningful regression (Japanese→English) linked to named entity handling (Section 6).
German target directions are reported as roughly on par in MQM (Section 6; Table 3), suggesting improvements are not uniform across all domains/directions under human judgment.
Metric mismatch and multi-objective tension
Image translation shows a metric disagreement for 12B (MetricX improves while Comet22 decreases) (Table 2), illustrating that optimizing for certain reward signals and text MT objectives may not transfer cleanly to all evaluation metrics or modalities.
Capacity constraints
The 4B model shows smaller gains on image translation, attributed to limited capacity (Section 5.3). More generally, smaller models may not fully absorb the benefits of the pipeline across tasks.
Embedding freezing trade-off
Freezing embeddings is motivated as helping languages/scripts not covered in the SFT mix (Section 3), but freezing can also limit adaptation in principle; the excerpt does not provide a systematic analysis of when freezing helps or hurts.

7. Implications and Future Directions¶

How this changes the landscape (within the report’s scope)
The reported results suggest that targeted translation post-training can significantly lift MT quality of a multilingual foundation model while enabling efficiency gains, where a smaller specialized model can rival or exceed a larger general baseline (Table 1 discussion).
The work also suggests that translation-focused post-training need not eliminate multimodal translation ability, at least under the tested Vistra protocol (Section 5.3; Table 2).
Follow-up research enabled or suggested by the presented results
Reward ensemble ablations: quantify the marginal value of each reward component (MetricX-QE vs AutoMQM-QE vs naturalness vs generalist) and how token-level advantages affect training efficiency/quality (Section 4; Figure 2 motivates this).
Named entity robustness: the Japanese→English regression attributed to named entity mistranslation suggests a targeted direction for data augmentation, decoding constraints, or specialized reward shaping (Section 6).
Cross-modal transfer analysis: given no multimodal data was used, understanding why text-only post-training improves (some) image translation metrics would be a useful mechanistic follow-up (Section 5.3).
Practical applications / downstream use
General MT across many language pairs (WMT24++ coverage) with improved quality and potential serving efficiency gains (Abstract; Table 1; Appendix A/Table 4).
Image-conditioned translation of text in natural images (Vistra-style use cases) without requiring separate multimodal fine-tuning in this recipe (Section 5.3).
Repro/Integration Guidance (based on provided details)
Use the recommended translation prompt for best behavior consistency, as both training and evaluation use it (Section 5.2; Figure 3).
Model size choice:
- Prefer 27B if maximum quality is needed and compute allows (Tables 1–3 generally show best performance).
- Prefer 12B if balancing quality and cost; it outperforms the Gemma 3 27B baseline on WMT24++ aggregates (Table 1), but note the image-translation Comet22 drop and the direction-specific MQM variability (Table 2; Table 3).
- Use 4B for constrained deployments; it improves over baseline on WMT24++ (Table 1) but shows limited gains on image translation (Table 2).
If your use case is sensitive to named entities, the reported Japanese→English regression suggests validating entity fidelity explicitly during evaluation (Section 6).