LLM Pruning and Distillation in Practice: The Minitron Approach¶
ArXiv: 2408.11796
🎯 Pitch¶
This paper adapts the Minitron pruning+distillation pipeline to work without access to the original pretraining data by introducing a lightweight "teacher correction" fine-tuning step so the unpruned teacher better matches the distillation distribution. The result is a practical, resource-efficient recipe that compresses state-of-the-art models (e.g., Mistral NeMo 12B→8B, Llama 3.1 8B→~4–4.5B) with strong benchmark accuracy and substantial inference throughput and token-efficiency gains, enabling lower-cost deployment of capable small language models.
1. Executive Summary (2-3 sentences)¶
This paper develops a practical recipe to compress large language models into smaller ones using structured pruning plus knowledge distillation even when the original pretraining dataset is unavailable. The central addition is a teacher correction phase: a lightweight fine-tuning of the unpruned teacher on the distillation dataset so the teacher’s guidance becomes well-matched to the student’s training distribution (Figure 2, Figure 4). Using this pipeline, the authors compress Mistral NeMo 12B → 8B and Llama 3.1 8B → ~4.5B, reporting strong benchmark performance and inference throughput gains (Tables 1–2, Figure 10).
2. Context and Motivation¶
- Problem / gap addressed
- Pruning + distillation can produce a family of smaller models from one large pretrained model with far fewer tokens/compute than training each size from scratch (Introduction).
- Prior pruning+distillation recipes often assume access to the original pretraining dataset for distillation (Introduction).
-
For many frontier/open models, pretraining data is proprietary/private, so the original dataset is not available to downstream practitioners (Introduction).
-
Why this matters
- Training multiple multi-billion-parameter models from scratch is “extremely time-, data- and resource-intensive” (Introduction).
-
If practitioners can reliably compress a model without the original data, they can:
- build deployable
small language models (SLMs)at lower cost, - obtain better latency/throughput and lower memory footprint for inference (Runtime Performance Analysis; Figure 10).
- build deployable
-
Prior approaches and shortcomings (as described in-paper)
- The paper builds directly on the
Minitronpruning+distillation recipe (Figure 3; references [2], [3]). -
Shortcoming highlighted here: distilling on a dataset different from the teacher’s pretraining distribution can yield “sub-optimal guidance” (Teacher Correction section).
-
How the paper positions its contribution
- It adapts the original Minitron compression strategy in two ways (Introduction; Methodology):
- Add
teacher correctionso the teacher adapts to the new distillation dataset (Figure 2). - Add a more effective
downstream task-basedsaliency criterion fordepth pruning(Pruning → Layer Importance; Analysis → Depth Pruning Metrics; Figures 8–9).
- Add
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is a pipeline that takes an existing pretrained LLM (the
teacher) and produces a smaller pruned model (thestudent) that recovers performance via distillation. - It solves the problem of compressing models when the original pretraining data is unavailable by first adapting the teacher to the available distillation dataset, then pruning, then distilling (Figure 1, Figure 2).
3.2 Big-picture architecture (diagram in words)¶
- Input: a pretrained base model (e.g.,
Mistral NeMo 12B,Llama 3.1 8B) + a replacement dataset for distillation (Nemotron-4 curated continued training (CT) dataset) (Training Details → Dataset). - Components:
Teacher correction: lightweight fine-tuning of the teacher on the distillation dataset (Figure 2; Training Details → Teacher Correction).Structured pruning: remove parameters by trimming structured parts of the network (layers for depth pruning; channels/neurons/embedding dimensions for width pruning) using activation-based importance estimation (Pruning section; Figure 3).Distillation retraining: train the pruned student to match teacher logits using forwardKL divergence(“logit-only distillation”) (Retraining with Distillation; Distillation section; Figure 2).- (For instruction models)
Alignment: supervised fine-tuning + preference optimization withNeMo-Aligner(Instruction Tuning section).
3.3 Roadmap for the deep dive¶
- I first explain teacher correction, because it is the key new step enabling distillation without original pretraining data (Figure 2, Figure 4).
- I then detail pruning, separating:
width pruning(joint hidden/attention/MLP-related axes) vs.depth pruning(removing whole transformer layers) (Pruning section; Table 3).- Next I explain distillation retraining (objective and hyperparameters) since it is what restores accuracy after trimming (Distillation; Table 4).
- Finally I summarize alignment and the evaluation setup used to validate capabilities (Instruction Tuning; Evaluation; Tables 1–2).
3.4 Detailed, sentence-based technical breakdown¶
This is an empirical systems/recipe paper: it assembles a practical sequence of steps—teacher adaptation, structured pruning, and logit distillation—and validates the resulting compressed models on benchmarks, ablations, and runtime throughput (Figures 1–7, Tables 1–4, Figure 10).
3.4.1 End-to-end “what happens first, second, third” pipeline (system/data flow diagram in words)¶
- Start with a pretrained base teacher model such as
Mistral NeMo 12BorLlama 3.1 8B(Training Details → Pre-training). - Select a distillation dataset you do have access to, here the
Nemotron-4 curated continued training (CT) dataset(Training Details → Dataset). - Teacher correction (new step):
- You fine-tune the teacher on the distillation dataset for about
~100B tokensto better match the dataset distribution used later for distillation (Teacher Correction section; Training Details → Teacher Correction; Figure 2). - The output is a
corrected teacherthat (empirically) provides better distillation guidance than the uncorrected teacher (Analysis → Teacher Correction; Figure 4). - Compute pruning importance scores using a small
calibration datasetof1024 samplesrandomly drawn from the full dataset and forward passes only (Pruning → Importance Estimation). - Prune the teacher into a student architecture by trimming structured components based on importance rankings:
- For
width pruning, prune along width-related axes (hidden size / MLP dimension / embedding channels / etc., as used in their recipe) without changing depth (Pruning → Model Trimming; “pruning recipes” bullets). - For
depth pruning, remove transformer layers, using a task-based criterion (Winogrande) to choose a contiguous block of layers to drop (Pruning → Layer Importance; Analysis → Depth Pruning Metrics; Figures 8–9). - Retrain the pruned student with logit-only distillation:
- The student is trained to match the corrected teacher’s output probability distribution over next-token predictions by minimizing forward
KL divergencebetween teacher and student logits (Distillation section; Figure 2). - The paper explicitly says it ignores the LM cross-entropy loss altogether during distillation retraining (Distillation section).
- (Optional) Align / instruction-tune the distilled base model:
- The authors apply math+code supervised fine-tuning (SFT), then instruction SFT, then two rounds of
Reward-aware Preference Optimization (RPO)usingNeMo-Aligner(Instruction Tuning section). - Evaluate base and aligned models on standard benchmarks and report throughput via
TensorRT-LLM(Evaluation section; Tables 1–2; Runtime Performance Analysis; Figure 10).
3.4.2 Teacher correction (key new mechanism)¶
- Motivation and hypothesis
- Distillation typically uses the same dataset the teacher trained on, but when that dataset is unavailable, distillation on a different dataset can degrade teacher guidance (Teacher Correction section).
-
The paper hypothesizes the cause is a shift in the “distribution of sub-word tokens” between the teacher’s original pretraining set and the distillation dataset (Teacher Correction section).
-
Mechanism
- Teacher correction is implemented as a lightweight fine-tuning of the teacher on the distillation dataset before pruning+distillation (Figure 2).
- Training details provided:
- Uses
~100B tokensfor correction (Teacher Correction section; Figure 1 showsTeacher Correction (127B)as a token figure in the diagram). - Uses
120 steps of warm-upand “low learning rates: one-fifth the peak learning rate,” while keeping “identical batch size, minimum learning rate and decay schedule the original model was trained on” (Training Details → Teacher Correction).
- Uses
-
Reported behavior:
- Teacher correction has “minor effect” on the teacher’s downstream accuracy with some tasks improving and some degrading (Training Details → Teacher Correction; Table 1 referenced there).
- It “significantly improves” guidance for the student during distillation (Analysis → Teacher Correction; Figure 4).
-
Parallel correction variant
- The authors also test distilling from a “continuously corrected teacher” while pruning the original teacher, and find it performs “on par” with distilling from a fully corrected teacher (Analysis → Teacher Correction; Figure 5).
3.4.3 Structured pruning: what is pruned and how importance is estimated¶
- Structured pruning (definition in this paper’s context)
-
Instead of deleting individual weights, the method removes structured blocks like whole layers, attention heads, neurons, or embedding channels/dimensions (Pruning section).
-
Importance estimation (activation-based, forward-only)
- The paper uses a “purely activation-based importance estimation strategy” that computes sensitivity for multiple axes—
depth,neuron,head, andembedding channel—using only forward passes (Pruning → Importance Estimation). - It uses a small calibration dataset of
1024randomly drawn samples (Pruning → Importance Estimation). -
For head/neuron/embedding channel importance, it examines activations from:
multi-head attention (MHA)for head importance,multi-layer perceptron (MLP)for neuron importance,LayerNormlayers for embedding channel importance (Pruning → Importance Estimation).
-
Depth pruning is treated as a special case
-
The paper states it does not combine depth pruning with compressing other dimensions (“We consider depth pruning as a special case and do not combine it with compressing other dimensions.”) (Pruning → Importance Estimation).
-
Depth pruning saliency / selection metric
- Two layer-importance metrics are considered (Pruning → Layer Importance):
- LM validation loss / perplexity impact.
- Downstream task accuracy impact.
- They explicitly avoid the
Block Importance (BI)metric because prior work showed it underperforms validation loss/PPL for this purpose (Pruning → Layer Importance). - Their operational depth-pruning procedure is:
- Remove one layer or a contiguous block of layers.
- Measure effect on metric (loss/PPL or downstream accuracy).
- Treat that as “importance/sensitivity” and choose what to drop accordingly (Pruning → Layer Importance).
-
Based on empirical analysis, they choose Winogrande as the downstream metric to decide which contiguous block of layers to remove (Pruning → Layer Importance; Analysis → Depth Pruning Metrics; Figures 8–9).
-
Contiguous vs non-contiguous layer dropping
- Figures 8–9 analyze removing layers:
- Figure 8 indicates non-contiguous removal can yield better LM validation loss (dashed line).
- Figure 9 shows this does not translate to downstream performance: removing 16 layers chosen by per-layer importance can produce “random” Winogrande accuracy
0.5, whereas removing layers16 to 31contiguously yields0.595(Analysis → Depth Pruning Metrics; Figure 9 text).
3.4.4 Model trimming: the actual architectural changes used¶
The paper uses manual architectures (skipping the original Minitron NAS step) and lists the final configurations (Pruning section; Table 3).
- Llama-3.1-Minitron-4B-Width (width pruning only) (Pruning recipes; Table 3)
- Starts from:
Llama 3.1 8B - Changes:
hidden size:4096 → 3072MLP hidden dim:14336 → 9216depth: unchanged (32per Table 3)query heads:32(unchanged)attention groups:8(unchanged)head dimension:128(unchanged)
-
Parameters (Table 3):
total params:4.5Bnon-emb params:3.7Bvocabulary:128256
-
Llama-3.1-Minitron-4B-Depth (depth pruning only) (Pruning recipes; Table 3)
- Starts from:
Llama 3.1 8B - Changes:
depth:32 → 16- Keeps
hidden size = 4096,MLP hidden dim = 14336, heads unchanged
-
Parameters (Table 3):
total params:4.5Bnon-emb params:3.5Bvocabulary:128256
-
MN-Minitron-8B (from Mistral NeMo 12B; width changes) (Pruning recipes; Table 3)
- Starts from:
Mistral NeMo 12B - Changes:
hidden size:5120 → 4096MLP hidden dim:14336 → 11520depth: unchanged (40per Table 3)query heads:32,attention groups:8,head dimension:128
- Parameters (Table 3):
total params:8.4Bnon-emb params:7.3Bvocabulary:131072
3.4.5 Distillation retraining objective (with equation + micro-example)¶
- Plain-language explanation first
- The student is trained so that, for each position in the input sequence, it produces nearly the same next-token probability distribution as the teacher.
-
This is done by penalizing differences between the teacher’s and student’s output distributions (Figure 2; Distillation section).
-
Objective used
- They use logit-only distillation with the forward
KL divergencebetween teacher and student probabilities over the vocabulary (Distillation section; Figure 2). -
They explicitly state they “ignore the LM cross-entropy loss altogether” (Distillation section).
-
Notation (minimal)
- Let
p_T(y | x)be the teacher’s next-token distribution andp_S(y | x)be the student’s distribution for contextx. - The forward KL term is:
KL(p_T || p_S) = Σ_y p_T(y|x) * log( p_T(y|x) / p_S(y|x) )
-
Minimizing this encourages
p_Sto put probability mass wherep_Tdoes. -
Worked micro-example (single next-token step)
- Suppose the vocabulary has just three candidate next tokens
{A, B, C}for illustration. - If the teacher assigns probabilities
p_T = [0.70, 0.20, 0.10]and the student assignsp_S = [0.40, 0.40, 0.20],- the KL loss is larger because the student underestimates
Arelative to teacher (0.40 vs 0.70) and overestimatesB.
- the KL loss is larger because the student underestimates
- Gradient-based training updates student parameters to increase
p_S(A|x)and decrease mismatched probabilities, pushing the student closer to teacher behavior, even without using ground-truth next tokens.
3.4.6 Distillation training configuration and compute (what is specified vs missing)¶
- Dataset
- All pruning/distillation experiments use the
Nemotron-4 curated continued training (CT) dataset(Training Details → Dataset). -
A separate calibration subset of
1024 samplesis used for importance estimation (Pruning → Importance Estimation). -
Training hyperparameters (Table 4)
- Shared:
peak learning rate:1e-4(both Llama and MN-Minitron)LR decay schedule:cosinecontext length:8192
Llama-3.1-Minitrondistillation:min learning rate:1e-5warm-up steps:40global batch size:1152total tokens:94B
-
MN-Minitrondistillation:min learning rate:4.5e-7warm-up steps:60global batch size:768total tokens:380B
-
Hardware
-
Distillation training uses
32 NVIDIA DGX H100 nodes(Distillation section). -
Important missing details (not provided in the supplied paper excerpt)
- The paper excerpt does not specify the optimizer name (e.g., AdamW) or its settings (
betas,eps, weight decay), gradient clipping, parallelism strategy (tensor/pipeline parallel), or exact tokenization method beyond vocabulary sizes (Table 3). - Because your instruction requires these when available, the safest statement is: these details are not stated in the provided content, so they cannot be reproduced exactly from this excerpt.
3.4.7 Alignment / instruction tuning recipe (as stated)¶
- The aligned/instruction-tuned models are produced using
NeMo-Aligner(Instruction Tuning section). - The sequence is:
- Math and code supervised fine-tuning (
SFT) - Instruction SFT
- Two rounds of
Reward-aware Preference Optimization (RPO)(Instruction Tuning section)
4. Key Insights and Innovations¶
- (1) Teacher correction enables distillation without original pretraining data
- What is new: a dedicated fine-tuning phase that adapts the teacher to the distillation dataset (Figure 2).
- Why it matters: the ablations show distilling from a corrected teacher reduces validation loss and improves convergence relative to an uncorrected teacher (Analysis → Teacher Correction; Figure 4).
-
The paper’s “Insights” section quantifies this as “over a 6% reduction in LM validation loss” (Insights → General #1).
-
(2) Task-based depth-pruning saliency using Winogrande + contiguity constraint
- What is new relative to simple loss-based pruning: choosing which layers to drop based on downstream task impact (Winogrande) rather than only LM validation loss/PPL (Pruning → Layer Importance).
-
Why it matters: Figures 8–9 show that the layer choices that best preserve LM loss do not necessarily preserve downstream performance, and contiguous dropping performs substantially better on Winogrande than selecting non-contiguous “least-loss-increasing” layers (Analysis → Depth Pruning Metrics; Figure 9).
-
(3) Empirical comparison of width vs depth pruning at the same parameter budget
- The paper produces two ~
4.5Bvariants from the same teacher (Llama 3.1 8B) (Table 3). -
It shows width pruning has better loss curves and benchmark accuracy than depth pruning at equal parameter counts (Figure 7; Table 1), while depth pruning yields higher throughput gains (Runtime Performance Analysis; Insights → Llama 3.1 bullets).
-
(4) Logit-only distillation (no cross-entropy) still recovers strong performance
- The retraining ignores LM cross-entropy and uses only forward KL on logits (Distillation section).
- Despite that, the MN-Minitron-8B model improves over its teacher on some tasks (e.g., GSM8K and HumanEval as reported in the Insights section and visible in Table 1).
5. Experimental Analysis¶
5.1 Evaluation methodology (datasets, metrics, setup)¶
- Benchmarks (base models) (Evaluation section)
- Knowledge / reasoning:
MMLU - Coding:
HumanEval(Python generation),MBPP - Math:
GSM8K - Commonsense QA:
Arc-Challenge,HellaSwag,TruthfulQA,WinoGrande - Summarization:
XL-Sum English(evaluated on “20% of XL-Sum”) - Benchmarks (instruction-tuned models) (Evaluation section)
- Multi-turn conversation:
MT-Bench (GPT4-Turbo)(they note it is a corrected version) - Instruction following:
IFEval - Function calling:
BFCLv2 (Live) - Hard QA:
GPQA -
Plus
MMLU,GSM8K,HumanEval,MBPP -
Shot settings and decoding (Evaluation section)
- Base models:
MMLU:5-shotWinogrande:5-shotARC-Challenge:25-shotHellaSwag:10-shotTruthfulQA:0-shot(as shown in Table 1 header)XL-Sum:0-shot on 20%(Table 1 indicates “XLSum en(20%) (3)”—the excerpt implies a shot configuration is used; the text says “0-shot on 20% of XL-Sum”)- Code pass@1:
temperature = 0.2- nucleus sampling with
top-p = 0.95
- Aligned models:
0-shotandgreedy samplingif applicable (Evaluation section).
5.2 Main quantitative results (with numbers)¶
Base models (Table 1)¶
- MN-Minitron-8B vs Mistral NeMo 12B (Base and FT)
MMLU (5-shot):69.5(MN-Minitron-8B) vs69.0(12B-Base) vs70.1(12B-FT)Winogrande (5-shot):80.4vs82.2vs82.7HellaSwag (10-shot):83.0vs85.2vs85.3GSM8K (5-shot):58.5vs56.4vs55.7HumanEval (n=20, 0-shot):36.2vs23.8vs23.8-
Interpretable point the paper itself emphasizes: MN-Minitron improves over the teacher on GSM8K and HumanEval (Insights → Mistral NeMo 12B to MN-Minitron-8B; Table 1).
-
Llama-3.1-Minitron-4B (Depth/Width) vs Llama 3.1 8B
MMLU (5-shot):58.7(4B-Depth),60.5(4B-Width), vs65.3(8B)GSM8K (5-shot):16.8(4B-Depth),41.2(4B-Width), vs48.6(8B)HellaSwag (10-shot):73.2(Depth),76.1(Width), vs81.8(8B)Winogrande (5-shot):72.1(Depth),73.5(Width), vs77.3(8B)-
The paper highlights that width-pruned is generally stronger than depth-pruned at the same parameter count (Table 1; Figure 7; Insights → Llama 3.1 bullets).
-
Token-efficiency comparisons explicitly claimed
- The paper compares distillation token counts to the teacher’s reported pretraining tokens:
MN-Minitron-8Buses380Btokens vsLlama 3.1 8Bpretraining15Ttokens (they describe this as “40× fewer” in Analysis and Base Models text; Table 1 lists “Training Tokens 380B” for MN-Minitron and “15T” for Llama 3.1 8B).Llama-3.1-Minitron-4Buses94Bvs15T(described as “150× fewer” in Base Models text; Table 1 lists94Bvs15T).
- Note: These ratios are presented in the paper narrative; they hinge on the
15Tfigure from the Llama 3.1 tech report (Training Details → Pre-training; [1]).
Instruction-tuned models (Table 2)¶
- MN-Minitron-8B vs Llama 3.1 8B
MT-Bench:7.86(MN-Minitron) vs7.78(Llama 3.1 8B)MMLU (5-shot):70.4vs69.4*GSM8K (0-shot):87.1vs83.8IFEval:84.4vs80.4*BFCLv2 (Live):67.6vs44.3HumanEval (0-shot):71.3vs72.6(MN-Minitron slightly lower)-
MBPP (0-shot):72.5vs72.8*(slightly lower) -
Llama-3.1-Minitron-4B variants
IFEval:66.77(Depth) vs79.54(Width)GSM8K (0-shot):71.11(Depth) vs79.76(Width)BFCLv2 (Live):55.89(Depth) vs55.0(Width) (Depth slightly higher here)- The paper notes Llama-3.1-Minitron-4B lags Gemma2 on MT-Bench, and that HumanEval/MBPP are exceptions where they don’t lead similarly-sized models (Instruct Models text).
5.3 Do the experiments support the claims?¶
- Teacher correction helps distillation: Supported by Figure 4 (better convergence/validation loss when distilling from corrected teacher) and Figure 5 (parallel correction performs comparably).
- Pruning + distillation beats weaker baselines: Figure 6 compares:
- random init + distillation,
- random pruning + distillation,
- pruning + LM loss,
- pruning + distillation, and shows the combined pipeline converges best (Analysis → Pruning and Distillation; Figure 6).
- Width vs depth trade-off:
- Accuracy: width better than depth (Figure 7; Table 1; Insights).
- Throughput: depth faster (Runtime Performance Analysis; Insights; Figure 10 summary).
- “State-of-the-art” claim caveat
- The paper claims “state-of-the-art” for MN-Minitron-8B among similarly-sized models (Abstract; surrounding discussion).
- Based strictly on the provided content, the evidence shown is comparisons in Table 1 against a selected set of “similarly-sized SoTA open models” (Table 1 caption). The claim is therefore best interpreted as: best within the compared set and their evaluation setup, rather than a universal SOTA proof.
5.4 Ablations, failure cases, robustness checks¶
- Ablations included (Analysis section)
- Teacher correction vs none (Figure 4).
- Serial correction vs parallel correction (Figure 5).
- Pruning+distillation vs random init / random pruning / LM-loss training (Figure 6).
- Width vs depth convergence (Figure 7).
- Depth pruning selection strategies and metrics (Figures 8–9).
- Failure/weakness signals explicitly noted
- Teacher correction can slightly degrade some downstream tasks (Training Details → Teacher Correction).
- In instruction-tuned evaluation, they do not outperform similarly sized variants on HumanEval and MBPP, and Llama-3.1-Minitron-4B lags Gemma2 on MT-Bench (Instruct Models text).
6. Limitations and Trade-offs¶
- Teacher correction cost and uncertainty
- Teacher correction uses on the order of
~100B tokens(Teacher Correction section; Figure 1 suggests127B), which is “lightweight” relative to trillions of tokens but still substantial. -
The correction step can shift downstream performance in mixed ways (“some tasks improving and some degrading”) and the paper attributes this possibly to the fine-tuning dataset (Training Details → Teacher Correction).
-
Depth-pruning criterion may be task-biased
- The depth-pruning strategy optimizes layer dropping based on
Winograndeaccuracy (Pruning → Layer Importance; Analysis → Depth Pruning Metrics). -
This could bias pruning decisions toward the chosen downstream proxy task; the excerpt does not report trying multiple downstream tasks as pruning criteria beyond the stated consideration.
-
Manual architecture selection (NAS skipped)
- They skip the original Minitron “lightweight NAS” stage and instead pick architectures manually “inspired” by earlier Minitron models (Pruning section).
-
This may limit optimality or generality: performance might depend on good manual choices.
-
Reproducibility gaps in the provided content
- Key training specifics are not included in the excerpt: optimizer type/settings, weight decay, gradient clipping, exact data filtering/deduplication/sampling for the Nemotron-4 CT dataset, and details of distributed training strategy.
-
The paper does mention activation aggregation functions for width pruning (
l2-normacross batch andmeanacross sequence) and single-shot pruning (Pruning section), which helps, but full replication would require the missing details. -
Depth vs width trade-off
- Depth pruning yields higher throughput (
2.7×) but can significantly harm reasoning performance in the base model (e.g.,GSM8K 16.8%depth vs41.2%width in Table 1) (Insights → Llama 3.1; Table 1). - Width pruning yields better accuracy but less throughput gain (
~1.8×) (Runtime Performance Analysis; Insights).
7. Implications and Future Directions¶
- How this changes practice
- The paper provides a concrete, empirically tested recipe for compressing LLMs into smaller deployable models without needing access to the original pretraining corpus (Figure 1–2).
-
The key practical takeaway is: if you must distill on a new dataset, first adapt the teacher to that dataset, otherwise the teacher’s probabilities may be misaligned with the student’s training distribution (Teacher Correction; Figure 4; Insights).
-
Research directions suggested by the paper (explicit)
-
The authors suggest optimizing teacher correction to be lighter than
~100B tokens, potentially using:LoRAfine-tuning, or- tuning
LayerNormparameters alone (Training Details → Teacher Correction; references [15], [16]).
-
Practical applications / downstream use cases implied
-
Producing smaller models with competitive benchmark performance at much lower training token budgets is relevant for deployment constraints (cost, latency) (Introduction; Tables 1–2; Runtime Performance Analysis).
-
Repro/Integration Guidance (when to prefer what, based on this paper)
- Prefer this pipeline when:
- you have a strong pretrained base model but cannot access its original pretraining data, and
- you can afford continued training on an accessible dataset (Nemotron-4 CT in this work).
- Choose
width pruningwhen:- you want better accuracy at a fixed parameter budget (Figure 7; Table 1; Insights).
- Choose
depth pruningwhen:- throughput/latency is the priority and you can tolerate more accuracy loss (Runtime Performance Analysis; Insights).
- If you implement teacher correction:
- consider that it may slightly alter teacher downstream performance, so validate on your target tasks (Training Details → Teacher Correction; Table 1 reference).
- For distillation retraining:
- the paper’s recipe is logit-only forward
KLwith cosine LR schedule and long context (8192) (Distillation; Table 4), but the optimizer and some training mechanics are not specified in the provided excerpt and would need clarification from the full paper or code release.
- the paper’s recipe is logit-only forward