Towards Scalable Pre-training of Visual Tokenizers for Generation¶

🎯 Pitch¶

The paper introduces VTP, a unified pre-training framework that jointly optimizes reconstruction with representation learning (image–text contrastive and self-supervised objectives) to produce latents that encode high-level semantics rather than only low-level pixels. This matters because it resolves the “pre-training scaling problem”: unlike reconstruction-only tokenizers that stagnate or degrade downstream generation as compute scales, VTP’s semantically structured latents scale with FLOPs, model size, and data to substantially improve and accelerate generative performance.

1. Executive Summary (2-3 sentences)¶

This paper addresses a mismatch in latent diffusion pipelines: visual tokenizers (autoencoders/VAEs) trained only for pixel reconstruction can keep improving reconstruction with more compute while not improving—and even hurting—downstream image generation. It introduces VTP, a unified visual-tokenizer pre-training framework that jointly optimizes reconstruction with representation learning (CLIP-style image–text contrastive learning + DINOv2-style self-supervision), producing a latent space that scales effectively for downstream generative quality as tokenizer compute/model/data scale up.

2. Context and Motivation¶

Problem/gap addressed: the “pre-training scaling problem.”
In latent diffusion models (LDMs), generation is performed in a compressed latent space produced by a visual tokenizer (e.g., a VAE/autoencoder) that is typically pre-trained separately using a reconstruction objective (Introduction).
A paradox appears: better reconstruction does not guarantee better generation, and a trade-off is widely observed (Introduction; Figure 1(c)).
The paper highlights a scaling-specific failure mode: pouring more FLOPs into reconstruction-only tokenizer pre-training improves reconstruction metrics but does not translate into better downstream generation, and can degrade it (Section 4.3; Figure 4).
Why it matters
Modern generative pipelines (e.g., DiT diffusion transformers trained on tokenizer latents) depend heavily on the tokenizer’s latent space quality (Abstract; Introduction).
If tokenizer pre-training does not scale “in the right direction,” additional compute spent on tokenizers becomes inefficient for improving generation (Abstract; Introduction).
Prior approaches and their shortcomings (as positioned here)
Reconstruction-only tokenizers bias latents toward low-level details and can move away from the “structured” latent space useful for generation as training scales (Introduction).
Methods that inject semantics exist in two broad patterns (Introduction; Related Work 2.2):
- Explicitly enriching latents (e.g., adding specific semantic signals or leveraging powerful pre-trained features).
- Regularizing/aligning latent features using representational priors or distillation from foundation models (e.g., VA-VAE, REPA-E).
The paper claims these are “preliminary” in the sense that broader scaling properties (compute/model/data) are generally not verified and some are limited by a ceiling imposed by fixed existing foundation representations (Related Work 2.2; Section 5).
How this paper positions itself
It reframes tokenizer pre-training as representation learning rather than purely reconstruction (Figure 1; Figure 3).
It proposes a joint, multi-objective pre-training approach (reconstruction + self-supervised + image–text contrastive) and studies how this affects scaling behavior for downstream generation (Sections 3–4; Figures 5–7).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a Vision Transformer (ViT)-based autoencoder visual tokenizer trained with multiple objectives so that its latent codes are useful for both reconstructing images and training latent-space diffusion models.
It solves the problem of tokenizer pre-training not scaling for generation by shaping the latent space with semantic representation learning losses while still keeping reconstruction in the loop, and then using the resulting latents to train a standard DiT with unchanged training specs.

3.2 Big-picture architecture (diagram in words)¶

Inputs: image I, and (for contrastive learning) a paired caption/text T.
Core components:
ViT encoder + bottleneck (“visual tokenizer”): maps I into a latent tensor Z ∈ R^{d×H/16×W/16} (Section 3.2).
Pixel decoder (ViT-based): reconstructs I' from Z for reconstruction losses (Section 3.2).
Text encoder: encodes T into text features for image–text contrastive loss (Section 3.4; Figure 3).
EMA teacher: an exponential-moving-average copy used for self-distillation / MIM targets (Section 3.3; Figure 3).
Training losses (combined):
Reconstruction loss L_rec (L1 + perceptual).
Self-supervised loss L_ssl (masked image modeling + self-distillation).
Contrastive loss L_clip (CLIP-style image–text alignment).
Downstream: freeze/use the trained tokenizer to encode ImageNet images into latents and train a fixed LightningDiT-B diffusion model for generation evaluation (Section 4.1; Figure 3).

3.3 Roadmap for the deep dive¶

Explain (1) what the tokenizer outputs and why that matters for latent diffusion.
Then cover (2) reconstruction training (including the two-stage stabilization strategy).
Then cover (3) self-supervised learning signals (MIM + self-distillation via EMA teacher).
Then cover (4) image–text contrastive learning and how it encourages semantic structure.
Finally cover (5) how losses are combined + batch sampling, and (6) how downstream DiT evaluation is done under fixed specs.

3.4 Detailed, sentence-based technical breakdown¶

This is an empirical systems/algorithmic training-framework paper whose core idea is that a visual tokenizer’s latent space becomes generation-scalable when it is trained not only to reconstruct pixels but also to learn semantically meaningful representations via contrastive and self-supervised objectives.

What the tokenizer is and what it outputs¶

The visual tokenizer is an autoencoder whose encoder is a Vision Transformer and whose output latent representation is a spatial latent tensor Z ∈ R^{d×H/16×W/16} (Section 3.2).
Here, H×W is the input image resolution, /16 reflects a downsampling ratio (also described as patch size for ViT in Section 4.2: the paper uses notation like f16d64 where f is downsample ratio/patch size and d is bottleneck dimension).
d is the bottleneck/latent channel dimension, primarily d=64 (Section 4.1), with an ablation at d=256.
The downstream generative model (DiT) operates in this latent space: it learns to denoise noisy versions of Z (Figure 1(b); Figure 3).

Reconstruction training (and why it is two-stage)¶

Goal in plain language: ensure the latents preserve enough visual detail so decoded samples look like valid images, while avoiding unstable training.
The reconstruction pipeline is: encode image I → get latent Z → pixel decoder reconstructs image I' (Section 3.2).
The pixel decoder “lifts latents back to the feature space,” refines them with N ViT blocks, and outputs pixels via a final pixel-shuffle layer (Section 3.2).
The paper reports a specific stability problem: GAN loss is poorly compatible with ViT architectures because it causes large gradient norms and instability (Section 3.2).
To address this, it uses a two-stage strategy (Section 3.2):
Pre-training stage: jointly train all parameters using a composite reconstruction loss consisting of L1 plus a perceptual loss L_perceptual between I and I'.
- The reconstruction loss is:
- Plain-language paraphrase: “penalize pixel differences and also penalize differences in perceptual feature space.”
- Notation: L_rec = L1 + L_perceptual (Section 3.2).
Second stage: freeze the visual tokenizer and fine-tune only the pixel decoder with a GAN objective to improve fidelity (Section 3.2).
- The paper does not provide the GAN objective’s exact formula in the provided excerpt.

Self-supervised learning (SSL): masked modeling + self-distillation¶

Goal in plain language: force the tokenizer latents to capture more than low-level pixel detail, including spatial structure and semantics, by learning from augmentations and masked inputs without labels.
The SSL framework follows DINOv2 and has two parts (Section 3.3):
Masked image modeling (MIM)
- An image I is augmented into a global view I_global and a local view I_local (Section 3.3).
- For MIM, I_global is patch-embedded and sent to the EMA teacher, while a masked version is processed by the visual tokenizer, producing a masking loss L_mim (Section 3.3; Figure 3).
- Intuition: the tokenizer must infer missing patches, which encourages learning contextual/spatial representations rather than copying pixels.
Self-distillation (DINO)
- I_global and I_local are passed to the visual tokenizer, and I_global is also passed to the EMA teacher, with cross-entropy loss L_dino computed between student predictions and teacher pseudo-labels (Section 3.3; Figure 3).
- Intuition: different crops/views should map to consistent semantic predictions, encouraging invariances and higher-level structure.
The combined SSL loss is (Section 3.3, Eq. (1)):
Plain-language paraphrase: “sum the masked modeling loss and the self-distillation consistency loss.”
Notation: L_ssl = L_mim + L_dino.

Image–text contrastive learning (CLIP-style)¶

Goal in plain language: explicitly align the visual latent features with text so the latent space organizes around semantics that matter for descriptions and categories.
Given image–text pairs (I, T) (Section 3.4):
The visual tokenizer encodes I into visual features.
A text encoder encodes T into textual features.
A CLIP-style contrastive objective encourages matched image/text pairs to be similar and mismatched pairs to be dissimilar, producing L_clip (Section 3.4).

Joint multi-objective training: the VTP objective¶

The paper’s central training mechanism is to optimize reconstruction + self-supervision + contrastive alignment together.
The overall loss is (Section 3.5, Eq. (2)):
Plain-language paraphrase: “a weighted sum that keeps reconstruction active but allows semantic objectives to dominate as needed.”
Notation:
- L_total = λ_rec L_rec + λ_ssl L_ssl + λ_clip L_clip,
- with λ_rec > 0, λ_ssl ≥ 0, and λ_clip ≥ 0 (Section 3.5).
Specific weights used in experiments (when stated):
λ_rec = 0.1, and λ_clip and λ_ssl are set to either 0 or 1 depending on the configuration being tested (Section 4.1).
The paper notes that smaller reconstruction weight improves generative performance (Section 4.1).

Batch sampling strategy (to reconcile very different batch-size needs)¶

Contrastive learning benefits from very large batch sizes, while SSL and reconstruction typically work with smaller ones (Section 3.6).
Given an input batch of B image–caption pairs (Section 3.6):
Use all B for CLIP: B_clip = B (e.g., 16k).
Randomly sample subsets for SSL and reconstruction: B_ssl and B_rec.
In implementation (Section 4.1), they set:
B_clip = 16k, B_ssl = 4k, B_rec = 2k.

Implementation details the paper provides (and what it does not)¶

Provided (Section 4.1):
Uses a ViT implementation from [26].
Incorporates QKNorm to improve training stability.
Text encoder: 12-layer transformer, hidden dimension 768.
Pixel decoder for fast experiments: 4-layer ViT-Large.
Bottleneck dimension: primarily d=64, with ablation at d=256.
Pre-training data: internally filtered DataComp-1B with 277M samples.
Downstream generation training: ImageNet; trains LightningDiT-B for 80 epochs under a fixed protocol.
Reconstruction evaluation: ImageNet validation set at 256 resolution; uses rFID as primary reconstruction metric.
Not specified in the provided content:
Optimizer type and settings (e.g., AdamW params), learning rate schedule, weight decay, gradient clipping, training hardware, total training tokens/steps in the pretraining runs (beyond “FLOPs” plots), attention head counts, and tokenizer patch embedding details beyond the f16 notation.

A concrete “one batch” walkthrough (micro-example)¶

For one training step on a batch of image–caption pairs:

Start with B image–caption pairs (I, T).
CLIP path (semantic alignment):
Encode every I with the tokenizer encoder to get image features.
Encode every T with the text encoder to get text features.
Compute L_clip by pushing matched pairs together and unmatched pairs apart (Section 3.4).
SSL path (semantic + spatial structure):
Sample B_ssl pairs.
For each sampled image, create global/local crops; feed crops through student tokenizer and global crops through EMA teacher.
Compute L_mim from masked global-view reconstruction targets and L_dino from teacher–student pseudo-label consistency (Section 3.3).
Reconstruction path (detail preservation):
Sample B_rec images.
Encode to Z, decode to I', compute L1 + L_perceptual = L_rec (Section 3.2).
Combine losses with λ_rec, λ_ssl, λ_clip into L_total and update model parameters (Section 3.5).

4. Key Insights and Innovations¶

(1) Formulation of the “pre-training scaling problem” for visual tokenizers
The paper crystallizes a specific failure mode: scaling reconstruction-only tokenizer pre-training can improve reconstruction while degrading downstream generation (Introduction; Figure 4).
This matters because it challenges an implicit assumption that “better tokenizer reconstruction” is a monotonic proxy for “better generator performance.”
(2) Joint multi-objective pre-training for tokenizers (CLIP + SSL + reconstruction)
The core innovation is not using a single semantic auxiliary loss, but a unified framework that jointly optimizes:
- cross-modal alignment (L_clip),
- self-supervised representation learning (L_ssl = L_mim + L_dino),
- and reconstruction (L_rec) (Section 3; Figure 3; Eq. (2)).
The paper claims (and provides evidence) that this combination is feasible/stable and produces better generative scaling (Section 4.3; Figure 6).
(3) Empirical claim: “understanding drives generation” in tokenizer latents
The paper uses linear probing accuracy on the bottleneck features as a proxy for “understanding,” and shows a strong positive relationship between understanding metrics and generation metrics across scaling experiments (Introduction; Figure 2; Figure 5; Figure 6).
(4) Demonstration of tokenizer scalability along multiple axes
Unlike reconstruction-only autoencoders that saturate early for downstream generation, VTP shows improved generation as you scale:
- tokenizer training compute (FLOPs) (Figures 5–6),
- tokenizer model size (encoder and decoder scaling) (Figure 7(b,c)),
- tokenizer data size (Figure 7(a)).

5. Experimental Analysis¶

Evaluation methodology (datasets, metrics, setup)¶

Tokenizer pre-training data: internally filtered DataComp-1B subset with 277M samples (Section 4.1).
Downstream generation data: ImageNet (Section 4.1).
Downstream generator: LightningDiT-B trained for 80 epochs under a fixed configuration (Section 4.1).
Metrics:
Reconstruction: rFID (reconstruction FID) reported on ImageNet validation at 256×256 (Section 4.1; Figure 4; Table 2). Table 1 also reports rPSNR.
Understanding: ImageNet linear probing Top-1 accuracy using only bottleneck features (no multi-layer feature enhancements) (Section 4.1). Table 2 also reports zero-shot accuracy for some models.
Generation: FID measured from the diffusion model trained on tokenizer latents.
- Section 4.1 mentions reporting FID-10k for most cases in the scaling studies.
- Table 2 reports FID-50k, with LightningDiT on ImageNet for 80 epochs, evaluated without CFG (classifier-free guidance).

Main quantitative results (with anchored references)¶

A. ViT tokenizers are competitive as autoencoders (Table 1). - Under the reported setup: - ViT-L autoencoder: rPSNR = 31.28, gFID = 53.51. - CNN tokenizer baseline (“CNN [25]”): rPSNR = 30.63, gFID = 59.53. - ViT-L uses more parameters (607.2M) but lower FLOPs than the CNN entry in Table 1 (Table 1). - Interpretation within the paper: ViT is an effective and efficient substitute architecture for tokenizers (Section 4.2).

B. Scaling reconstruction-only pre-training improves reconstruction but hurts generation (Figure 4). - As tokenizer training FLOPs increase: - rFID improves from 2.070 → 0.510 (better reconstruction). - gFID worsens from 55.04 → 58.56 (worse generation). - This is the empirical signature of the “scaling paradox” the paper emphasizes (Section 4.3; Figure 4).

C. Adding semantic objectives (CLIP or SSL) makes generation scale with compute and correlates with “understanding” (Figure 5). - The paper compares: - AE only vs CLIP+AE vs SSL+AE, - at latent dimensions d=64 and d=256, - across compute scales (Figure 5). - Qualitative takeaway anchored to the plotted relationships: - AE only sits in a regime where reconstruction improves but understanding/generation stagnate or degrade. - CLIP+AE and SSL+AE show concurrent improvement in understanding + generation as compute increases (Figure 5(a,d)).

D. Combining both CLIP and SSL with reconstruction performs best under a fixed compute budget (Figure 6). - In the f16d64 setting, the joint objective CLIP+SSL+AE achieves: - Generation: gFID = 27.8, - Understanding: 74.9% linear probing accuracy (Figure 6). - Figure 6 positions this as a stronger upper bound than “best CLIP+AE” or “best SSL+AE” under the same compute budget.

E. Scaling behavior across data and parameters (Figure 7). - Data scaling (Figure 7(a)): - For AE-B, gFID barely changes with more data: 58.37 → 56.71 from 100K → 100M pretraining data size. - For VTP-B, gFID improves dramatically: 47.59 → 27.45 across the same data scaling (Figure 7(a)). - The paper emphasizes that downstream DiT training FLOPs are kept identical, isolating tokenizer pretraining as the driver (Section 4.5). - Encoder scaling (Figure 7(b)): - AE variants remain around ~57 gFID across sizes (Figure 7(b)). - VTP improves as encoder size increases, with gFID improving from 31.28 → 26.12 (Figure 7(b); Section 4.4). - Decoder scaling (Figure 7(c)): - Scaling decoder depth from 4 to 24 layers improves VTP gFID from 26.12 → 24.08. - The analogous AE curves do not improve and slightly worsen (Figure 7(c)).

F. Final comparison profile (Table 2 + Figures 8–9). - VTP-L-d64 reports (Table 2): - Understanding: 78.2 zero-shot accuracy, 85.7 linear probe. - Reconstruction: rFID = 0.36. - Generation: FID-50k = 2.81 (80 epochs, without CFG). - The paper also claims a 4.1× faster convergence on generation compared to prior distillation-based methods (VA-VAE), illustrated by convergence curves (Figure 9) and discussed in Section 5. - Reconstruction visual comparison indicates VTP reduces color/texture artifacts compared to a fixed-representation approach (RAE) (Figure 8; Section 5).

Do experiments support the claims?¶

Support for “reconstruction-only scaling fails for generation”: directly shown by opposite trends in Figure 4 with explicit numbers.
Support for “understanding drives generation”:
The paper operationalizes “understanding” as linear probe accuracy on bottleneck features and shows strong correlation plots (Figure 2; Figure 5(a,d); Figure 6(a)).
This is correlational evidence, not necessarily causal isolation, but the systematic comparisons (AE-only vs +CLIP vs +SSL vs +both) strengthen the argument that semantic objectives are a key factor.
Support for “scaling laws”:
The paper provides separate scaling axes (compute, data, model size) and contrasts against reconstruction-only baselines (Figures 5–7).
The strongest evidence is that AE curves saturate while VTP curves improve consistently under controlled downstream training (Figure 7).

Ablations / robustness checks mentioned¶

Latent dimension ablation: d=64 vs d=256 (Section 4.1; Figure 5).
Objective ablations: AE only vs CLIP+AE vs SSL+AE vs CLIP+SSL+AE (Figures 5–6).
Parameter scaling: encoder size and decoder depth scaling (Figure 7(b,c)).
Data scaling: 100K to 100M subsets, with equal “1.1 billion samples each” training budget (Section 4.5; Figure 7(a)).

6. Limitations and Trade-offs¶

Missing training details for full reproducibility (in the provided content)
The paper does not specify key optimizer/hyperparameter details (optimizer type, learning rate schedule, weight decay, training steps, hardware), which are important for faithfully reproducing large-scale pretraining behavior.
Dependence on paired image–text data
VTP’s best configuration uses CLIP-style image–text contrastive learning, which requires image–caption pairs (Section 3.4; Figure 3; Section 4.1’s DataComp usage).
The framework includes SSL-only variants (SSL+AE), but the top reported joint setup uses both.
Two-stage training complexity
Reconstruction uses a two-stage approach where the pixel decoder is later GAN-finetuned while freezing the tokenizer (Section 3.2).
This adds training complexity and introduces an additional “hand-off” point; the paper does not quantify how sensitive results are to the second-stage GAN procedure in the provided text.
Evaluation scope
Downstream generation evaluation is centered on ImageNet with a fixed LightningDiT-B protocol (Section 4.1).
The provided content does not show how well scaling conclusions transfer to other datasets, resolutions, or other generator architectures beyond the stated DiT setup.
“Understanding” proxy is narrow
Understanding is measured via ImageNet linear probing on bottleneck features (Section 4.1), which is a useful but specific proxy.
The paper does not, in the provided excerpt, test other semantic evaluations (e.g., retrieval, dense prediction) to triangulate what semantic aspects matter most.
Data filtering is internal
The tokenizer pretraining data is an “internally filtered” DataComp-1B subset (Section 4.1), but filtering criteria and contamination checks are not described here.

7. Implications and Future Directions¶

How this changes the landscape (within the paper’s framing)
The work suggests that tokenizer pre-training should be treated as scalable representation learning, not just reconstruction, if the goal is to improve latent diffusion generation at scale (Abstract; Introduction; Figure 1).
It provides empirical evidence that tokenizer scaling can be made productive: generation improves with tokenizer compute/model/data when semantic objectives are included (Figures 5–7), in contrast to reconstruction-only saturation.
Follow-up research directions explicitly suggested
Broader multi-objective pretraining: the paper asks what other perceptual/representation tasks could be integrated into multi-objective tokenizer pretraining beyond CLIP and SSL (Conclusion).
Data distribution as a lever: it suggests not only scaling data size but also tailoring data distributions (e.g., text-rendering-heavy data) to unlock specialized downstream generative capabilities (Conclusion).
Practical applications / downstream use cases
Any setting using a two-stage latent generative pipeline (tokenizer → latent diffusion/DiT) could benefit if improved generation can be achieved by scaling tokenizer pretraining alone while keeping downstream generator training fixed (Figure 1; Sections 4–5).
Repro/Integration Guidance (based on the provided paper)
Prefer VTP-style multi-objective tokenizer pretraining when:
- you are training latent diffusion/DiT models and observe that improving tokenizer reconstruction does not improve generation (the failure mode in Figure 4),
- you want improvements without changing downstream DiT training specs (a core experimental control emphasized in Figure 1 and Section 4).
Minimal integration steps implied by the pipeline (Figure 3):
1. Pretrain a ViT-based tokenizer with L_total = λ_rec L_rec + λ_ssl L_ssl + λ_clip L_clip, using large batches for CLIP (B_clip=16k) and smaller sampled batches for SSL/reconstruction (Section 3.6; Section 4.1).
2. Use the pretrained tokenizer to encode images into latents Z and train a standard DiT in latent space under an unchanged training recipe (Section 4.1; Figure 3).