An Empirical Study of Autoregressive Pre-training from Videos¶

🎯 Pitch¶

This paper introduces Toto, a family of decoder-only autoregressive transformers that treat images and videos as sequences of visual tokens and are pre-trained on ~1 trillion visual tokens to predict next tokens. It shows that simple next-token prediction on large, mixed image/video data yields competitive, general-purpose visual representations across recognition, tracking, forecasting, and robotics, and demonstrates compute–loss scaling behavior for video models analogous to language models—supporting a scalable, unified approach to visual foundation models.

1. Executive Summary (2-3 sentences)¶

This paper studies whether a simple next-token prediction objective—applied to videos tokenized into sequences of visual tokens—can produce strong, general-purpose visual representations. It introduces a family of decoder-only autoregressive transformers called Toto, pre-trained on a mixture of images and internet videos totaling ~1T training tokens (from a larger pool of ~2.5T tokens / ~100k hours), and evaluates the learned representations across recognition, forecasting, tracking, object permanence, and robotics tasks. The main takeaway is that autoregressive pre-training from videos works competitively with minimal inductive bias, and exhibits compute–loss scaling similar in form (power law) to language models but with a slower exponent (Section 4.9, Figure 9).

2. Context and Motivation¶

What specific problem or gap does this paper address?
Large Language Models demonstrate that autoregressive next-token prediction can produce broadly useful representations, but the analogous question for video—the dominant “big visual data”—is “underexplored” (Introduction).
The paper targets the gap: Can we learn strong, transferable visual/video representations by training a decoder-only model to predict the next visual token in sequences derived from images and videos? (Introduction, Section 3).
Why is this problem important?
Videos contain rich temporal dynamics and diverse real-world signals (egocentric, instructional, action-centric), and the paper argues that video data is still early in large-scale exploitation compared to internet text (Introduction).
A unified autoregressive objective that works across images + videos could simplify pre-training and produce representations reusable across many tasks (Figure 1, Sections 3.3–3.5).
What prior approaches existed, and where do they fall short (as positioned here)?
Discriminative self-supervised methods (e.g., instance discrimination, contrastive, DINO-like) are strong for recognition but are not generative (Section 2; comparisons in Tables 7–8).
Masked modeling (e.g., MAE/VideoMAE) learns representations by reconstructing masked tokens with a lightweight decoder; it is not pure next-token generation (Section 2).
Earlier autoregressive vision work (PixelCNN/Image Transformer/iGPT) mostly focused on images or generation quality; iGPT showed representation learning from pixels, but this paper explores videos at larger token scale and different design choices (Section 2; Appendix A.4; Table 14).
How this paper positions itself relative to existing work
It emphasizes: no supervision during pre-training, joint training on images+videos via a unified token format, and a broad multi-task evaluation beyond standard recognition (Introduction; Section 4).
It frames itself as an empirical study: exploring tokenizers, architectures, resolution strategies, probing choices, and scaling (Section 4.1; Section 4.9).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a decoder-only transformer trained like a language model, except its “words” are visual tokens produced from video frames or images.
It solves representation learning by doing autoregressive next-token prediction over token sequences derived from raster-scanned patches from images and from sequences of video frames (Figure 1; Section 3.1).

3.2 Big-picture architecture (diagram in words)¶

(1) Data sources (images + videos) →
(2) Frame preprocessing (resize/crop/sample frames) →
(3) Tokenizer (e.g., dVAE, optionally VQGAN or continuous patch tokens) converts each frame/image into a grid of tokens →
(4) Sequence construction (raster scan + temporal concatenation + special start/end tokens) →
(5) Causal transformer (Toto) predicts the next token at each position (Sections 3.1–3.4) →
(6) Representation extraction from intermediate layers + attention pooling or average pooling →
(7) Downstream evaluation via linear/attention probing or task-specific fine-tuning depending on the task (Section 3.5; Section 4).

3.3 Roadmap for the deep dive¶

I will explain:
How images/videos become token sequences (tokenization + ordering), because this defines the learning problem (Section 3.4).
The autoregressive objective and what the model is optimizing (Section 3.1, Eq. (1)–(2)).
The transformer architecture and training setup (Section 3.2, Table 1).
The data mixture and scale (Section 3.3, Table 2).
How representations are extracted and probed (Section 3.5; Tables 5–6; Figures 4, 8).
Key design-choice findings (tokenizer/resolution/architecture) that affect performance (Section 4.1; Tables 3–6).

3.4 Detailed, sentence-based technical breakdown¶

This is an empirical systems + algorithmic study: it builds a family of autoregressive video token models (Toto) and evaluates how design decisions affect representation quality and scaling, with the core idea being “train a causal transformer to predict the next visual token” (Sections 3–4).

3.4.1 System/data pipeline diagram in words (explicit step-by-step flow)¶

Collect training data from a mixture of image and video datasets (Section 3.3, Table 2):
ImageNet: 13.9M instances, 3.6B tokens.
Kinetics-600: 0.53M instances, 41.3B tokens, 1,496 hours.
Ego4D: 52.1K instances, 103B tokens, 3,750 hours.
HowTo100m: 1.172M instances, 2,560B tokens, 92,627 hours.
Combined pool: ~100,000 hours and ~2.5T tokens, while the reported full training uses ~1T tokens (Section 3.3).
Preprocess each video by resizing and cropping:
The video is resized so its shortest side is R pixels, then a random crop of size R × R × T is taken (Section 3.4).
Frames are sampled every 4 frames (stride 4), and T is the number of frames per training sample (Section 3.4).
Tokenize each frame independently into a grid of discrete tokens:
Default tokenizer: dVAE (from DALL·E) with vocabulary size 8k (Section 3.4).
With R = 128, each frame becomes a 16 × 16 token grid, i.e., 256 tokens per frame (Section 3.4).
With T = 16 frames, each training sequence is T × 256 = 4096 tokens (Section 3.4), matching the stated context length of 4k tokens (Figure 1; Section 3.4).
Convert the 2D token grids into a 1D sequence using raster scan ordering:
Raster scan means reading tokens row-by-row (top-left to bottom-right) so spatial data becomes a single sequence (Section 3.1; Appendix A.1).
Unify images and videos into the same training format:
For videos: 16 frames are tokenized and concatenated to form a 4k-token sequence (Section 3.4).
For images: 16 images are sampled and concatenated as if they were “frames” to form a 4k-token sequence (Section 3.4).
Special tokens: the sequence includes start and end markers, with different start tokens for video vs image:
- Video start token: [1]
- Image start token: [3]
- End token: [2] (Section 3.4)
Ambiguity to note: the paper states a 4096-token context length (Section 3.4; Figure 1) and also adds start/end tokens; it is not fully specified whether these special tokens are included within the 4096 count or appended beyond it.
Train a causal transformer to predict the next token:
The model defines the probability of a token sequence x^j = (x^j_1, …, x^j_n) as a product of conditional next-token probabilities (Section 3.1, Eq. (1)):
- In plain language: the model predicts each token given all previous tokens in the sequence.
Training minimizes negative log-likelihood (Section 3.1, Eq. (2)):
- In plain language: penalize the model when it assigns low probability to the true next token.
Extract intermediate representations for downstream tasks:
Let H^l be the hidden states after layer l (Section 3.2).
For downstream use, the paper probes intermediate layers rather than only the final layer, because performance often peaks around the model’s middle depth (Section 4.1 “Probing Layer”; Figure 4; Figure 8).

3.4.2 Model architecture (what is built)¶

Toto uses a decoder-only transformer with causal attention, aligned with LLaMA-style design (Section 3.2):
RMSNorm (pre-norm),
SwiGLU activation in the MLP,
RoPE positional embeddings.
The layer update equations define a residual attention block then a residual MLP block (Section 3.2, Eq. (3)–(4)):
In plain language:
- Normalize the hidden states, apply multi-head self-attention, and add it back (residual).
- Normalize again, apply an MLP, and add it back.
Model sizes and core architectural hyperparameters are given in Table 1:
Toto-base: 120M params, hidden dim 768, 12 heads, 12 layers.
Toto-large: 280M params, hidden dim 1024, 16 heads, 16 layers.
Toto-1b: 1.1B params, hidden dim 2048, 16 heads, 22 layers.
(Table 1)

3.4.3 Training configuration and hyperparameters (what is explicitly specified)¶

Batch size: 1M tokens per batch (Section 3.2).
Optimizer: AdamW (Section 3.2).
Max learning rate: 3e-4 (Section 3.2).
Betas: β1 = 0.9, β2 = 0.95 (Section 3.2).
Warmup: 2000 steps (Section 3.2).
Learning rate schedule: cosine decay after warmup (Section 3.2).
Not specified in provided text: weight decay value, gradient clipping, EMA, dropout rates, precision (fp16/bf16), or exact hardware; the acknowledgments mention TPU setup help, but no hardware details are provided.
Training scale: “over one visual trillion tokens” for pre-training (Section 3.1) and “full training utilized about 1 trillion tokens” (Section 3.3).
Context length: 4096 tokens (Section 3.4; Figure 1).
Tokenizer setup (default):
dVAE, vocab size 8k, tokenize each frame independently (Section 3.4).
Training crop resolution: R = 128 → 16×16 tokens per frame (Section 3.4).
Temporal length: T = 16 frames, sampled every 4 frames (Section 3.4).

3.4.4 Representation extraction and probing (how they measure “representation quality”)¶

The paper uses linear probing and attention probing/pooling (Section 3.5):
Average pooling probe: global average pool token embeddings from a chosen layer, then train a linear classifier on top.
Attention pooling probe: a learned query token q cross-attends to intermediate tokens using learned projection matrices (Wk, Wv) to compute a weighted sum, producing a single representation vector (Section 3.5).
- Motivation: in causal (decoder-only) models, later tokens “see” more previous context than earlier tokens, so uniform averaging can overweight tokens with smaller receptive fields (Section 3.5).
Layer choice matters: best downstream performance often comes from middle layers (~50% depth) rather than the final layer (Section 4.1; Figure 4; Figure 8). The paper attributes worse performance in last layers to their role in reconstructing/predicting tokens (Section 4.1 “Probing Layer”).

3.4.5 Design choices explored (what they vary and why)¶

Tokenizer type (Section 4.1; Table 3; Figure 3):
Discrete tokenizers: dVAE, VQGAN (multiple vocab sizes).
Continuous tokens: patch embeddings with normalized patch regression targets.
Key nuance: the paper argues VQGAN tokenizers may be indirectly contaminated with ImageNet label information due to perceptual loss using a VGG network (Section 3.4; Section 4.1).
Token resolution strategy (Section 4.1; Table 4):
Train cheaper at low token resolution (16×16 tokens) then fine-tune at higher resolution (32×32 tokens), leveraging RoPE to adapt positional embeddings.
Architecture family (Section 4.1; Table 6):
Compare LLaMA-style decoder-only transformer vs GPT-2-style vs Mamba (state-space model).
Pooling/probing method (Section 4.1; Table 5):
Average pooling vs attention pooling, showing large gains for attention pooling on ImageNet linear probing.
Scaling behavior (Section 4.9; Figure 9; Appendix A.5; Tables 15):
Use µ-parameterization to tune learning rates across widths and analyze compute-optimal scaling.

4. Key Insights and Innovations¶

Autoregressive pre-training from mixed image+video tokens yields broadly transferable representations
The model is trained with a single objective (next-token prediction) and then evaluated across diverse tasks: ImageNet classification, Kinetics action recognition, Ego4D anticipation, DAVIS tracking, CATER object permanence, and robotics (Section 4.2–4.7).
Significance: it suggests that even without strong vision-specific inductive biases, decoder-only autoregressive modeling can be a viable representation learning strategy (Conclusion; also reflected in task results across Tables 7–12).
Attention pooling is materially better than average pooling for decoder-only visual transformers
Table 5 shows a large gap on ImageNet probing at the same layer and tokenization setup:
- Average pooling: 53.2% top-1
- Attention pooling: 61.1% top-1
  (Table 5)
Why it matters: it addresses a decoder-only-specific issue—the receptive field is skewed because later tokens attend to more context (Section 3.5).
Efficient resolution training strategy: low-res pre-train → high-res fine-tune works well, aided by RoPE
Table 4 shows:
- dVAE/32 (full higher-res training) achieves 61.2% top-1 at compute 5.68×10^17.
- dVAE/16→32 achieves 63.2% top-1 at compute 2.13×10^17.
- dVAE/16→32† (RoPE base 50,000) achieves 64.4% top-1 at the same compute.
  (Table 4)
Significance: this is a practical recipe for reducing compute while improving downstream accuracy, and it ties directly to positional embedding adaptability (Section 4.1 “Resolution”).
Intermediate layers are best for semantic tasks; last layers may specialize toward reconstruction
Figure 4 shows ImageNet probing peaking around ~50% depth across model sizes, and Figure 8 generalizes this to multiple tasks (ImageNet, Kinetics, DAVIS), with robotics showing a slightly different pattern where later layers can also be strong (Section 4.8; Figure 8).
This observation guides how to use decoder-only models as feature extractors (probe mid-layers, not the final layer).
Compute-optimal scaling exhibits a power law but with a slower exponent than a referenced language-model scaling curve
The paper fits a compute–loss power law for Toto (Section 4.9; Figure 9):
- L(C) = 7.32 · C^{-0.0378}
It compares to a GPT-3 relation L(C) = 2.57 · C^{-0.048} and notes they are not directly comparable, but the smaller magnitude exponent suggests slower improvement per additional compute for visual next-token prediction (Section 4.9).

5. Experimental Analysis¶

5.1 Evaluation methodology (datasets, metrics, setup)¶

Pre-training data mixture (Section 3.3; Table 2):
Mixed mini-batches are sampled approximately:
- 20% ImageNet images,
- 10% Ego4D videos,
- 10% Kinetics videos,
- 60% HowTo100m videos (Section 3.3).
Total pool ~2.5T tokens; training uses ~1T tokens (Section 3.3).
Core probing approach:
For ImageNet and Kinetics: probe each layer using attention pooling and select the best-performing layer (“optimal layer”) (Section 4.2; Section 4.3).
For tracking (DAVIS): use features directly for nearest-neighbor label propagation without fine-tuning/probing (Section 4.5).
Metrics:
ImageNet: Top-1 accuracy (Tables 3–7; Table 13–14).
Kinetics-400: Top-1 accuracy (Table 8; Appendix Table 16).
Ego4D anticipation: mean average precision for multiple targets (noun, noun+verb, noun+TTC, overall) (Table 9).
DAVIS tracking: J, F, and J&F scores (Table 10).
CATER object permanence localization: accuracy (%) at temporal resolutions 16 and 32 frames (Table 12).
Robotics: success rate and learning curves (Figure 6; Table 11).

5.2 Main quantitative results (with numbers and comparisons)¶

ImageNet-1k recognition¶

Toto performance scales with model size (Table 7):
Toto-base (120M): 64.7% top-1
Toto-large (280M): 71.1% top-1
Toto-1b (1.1B): 75.3% top-1
(Table 7)
Comparison context in Table 7:
Strong discriminative/self-supervised baselines are higher (e.g., DINO 80.1%, DINOv2 86.4%), while generative approaches vary (e.g., MAE 80.9%, AIM 80.6%).
Within autoregressive generative baselines, Table 7 includes iGPT numbers (65.2% for iGPT-L, 72.0% for iGPT-XL) but notes iGPT entries are evaluated with linear probing (Table 7 footnote †).
Fair linear-probe comparison to iGPT (Appendix A.4; Table 14):
Toto-1b: 66.2%
iGPT-L (1.386B): 65.2%
(Table 14)
Full fine-tuning (Appendix A.3; Table 13):
Toto achieves 82.6% on ImageNet-1K under the paper’s full fine-tuning setting, compared against several methods listed there (Table 13).
Caveat: the appendix states fine-tuning uses full attention (Appendix A.3), whereas pre-training uses causal attention (Section 3.2).

Kinetics-400 action recognition¶

Scaling with model size is strong (Table 8):
Toto-base: 59.3%
Toto-large: 65.3%
Toto-1b: 74.4%
(Table 8)
With a higher-capacity attention classifier head variant (Appendix A.7; Table 16):
Toto-1b: 74.8%
Toto-large: 65.8%
Toto-base: 61.2%
(Table 16)

Ego4D short-term action anticipation¶

Toto-large achieves an overall score of 2.70 (Table 9), with sub-metrics:
Noun: 15.20
N+V: 6.75
N+TTC: 5.41
Overall: 2.70
(Table 9)

DAVIS semi-supervised tracking (label propagation)¶

Table 10 reports J&F improvements with scale and (especially) resolution:
Toto-base (256/8): 42.0 J&F
Toto-large (256/8): 44.8
Toto-1b (256/8): 46.1
Toto-large (512/8): 62.4
(Table 10)
The paper notes that at large resolution (512), Toto-large outperforms listed baselines in that table (Table 10).

Object permanence (CATER localization)¶

Table 12 reports Toto-large outperforming two baselines:
Toto-large: 62.8 (16 frames), 72.9 (32 frames) (Table 12).
Internal inconsistency to note: the paragraph in Section 4.7 states Toto-large achieves 62.8% and 70.9% for 16 and 32 frames, but Table 12 shows 72.9 for 32 frames. Based on the provided content, the table value (72.9) is the only explicit tabulated number.

Robotics¶

Simulation: Figure 6 shows Toto-base learning faster than MAE-base across multiple tasks (Franka/Kuka pick & cabinet), but exact scalar values are not provided in the visible figure text; only qualitative curve comparisons are described (Section 4.6; Figure 6).
Real-world: Table 11 reports success rates with the same number of demonstrations (# Traj = 240):
MVP: 75%
Toto-base: 63%
(Table 11)
The text says success is reported over 16 trials with object variations (Section 4.6).

5.3 Design-choice experiments (what supports which claims)¶

Tokenizer choice has limited effect on ImageNet linear probing in their setup (Section 4.1; Table 3):
Examples from Table 3 (all Toto-large trained 400 epochs on ImageNet-1k):
- VQGAN-VQGAN 16×16 (16k vocab): 61.3
- dVAE-dVAE 32×32 (8k vocab): 61.2
- patch-patch 16×16: 60.6
But dVAE-dVAE 16×16 drops to 53.2, indicating resolution/token count matters a lot (Table 3; also discussed in Section 4.1 “Resolution”).
Attention pooling is critical for decoder-only probing (Table 5; Section 3.5 rationale).
Architecture matters: LLaMA-style > GPT2-style > Mamba (in this ImageNet linear probing study) (Table 6):
LLaMA (280M): 53.2
GPT2 (280M): 48.5
Mamba (~290M): 40.7
(Table 6)
Resolution training strategy with RoPE improves accuracy-per-compute (Table 4; Section 4.1).

5.4 Do experiments convincingly support the paper’s claims?¶

Claim: “competitive performance across all benchmarks” with minimal inductive bias (Abstract/Conclusion).
Supported in the sense that Toto is evaluated across many tasks (Sections 4.2–4.7) and achieves:
- strong scaling trends (Tables 7–8),
- competitive anticipation results (Table 9),
- strong tracking at higher resolution (Table 10),
- strong CATER localization (Table 12),
- usable robotics performance (Table 11; Figure 6).
However, for recognition tasks the paper itself acknowledges discriminative methods often perform better (Sections 4.2–4.3; Tables 7–8).
Claim: “scaling curves similar to language models, albeit with a different rate” (Abstract; Section 4.9).
Directly supported by the fitted power law in Section 4.9 (Figure 9), with an explicit exponent comparison to a GPT-3 scaling relation (Section 4.9).
The paper explicitly cautions they are “not comparable directly” (Section 4.9), which appropriately limits overclaiming.
Ablations / robustness
The paper includes multiple ablations (tokenizer, resolution, pooling, architecture, probing layer; Section 4.1) and cross-task layer behavior (Figure 8).
Some potentially important controls are not fully specified in the provided text (e.g., exact pre-training compute budget in PF-days, hardware, data filtering/deduplication, contamination checks beyond the tokenizer note about VQGAN perceptual loss).

6. Limitations and Trade-offs¶

Based on the paper’s own limitations section plus implications of the methodology:

Internet video quality variance can reduce performance compared to curated data, and introduces diversity/quality challenges (Section 5).
Tokenizer dependency (non end-to-end learning):
The model’s learning and generation quality are bounded by tokenizer quality; quantization limits fidelity, and the paper calls for a “universal visual tokenizer” (Section 5).
The paper also flags that some tokenizers (e.g., VQGAN trained with perceptual loss) may indirectly ingest label information (Section 3.4; Section 4.1), complicating “pure self-supervised” claims if such tokenizers are used.
Video redundancy may make next-token prediction “too easy” temporally
Appendix A.1 and Figure 10 show the first frame has higher loss while later frames have lower average loss, interpreted as redundancy that can hinder learning efficient representations.
Design choices optimized on ImageNet may not be optimal elsewhere
The paper’s design-choice conclusions are largely drawn from ImageNet classification experiments (Section 5), and transfer is not guaranteed for dense prediction, fine-grained recognition, or long-horizon temporal reasoning (Section 5).
Long-range temporal dynamics are not fully assessed
Training uses T = 16 frames with stride 4 (Section 3.4), which constrains temporal coverage per example; the paper notes extended temporal dynamics remain open (Section 5).
Missing methodological details (in provided content)
The paper provides many key training hyperparameters (Section 3.2) but not others (weight decay value, dropout, augmentation specifics, dedup/contamination checks for the full video mixture, or hardware), limiting strict reproducibility from the excerpt alone.

7. Implications and Future Directions¶

Field-level implication: decoder-only autoregressive modeling is a viable vision/video pre-training paradigm
Toto demonstrates that a causal transformer trained purely on next-token prediction can yield representations that work across semantics (ImageNet/Kinetics), temporal reasoning (Ego4D anticipation), correspondence (DAVIS tracking), object permanence (CATER), and embodied control contexts (robotics) (Sections 4.2–4.7).
This broad evaluation supports the idea that “language-model-style” training objectives can generalize beyond text, though recognition still trails top discriminative approaches in many settings (Tables 7–8).
Scaling implication: compute helps, but gains per compute may be smaller than text
The compute-optimal scaling exponent -0.0378 (Section 4.9) suggests progress with scale but slower loss improvement than the referenced GPT-3 curve exponent -0.048 (Section 4.9). This motivates research into objectives or tokenizations that reduce redundancy and increase predictive difficulty (Appendix A.1).
Practical applications / downstream use cases suggested by the evaluations
Feature backbone for diverse tasks: The probing and fine-tuning protocols show Toto can serve as a general representation provider (Section 3.5; Sections 4.2–4.8).
Tracking via correspondence: The direct use of intermediate features for nearest-neighbor label propagation (Section 4.5) indicates usefulness without task training.
Robotics perception: Even Toto-base runs in real time per the narrative and achieves 63% success in the reported cube-picking setup (Section 4.6; Table 11), suggesting applicability when compute is constrained.
Repro/Integration Guidance (when to prefer this method)
Prefer Toto-style autoregressive pre-training when:
- You want a single pre-training objective for both images and videos (Section 3.4).
- You plan to use a decoder-only transformer and can exploit its generative nature (e.g., robotics tasks where later layers can remain useful; Figure 8).
Key integration choices supported by the paper:
- Use attention pooling, not average pooling, when probing decoder-only representations (Table 5; Section 3.5).
- Probe middle layers (~half depth) for semantic tasks (Figure 4; Figure 8).
- Consider low-res pre-train → high-res fine-tune with RoPE adaptation for better accuracy-per-compute (Table 4; Section 4.1).
If the goal is peak recognition accuracy alone, the paper’s own comparisons show discriminative approaches can outperform generative autoregressive ones (Tables 7–8), so Toto may be more attractive when multi-task transfer and generative/sequence modeling benefits matter.