An Empirical Study of Autoregressive Pre-training from Videos¶
ArXiv: 2501.05453
🎯 Pitch¶
This paper introduces Toto, a family of decoder-only autoregressive transformers that treat images and videos as sequences of visual tokens and are pre-trained on ~1 trillion visual tokens to predict next tokens. It shows that simple next-token prediction on large, mixed image/video data yields competitive, general-purpose visual representations across recognition, tracking, forecasting, and robotics, and demonstrates compute–loss scaling behavior for video models analogous to language models—supporting a scalable, unified approach to visual foundation models.
1. Executive Summary (2-3 sentences)¶
This paper studies whether a simple next-token prediction objective—applied to videos tokenized into sequences of visual tokens—can produce strong, general-purpose visual representations. It introduces a family of decoder-only autoregressive transformers called Toto, pre-trained on a mixture of images and internet videos totaling ~1T training tokens (from a larger pool of ~2.5T tokens / ~100k hours), and evaluates the learned representations across recognition, forecasting, tracking, object permanence, and robotics tasks. The main takeaway is that autoregressive pre-training from videos works competitively with minimal inductive bias, and exhibits compute–loss scaling similar in form (power law) to language models but with a slower exponent (Section 4.9, Figure 9).
2. Context and Motivation¶
- What specific problem or gap does this paper address?
- Large Language Models demonstrate that autoregressive next-token prediction can produce broadly useful representations, but the analogous question for video—the dominant “big visual data”—is “underexplored” (Introduction).
-
The paper targets the gap: Can we learn strong, transferable visual/video representations by training a decoder-only model to predict the next visual token in sequences derived from images and videos? (Introduction, Section 3).
-
Why is this problem important?
- Videos contain rich temporal dynamics and diverse real-world signals (egocentric, instructional, action-centric), and the paper argues that video data is still early in large-scale exploitation compared to internet text (Introduction).
-
A unified autoregressive objective that works across images + videos could simplify pre-training and produce representations reusable across many tasks (Figure 1, Sections 3.3–3.5).
-
What prior approaches existed, and where do they fall short (as positioned here)?
- Discriminative self-supervised methods (e.g., instance discrimination, contrastive, DINO-like) are strong for recognition but are not generative (Section 2; comparisons in Tables 7–8).
- Masked modeling (e.g., MAE/VideoMAE) learns representations by reconstructing masked tokens with a lightweight decoder; it is not pure next-token generation (Section 2).
-
Earlier autoregressive vision work (PixelCNN/Image Transformer/iGPT) mostly focused on images or generation quality; iGPT showed representation learning from pixels, but this paper explores videos at larger token scale and different design choices (Section 2; Appendix A.4; Table 14).
-
How this paper positions itself relative to existing work
- It emphasizes: no supervision during pre-training, joint training on images+videos via a unified token format, and a broad multi-task evaluation beyond standard recognition (Introduction; Section 4).
- It frames itself as an empirical study: exploring tokenizers, architectures, resolution strategies, probing choices, and scaling (Section 4.1; Section 4.9).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is a decoder-only transformer trained like a language model, except its “words” are visual tokens produced from video frames or images.
- It solves representation learning by doing autoregressive next-token prediction over token sequences derived from raster-scanned patches from images and from sequences of video frames (Figure 1; Section 3.1).
3.2 Big-picture architecture (diagram in words)¶
- (1) Data sources (images + videos) →
- (2) Frame preprocessing (resize/crop/sample frames) →
- (3) Tokenizer (e.g.,
dVAE, optionallyVQGANor continuous patch tokens) converts each frame/image into a grid of tokens → - (4) Sequence construction (raster scan + temporal concatenation + special start/end tokens) →
- (5) Causal transformer (
Toto) predicts the next token at each position (Sections 3.1–3.4) → - (6) Representation extraction from intermediate layers + attention pooling or average pooling →
- (7) Downstream evaluation via linear/attention probing or task-specific fine-tuning depending on the task (Section 3.5; Section 4).
3.3 Roadmap for the deep dive¶
- I will explain:
- How images/videos become token sequences (tokenization + ordering), because this defines the learning problem (Section 3.4).
- The autoregressive objective and what the model is optimizing (Section 3.1, Eq. (1)–(2)).
- The transformer architecture and training setup (Section 3.2, Table 1).
- The data mixture and scale (Section 3.3, Table 2).
- How representations are extracted and probed (Section 3.5; Tables 5–6; Figures 4, 8).
- Key design-choice findings (tokenizer/resolution/architecture) that affect performance (Section 4.1; Tables 3–6).
3.4 Detailed, sentence-based technical breakdown¶
This is an empirical systems + algorithmic study: it builds a family of autoregressive video token models (Toto) and evaluates how design decisions affect representation quality and scaling, with the core idea being “train a causal transformer to predict the next visual token” (Sections 3–4).
3.4.1 System/data pipeline diagram in words (explicit step-by-step flow)¶
- Collect training data from a mixture of image and video datasets (Section 3.3, Table 2):
ImageNet: 13.9M instances, 3.6B tokens.Kinetics-600: 0.53M instances, 41.3B tokens, 1,496 hours.Ego4D: 52.1K instances, 103B tokens, 3,750 hours.HowTo100m: 1.172M instances, 2,560B tokens, 92,627 hours.-
Combined pool: ~100,000 hours and ~2.5T tokens, while the reported full training uses ~1T tokens (Section 3.3).
-
Preprocess each video by resizing and cropping:
- The video is resized so its shortest side is
Rpixels, then a random crop of sizeR × R × Tis taken (Section 3.4). -
Frames are sampled every 4 frames (stride 4), and
Tis the number of frames per training sample (Section 3.4). -
Tokenize each frame independently into a grid of discrete tokens:
- Default tokenizer:
dVAE(from DALL·E) with vocabulary size 8k (Section 3.4). - With
R = 128, each frame becomes a16 × 16token grid, i.e., 256 tokens per frame (Section 3.4). -
With
T = 16frames, each training sequence isT × 256 = 4096tokens (Section 3.4), matching the stated context length of 4k tokens (Figure 1; Section 3.4). -
Convert the 2D token grids into a 1D sequence using raster scan ordering:
-
Raster scan means reading tokens row-by-row (top-left to bottom-right) so spatial data becomes a single sequence (Section 3.1; Appendix A.1).
-
Unify images and videos into the same training format:
- For videos: 16 frames are tokenized and concatenated to form a 4k-token sequence (Section 3.4).
- For images: 16 images are sampled and concatenated as if they were “frames” to form a 4k-token sequence (Section 3.4).
- Special tokens: the sequence includes start and end markers, with different start tokens for video vs image:
- Video start token:
[1] - Image start token:
[3] - End token:
[2](Section 3.4)
- Video start token:
-
Ambiguity to note: the paper states a 4096-token context length (Section 3.4; Figure 1) and also adds start/end tokens; it is not fully specified whether these special tokens are included within the 4096 count or appended beyond it.
-
Train a causal transformer to predict the next token:
- The model defines the probability of a token sequence
x^j = (x^j_1, …, x^j_n)as a product of conditional next-token probabilities (Section 3.1, Eq. (1)):- In plain language: the model predicts each token given all previous tokens in the sequence.
-
Training minimizes negative log-likelihood (Section 3.1, Eq. (2)):
- In plain language: penalize the model when it assigns low probability to the true next token.
-
Extract intermediate representations for downstream tasks:
- Let
H^lbe the hidden states after layerl(Section 3.2). - For downstream use, the paper probes intermediate layers rather than only the final layer, because performance often peaks around the model’s middle depth (Section 4.1 “Probing Layer”; Figure 4; Figure 8).
3.4.2 Model architecture (what is built)¶
Totouses a decoder-only transformer with causal attention, aligned withLLaMA-style design (Section 3.2):RMSNorm(pre-norm),SwiGLUactivation in the MLP,-
RoPEpositional embeddings. -
The layer update equations define a residual attention block then a residual MLP block (Section 3.2, Eq. (3)–(4)):
-
In plain language:
- Normalize the hidden states, apply multi-head self-attention, and add it back (residual).
- Normalize again, apply an MLP, and add it back.
-
Model sizes and core architectural hyperparameters are given in Table 1:
Toto-base: 120M params, hidden dim 768, 12 heads, 12 layers.Toto-large: 280M params, hidden dim 1024, 16 heads, 16 layers.Toto-1b: 1.1B params, hidden dim 2048, 16 heads, 22 layers.
(Table 1)
3.4.3 Training configuration and hyperparameters (what is explicitly specified)¶
- Batch size:
1M tokensper batch (Section 3.2). - Optimizer:
AdamW(Section 3.2). - Max learning rate:
3e-4(Section 3.2). - Betas:
β1 = 0.9,β2 = 0.95(Section 3.2). - Warmup:
2000steps (Section 3.2). - Learning rate schedule:
cosine decayafter warmup (Section 3.2). -
Not specified in provided text: weight decay value, gradient clipping, EMA, dropout rates, precision (fp16/bf16), or exact hardware; the acknowledgments mention TPU setup help, but no hardware details are provided.
-
Training scale: “over one visual trillion tokens” for pre-training (Section 3.1) and “full training utilized about 1 trillion tokens” (Section 3.3).
-
Context length:
4096tokens (Section 3.4; Figure 1). -
Tokenizer setup (default):
dVAE, vocab size8k, tokenize each frame independently (Section 3.4).- Training crop resolution:
R = 128→16×16tokens per frame (Section 3.4). - Temporal length:
T = 16frames, sampled every 4 frames (Section 3.4).
3.4.4 Representation extraction and probing (how they measure “representation quality”)¶
- The paper uses linear probing and attention probing/pooling (Section 3.5):
- Average pooling probe: global average pool token embeddings from a chosen layer, then train a linear classifier on top.
-
Attention pooling probe: a learned query token
qcross-attends to intermediate tokens using learned projection matrices (Wk,Wv) to compute a weighted sum, producing a single representation vector (Section 3.5).- Motivation: in causal (decoder-only) models, later tokens “see” more previous context than earlier tokens, so uniform averaging can overweight tokens with smaller receptive fields (Section 3.5).
-
Layer choice matters: best downstream performance often comes from middle layers (~50% depth) rather than the final layer (Section 4.1; Figure 4; Figure 8). The paper attributes worse performance in last layers to their role in reconstructing/predicting tokens (Section 4.1 “Probing Layer”).
3.4.5 Design choices explored (what they vary and why)¶
- Tokenizer type (Section 4.1; Table 3; Figure 3):
- Discrete tokenizers:
dVAE,VQGAN(multiple vocab sizes). - Continuous tokens: patch embeddings with normalized patch regression targets.
-
Key nuance: the paper argues
VQGANtokenizers may be indirectly contaminated with ImageNet label information due to perceptual loss using a VGG network (Section 3.4; Section 4.1). -
Token resolution strategy (Section 4.1; Table 4):
-
Train cheaper at low token resolution (
16×16tokens) then fine-tune at higher resolution (32×32tokens), leveragingRoPEto adapt positional embeddings. -
Architecture family (Section 4.1; Table 6):
-
Compare
LLaMA-style decoder-only transformer vsGPT-2-style vsMamba(state-space model). -
Pooling/probing method (Section 4.1; Table 5):
-
Average pooling vs attention pooling, showing large gains for attention pooling on ImageNet linear probing.
-
Scaling behavior (Section 4.9; Figure 9; Appendix A.5; Tables 15):
- Use
µ-parameterizationto tune learning rates across widths and analyze compute-optimal scaling.
4. Key Insights and Innovations¶
- Autoregressive pre-training from mixed image+video tokens yields broadly transferable representations
- The model is trained with a single objective (next-token prediction) and then evaluated across diverse tasks: ImageNet classification, Kinetics action recognition, Ego4D anticipation, DAVIS tracking, CATER object permanence, and robotics (Section 4.2–4.7).
-
Significance: it suggests that even without strong vision-specific inductive biases, decoder-only autoregressive modeling can be a viable representation learning strategy (Conclusion; also reflected in task results across Tables 7–12).
-
Attention pooling is materially better than average pooling for decoder-only visual transformers
- Table 5 shows a large gap on ImageNet probing at the same layer and tokenization setup:
Average pooling: 53.2% top-1Attention pooling: 61.1% top-1
(Table 5)
-
Why it matters: it addresses a decoder-only-specific issue—the receptive field is skewed because later tokens attend to more context (Section 3.5).
-
Efficient resolution training strategy: low-res pre-train → high-res fine-tune works well, aided by RoPE
- Table 4 shows:
dVAE/32(full higher-res training) achieves 61.2% top-1 at compute5.68×10^17.dVAE/16→32achieves 63.2% top-1 at compute2.13×10^17.dVAE/16→32†(RoPE base50,000) achieves 64.4% top-1 at the same compute.
(Table 4)
-
Significance: this is a practical recipe for reducing compute while improving downstream accuracy, and it ties directly to positional embedding adaptability (Section 4.1 “Resolution”).
-
Intermediate layers are best for semantic tasks; last layers may specialize toward reconstruction
- Figure 4 shows ImageNet probing peaking around ~50% depth across model sizes, and Figure 8 generalizes this to multiple tasks (ImageNet, Kinetics, DAVIS), with robotics showing a slightly different pattern where later layers can also be strong (Section 4.8; Figure 8).
-
This observation guides how to use decoder-only models as feature extractors (probe mid-layers, not the final layer).
-
Compute-optimal scaling exhibits a power law but with a slower exponent than a referenced language-model scaling curve
- The paper fits a compute–loss power law for Toto (Section 4.9; Figure 9):
L(C) = 7.32 · C^{-0.0378}
- It compares to a GPT-3 relation
L(C) = 2.57 · C^{-0.048}and notes they are not directly comparable, but the smaller magnitude exponent suggests slower improvement per additional compute for visual next-token prediction (Section 4.9).
5. Experimental Analysis¶
5.1 Evaluation methodology (datasets, metrics, setup)¶
- Pre-training data mixture (Section 3.3; Table 2):
- Mixed mini-batches are sampled approximately:
- 20% ImageNet images,
- 10% Ego4D videos,
- 10% Kinetics videos,
- 60% HowTo100m videos (Section 3.3).
-
Total pool ~2.5T tokens; training uses ~1T tokens (Section 3.3).
-
Core probing approach:
- For ImageNet and Kinetics: probe each layer using attention pooling and select the best-performing layer (“optimal layer”) (Section 4.2; Section 4.3).
-
For tracking (DAVIS): use features directly for nearest-neighbor label propagation without fine-tuning/probing (Section 4.5).
-
Metrics:
- ImageNet: Top-1 accuracy (Tables 3–7; Table 13–14).
- Kinetics-400: Top-1 accuracy (Table 8; Appendix Table 16).
- Ego4D anticipation: mean average precision for multiple targets (noun, noun+verb, noun+TTC, overall) (Table 9).
- DAVIS tracking:
J,F, andJ&Fscores (Table 10). - CATER object permanence localization: accuracy (%) at temporal resolutions 16 and 32 frames (Table 12).
- Robotics: success rate and learning curves (Figure 6; Table 11).
5.2 Main quantitative results (with numbers and comparisons)¶
ImageNet-1k recognition¶
- Toto performance scales with model size (Table 7):
Toto-base (120M): 64.7% top-1Toto-large (280M): 71.1% top-1-
Toto-1b (1.1B): 75.3% top-1
(Table 7) -
Comparison context in Table 7:
- Strong discriminative/self-supervised baselines are higher (e.g.,
DINO80.1%,DINOv286.4%), while generative approaches vary (e.g.,MAE80.9%,AIM80.6%). -
Within autoregressive generative baselines, Table 7 includes iGPT numbers (65.2% for iGPT-L, 72.0% for iGPT-XL) but notes iGPT entries are evaluated with linear probing (Table 7 footnote †).
-
Fair linear-probe comparison to iGPT (Appendix A.4; Table 14):
Toto-1b: 66.2%-
iGPT-L (1.386B): 65.2%
(Table 14) -
Full fine-tuning (Appendix A.3; Table 13):
- Toto achieves 82.6% on ImageNet-1K under the paper’s full fine-tuning setting, compared against several methods listed there (Table 13).
- Caveat: the appendix states fine-tuning uses full attention (Appendix A.3), whereas pre-training uses causal attention (Section 3.2).
Kinetics-400 action recognition¶
- Scaling with model size is strong (Table 8):
Toto-base: 59.3%Toto-large: 65.3%-
Toto-1b: 74.4%
(Table 8) -
With a higher-capacity attention classifier head variant (Appendix A.7; Table 16):
Toto-1b: 74.8%Toto-large: 65.8%Toto-base: 61.2%
(Table 16)
Ego4D short-term action anticipation¶
Toto-largeachieves an overall score of 2.70 (Table 9), with sub-metrics:- Noun: 15.20
- N+V: 6.75
- N+TTC: 5.41
- Overall: 2.70
(Table 9)
DAVIS semi-supervised tracking (label propagation)¶
- Table 10 reports
J&Fimprovements with scale and (especially) resolution: Toto-base (256/8): 42.0 J&FToto-large (256/8): 44.8Toto-1b (256/8): 46.1Toto-large (512/8): 62.4
(Table 10)- The paper notes that at large resolution (512), Toto-large outperforms listed baselines in that table (Table 10).
Object permanence (CATER localization)¶
- Table 12 reports Toto-large outperforming two baselines:
Toto-large: 62.8 (16 frames), 72.9 (32 frames) (Table 12).- Internal inconsistency to note: the paragraph in Section 4.7 states Toto-large achieves 62.8% and 70.9% for 16 and 32 frames, but Table 12 shows 72.9 for 32 frames. Based on the provided content, the table value (72.9) is the only explicit tabulated number.
Robotics¶
- Simulation: Figure 6 shows Toto-base learning faster than MAE-base across multiple tasks (Franka/Kuka pick & cabinet), but exact scalar values are not provided in the visible figure text; only qualitative curve comparisons are described (Section 4.6; Figure 6).
- Real-world: Table 11 reports success rates with the same number of demonstrations (
# Traj = 240): MVP: 75%Toto-base: 63%
(Table 11)- The text says success is reported over 16 trials with object variations (Section 4.6).
5.3 Design-choice experiments (what supports which claims)¶
- Tokenizer choice has limited effect on ImageNet linear probing in their setup (Section 4.1; Table 3):
- Examples from Table 3 (all Toto-large trained 400 epochs on ImageNet-1k):
VQGAN-VQGAN 16×16 (16k vocab): 61.3dVAE-dVAE 32×32 (8k vocab): 61.2patch-patch 16×16: 60.6
-
But
dVAE-dVAE 16×16drops to 53.2, indicating resolution/token count matters a lot (Table 3; also discussed in Section 4.1 “Resolution”). -
Attention pooling is critical for decoder-only probing (Table 5; Section 3.5 rationale).
-
Architecture matters: LLaMA-style > GPT2-style > Mamba (in this ImageNet linear probing study) (Table 6):
LLaMA (280M): 53.2GPT2 (280M): 48.5-
Mamba (~290M): 40.7
(Table 6) -
Resolution training strategy with RoPE improves accuracy-per-compute (Table 4; Section 4.1).
5.4 Do experiments convincingly support the paper’s claims?¶
- Claim: “competitive performance across all benchmarks” with minimal inductive bias (Abstract/Conclusion).
- Supported in the sense that Toto is evaluated across many tasks (Sections 4.2–4.7) and achieves:
- strong scaling trends (Tables 7–8),
- competitive anticipation results (Table 9),
- strong tracking at higher resolution (Table 10),
- strong CATER localization (Table 12),
- usable robotics performance (Table 11; Figure 6).
-
However, for recognition tasks the paper itself acknowledges discriminative methods often perform better (Sections 4.2–4.3; Tables 7–8).
-
Claim: “scaling curves similar to language models, albeit with a different rate” (Abstract; Section 4.9).
- Directly supported by the fitted power law in Section 4.9 (Figure 9), with an explicit exponent comparison to a GPT-3 scaling relation (Section 4.9).
-
The paper explicitly cautions they are “not comparable directly” (Section 4.9), which appropriately limits overclaiming.
-
Ablations / robustness
- The paper includes multiple ablations (tokenizer, resolution, pooling, architecture, probing layer; Section 4.1) and cross-task layer behavior (Figure 8).
- Some potentially important controls are not fully specified in the provided text (e.g., exact pre-training compute budget in PF-days, hardware, data filtering/deduplication, contamination checks beyond the tokenizer note about VQGAN perceptual loss).
6. Limitations and Trade-offs¶
Based on the paper’s own limitations section plus implications of the methodology:
- Internet video quality variance can reduce performance compared to curated data, and introduces diversity/quality challenges (Section 5).
- Tokenizer dependency (non end-to-end learning):
- The model’s learning and generation quality are bounded by tokenizer quality; quantization limits fidelity, and the paper calls for a “universal visual tokenizer” (Section 5).
-
The paper also flags that some tokenizers (e.g., VQGAN trained with perceptual loss) may indirectly ingest label information (Section 3.4; Section 4.1), complicating “pure self-supervised” claims if such tokenizers are used.
-
Video redundancy may make next-token prediction “too easy” temporally
-
Appendix A.1 and Figure 10 show the first frame has higher loss while later frames have lower average loss, interpreted as redundancy that can hinder learning efficient representations.
-
Design choices optimized on ImageNet may not be optimal elsewhere
-
The paper’s design-choice conclusions are largely drawn from ImageNet classification experiments (Section 5), and transfer is not guaranteed for dense prediction, fine-grained recognition, or long-horizon temporal reasoning (Section 5).
-
Long-range temporal dynamics are not fully assessed
-
Training uses
T = 16frames with stride 4 (Section 3.4), which constrains temporal coverage per example; the paper notes extended temporal dynamics remain open (Section 5). -
Missing methodological details (in provided content)
- The paper provides many key training hyperparameters (Section 3.2) but not others (weight decay value, dropout, augmentation specifics, dedup/contamination checks for the full video mixture, or hardware), limiting strict reproducibility from the excerpt alone.
7. Implications and Future Directions¶
- Field-level implication: decoder-only autoregressive modeling is a viable vision/video pre-training paradigm
- Toto demonstrates that a causal transformer trained purely on next-token prediction can yield representations that work across semantics (ImageNet/Kinetics), temporal reasoning (Ego4D anticipation), correspondence (DAVIS tracking), object permanence (CATER), and embodied control contexts (robotics) (Sections 4.2–4.7).
-
This broad evaluation supports the idea that “language-model-style” training objectives can generalize beyond text, though recognition still trails top discriminative approaches in many settings (Tables 7–8).
-
Scaling implication: compute helps, but gains per compute may be smaller than text
-
The compute-optimal scaling exponent
-0.0378(Section 4.9) suggests progress with scale but slower loss improvement than the referenced GPT-3 curve exponent-0.048(Section 4.9). This motivates research into objectives or tokenizations that reduce redundancy and increase predictive difficulty (Appendix A.1). -
Practical applications / downstream use cases suggested by the evaluations
- Feature backbone for diverse tasks: The probing and fine-tuning protocols show Toto can serve as a general representation provider (Section 3.5; Sections 4.2–4.8).
- Tracking via correspondence: The direct use of intermediate features for nearest-neighbor label propagation (Section 4.5) indicates usefulness without task training.
-
Robotics perception: Even
Toto-baseruns in real time per the narrative and achieves 63% success in the reported cube-picking setup (Section 4.6; Table 11), suggesting applicability when compute is constrained. -
Repro/Integration Guidance (when to prefer this method)
- Prefer Toto-style autoregressive pre-training when:
- You want a single pre-training objective for both images and videos (Section 3.4).
- You plan to use a decoder-only transformer and can exploit its generative nature (e.g., robotics tasks where later layers can remain useful; Figure 8).
- Key integration choices supported by the paper:
- Use attention pooling, not average pooling, when probing decoder-only representations (Table 5; Section 3.5).
- Probe middle layers (~half depth) for semantic tasks (Figure 4; Figure 8).
- Consider low-res pre-train → high-res fine-tune with
RoPEadaptation for better accuracy-per-compute (Table 4; Section 4.1).
- If the goal is peak recognition accuracy alone, the paper’s own comparisons show discriminative approaches can outperform generative autoregressive ones (Tables 7–8), so Toto may be more attractive when multi-task transfer and generative/sequence modeling benefits matter.