Skip to content

An Empirical Study of Autoregressive Pre-training from Videos

ArXiv: 2501.05453

🎯 Pitch

This paper introduces Toto, a family of decoder-only autoregressive transformers that treat images and videos as sequences of visual tokens and are pre-trained on ~1 trillion visual tokens to predict next tokens. It shows that simple next-token prediction on large, mixed image/video data yields competitive, general-purpose visual representations across recognition, tracking, forecasting, and robotics, and demonstrates compute–loss scaling behavior for video models analogous to language models—supporting a scalable, unified approach to visual foundation models.


1. Executive Summary (2-3 sentences)

This paper studies whether a simple next-token prediction objective—applied to videos tokenized into sequences of visual tokens—can produce strong, general-purpose visual representations. It introduces a family of decoder-only autoregressive transformers called Toto, pre-trained on a mixture of images and internet videos totaling ~1T training tokens (from a larger pool of ~2.5T tokens / ~100k hours), and evaluates the learned representations across recognition, forecasting, tracking, object permanence, and robotics tasks. The main takeaway is that autoregressive pre-training from videos works competitively with minimal inductive bias, and exhibits compute–loss scaling similar in form (power law) to language models but with a slower exponent (Section 4.9, Figure 9).

2. Context and Motivation

  • What specific problem or gap does this paper address?
  • Large Language Models demonstrate that autoregressive next-token prediction can produce broadly useful representations, but the analogous question for video—the dominant “big visual data”—is “underexplored” (Introduction).
  • The paper targets the gap: Can we learn strong, transferable visual/video representations by training a decoder-only model to predict the next visual token in sequences derived from images and videos? (Introduction, Section 3).

  • Why is this problem important?

  • Videos contain rich temporal dynamics and diverse real-world signals (egocentric, instructional, action-centric), and the paper argues that video data is still early in large-scale exploitation compared to internet text (Introduction).
  • A unified autoregressive objective that works across images + videos could simplify pre-training and produce representations reusable across many tasks (Figure 1, Sections 3.3–3.5).

  • What prior approaches existed, and where do they fall short (as positioned here)?

  • Discriminative self-supervised methods (e.g., instance discrimination, contrastive, DINO-like) are strong for recognition but are not generative (Section 2; comparisons in Tables 7–8).
  • Masked modeling (e.g., MAE/VideoMAE) learns representations by reconstructing masked tokens with a lightweight decoder; it is not pure next-token generation (Section 2).
  • Earlier autoregressive vision work (PixelCNN/Image Transformer/iGPT) mostly focused on images or generation quality; iGPT showed representation learning from pixels, but this paper explores videos at larger token scale and different design choices (Section 2; Appendix A.4; Table 14).

  • How this paper positions itself relative to existing work

  • It emphasizes: no supervision during pre-training, joint training on images+videos via a unified token format, and a broad multi-task evaluation beyond standard recognition (Introduction; Section 4).
  • It frames itself as an empirical study: exploring tokenizers, architectures, resolution strategies, probing choices, and scaling (Section 4.1; Section 4.9).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a decoder-only transformer trained like a language model, except its “words” are visual tokens produced from video frames or images.
  • It solves representation learning by doing autoregressive next-token prediction over token sequences derived from raster-scanned patches from images and from sequences of video frames (Figure 1; Section 3.1).

3.2 Big-picture architecture (diagram in words)

  • (1) Data sources (images + videos) →
  • (2) Frame preprocessing (resize/crop/sample frames) →
  • (3) Tokenizer (e.g., dVAE, optionally VQGAN or continuous patch tokens) converts each frame/image into a grid of tokens →
  • (4) Sequence construction (raster scan + temporal concatenation + special start/end tokens) →
  • (5) Causal transformer (Toto) predicts the next token at each position (Sections 3.1–3.4) →
  • (6) Representation extraction from intermediate layers + attention pooling or average pooling →
  • (7) Downstream evaluation via linear/attention probing or task-specific fine-tuning depending on the task (Section 3.5; Section 4).

3.3 Roadmap for the deep dive

  • I will explain:
  • How images/videos become token sequences (tokenization + ordering), because this defines the learning problem (Section 3.4).
  • The autoregressive objective and what the model is optimizing (Section 3.1, Eq. (1)–(2)).
  • The transformer architecture and training setup (Section 3.2, Table 1).
  • The data mixture and scale (Section 3.3, Table 2).
  • How representations are extracted and probed (Section 3.5; Tables 5–6; Figures 4, 8).
  • Key design-choice findings (tokenizer/resolution/architecture) that affect performance (Section 4.1; Tables 3–6).

3.4 Detailed, sentence-based technical breakdown

This is an empirical systems + algorithmic study: it builds a family of autoregressive video token models (Toto) and evaluates how design decisions affect representation quality and scaling, with the core idea being “train a causal transformer to predict the next visual token” (Sections 3–4).

3.4.1 System/data pipeline diagram in words (explicit step-by-step flow)

  1. Collect training data from a mixture of image and video datasets (Section 3.3, Table 2):
  2. ImageNet: 13.9M instances, 3.6B tokens.
  3. Kinetics-600: 0.53M instances, 41.3B tokens, 1,496 hours.
  4. Ego4D: 52.1K instances, 103B tokens, 3,750 hours.
  5. HowTo100m: 1.172M instances, 2,560B tokens, 92,627 hours.
  6. Combined pool: ~100,000 hours and ~2.5T tokens, while the reported full training uses ~1T tokens (Section 3.3).

  7. Preprocess each video by resizing and cropping:

  8. The video is resized so its shortest side is R pixels, then a random crop of size R × R × T is taken (Section 3.4).
  9. Frames are sampled every 4 frames (stride 4), and T is the number of frames per training sample (Section 3.4).

  10. Tokenize each frame independently into a grid of discrete tokens:

  11. Default tokenizer: dVAE (from DALL·E) with vocabulary size 8k (Section 3.4).
  12. With R = 128, each frame becomes a 16 × 16 token grid, i.e., 256 tokens per frame (Section 3.4).
  13. With T = 16 frames, each training sequence is T × 256 = 4096 tokens (Section 3.4), matching the stated context length of 4k tokens (Figure 1; Section 3.4).

  14. Convert the 2D token grids into a 1D sequence using raster scan ordering:

  15. Raster scan means reading tokens row-by-row (top-left to bottom-right) so spatial data becomes a single sequence (Section 3.1; Appendix A.1).

  16. Unify images and videos into the same training format:

  17. For videos: 16 frames are tokenized and concatenated to form a 4k-token sequence (Section 3.4).
  18. For images: 16 images are sampled and concatenated as if they were “frames” to form a 4k-token sequence (Section 3.4).
  19. Special tokens: the sequence includes start and end markers, with different start tokens for video vs image:
    • Video start token: [1]
    • Image start token: [3]
    • End token: [2] (Section 3.4)
  20. Ambiguity to note: the paper states a 4096-token context length (Section 3.4; Figure 1) and also adds start/end tokens; it is not fully specified whether these special tokens are included within the 4096 count or appended beyond it.

  21. Train a causal transformer to predict the next token:

  22. The model defines the probability of a token sequence x^j = (x^j_1, …, x^j_n) as a product of conditional next-token probabilities (Section 3.1, Eq. (1)):
    • In plain language: the model predicts each token given all previous tokens in the sequence.
  23. Training minimizes negative log-likelihood (Section 3.1, Eq. (2)):

    • In plain language: penalize the model when it assigns low probability to the true next token.
  24. Extract intermediate representations for downstream tasks:

  25. Let H^l be the hidden states after layer l (Section 3.2).
  26. For downstream use, the paper probes intermediate layers rather than only the final layer, because performance often peaks around the model’s middle depth (Section 4.1 “Probing Layer”; Figure 4; Figure 8).

3.4.2 Model architecture (what is built)

  • Toto uses a decoder-only transformer with causal attention, aligned with LLaMA-style design (Section 3.2):
  • RMSNorm (pre-norm),
  • SwiGLU activation in the MLP,
  • RoPE positional embeddings.

  • The layer update equations define a residual attention block then a residual MLP block (Section 3.2, Eq. (3)–(4)):

  • In plain language:

    • Normalize the hidden states, apply multi-head self-attention, and add it back (residual).
    • Normalize again, apply an MLP, and add it back.
  • Model sizes and core architectural hyperparameters are given in Table 1:

  • Toto-base: 120M params, hidden dim 768, 12 heads, 12 layers.
  • Toto-large: 280M params, hidden dim 1024, 16 heads, 16 layers.
  • Toto-1b: 1.1B params, hidden dim 2048, 16 heads, 22 layers.
    (Table 1)

3.4.3 Training configuration and hyperparameters (what is explicitly specified)

  • Batch size: 1M tokens per batch (Section 3.2).
  • Optimizer: AdamW (Section 3.2).
  • Max learning rate: 3e-4 (Section 3.2).
  • Betas: β1 = 0.9, β2 = 0.95 (Section 3.2).
  • Warmup: 2000 steps (Section 3.2).
  • Learning rate schedule: cosine decay after warmup (Section 3.2).
  • Not specified in provided text: weight decay value, gradient clipping, EMA, dropout rates, precision (fp16/bf16), or exact hardware; the acknowledgments mention TPU setup help, but no hardware details are provided.

  • Training scale: “over one visual trillion tokens” for pre-training (Section 3.1) and “full training utilized about 1 trillion tokens” (Section 3.3).

  • Context length: 4096 tokens (Section 3.4; Figure 1).

  • Tokenizer setup (default):

  • dVAE, vocab size 8k, tokenize each frame independently (Section 3.4).
  • Training crop resolution: R = 12816×16 tokens per frame (Section 3.4).
  • Temporal length: T = 16 frames, sampled every 4 frames (Section 3.4).

3.4.4 Representation extraction and probing (how they measure “representation quality”)

  • The paper uses linear probing and attention probing/pooling (Section 3.5):
  • Average pooling probe: global average pool token embeddings from a chosen layer, then train a linear classifier on top.
  • Attention pooling probe: a learned query token q cross-attends to intermediate tokens using learned projection matrices (Wk, Wv) to compute a weighted sum, producing a single representation vector (Section 3.5).

    • Motivation: in causal (decoder-only) models, later tokens “see” more previous context than earlier tokens, so uniform averaging can overweight tokens with smaller receptive fields (Section 3.5).
  • Layer choice matters: best downstream performance often comes from middle layers (~50% depth) rather than the final layer (Section 4.1; Figure 4; Figure 8). The paper attributes worse performance in last layers to their role in reconstructing/predicting tokens (Section 4.1 “Probing Layer”).

3.4.5 Design choices explored (what they vary and why)

  1. Tokenizer type (Section 4.1; Table 3; Figure 3):
  2. Discrete tokenizers: dVAE, VQGAN (multiple vocab sizes).
  3. Continuous tokens: patch embeddings with normalized patch regression targets.
  4. Key nuance: the paper argues VQGAN tokenizers may be indirectly contaminated with ImageNet label information due to perceptual loss using a VGG network (Section 3.4; Section 4.1).

  5. Token resolution strategy (Section 4.1; Table 4):

  6. Train cheaper at low token resolution (16×16 tokens) then fine-tune at higher resolution (32×32 tokens), leveraging RoPE to adapt positional embeddings.

  7. Architecture family (Section 4.1; Table 6):

  8. Compare LLaMA-style decoder-only transformer vs GPT-2-style vs Mamba (state-space model).

  9. Pooling/probing method (Section 4.1; Table 5):

  10. Average pooling vs attention pooling, showing large gains for attention pooling on ImageNet linear probing.

  11. Scaling behavior (Section 4.9; Figure 9; Appendix A.5; Tables 15):

  12. Use µ-parameterization to tune learning rates across widths and analyze compute-optimal scaling.

4. Key Insights and Innovations

  1. Autoregressive pre-training from mixed image+video tokens yields broadly transferable representations
  2. The model is trained with a single objective (next-token prediction) and then evaluated across diverse tasks: ImageNet classification, Kinetics action recognition, Ego4D anticipation, DAVIS tracking, CATER object permanence, and robotics (Section 4.2–4.7).
  3. Significance: it suggests that even without strong vision-specific inductive biases, decoder-only autoregressive modeling can be a viable representation learning strategy (Conclusion; also reflected in task results across Tables 7–12).

  4. Attention pooling is materially better than average pooling for decoder-only visual transformers

  5. Table 5 shows a large gap on ImageNet probing at the same layer and tokenization setup:
    • Average pooling: 53.2% top-1
    • Attention pooling: 61.1% top-1
      (Table 5)
  6. Why it matters: it addresses a decoder-only-specific issue—the receptive field is skewed because later tokens attend to more context (Section 3.5).

  7. Efficient resolution training strategy: low-res pre-train → high-res fine-tune works well, aided by RoPE

  8. Table 4 shows:
    • dVAE/32 (full higher-res training) achieves 61.2% top-1 at compute 5.68×10^17.
    • dVAE/16→32 achieves 63.2% top-1 at compute 2.13×10^17.
    • dVAE/16→32† (RoPE base 50,000) achieves 64.4% top-1 at the same compute.
      (Table 4)
  9. Significance: this is a practical recipe for reducing compute while improving downstream accuracy, and it ties directly to positional embedding adaptability (Section 4.1 “Resolution”).

  10. Intermediate layers are best for semantic tasks; last layers may specialize toward reconstruction

  11. Figure 4 shows ImageNet probing peaking around ~50% depth across model sizes, and Figure 8 generalizes this to multiple tasks (ImageNet, Kinetics, DAVIS), with robotics showing a slightly different pattern where later layers can also be strong (Section 4.8; Figure 8).
  12. This observation guides how to use decoder-only models as feature extractors (probe mid-layers, not the final layer).

  13. Compute-optimal scaling exhibits a power law but with a slower exponent than a referenced language-model scaling curve

  14. The paper fits a compute–loss power law for Toto (Section 4.9; Figure 9):
    • L(C) = 7.32 · C^{-0.0378}
  15. It compares to a GPT-3 relation L(C) = 2.57 · C^{-0.048} and notes they are not directly comparable, but the smaller magnitude exponent suggests slower improvement per additional compute for visual next-token prediction (Section 4.9).

5. Experimental Analysis

5.1 Evaluation methodology (datasets, metrics, setup)

  • Pre-training data mixture (Section 3.3; Table 2):
  • Mixed mini-batches are sampled approximately:
    • 20% ImageNet images,
    • 10% Ego4D videos,
    • 10% Kinetics videos,
    • 60% HowTo100m videos (Section 3.3).
  • Total pool ~2.5T tokens; training uses ~1T tokens (Section 3.3).

  • Core probing approach:

  • For ImageNet and Kinetics: probe each layer using attention pooling and select the best-performing layer (“optimal layer”) (Section 4.2; Section 4.3).
  • For tracking (DAVIS): use features directly for nearest-neighbor label propagation without fine-tuning/probing (Section 4.5).

  • Metrics:

  • ImageNet: Top-1 accuracy (Tables 3–7; Table 13–14).
  • Kinetics-400: Top-1 accuracy (Table 8; Appendix Table 16).
  • Ego4D anticipation: mean average precision for multiple targets (noun, noun+verb, noun+TTC, overall) (Table 9).
  • DAVIS tracking: J, F, and J&F scores (Table 10).
  • CATER object permanence localization: accuracy (%) at temporal resolutions 16 and 32 frames (Table 12).
  • Robotics: success rate and learning curves (Figure 6; Table 11).

5.2 Main quantitative results (with numbers and comparisons)

ImageNet-1k recognition

  • Toto performance scales with model size (Table 7):
  • Toto-base (120M): 64.7% top-1
  • Toto-large (280M): 71.1% top-1
  • Toto-1b (1.1B): 75.3% top-1
    (Table 7)

  • Comparison context in Table 7:

  • Strong discriminative/self-supervised baselines are higher (e.g., DINO 80.1%, DINOv2 86.4%), while generative approaches vary (e.g., MAE 80.9%, AIM 80.6%).
  • Within autoregressive generative baselines, Table 7 includes iGPT numbers (65.2% for iGPT-L, 72.0% for iGPT-XL) but notes iGPT entries are evaluated with linear probing (Table 7 footnote †).

  • Fair linear-probe comparison to iGPT (Appendix A.4; Table 14):

  • Toto-1b: 66.2%
  • iGPT-L (1.386B): 65.2%
    (Table 14)

  • Full fine-tuning (Appendix A.3; Table 13):

  • Toto achieves 82.6% on ImageNet-1K under the paper’s full fine-tuning setting, compared against several methods listed there (Table 13).
  • Caveat: the appendix states fine-tuning uses full attention (Appendix A.3), whereas pre-training uses causal attention (Section 3.2).

Kinetics-400 action recognition

  • Scaling with model size is strong (Table 8):
  • Toto-base: 59.3%
  • Toto-large: 65.3%
  • Toto-1b: 74.4%
    (Table 8)

  • With a higher-capacity attention classifier head variant (Appendix A.7; Table 16):

  • Toto-1b: 74.8%
  • Toto-large: 65.8%
  • Toto-base: 61.2%
    (Table 16)

Ego4D short-term action anticipation

  • Toto-large achieves an overall score of 2.70 (Table 9), with sub-metrics:
  • Noun: 15.20
  • N+V: 6.75
  • N+TTC: 5.41
  • Overall: 2.70
    (Table 9)

DAVIS semi-supervised tracking (label propagation)

  • Table 10 reports J&F improvements with scale and (especially) resolution:
  • Toto-base (256/8): 42.0 J&F
  • Toto-large (256/8): 44.8
  • Toto-1b (256/8): 46.1
  • Toto-large (512/8): 62.4
    (Table 10)
  • The paper notes that at large resolution (512), Toto-large outperforms listed baselines in that table (Table 10).

Object permanence (CATER localization)

  • Table 12 reports Toto-large outperforming two baselines:
  • Toto-large: 62.8 (16 frames), 72.9 (32 frames) (Table 12).
  • Internal inconsistency to note: the paragraph in Section 4.7 states Toto-large achieves 62.8% and 70.9% for 16 and 32 frames, but Table 12 shows 72.9 for 32 frames. Based on the provided content, the table value (72.9) is the only explicit tabulated number.

Robotics

  • Simulation: Figure 6 shows Toto-base learning faster than MAE-base across multiple tasks (Franka/Kuka pick & cabinet), but exact scalar values are not provided in the visible figure text; only qualitative curve comparisons are described (Section 4.6; Figure 6).
  • Real-world: Table 11 reports success rates with the same number of demonstrations (# Traj = 240):
  • MVP: 75%
  • Toto-base: 63%
    (Table 11)
  • The text says success is reported over 16 trials with object variations (Section 4.6).

5.3 Design-choice experiments (what supports which claims)

  • Tokenizer choice has limited effect on ImageNet linear probing in their setup (Section 4.1; Table 3):
  • Examples from Table 3 (all Toto-large trained 400 epochs on ImageNet-1k):
    • VQGAN-VQGAN 16×16 (16k vocab): 61.3
    • dVAE-dVAE 32×32 (8k vocab): 61.2
    • patch-patch 16×16: 60.6
  • But dVAE-dVAE 16×16 drops to 53.2, indicating resolution/token count matters a lot (Table 3; also discussed in Section 4.1 “Resolution”).

  • Attention pooling is critical for decoder-only probing (Table 5; Section 3.5 rationale).

  • Architecture matters: LLaMA-style > GPT2-style > Mamba (in this ImageNet linear probing study) (Table 6):

  • LLaMA (280M): 53.2
  • GPT2 (280M): 48.5
  • Mamba (~290M): 40.7
    (Table 6)

  • Resolution training strategy with RoPE improves accuracy-per-compute (Table 4; Section 4.1).

5.4 Do experiments convincingly support the paper’s claims?

  • Claim: “competitive performance across all benchmarks” with minimal inductive bias (Abstract/Conclusion).
  • Supported in the sense that Toto is evaluated across many tasks (Sections 4.2–4.7) and achieves:
    • strong scaling trends (Tables 7–8),
    • competitive anticipation results (Table 9),
    • strong tracking at higher resolution (Table 10),
    • strong CATER localization (Table 12),
    • usable robotics performance (Table 11; Figure 6).
  • However, for recognition tasks the paper itself acknowledges discriminative methods often perform better (Sections 4.2–4.3; Tables 7–8).

  • Claim: “scaling curves similar to language models, albeit with a different rate” (Abstract; Section 4.9).

  • Directly supported by the fitted power law in Section 4.9 (Figure 9), with an explicit exponent comparison to a GPT-3 scaling relation (Section 4.9).
  • The paper explicitly cautions they are “not comparable directly” (Section 4.9), which appropriately limits overclaiming.

  • Ablations / robustness

  • The paper includes multiple ablations (tokenizer, resolution, pooling, architecture, probing layer; Section 4.1) and cross-task layer behavior (Figure 8).
  • Some potentially important controls are not fully specified in the provided text (e.g., exact pre-training compute budget in PF-days, hardware, data filtering/deduplication, contamination checks beyond the tokenizer note about VQGAN perceptual loss).

6. Limitations and Trade-offs

Based on the paper’s own limitations section plus implications of the methodology:

  • Internet video quality variance can reduce performance compared to curated data, and introduces diversity/quality challenges (Section 5).
  • Tokenizer dependency (non end-to-end learning):
  • The model’s learning and generation quality are bounded by tokenizer quality; quantization limits fidelity, and the paper calls for a “universal visual tokenizer” (Section 5).
  • The paper also flags that some tokenizers (e.g., VQGAN trained with perceptual loss) may indirectly ingest label information (Section 3.4; Section 4.1), complicating “pure self-supervised” claims if such tokenizers are used.

  • Video redundancy may make next-token prediction “too easy” temporally

  • Appendix A.1 and Figure 10 show the first frame has higher loss while later frames have lower average loss, interpreted as redundancy that can hinder learning efficient representations.

  • Design choices optimized on ImageNet may not be optimal elsewhere

  • The paper’s design-choice conclusions are largely drawn from ImageNet classification experiments (Section 5), and transfer is not guaranteed for dense prediction, fine-grained recognition, or long-horizon temporal reasoning (Section 5).

  • Long-range temporal dynamics are not fully assessed

  • Training uses T = 16 frames with stride 4 (Section 3.4), which constrains temporal coverage per example; the paper notes extended temporal dynamics remain open (Section 5).

  • Missing methodological details (in provided content)

  • The paper provides many key training hyperparameters (Section 3.2) but not others (weight decay value, dropout, augmentation specifics, dedup/contamination checks for the full video mixture, or hardware), limiting strict reproducibility from the excerpt alone.

7. Implications and Future Directions

  • Field-level implication: decoder-only autoregressive modeling is a viable vision/video pre-training paradigm
  • Toto demonstrates that a causal transformer trained purely on next-token prediction can yield representations that work across semantics (ImageNet/Kinetics), temporal reasoning (Ego4D anticipation), correspondence (DAVIS tracking), object permanence (CATER), and embodied control contexts (robotics) (Sections 4.2–4.7).
  • This broad evaluation supports the idea that “language-model-style” training objectives can generalize beyond text, though recognition still trails top discriminative approaches in many settings (Tables 7–8).

  • Scaling implication: compute helps, but gains per compute may be smaller than text

  • The compute-optimal scaling exponent -0.0378 (Section 4.9) suggests progress with scale but slower loss improvement than the referenced GPT-3 curve exponent -0.048 (Section 4.9). This motivates research into objectives or tokenizations that reduce redundancy and increase predictive difficulty (Appendix A.1).

  • Practical applications / downstream use cases suggested by the evaluations

  • Feature backbone for diverse tasks: The probing and fine-tuning protocols show Toto can serve as a general representation provider (Section 3.5; Sections 4.2–4.8).
  • Tracking via correspondence: The direct use of intermediate features for nearest-neighbor label propagation (Section 4.5) indicates usefulness without task training.
  • Robotics perception: Even Toto-base runs in real time per the narrative and achieves 63% success in the reported cube-picking setup (Section 4.6; Table 11), suggesting applicability when compute is constrained.

  • Repro/Integration Guidance (when to prefer this method)

  • Prefer Toto-style autoregressive pre-training when:
    • You want a single pre-training objective for both images and videos (Section 3.4).
    • You plan to use a decoder-only transformer and can exploit its generative nature (e.g., robotics tasks where later layers can remain useful; Figure 8).
  • Key integration choices supported by the paper:
    • Use attention pooling, not average pooling, when probing decoder-only representations (Table 5; Section 3.5).
    • Probe middle layers (~half depth) for semantic tasks (Figure 4; Figure 8).
    • Consider low-res pre-train → high-res fine-tune with RoPE adaptation for better accuracy-per-compute (Table 4; Section 4.1).
  • If the goal is peak recognition accuracy alone, the paper’s own comparisons show discriminative approaches can outperform generative autoregressive ones (Tables 7–8), so Toto may be more attractive when multi-task transfer and generative/sequence modeling benefits matter.