Autoregressive Universal Video Segmentation Model¶

ArXiv: 2508.19242

🎯 Pitch¶

This paper introduces AUSM, a single autoregressive architecture that unifies both prompted (user-specified) and unprompted (automatic discovery) video segmentation by treating mask prediction as a sequential, language-model-like process. Its scalable state-space design enables efficient segmentation and tracking of objects in arbitrarily long video streams with constant memory, while supporting parallel training that accelerates learning up to 2.5×. By consolidating the fragmented landscape of task-specific models, AUSM sets a new foundation for practical, general-purpose video perception across interactive and autonomous scenarios.

1. Executive Summary¶

AUSM (Autoregressive Universal Segmentation Model) reframes streaming video segmentation as next-frame mask prediction, analogous to next-token prediction in language models. It introduces a single architecture that handles both prompted (user-specified targets) and unprompted (discover everything) video segmentation, scales to arbitrarily long streams with constant memory via a state-space module, and supports parallel (non-recurrent) training for significant speedups.

2. Context and Motivation¶

Problem addressed
The field is split between two regimes:
- Prompted video segmentation (e.g., VOS): given an initial cue (mask/box/point), track that object over time.
- Unprompted video segmentation (e.g., VIS): detect, segment, and track all instances without external cues.
Real deployments often demand both, but current practice uses separate task-specific systems and training protocols, each with different memory and compute characteristics (Sec. 1).
Why this matters
A unified model can amortize supervision across tasks and datasets and simplify deployment in streaming scenarios (e.g., interactive editing needs prompting; autonomous driving needs unprompted discovery) (Sec. 1).
Video is costly to collect/annotate; shared architectures increase data efficiency (Sec. 1).
Shortcomings of prior approaches
Prompted: Memory-based systems (e.g., STM/XMem/SAM2) run separate per-object pipelines with FIFO memory buffers, which are powerful but heavy and not naturally universal (Sec. 1, Related Work).
Unprompted: Many detect-then-track pipelines process frames independently and associate object identities post hoc, compressing history to a few vectors and discarding fine mask detail needed for detection quality (Sec. 1).
“Universal” attempts often retrofit VOS into a VIS-style vector token; compressing each instance to a token hurts mask quality in VOS (Sec. 1 and Sec. 2.2 “History Marker” motivation).
Training scalability: existing video segmentation frameworks typically train with recurrent, frame-by-frame propagation, which prevents LLM-style parallelism and slows training as sequence length grows (Sec. 1 and Sec. 2.3).
Positioning of this work
Recasts the task as autoregressive mask prediction with the probability factorization in Eq. (2), directly mirroring LLM next-token modeling (Sec. 2.1).
Designs an architecture with (i) a History Marker that preserves fine spatial details by “dissolving” masks into feature maps, and (ii) a History Compressor built on the Mamba state-space model to maintain a single spatial state across unlimited frames (Sec. 2.2).
Makes every module compatible with teacher forcing for fully parallel training across frames, yielding large speedups as sequences lengthen (Sec. 2.3, Fig. 1-Right, Fig. 4).

3. Technical Approach¶

The model casts streaming video segmentation as an autoregressive process where each frame’s segmentation is conditioned on past frames and past segmentations, optionally with an initial prompt (Eq. 2 in Sec. 2.1): - Plain-language view: “Predict masks at time t by looking at the current image, all earlier images, the previously predicted masks, and an optional initial mask prompt.”

Step-by-step (inference, then training):

Inference as a recurrent computation (Algorithm 1; Fig. 1-Left)
Inputs and representations
- Each frame ℐt is encoded by a shared, frame-independent backbone into features Xt (Sec. 2.2).
- The model maintains:
- A set of object queries V (size Ndet) used to discover new objects at every frame.
- A buffer of ID vectors B (size Nid) used to represent tracked instances across time.
- Two growing sets: A (allocated ID vectors) and M (their corresponding masks), kept in one-to-one order (|A| = |M|) (Sec. 2.2).
- Unification of tasks:
- Unprompted: start with A0 = ∅ and M0 = ∅.
- Prompted: sample as many ID vectors as there are prompt instances from B and set M0 to the prompt masks (Algorithm 1, lines 2–9; Sec. 2.2 “Unification of Tasks”).
History injection that preserves spatial detail
- History Marker: transforms the previous frame’s allocated ID vectors and masks (A_{t−1}, M_{t−1}) into a dense feature map S_t (same spatial shape as frame features) by softly “painting” ID vectors inside their masks (equation under “History Marker” in Sec. 2.2).
- Intuition: Instead of compressing each object to a single token, broadcast its ID vector over the pixels it occupied, keeping fine-grained location/shape cues for later retrieval.
- Combine with features from the previous frame: E_t = X_{t−1} + HistoryMarker(A_{t−1}, M_{t−1}) (Algorithm 1, line 11).
Temporal compression to a constant-size state
- History Compressor: applies a stack where a temporal Mamba layer mixes information for each pixel across time, and a spatial self-attention layer fuses information across pixels within each frame (Fig. 2; Sec. 2.2).
- Output is a single spatial state F_t passed forward in time with constant memory (Algorithm 1, line 12).
- Why Mamba: state-space models keep only a fixed-size state rather than a growing cache, making arbitrarily long streams feasible at inference (Sec. 2.2 “History Compressor”).
Fuse compressed history with the current frame
- History Decoder: a Transformer decoder that takes the current frame features X_t as queries and the compressed state F_t as keys/values to produce a spatial feature G_t that encodes both (Algorithm 1, line 13; Sec. 2.2).
Predict tracking and new detections jointly
- Pixel Decoder (Mask2Former-style): uses queries concat(A_{t−1}, V) and keys/values G_t (Algorithm 1, line 14; Sec. 2.2).
- Queries from A_{t−1} produce tracking predictions ŷ_trk_t for known objects.
- Queries from V produce detection predictions ŷ_det_t for new objects.
Update identity sets and masks
- Filter confident foreground detections D = filter_fg(ŷ_det_t) (Algorithm 1, line 15).
- Sample |D| fresh ID vectors from the buffer B_{t-1} and append to A_{t-1} to form A_t; remove those from the buffer (Algorithm 1, lines 16–18).
- Update masks M_t by concatenating masks from tracked objects and newly detected objects (Algorithm 1, line 19).
- This makes prompted and unprompted modes identical except for initialization (Sec. 2.2).
Training in parallel with teacher forcing (Algorithm 2; Fig. 1-Right; Sec. 2.3)
Key idea: Replace recurrent dependence on predicted masks with ground-truth masks in training (“teacher forcing”), so the whole time dimension can be processed in parallel.
Preprocess constructs, per instance i, a random “switch” timestep t_i^sample dividing frames into:
- Before the switch: instance appears as a detection target (y_det,i_t = y^i_t if t ≤ t_i^sample)
- After the switch: instance appears as a tracking target (y_trk,i_t = y^i_t if t > t_i^sample)
- The ground-truth masks enter history from the switch time onward (M^i_t = y^i_t if t ≥ t_i^sample) (Fig. 3; Sec. 2.3).
Using these ground-truth-based A_{t-1} and M_{t-1}, the pipeline computes E_{1:T}, F_{1:T}, G_{1:T}, and outputs for all frames at once (Algorithm 2, lines 4–7).
Losses sum per frame:
- L_trk uses the known mapping from A_{t-1} to tracked instances.
- L_det uses Hungarian matching between detection queries V and the detection ground truth (Algorithm 2, line 8; Sec. 2.3).
Model and training details (App. A)
100 detection queries (Ndet=100), 100 ID vectors (Nid=100), 6-layer History Compressor and 6-layer Transformer decoder, operating at 1/8 resolution with 256 channels (App. A).
Three-stage training: pseudo-video on COCO (3 frames), multi-source short clips (5 frames), then long-clip adaptation (16 frames, freezing backbone) (Sec. 3.2).
Inference for unprompted VIS: fixed foreground threshold 0.5 and top-k selection per frame (App. A). No post-processing for prompted VOS.
Optional test-time compute scaling (Sec. 3.4 “Scaling Inference Compute in AUSM”)
Repeat inputs (image or video) to allow iterative refinement; take the final pass as output.
Quantitatively improves AP on COCO and YTVIS19 as repetitions increase (Table 3).

4. Key Insights and Innovations¶

Autoregressive reframing that unifies prompted and unprompted segmentation (Sec. 2.1; Eq. 2)
Novelty: Explicitly formulates streaming segmentation as P(y_t | y_0, y_<t, I_≤t), paralleling decoder-only language models (Eq. 1 vs. Eq. 2).
Significance: Provides a single, consistent interface across tasks: initialize with y0=∅ for unprompted VIS or y0=m0 for prompted VOS (Algorithm 1, lines 2–9; Sec. 2.2).
This is a conceptual shift that enables shared architecture, shared training, and shared evaluation.
History Marker: mask-to-feature “dissolution” instead of object tokenization (Sec. 2.2)
What’s different: Prior “universal” online models often compressed each instance to a token, losing spatial detail and hurting VOS. History Marker spreads an ID vector over the exact pixels of its mask, preserving shape and locality (equation in “History Marker”).
Why it matters: The paper reports “nearly 10% improvement in VOS performance compared to previous unified online architectures” when using this approach (Sec. 2.2, paragraph under History Marker). This is a fundamental change to how object history is stored.
History Compressor with Mamba for constant-memory long-range temporal reasoning (Fig. 2; Sec. 2.2)
What’s different: Instead of keeping a growing key-value cache or a FIFO memory buffer, the state-space model compresses the entire temporal history into a single state per pixel.
Why it matters: Enables processing arbitrarily long video streams with fixed memory, while still retaining access to temporal context (Sec. 2.2). This tackles a central bottleneck for streaming video models.
Fully parallel training with teacher forcing across frames (Sec. 2.3; Fig. 1-Right; Algorithm 2)
What’s different: Most video segmentation pipelines need iterative training because later frames depend on earlier predictions. AUSM aligns all modules (Marker → Compressor → Decoder → Pixel Decoder) to accept ground truth as history, so the entire sequence can be trained in one pass.
Why it matters: Substantial speedups that grow with sequence length; up to 2.5× faster at 16 frames (Fig. 4). This is a systems-level innovation enabling scalable training for long videos.
Unified update mechanism for detection and tracking with an ID buffer (Algorithm 1)
What’s different: A single Pixel Decoder produces both tracking (A_{t−1} queries) and discovery (V queries), and a buffer of ID vectors provides identities for new objects.
Why it matters: Seamlessly bridges prompted and unprompted modes without separate pipelines or explicit per-object memory buffers (Sec. 2.2 “Pixel Decoder and Update Process”).

5. Experimental Analysis¶

Evaluation setup (Sec. 3.1)
Datasets
- Prompted (VOS): DAVIS 2017, YouTube-VOS 2018 & 2019, MOSE.
- Unprompted (VIS): YouTube-VIS 2019 & 2021, OVIS.
Metrics
- VOS: J&F (average of region similarity J and contour accuracy F) for DAVIS and MOSE; G (average J&F across seen/unseen categories) for YouTube-VOS.
- VIS: AP (Average Precision).
Training data mixture spans COCO (with pseudo-video), MOSE, SA-V, YouTube-VIS 2019/2021, OVIS; class-aware heads used where labels exist (Sec. 3.1; App. A).
Main results (Table 1)
AUSM (Swin-B) across tasks with a single model:
- Prompted VOS: DAVIS 81.6 J&F, MOSE 62.1 J&F, YTVOS18 80.2 G, YTVOS19 79.1 G.
- Unprompted VIS: YTVIS19 62.6 AP, YTVIS21 58.6 AP, OVIS 45.5 AP.
Comparisons
- Against universal streaming baselines:
- AUSM (Swin-B) improves OVIS to 45.5 AP, higher than UniVS Swin-L (41.7 AP) and UNINEXT ConvNeXt-L (41.1 AP) (Table 1).
- On YouTube-VOS 2018, AUSM Swin-B (80.2 G) surpasses UniVS Swin-L (71.5 G) by +8.7 points (Table 1).
- Against specialized SOTA (task-specific):
- SAM2 (with private data) achieves higher VOS (e.g., DAVIS 90.7 J&F) than AUSM’s 81.6, reflecting the trade-off between universality/efficiency and peak performance in specialized, memory-heavy systems (Table 1).
Takeaway
- AUSM is competitive to strong VIS systems and sets the best results among universal streaming models on harder datasets like OVIS, while closing much of the gap on VOS in a single architecture (Sec. 3.3).
Ablations and robustness
Training efficiency vs. sequence length (Fig. 4)
- Quote:
  
  “Whereas the iterative method grows from 1.47s to 8.75s per iteration, our parallel approach increases only from 1.47s to 3.45s at length 16… 2.5× speedup at sequence length 16.” (Sec. 3.4; Fig. 4)
Long-clip adaptation (Fig. 5)
- Moving from 5- to 16-frame training improves: MOSE +4.5 J&F (57.6→62.1), YTVOS18 +1.9 (78.3→80.2), YTVOS19 +3.1 (76.0→79.1), OVIS +5.2 AP (40.3→45.5) (Fig. 5).
- Quote:
  
  “The largest gains are on MOSE (+4.52) and OVIS (+5.2), indicating that longer temporal context improves modeling of complex dynamics.” (Sec. 3.4)
Foreground threshold sensitivity (Table 2)
- Varies threshold for selecting new detections; performance is stable from 0.3–0.7. Best overall is around 0.5 for YouTube-VIS, while OVIS benefits from higher thresholds (up to 0.7) (Table 2). The model uses a fixed 0.5 to avoid dataset-specific tuning (Sec. 3.4).
Test-time compute scaling by repetition (Table 3)
- On YTVIS19 with Swin-B: ×1→×3 repetition improves AP 62.6→63.5. On COCO, AP 34.2→35.0; further spatial traversal augmentation yields 35.9 AP (App. A; Sec. 3.4).
Qualitative evidence
- Figures 6 and 7 show side-by-side prompted and unprompted results from the same model, highlighting the unified behavior (App. B).
Do experiments support claims?
Universality: Table 1 demonstrates one model, shared weights, across VOS and VIS benchmarks.
Scalability and efficiency: Fig. 4 quantifies parallel training advantages; Sec. 2.2 explains constant-memory inference via Mamba.
Fine-grained temporal detail: The History Marker design choice (Sec. 2.2) is motivated by mask quality; the strong VOS scores relative to prior universal online models (Table 1) and the reported near-10% VOS improvement over such designs support this.
Long-context handling: Gains from long-clip adaptation (Fig. 5) and constant-memory architecture indicate robustness to longer horizons, though the Discussion notes remaining challenges at extreme lengths.

6. Limitations and Trade-offs¶

Detail vs. efficiency
The model largely operates on 1/8-resolution features to save memory; this can miss fine details needed for peak VOS performance (Discussion “Limitations”). Specialized models with finer features and per-object memory (e.g., SAM2) still outperform on VOS (Table 1).
Training horizon vs. inference horizon
Although the architecture can process arbitrarily long streams, training uses up to 16-frame clips (Sec. 3.2), and the Discussion acknowledges “performance degradation on extremely long sequences,” suggesting room for long-context techniques.
Identity capacity and selection
The number of ID vectors is fixed (Nid=100), and VIS inference limits the number of retained instances per frame (top-k) (App. A). Very crowded scenes or very long videos with many distinct instances could challenge capacity or require tuning these hyperparameters.
No vision-language pretraining
The implementation avoids CLIP or similar supervision; it instead uses dataset-specific classification heads (App. A). This preserves purity of the segmentation formulation but may limit category generalization compared to vision-language pretraining.
Teacher forcing mismatch
Parallel training relies on ground-truth masks in history (Sec. 2.3), while inference uses model predictions. Although standard, this introduces exposure bias that might affect robustness in long sequences or heavy occlusion.

7. Implications and Future Directions¶

Impact on the field
Establishes a principled LLM-style framing for streaming video segmentation, enabling unified architectures that cover both prompted and unprompted modes. This reduces the need for separate pipelines and opens the door to scalable, general-purpose video perception systems (Sec. 6 Conclusion).
Research directions
Long-sequence methods from LLMs (e.g., retrieval for long contexts, length extrapolation) adapted to video to stabilize extremely long horizons (Discussion).
Video-specialized backbones emphasizing temporal modeling and prompt conditioning to close the VOS gap without sacrificing VIS performance (Discussion “Limitations”).
Extending the unified interface to more tasks:
- Multi-object tracking by converting boxes/points/scribbles to mask prompts.
- Referring VOS by initializing the history state with text embeddings (Discussion).
Practical applications
Interactive editing and content creation (prompted mode), autonomous perception (unprompted), AR/VR object understanding, long-duration surveillance/robotics where constant-memory streaming and parallel training are decisive.
The test-time compute scaling trick (Sec. 3.4; Table 3, App. A Eq. 3) offers an easy accuracy knob—repeat frames to refine results—useful when extra latency is acceptable.

Overall, AUSM delivers a compelling synthesis: a language-model-inspired, streamable, and trainable-at-scale framework that meaningfully advances universal video segmentation. The architecture’s History Marker and History Compressor are the critical mechanisms that preserve spatial detail and compress temporal context, respectively, while the training pipeline achieves the efficiency gains needed to make long-sequence learning practical.