Skip to content

REVISITING MULTIMODAL POSITIONAL ENCODING IN

VISION–LANGUAGE MODELS

ArXiv: 2510.23095

🎯 Pitch

This paper gives the first systematic analysis of how rotary positional embeddings (RoPE) should be extended from text-only LLMs to mixed 1D/2D/3D vision–language inputs, decomposing multimodal RoPE into position design and frequency allocation and deriving three practical design guidelines. Guided by these insights it proposes two plug-and-play solutions (Multi-Head RoPE and MRoPE-Interleave) plus a spatial-reset tweak that preserve text RoPE compatibility and full frequency use, yielding consistent, measurable gains across image, video, and grounding benchmarks—improving spatial reasoning and long-range multimodal modeling in VLMs.


1. Executive Summary (2-3 sentences)

This paper systematically analyzes how Rotary Positional Embedding (RoPE) should be extended from text-only Large Language Models (LLMs) to Vision–Language Models (VLMs) that mix 1D text with 2D/3D visual tokens. It decomposes “multimodal RoPE” into (i) position design (how to assign coordinates/IDs to text and visual tokens) and (ii) frequency allocation (how RoPE frequencies are assigned across axes), distilling three guidelines—positional coherence, full frequency utilization, and preservation of textual priors. Guided by these, it proposes two plug-and-play variants, Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), plus a spatial-reset tweak, and shows consistent improvements across image, video, and grounding benchmarks (Tables 2–4).

2. Context and Motivation

  • What specific problem/gap is addressed?
  • VLMs need positional encoding that can represent:
    • 1D order for text, and
    • 2D (image) / 3D (video: time-height-width) geometry for visual tokens.
  • Despite multimodal RoPE being widely used, the paper argues there has been little systematic investigation into which design choices matter and why (Abstract, Section 1).

  • Why is this important?

  • Self-attention is permutation-invariant, so without positional encoding, token order/structure is ambiguous (Section 1).
  • Flattening visual tokens into 1D can harm tasks requiring spatial grounding and geometry-aware reasoning (Section 1; Figure 1a discussion; Table 2 ChartQA / RefCOCO patterns).

  • What prior approaches existed, and where do they fall short?

  • 1D sequential designs (vanilla RoPE, V2PE):
    • Flatten everything into one sequence, losing native visual structure (Section 2.2.1; Figure 1a).
    • Can create very large position IDs for long sequences, hurting extrapolation (Section 2.2.1).
    • V2PE mitigates growth by scaling visual position steps (Section 2.2.1), but still ignores 3D structure.
  • Multi-dimensional designs (MRoPE, VideoRoPE/HoPE, CircleRoPE, IL-RoPE/Omni-RoPE):

    • Preserve some 3D structure, but introduce issues in either:
    • positional ambiguity / overlap causing “modalities confusion in generation” (VideoRoPE/HoPE diagonal layout; Section 2.2.2 and Table 3),
    • poor modality interval (CircleRoPE; Section 2.2.2),
    • frequency spectrum “chunking” that harms long-range temporal reasoning and/or multi-scale spatial reasoning (Section 2.3.2; Figures 3–4),
    • breaking compatibility with the base LLM’s text RoPE (IL-RoPE/Omni-RoPE; Section 2.4; Figure 1e).
  • How does this paper position itself?

  • It aims for a holistic positional encoding strategy that supports unified image + video understanding and fine-grained grounding, rather than specialized schemes optimized for only one domain (Section 1).
  • It explicitly evaluates and compares methods along three axes: position design, frequency allocation, and compatibility with text-only RoPE (Table 1; Section 2).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a multimodal positional encoding scheme (RoPE variants) that can be dropped into a transformer-based VLM without changing the model architecture.
  • It solves the problem of giving the model unambiguous, geometry-aware positions for mixed text/image/video tokens while maintaining good long-context behavior and transfer from a pretrained text LLM (Sections 1–2).

3.2 Big-picture architecture (diagram in words)

  • Inputs: interleaved sequences of text segments and visual token blocks (images and/or video frames) (Figure 1 caption).
  • Position design module: assigns each token a position identifier:
  • text: 1D positions (like the base LLM),
  • visual: 3D tuple (t, h, w) with spatial-reset (Sections 2.2–2.5).
  • Frequency allocation module: decides how RoPE’s frequency spectrum is distributed across (t, h, w) (Section 2.3; Figures 3–4).
  • RoPE application: rotates query/key vectors using those positions so attention depends on relative positions (Section 2.1; Eq. (1)).
  • Output: attention scores and downstream VLM behavior that better supports spatial reasoning, grounding, and long video (Tables 2–7).

3.3 Roadmap for the deep dive

  • Explain vanilla RoPE and why it works for text (Eq. (1), frequency spectrum).
  • Define the two design levers for multimodal RoPE:
  • position design (how positions/coordinates are assigned),
  • frequency allocation (how frequency channels/heads map to axes).
  • Show failure modes in existing multimodal extensions (Figures 1–2; Section 2.2–2.4).
  • Describe the proposed fixes:
  • spatial-reset,
  • MHRoPE (multi-head allocation),
  • MRoPE-I (interleaved allocation) (Sections 2.3.3–2.5).
  • Connect these choices to empirical outcomes (Tables 2–4, plus appendices Tables 5–7 and Figure 5).

3.4 Detailed, sentence-based technical breakdown

  • Framing: This is an empirical + algorithmic design paper that introduces two RoPE variants and a position-assignment tweak, motivated by an analytical decomposition of multimodal RoPE into position design and frequency allocation (Abstract; Sections 2–3).

3.4.1 Vanilla RoPE recap (what it computes and why relative position emerges)

  • RoPE encodes position by rotating query and key vectors rather than adding embeddings (Section 2.1).
  • Given a query q at position m and a key k at position n, attention uses:
  • S = (R_m q)^T (R_n k) = q^T R_{n-m} k (Eq. (1)).
  • The key property is that the score depends on relative position n − m, because R_m^T R_n = R_{n-m} (Eq. (1)).
  • Frequencies are fixed as a geometric progression:
  • θ_i = base^{-2i/d} for i ∈ [0, d/2 − 1] (Section 2.1),
  • creating a spectrum from high frequency (small i) to low frequency (large i).

3.4.2 Multimodal RoPE: two design degrees of freedom

The paper treats multimodal RoPE as two coupled choices (Section 2):

  1. Position design: What is a token’s “position” when you have text + image/video tokens?
  2. Frequency allocation: Once position is multi-axis (e.g., (t,h,w)), which RoPE frequencies encode which axis?

A third constraint is emphasized:

  1. Compatibility with text-only RoPE: Keep text encoding identical to the pretrained LLM’s RoPE so transfer learning works well (Section 2.4).

3.4.3 Position design: what can go wrong and what “positional coherence” means

  • 1D sequential design (vanilla RoPE / V2PE):
  • All tokens share a single increasing position ID; visual tokens are flattened so they lose 2D/3D structure (Section 2.2.1; Figure 1a).
  • Very long multimodal sequences can create very large position indices, which can hurt extrapolation (Section 2.2.1; Appendix D.4 notes sharp drop for vanilla RoPE at long contexts).
  • V2PE reduces growth by setting a modality-specific step size for visuals (s_visual ∈ {1, 1/2, …, 1/256}), but still does not encode 3D geometry (Section 2.2.1).

  • Multi-dimensional design (MRoPE):

  • Visual tokens get a 3D coordinate like (t, h, w); text and modalities are laid out to avoid overlaps by “jumping” temporal position past the previous block:
    • m_next^t = max(m_prev^t, m_prev^h, m_prev^w) + 1 (Eq. (2); Section 2.2.2).
  • This preserves 3D structure and avoids overlaps, but the paper observes a visual “attention sink”: attention concentrates near the top-left corner (t,0,0) of each visual block (Section 2.2.3; Figure 2).

  • Diagonal layout (VideoRoPE / HoPE style):

  • Shifts frames along spatial axes too (Figure 1c), intended for “inter-modal symmetry” (Section 2.2.2).
  • The paper identifies a critical flaw: for high-resolution visuals (e.g., documents), spatial coordinates can extend into the range used by generated text tokens, creating position overlap and “modalities confusion in generation” (Section 2.2.2).
  • In ablations, this manifests as repetitive nonsensical output like "1111..." even if applied only at inference (Section 3.3.1; Table 3).

  • Circle layout (CircleRoPE):

  • Places image tokens on a “ring” orthogonal to text (Figure 1d; Section 2.2.2).
  • The paper argues it can introduce:

    • a large modality interval that may hinder cross-modal interaction,
    • temporal ambiguity for video because it collapses frames onto one ring (Section 2.2.2).
  • Positional coherence (paper’s guideline):

  • A robust position design should (Section 2.2.3):
    1. preserve 3D visual structure,
    2. keep slow growth of position IDs,
    3. avoid text/visual confusion during generation,
    4. define an appropriate modality interval (gap/layout between modalities).

3.4.4 Spatial-reset: the paper’s position-design fix

  • Definition (paper-specific): spatial-reset resets the spatial coordinates (h,w) for each separate visual content block, instead of letting them be offset by time or by prior blocks (Section 2.2.3; Figure 1f discussion).
  • It is motivated by two mechanisms described in the paper:
  • Aligning attention sinks: Since LLMs often attend more to small position IDs, resetting spatial positions makes the visual block’s salient “small IDs” align with that bias, improving visual adaptation (Section 2.2.3; Figure 2).
  • Disentangling motion (video): Without spatial-reset, time and space get coupled in the absolute position definition, which entangles relative indices (Section 2.2.3).
    • The paper illustrates the entanglement:
    • Standard MRoPE positions:
      • m1 = (t1, t1+h1, t1+w1), m2 = (t2, t2+h2, t2+w2),
      • so relative position becomes (Eq. (3)):
      • m_rel = (t2−t1, (t2−t1)+(h2−h1), (t2−t1)+(w2−w1)).
    • With spatial-reset, positions become:
    • m1 = (t1, h1, w1), m2 = (t2, h2, w2),
    • yielding a clean spatio-temporal delta (Eq. (4)):
      • m_rel = (t2−t1, h2−h1, w2−w1).
  • Micro-example (single walk-through using the paper’s equations):
  • Suppose an object moves right by 2 pixels between frames: at time t1=5 it is at (h1=10,w1=10), and at time t2=6 it is at (h2=10,w2=12).
  • Under Eq. (3) entanglement, the horizontal relative component becomes (t2−t1)+(w2−w1) = 1+2 = 3, mixing time and space.
  • Under spatial-reset (Eq. (4)), the horizontal delta is w2−w1 = 2, cleanly representing “moved right by 2” independent of time-step size.

3.4.5 Frequency allocation: why “chunking” harms and what “full frequency utilization” means

  • What frequency allocation controls: how feature dimensions (and their associated θ_i) are assigned to axes t, h, w (Section 2.3).
  • Problem with standard MRoPE “chunked” allocation:
  • It splits the d channels into three contiguous blocks for t/h/w (Section 2.3.2; Figure 3).
  • Because frequencies decay with channel index, the paper argues this forces:
    • the temporal axis t into high-frequency channels only → faster long-range decay over time,
    • spatial axes into different frequency ranges → asymmetric decay between h and w (Section 2.3.2; Figure 4a).
  • The paper links frequency allocation to an attention-decay analysis:
  • It describes an upper bound whose position-dependent term can be approximated by a sum like ∑ |S_{i+1}| (Section 2.3.1; Appendix D.2 provides derivation and Eq. (6)).
  • The practical takeaway used in the paper is that frequency allocation affects how quickly attention decays with distance along each axis (Figures 4a–4b).

  • Guideline: full frequency utilization

  • Each positional axis should have access to the full frequency spectrum (high → low) to support:
    • fine-grained spatial relationships (needs high-frequency components),
    • long-range temporal dependencies (needs low-frequency components),
    • symmetric multi-scale modeling across axes (Section 2.3.2–2.3.3).

3.4.6 The proposed methods: MHRoPE and MRoPE-I

Both methods share: - Position design: MRoPE-style 3D structure plus spatial-reset (Section 2.5; Figure 1f). - Text compatibility: keep text tokens’ RoPE identical to the base LLM (Section 2.4–2.5).

They differ in frequency allocation:

  1. Multi-Head RoPE (MHRoPE) = multi-head allocation
  2. Instead of splitting channels within a head, it dedicates different attention heads to different axes (Section 2.3.3; Figure 3).
  3. Claimed advantages in the paper:
    • each axis uses the full within-head frequency spectrum (avoids per-axis “coarsening” from channel splitting),
    • may scale better as the number of axes grows (Section 2.3.3).
  4. Implementation note given:

    • For Group Query Attention (GQA), the partition is on KV heads and repeated for corresponding query heads (footnote in Section 2.4).
  5. MRoPE-Interleave (MRoPE-I) = interleaved allocation

  6. Distributes channels to axes in a fine-grained round-robin manner (Section 2.3.3; Figure 3).
  7. Ensures each axis sees high-to-low frequencies, improving multi-scale modeling (Section 2.3.3).
  8. The paper notes compatibility with extrapolation schemes that rescale frequency spectra (Appendix D.3), precisely because each axis has a full spectrum (Figure 4b discussion).

3.4.7 System/data pipeline “diagram in words” (what happens first → second → third)

Within the VLM forward pass implied by Sections 2–3:

  1. First, tokenize/encode inputs into a single interleaved sequence containing text tokens and visual tokens (images as 2D grids; videos as stacked frames), as illustrated by the example sequence in Figure 1’s caption.
  2. Second, assign positions:
  3. text tokens get standard 1D increasing positions (kept compatible with the base LLM; Section 2.4),
  4. visual tokens get 3D positions (t,h,w) with spatial-reset so each visual block starts with small spatial coordinates (Section 2.2.3; Eq. (4) motivation).
  5. Third, map axes to RoPE frequencies:
  6. MHRoPE: choose per-head axis responsibility so each axis is encoded with full-spectrum RoPE inside its heads (Section 2.3.3),
  7. MRoPE-I: interleave channels so each axis receives high-to-low frequencies across the embedding (Section 2.3.3).
  8. Fourth, apply RoPE rotations to queries/keys using the positions, so attention depends on relative offsets (Section 2.1; Eq. (1)).
  9. Finally, compute attention and decode outputs, where improved positional encoding should yield better spatial grounding, better long-range video modeling, and fewer generation pathologies (Tables 2–4; Section 3.3.1).

3.4.8 Core configurations / hyperparameters (as provided)

From Section 3.1 (and appendices where relevant):

  • Backbone / components:
  • Vision encoder: QwenViT (from Qwen2.5VL) (Section 3.1).
  • Connector: from Qwen2.5VL (Section 3.1).
  • LLM backbone initialization: Qwen2.5 7B (Section 3.1).

  • Trainable vs frozen:

  • Freeze ViT (visual encoder).
  • Unfreeze connector and LLM backbone (Section 3.1).
  • The paper states this is to isolate RoPE effects while following typical adaptation patterns (Section 3.1).

  • Optimization/training settings (reported):

  • Batch size: 128 (Section 3.1).
  • Learning rate: 1 × 10^-5 with cosine decay (Section 3.1).
  • Training compute per experiment: ~512 Nvidia A100 GPU hours (Section 3.1).
  • Training context length: 32K tokens (Section 3.1).
  • Rotary base: 1,000,000 (Section 3.1).

  • Not specified in the provided content (cannot be filled without guessing):

  • Optimizer type (e.g., AdamW) and its parameters (betas, weight decay, eps).
  • Number of training steps/epochs.
  • Tokenizer details.
  • Exact transformer hyperparameters: number of layers, hidden size, attention heads, MLP size.
  • Visual tokenization specifics (patch size, tokens per image/frame) and any image/video resolution policies.

  • Data scale (reported):

  • Supervised fine-tuning samples: ~2M high-quality SFT samples (Section 3.1).
  • Data task coverage includes captioning, OCR, reasoning, grounding, document understanding, long video understanding (Section 3.1).

4. Key Insights and Innovations

  • (1) A clean decomposition of multimodal RoPE into position design + frequency allocation + text compatibility
  • Novelty is not a brand-new architecture, but a systematic lens and evaluation grid (Table 1; Section 2).
  • Significance: it explains why the field has “fragmented” into specialized solutions—because these axes trade off against each other (Section 1; Sections 2.2–2.4).

  • (2) Spatial-reset as a simple position-design tweak with broad impact

  • It addresses an observed “visual attention sink” (Figure 2) and produces measurable gains in ablations:
    • Adding 3D + spatial-reset improves Image/Video/Grounding over vanilla and over “3D only” (Table 3).
  • It also yields a conceptual improvement for video: disentangling temporal vs spatial relative offsets (Eq. (3) → Eq. (4)).

  • (3) Full frequency utilization as a guiding principle, and two plug-and-play mechanisms to achieve it

  • The paper argues chunked allocation restricts axes to partial frequency bands, harming long-range temporal and multi-scale spatial modeling (Section 2.3.2; Figures 3–4a).
  • It proposes two ways to give each axis a full spectrum:
    • MHRoPE: allocate by heads (Section 2.3.3),
    • MRoPE-I: interleave channels (Section 2.3.3).
  • Ablations show both outperform “partial spectrum” schemes (Table 4).

  • (4) Preservation of text RoPE as essential for transfer

  • The paper tests deviations (e.g., resetting text spatial dimensions; scaling rotary base for spatial axes) and finds they degrade performance (Section 3.3.1; Table 3; Section 2.4).
  • This is a practical insight: multimodal improvements should not come at the cost of breaking pretrained LLM positional priors.

5. Experimental Analysis

5.1 Evaluation methodology (datasets, metrics, baselines, setup)

  • Training setup:
  • Same architecture, same training data (~2M SFT), same hyperparameters across comparisons; only RoPE variant changes (Section 3.1).
  • ViT frozen; connector + LLM unfrozen (Section 3.1).
  • Context length 32K; rotary base 1e6 (Section 3.1).

  • Benchmarks (20+ total) grouped as:

  • Image: MMMU, MMBench, MMStar, OCRBench, AI2D, RealWorldQA, DocVQA, TextVQA, InfoVQA, ChartQA (Section 3.1).
  • Video: MVBench, STAR, VideoMME, LVBench, MLVU, Charades-STA (Section 3.1).
  • Grounding: RefCOCO series (Section 3.1).

  • Baselines compared (Table 2):

  • Vanilla RoPE, MRoPE, VideoRoPE, HoPE, CircleRoPE, plus proposed MHRoPE and MRoPE-I.

  • Metrics:

  • The paper reports benchmark scores as percentages/points per dataset in Table 2; it does not further specify metric definitions in the provided content (so I treat them as the benchmark’s standard scalar score).

5.2 Main quantitative results (with specific numbers)

From Table 2 (selected highlights, all numbers are as reported):

  • Image benchmarks:
  • MMMU: vanilla 50.56MRoPE-I 53.22 (+2.67) and MHRoPE 53.00.
  • ChartQA: vanilla 56.84MRoPE-I 62.12 (+5.28) and MHRoPE 62.44.
  • DocVQA: vanilla 82.94MRoPE-I 83.72 (+0.78).
  • Some datasets show small or negative deltas for MRoPE-I vs vanilla (e.g., AI2D -0.84, RealWorldQA -0.92, InfoVQA -0.62), indicating gains are not universal across every benchmark.

  • Video benchmarks:

  • LVBench: vanilla 38.93MRoPE-I 40.54 (+1.61).
  • Charades-STA: vanilla 32.49MRoPE-I 34.36 (+1.87).
  • VideoMME: vanilla 58.63MRoPE-I 58.96 (+0.33).
  • MVBench: vanilla 57.05MRoPE-I 57.05 (+0.00).

  • Grounding (RefCOCO family):

  • RefCOCOval: vanilla 77.67MRoPE-I 80.94 (+3.27).
  • RefCOCOtestA: vanilla 81.37MRoPE-I 84.55 (+3.18).
  • RefCOCOtestB: vanilla 72.66MRoPE-I 75.05 (+2.39).
  • Similar gains appear across RefCOCO+, RefCOCOg splits (Table 2).

  • Aggregated category averages (Table 2 “Overall” rows):

  • Image avg: vanilla 65.69MRoPE-I 66.65 (+0.96).
  • Video avg: vanilla 51.64MRoPE-I 52.36 (+0.72).
  • Grounding avg: vanilla 73.48MRoPE-I 75.85 (+2.36).

5.3 Do experiments support the claims?

  • Support for “spatial-reset helps”: Strongly supported by Table 3 and Appendix Table 5.
  • Table 3 shows a monotonic improvement from:
    • vanilla → +3D structure+3D + spatial-reset across Image/Grounding/Video and key doc/chart tasks.
  • Appendix Table 5 shows higher average attention mass on visual tokens with spatial-reset in deeper layers (e.g., for MRoPE-I, Layer 28: 23.23 vs 11.69 without spatial-reset).

  • Support for “full frequency utilization helps”: Supported by Table 4.

  • With position design fixed (MRoPE + spatial-reset), frequency allocations compare as:

    • VideoRoPE-like Overall 63.31,
    • IL-RoPE-like Overall 63.07,
    • Multi-Head Overall 64.63,
    • Interleave Overall 64.95 (best) (Table 4).
  • Support for “diagonal layout causes generation confusion”: Supported by Table 3 plus qualitative description.

  • + diagonal layout causes large drops in DocVQA (60.13), InfoVQA (37.42), ChartQA (54.88) relative to vanilla (Table 3).
  • The paper reports repetitive output failure (“1111…”) and attributes it to positional overlap (Section 3.3.1).

5.4 Ablations, robustness checks, and long-context behavior

  • Position design ablation (Table 3):
  • Shows the impact of:
    • adding 3D structure,
    • adding spatial-reset,
    • diagonal layout,
    • enlarged modality interval,
    • text spatial-reset (breaking text compatibility),
    • scaling rotary base for spatial axes (breaking compatibility).
  • Notably, text spatial-reset drops Image to 58.27 and ChartQA to 44.33 (Table 3), aligning with the “preserve textual priors” guideline.

  • Frequency allocation ablation (Table 4):

  • Interleave and Multi-Head outperform partial spectrum schemes.

  • Interleave ratio ablation (Appendix Table 6):

  • Balanced ratio t:h:w = 24:20:20 gives best overall 64.95.
  • Increasing temporal share harms grounding (e.g., 48:8:8 grounding 72.87 vs 75.85) (Table 6).

  • Temporal stride ablation (Appendix Table 7):

  • Stride δ = 1 yields best overall on video suite (52.36) vs 0.5 (51.11) or 2 (51.10); dynamic stride gives 51.80 (Table 7).

  • Extrapolation / long context (Appendix D.4, Figure 5):

  • Models trained at 32K and extrapolated to 64K/128K/256K.
  • The paper reports vanilla RoPE sharply drops at 128K/256K and attributes it to fast-growing position IDs (Appendix D.4; Figure 5).
  • It also notes VideoRoPE/HoPE have slightly better extrapolation in long-video scenarios, but MHRoPE/MRoPE-I are “most comprehensive and balanced” when considering images + grounding too (Appendix D.4).

6. Limitations and Trade-offs

  • Incomplete “win everywhere”: Even the best method (MRoPE-I) has small regressions on some image benchmarks (e.g., AI2D and RealWorldQA in Table 2), suggesting the improvements are strongest on geometry/grounding-heavy tasks rather than uniformly boosting all capabilities.

  • MHRoPE vs MRoPE-I trade-off (Appendix D.1):

  • The paper recommends MRoPE-I due to:
    • slight performance advantage,
    • simpler implementation.
  • It hypothesizes MHRoPE can underperform because head-level partitioning “prevents the integration of different positional axes within the self-attention mechanism” (Appendix D.1).

  • Engineering constraints (Appendix D.1):

  • MHRoPE is said to complicate distributed training such as tensor parallelism, whereas MRoPE-I is simpler.

  • Scope of experiments:

  • Training is performed with Qwen2.5VL components and Qwen2.5 7B initialization, with the ViT frozen (Section 3.1).
  • The paper does not provide evidence (in the provided content) that results generalize across:

    • different VLM architectures,
    • different LLM backbones,
    • different visual tokenization schemes,
    • full end-to-end finetuning where the ViT is unfrozen.
  • Missing system-performance reporting:

  • No latency/throughput/memory overhead numbers are provided for these RoPE variants in the excerpt, so practical serving cost differences are not quantified here.

  • Hyperparameter transparency gaps (from what’s provided):

  • Optimizer type and key model configuration (layers/heads/hidden size/tokenizer) are not specified in the provided content, limiting exact reproducibility from this summary alone (even though the paper includes a reproducibility statement).

7. Implications and Future Directions

  • How this changes the landscape (within the paper’s scope):
  • It reframes multimodal positional encoding as a small set of design constraints rather than a growing set of bespoke methods per domain (images vs videos vs generation).
  • The three guidelines—positional coherence, full frequency utilization, preservation of textual priors—serve as a checklist for future multimodal RoPE designs (Abstract; Section 4).

  • Follow-up research suggested by the paper’s findings:

  • Scalability to more axes: MHRoPE is motivated as more scalable when positional axes increase (Section 2.3.3).
  • Better long-context extrapolation: Appendix D.3 suggests MRoPE-I aligns well with frequency-rescaling extrapolation methods because each axis has a full spectrum, and Appendix D.4 highlights remaining extrapolation differences among methods.
  • Understanding attention sinks: The “visual attention sink” (Figure 2) and the effect of spatial-reset (Appendix Table 5) suggest further mechanistic study of how positional IDs interact with pretrained attention biases.

  • Practical applications / downstream use cases:

  • If you need one positional encoding that performs well across:
    • general image understanding,
    • long video understanding,
    • fine-grained grounding (RefCOCO),
  • the paper’s results indicate MRoPE-I is the most consistently strong single choice (Tables 2 and 4; Appendix D.1 recommendation).

  • Repro/Integration Guidance (when to prefer what, based on this paper):

  • Prefer MRoPE-I when you want:
    • the best overall balance and slightly better scores than MHRoPE (Tables 2, 4; Appendix D.1),
    • simpler implementation and fewer distributed-training complications (Appendix D.1),
    • compatibility with frequency-spectrum rescaling style extrapolation (Appendix D.3).
  • Consider MHRoPE when:
    • you anticipate needing to add more positional axes in the future and want head-based scalability (Section 2.3.3),
    • you can tolerate slightly lower performance and higher implementation complexity (Appendix D.1).
  • Avoid position designs that create overlap between visual coordinates and generated text positions:
    • the diagonal layout causes severe failures on document-style tasks and induces repetitive generation (Section 3.3.1; Table 3).
  • Keep text RoPE unchanged from the base LLM:
    • “text spatial-reset” and spatial rotary-base scaling degrade performance (Section 3.3.1; Table 3), supporting the “preservation of textual priors” guideline.