2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining¶

🎯 Pitch¶

The paper introduces a large, video-centric “multimodal textbook” dataset that converts 22,000 hours of instructional videos into temporally ordered keyframes interleaved with refined ASR and OCR, producing 6.5M images and 0.75B text tokens for VLM pretraining. By prioritizing coherent image sequences and tight image–text alignment, this resource substantially improves VLM performance on knowledge- and reasoning-intensive benchmarks (e.g., ScienceQA-IMG, MathVista) and strengthens in-context, multi-image reasoning compared to webpage-crawled interleaved corpora.

1. Executive Summary (2-3 sentences)¶

This paper builds a video-centric, image-text interleaved “multimodal textbook” dataset for vision-language model (VLM) pretraining by converting instructional videos into temporally ordered keyframes plus refined ASR (speech-to-text) and OCR (on-screen text/formulas). The core significance is that instructional videos naturally provide coherent sequences and tight text-image coupling, addressing common issues in webpage-crawled interleaved corpora (loose image-text relations, low knowledge density, incoherent image order), and the experiments show consistent gains—especially on knowledge- and reasoning-heavy benchmarks like ScienceQA-IMG and MathVista (Tables 2–3).

2. Context and Motivation¶

Problem/gap addressed
Interleaved image-text corpora (sequences of paragraphs with multiple images) are useful because they resemble how humans consume multimodal information and can unlock few-shot/in-context capabilities in VLMs (Introduction; Related Work §2.2).
Existing interleaved datasets are typically crawled from webpages/documents (e.g., Common Crawl) and suffer from three issues highlighted in Figure 1:
1. Loose text-image relation (irrelevant images like logos/ads).
2. Weak logical coherence between images (few images per document and unclear progression).
3. Low knowledge density (news/entertainment/ads dilute foundational knowledge).
Why it matters
If interleaved corpora are noisy or poorly aligned, VLMs may fail to learn reliable multimodal reasoning patterns from context, which directly impacts few-shot performance and knowledge-intensive tasks (Introduction; §5.2–5.3).
Prior approaches and shortcomings (as positioned by the paper)
Standard VLM pretraining often uses image-caption pairs to align vision and language, but those pairs are less “text-corpus-like” and may limit in-context learning and long-form reasoning (§2.2).
Webpage-centric interleaved datasets (e.g., MMC4, OBELICS) help but still contain the structural problems above (Figure 1; §4.2).
Paper’s positioning
The paper proposes using instructional videos (e.g., math/physics courses) as a source of textbook-level multimodal sequences: slide/board/animation visuals + spoken explanations, then converts them into a curated interleaved dataset (Introduction; §3; Figures 1–2).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a data curation pipeline that turns raw instructional videos into an interleaved multimodal training corpus (images + text in temporal order).
It solves the problem of building high-quality, coherent interleaved pretraining data by extracting keyframes and educational text (spoken narration + on-screen formulas) and filtering noise at multiple levels (Figure 2; §3).

3.2 Big-picture architecture (diagram in words)¶

(A) Taxonomy builder → (B) Video retriever → (C) Multi-level extraction/filtering → (D) Interleaved sample builder
A. Knowledge taxonomy (LLM-generated): defines subjects/courses/knowledge points to drive targeted collection (§3.1; Appendix §7.4).
B. Video search & metadata filtering: retrieves videos per knowledge point, deduplicates, filters via LLM on metadata (§3.1; Figure 2).
C. Video-to-textbook pipeline (3 levels):
- Video-level: extract audio → ASR → refine ASR → filter videos using ASR quality/relevance (§3.2; Appendix §7.1).
- Clip-level: segment by ASR timestamps → caption clips → filter clips by caption/ASR similarity (§3.2).
- Keyframe-level: extract keyframes using SSIM change detection → OCR extraction → filter redundant/low-quality OCR and keyframes (§3.2; Algorithm 1 in Appendix).
D. Interleaving: order keyframes and text chronologically to form training samples (§3.2; Figure 2).

3.3 Roadmap for the deep dive¶

I will explain:
How the paper collects instructional videos using an LLM-generated taxonomy (§3.1; Fig. 2).
How it extracts and refines ASR text and uses it for filtering (§3.2; Appendix §7.1; Table 6).
How it segments videos into clips and filters visually uninformative segments (§3.2).
How it extracts keyframes and OCR, and why SSIM is chosen (Algorithm 1; Table 6).
How the final dataset is packaged and analyzed, including the InSI-SIM coherence metric (Table 1; Appendix §7.5).
How experiments test both benchmark performance and interleaved-context attention (Tables 2–5; Figure 3).

3.4 Detailed, sentence-based technical breakdown¶

Framing: This is primarily a dataset + pipeline + empirical evaluation paper whose core idea is to convert instructional videos into a filtered, temporally coherent interleaved corpus (Figures 1–2; §3–§5).

3.4.1 Data sourcing via an LLM-proposed knowledge taxonomy¶

The pipeline begins by creating a four-level taxonomy: Subject → Course → Sub-course → Knowledge Point to systematically cover foundational topics (§3.1).
The taxonomy is generated by prompting an LLM to ensure broad coverage across educational stages and subjects; the resulting taxonomy has 6 subjects, 55 courses, and 3915 knowledge points (Appendix §7.4; Table 7 summarizes per-subject stats).
Each knowledge point is then used as a YouTube search keyword, retrieving the top 50 videos per knowledge point, followed by deduplication by video ID (§3.1).
The paper also uses an LLM to review metadata (title/description/comments) to filter irrelevant or unsafe content (e.g., pornographic/illegal) (§3.1; Figure 2; Appendix §7.1).
This stage yields 159,565 videos collected (Figure 2; §3.1 and §4.1 report 159K/159,565).

3.4.2 Video-level extraction: audio → ASR → refined tutorial text¶

For each retained candidate video, the pipeline extracts audio using FFmpeg (§3.2).
It transcribes speech to text using whisper-large-v3 (§3.2; Appendix §7.1).
Because raw tutorial speech ASR tends to be colloquial, fragmented, and noisy, the paper refines the ASR by rewriting it with Qwen2-72B-Instruct to improve fluency/coherence without changing meaning (§3.2).
The paper supports the need for refinement with an ablation: removing ASR refinement increases perplexity and reduces benchmark accuracy (Table 6).
In Table 6:
- Ours (ASR Refine, OCR, SSIM) has perplexity 13.92 and 1-shot average accuracy 31.1.
- w/o ASR Refine has perplexity 16.86 and 1-shot accuracy 26.2 (a ↓4.9 drop).

Important missing detail: The paper reports perplexity values for corpora (Table 6) but the excerpt does not specify which language model is used to compute perplexity; it only reports the numbers and interpretation.

3.4.3 Video-level filtering using ASR: removing non-instructional videos¶

After ASR extraction (and before ASR rewriting, per §3.2), the pipeline filters low-quality videos in two steps:
Rule-based filtering: removes non-English videos, videos shorter than 10 seconds, and silent videos with very few ASR tokens (§3.2).
LLM scoring of ASR transcripts: an LLM evaluates each video along:
- Relevance to the targeted knowledge point,
- Knowledge density (filters filler-heavy speech),
- Transcription quality (filters repetitive/erroneous ASR) (§3.2).
In the supplementary implementation details, this ASR scoring uses two LLMs (DeepSeek-V2 and Llama3-70B-Instruct) and discards a video if both deem the transcript insufficient (Appendix §7.1).
This filtering reduces the dataset to 75,000 videos totaling 22,697 class hours, with average duration 18 minutes (§4.1; Table 7).

3.4.4 Clip-level extraction: aligning time-localized visuals with ASR segments¶

Instructional videos are long, so the paper splits each video into short clips (10–20 seconds) using ASR timestamps (§3.2).
Because ASR often comes in fragments, the pipeline first merges incomplete ASR segments into semantically coherent paragraphs, then segments clips accordingly (§3.2).
Each resulting unit is a pair like ⟨clip_i, asr_i⟩ (§3.2), which is intended to improve temporal alignment between the spoken explanation and the visuals shown during that time window.

3.4.5 Clip-level filtering: removing clips with weak visual knowledge¶

Even in instructional videos, some segments are visually uninformative (speaker-only shots, transitions, clutter) (§3.2).
To filter these, the paper:
Generates a detailed caption for each clip using VideoLlama2 (§3.2; Appendix §7.1 specifies VideoLlama2-7B).
Computes text similarity between the clip caption and the ASR text using gte-Qwen2-7B-instruct embeddings (§3.2; Appendix §7.1).
Discards clips where visuals do not match ASR (e.g., speaker-only visuals).
Crucially, if a clip is discarded visually, the pipeline may still keep the ASR text because it can contain useful instructional content (§3.2). This yields sequences where some ASR segments have no associated images:
Example format shown in §3.2: ⟨clip1, asr1⟩, asr2, asr3, ⟨clip4, asr4⟩, ...

3.4.6 Keyframe extraction: selecting representative frames using SSIM change detection¶

From each retained clip, the pipeline extracts keyframes by detecting significant frame-to-frame changes (§3.2).
It uses the Structural Similarity Index (SSIM) to compare consecutive frames, selecting a new keyframe when SSIM drops below a threshold T (Appendix Algorithm 1).
Algorithm 1 (Appendix): Initialize with first frame as keyframe and as reference; iterate frames F_i, compute SSIM(reference, F_i), and if SSIM < T, add F_i as a keyframe and update reference.
The paper later justifies SSIM vs alternatives with ablations (Table 6):
Pixel-level difference extraction yields 18M keyframes (vs 6.5M baseline) and reduces 1-shot accuracy to 22.1 (↓9).
CLIP-based semantic extractor yields 1.7M keyframes (too few) and reduces 1-shot accuracy to 24.6 (↓6.5).
SSIM yields 6.5M keyframes and the best reported accuracy among these choices (Table 6).

3.4.7 OCR extraction and filtering: capturing formulas, symbols, and on-screen text¶

Instructional videos frequently display formulas and bullet points (especially math/science), so the pipeline extracts OCR text from keyframes (§3.2).
It uses InternVL2-40B for OCR to extract text, symbols, and formulas (§3.2), and also uses InternVL2 to score/filter keyframes (occlusion/low information) and remove redundant OCR that is identical/similar across adjacent frames (§3.2).
The benefit of OCR is quantified by an ablation in Table 6:
Removing OCR (w/o OCR) reduces 1-shot average accuracy from 31.1 to 28.8 (↓2.3).

3.4.8 Packaging into interleaved training samples¶

After extraction and filtering, the pipeline interleaves keyframes and text in chronological order to form a “textbook-like” sequence (§3.2).
The paper gives an example structure:
- {frame_{k1}^1, frame_{k2}^1, ocr_1, asr_1, asr_2, asr_3, frame_{k1}^4, ocr_4, asr_4, ...} (§3.2).
To improve training efficiency, multiple extracted fragments ⟨frames, ocr_i, asr_i⟩ are concatenated into a single sample, producing 610K interleaved samples (§4.1; Figure 2).
The supplementary discusses sample construction strategies and notes a trade-off:
Treating each video as one sample preserves semantic integrity but produces contexts that are too long (average 86 keyframes/video).
Concatenating clips may mix videos, so they insert an End of Video token to reduce harm (Appendix §7.3).

3.4.9 Dataset scale and composition (what comes out)¶

From Figure 2, §4.1, and Table 7, the resulting dataset is:

75,000 instructional videos
22,697 hours (≈ 2.5 years) of class content
4M video clips
6.58M keyframes (≈ 6.5M, reported consistently across sections)
258–259M ASR tokens (Figure 2 reports 259M; Table 7 reports 258M)
500M OCR tokens
610K interleaved samples
Per-sample averages: 10.7 images and 1,297 text tokens (§4.1; Table 1 also reports these averages)

3.4.10 Dataset coherence metric: InSI-SIM¶

The paper introduces InSI-SIM (In-sample Image Similarity) to quantify how related images are within a sample (§4.2).
It combines semantic similarity (CLIP score) and structural similarity (SSIM), averaging over all image pairs in the sample (Appendix §7.5, Eq. (1)).
In plain language: for each sample, compute similarity for every image pair using both CLIP and SSIM, average them, then average across samples.
In Table 1, InSI-SIM is reported for subsets of size L=4…8, and the dataset maintains high coherence across larger subsets:
Ours average InSI-SIM ≈ 0.686, vs OBELICS ≈ 0.345, and MMC4 ≈ 0.319 (Table 1).
The paper highlights that as L increases, their score stays around 0.68, while others drop more noticeably (§4.2; Table 1).

3.4.11 Training configuration / hyperparameters (what the excerpt provides vs omits)¶

The paper specifies which base models are pretrained/continually pretrained:
LLaVA-1.5-7B continual pretraining on interleaved corpora (starting from a model already aligned on 558K paired data) (§5.1; Table 2 context).
Idefics2-8B in two settings:
1. Train “from scratch” with randomly initialized projector.
2. Continual pretraining from Idefics2-8B-base already pretrained on OBELICS (§5.1; Table 3).
For fairness, they sample 610K examples from MMC4 and OBELICS and apply the same training parameters (§5.1).
However, the provided excerpt does not include core optimizer/training hyperparameters (optimizer name, learning rate schedule, batch size, context length, tokenizer, layers/hidden size/heads, compute budget, hardware). Under your instruction (“Always include core configurations…”), the correct action is to state they are not available in the provided content rather than guessing.

4. Key Insights and Innovations¶

Video-centric “multimodal textbook” as an interleaved corpus (Figures 1–2; §3–§4)
Difference vs webpage interleaving: webpages often contain sparse and weakly related images; instructional videos naturally produce frame-by-frame coherent visuals aligned with an explanation.
Significance: the dataset improves performance particularly in knowledge/reasoning tasks (Tables 2–3) and improves few-shot behavior (Table 4).
LLM-driven taxonomy → scalable targeted data collection (§3.1; Appendix §7.4)
The paper operationalizes “collect educational content” by first creating a structured taxonomy (6 subjects, 55 courses, 3915 knowledge points), then querying YouTube with each knowledge point.
Significance: this reduces reliance on broad web crawl heuristics and increases knowledge density by construction (motivation in Figure 1; stats in §4.1).
Multi-level, coarse-to-fine filtering and extraction pipeline (Figure 2; §3.2; Appendix §7.1; Table 6)
Video-level filtering uses ASR-based LLM scoring for relevance/knowledge density/transcription quality.
Clip-level filtering removes segments lacking visual instructional content using caption/ASR similarity.
Keyframe-level filtering reduces redundancy and OCR noise.
Significance: ablations show that design choices (ASR refining, OCR inclusion, SSIM keyframes) materially affect downstream benchmark accuracy (Table 6).
Quantifying within-sample visual coherence with InSI-SIM (Table 1; Appendix §7.5 Eq. (1))
Novelty: not just counting images/tokens but measuring how related the multiple images are within a training sample using both structural and semantic similarity.
Significance: empirically supports the claim that video-derived interleaving yields more coherent image sequences than webpage-derived corpora (Table 1).
Evaluating “interleaved context awareness” with a Cheat Test (Table 4; §5.3)
The test checks whether the model can exploit a prompt example identical to the test sample, which probes whether the model pays attention to interleaved context rather than ignoring it.
Significance: the model pretrained on this textbook shows substantially higher “cheat” success rates, especially on math reasoning benchmarks (Table 4).

5. Experimental Analysis¶

5.1 Evaluation methodology (datasets, metrics, setup)¶

Models / baselines (§5.1):
LLaVA-1.5-7B continual pretraining on:
- MMC4, MMC4-Core-ff, OBELICS, and Textbook-6.5M (Table 2).
Idefics2-8B:
- Continual pretraining from Idefics2-8B-base, and training from scratch with random projector initialization (Table 3).
Training data parity (§5.1):
Uses 610K samples for each interleaved dataset (their textbook produces 610K samples; they sample equivalently from MMC4/OBELICS).
Benchmarks (§5.1; Appendix §8.1):
VQA: TextVQA, OKVQA
Knowledge-centric: ScienceQA-IMG
Visual reasoning/math: MathVista, MathVision, MathVerse (the excerpt contains a minor duplication typo in §5.1 but Tables list these three).
Few-shot protocol (§5.1; Appendix §8.1):
Accuracy is measured in 0-shot, 1-shot, 2-shot, 4-shot for LLaVA (Table 2).
For Idefics2, evaluation is extended to 8-shot (Table 3).
Uses RICES retrieval to select k similar examples based on image features (Appendix §8.1), with special handling because some math benchmarks only have test sets.

Missing detail: The excerpt does not specify exact prompt templates for all models, but Appendix §8.1 provides the LLaVA-style chat prompt skeleton used in evaluation.

5.2 Main quantitative results (with numbers)¶

5.2.1 LLaVA-1.5-7B continual pretraining (Table 2)¶

ScienceQA-IMG (accuracy, by shot):
Textbook-6.5M: 26.3 (0-shot), 29.4 (1-shot), 25.1 (2-shot), 37.3 (4-shot)
OBELICS: (0-shot not shown), 2.8 (1-shot), 3.0 (2-shot), 16.4 (4-shot)
MMC4: (0-shot not shown), 1.6 (1-shot), 3.9 (2-shot), 11.6 (4-shot)
The paper highlights >20% improvement vs MMC4 on ScienceQA in both zero- and few-shot settings (§5.2). Table 2 indeed shows a very large gap between Textbook and MMC4/OBELICS on ScienceQA-IMG.
MathVista (Table 2):
Textbook-6.5M: 24.3 (0-shot), 43.4 (1-shot), 33.2 (2-shot), 29.2 (4-shot)
OBELICS: 21.6, 28.5, 31.1, 27.6
MMC4: 20.4, 30.0, 27.9, 26.0
The largest gain appears at 1-shot: 43.4 vs 28.5 (OBELICS) and 30.0 (MMC4).
Overall average across 7 benchmarks (Table 2 “Avg.” row, by shot):
Textbook-6.5M: 15.5 (0-shot), 31.1 (1-shot), 28.8 (2-shot), 30.8 (4-shot)
OBELICS: 10.7, 22.8, 24.8, 26.2
MMC4: 10.9, 19.4, 19.5, 21.9
The paper summarizes average gains as +3.2%, +8.3%, +4.0%, +4.6% from 0-shot to 4-shot (§5.2), though the excerpt does not show the exact baseline used for each delta; the tables provide the raw accuracies.

5.2.2 Idefics2-8B results (Table 3)¶

Continual pretraining from Idefics2-8B-base:
Textbook-6.5M improves over OBELICS and MMC4-cf on all listed tasks:
- OKVQA: 55.1 (Textbook) vs 54.6 (OBELICS) vs 54.1 (MMC4-cf)
- TextVQA: 58.2 vs 57.5 vs 57.7
- MathVista: 29.7 vs 27.6 vs 27.8
- MathVision: 16.2 vs 14.3 vs 14.0
- MathVerse: 19.4 vs 17.5 vs 17.3
Pretraining Idefics2-8B from scratch (random projector init):
Textbook-6.5M: TextVQA 26.8, MathVista 26.1, MathVision 14.4, MathVerse 19.8.
Compared to OBELICS-from-scratch: TextVQA 25.7, MathVista 24.2, MathVision 13.6, MathVerse 17.7.

5.2.3 In-context / interleaved attention probing: Cheat Test (Table 4)¶

1-shot cheat test (prompt contains one example identical to test case):
Ours vs OBELICS vs MMC4-cf:
- OKVQA: 79.2 vs 71.5 vs 69.0
- TextVQA: 51.9 vs 43.8 vs 41.0
- MathVista: 94.1 vs 67.7 vs 72.6
- MathVision: 98.4 vs 66.5 vs 69.3
- MathVerse: 76.8 vs 62.8 vs 55.7
2-shot cheat test (one identical + one random example):
OKVQA: 84.3 (Ours) vs 71.3 (OBELICS) vs 53.5 (MMC4-cf)
MathVision: 70.7 vs 39.9 vs 51.9
Interpretation in §5.3: pretraining on the textbook makes the model better at allocating attention to longer interleaved contexts, showing smaller drops from 1-shot to 2-shot cheating (Table 4).

5.2.4 Sensitivity to image order: shuffling ablation (Figure 3; §5.3)¶

The paper shuffles image order within samples for 20%, 50%, and 100% of samples and retrains.
Figure 3 shows:
MMC4 is largely unaffected by shuffling.
OBELICS drops moderately.
Textbook-6.5M drops substantially, increasingly with higher shuffle ratios.
The paper’s conclusion: unlike webpage datasets (where order is weakly meaningful), the textbook’s image order encodes meaningful sequential structure (Figure 3; §5.3).

5.2.5 Downstream after instruction tuning (Table 5)¶

They instruction-fine-tune on LLaVA-665K after different pretraining corpora.
Table 5 shows gains (zero-shot after SFT) on:
OKVQA: from 61.1 (no additional interleaved pretraining) to 62.2 (Textbook), vs 61.8 (OBELICS), 61.5 (MMC4-Core-ff).
MathVista: from 23.2 to 28.7 (Textbook), vs 25.6 (OBELICS), 24.8 (MMC4-Core-ff).
The paper interprets this as knowledge learned during pretraining transferring positively into instruction tuning (§5.3).

5.3 Do the experiments convincingly support the claims?¶

Supportive evidence:
The paper’s central claim—video-centric textbook data improves knowledge/reasoning benchmarks—is directly supported by the large gaps on ScienceQA-IMG and meaningful gains on MathVista/MathVision/MathVerse in Tables 2–3.
Claims about interleaved context awareness are supported by:
- Better few-shot trends vs OBELICS on general VQA at higher shots (§5.2; Table 2 OKVQA/TextVQA patterns),
- Cheat Test results (Table 4),
- Image-order shuffling sensitivity (Figure 3).
The pipeline design choices are backed by ablations (Table 6).
What is not fully supported / unclear from the provided excerpt:
The paper attributes improved few-shot general-domain performance to coherent context (§5.2), but the excerpt does not provide deeper qualitative analyses of attention patterns—only behavioral tests (Cheat Test).
Training compute, hyperparameters, and architecture details are not included here, which limits reproducibility assessment.

6. Limitations and Trade-offs¶

Residual noise and redundancy
Even with multi-level filtering, the dataset may still contain redundant keyframes and low-quality texts (Appendix §9 Limitations).
OCR quality is a double-edged sword
OCR improves performance (Table 6), but the paper notes low-quality OCR can introduce noise and degrade performance, making tool choice critical (§5.4).
Dependence on multiple large external models/tools
The pipeline uses many components (Whisper-large-v3, multiple LLMs for rewriting/scoring, VideoLlama2, InternVL2-40B, embedding models), which may be expensive and complex to reproduce exactly (Appendix §7.1; §3.2).
Order sensitivity implies fragility to preprocessing
The dataset’s benefit relies on coherent temporal ordering; shuffling harms performance (Figure 3). This is good evidence of structure, but also means incorrect segmentation or ordering errors could be harmful.
Training objective scope
The limitations section states that during training “the loss is not computed for image tokens,” focusing on understanding and text generation for interleaved contexts (Appendix §9). This may limit applicability to models requiring explicit image generation or more “omni-modal” objectives (Appendix §9).
Missing reproducibility details in the provided content
Core training hyperparameters (optimizer, LR schedule, batch size, context length) and compute/hardware are not present in the excerpt; this is a practical limitation for replication under strict settings.

7. Implications and Future Directions¶

How this changes the landscape (based on the paper’s evidence)
The work suggests that where interleaved data comes from matters: instructional-video-derived sequences provide higher within-sample coherence (InSI-SIM, Table 1) and better downstream reasoning and in-context behavior (Tables 2–4; Figure 3).
It strengthens the argument that “textbook-level” curated corpora can improve not only language models (motivation references Phi-style work in the Introduction) but also multimodal pretraining when converted into aligned interleavings.
Follow-up research enabled or suggested by the paper
Better filtering and de-duplication: The authors explicitly plan to further improve quality and knowledge density (Appendix §9).
Objective extensions to omni-modal training: Since current training does not compute loss on image tokens, extending to objectives that do may broaden applicability (Appendix §9).
More fine-grained alignment checks: The clip filtering uses caption/ASR similarity; future work could explore tighter grounding between specific OCR spans, ASR sentences, and frames (not described here, but directly motivated by the pipeline’s goals).
Practical applications / downstream use cases
Pretraining data for VLMs intended for:
- STEM-heavy VQA and multimodal reasoning (supported by ScienceQA-IMG and Math benchmarks in Tables 2–3).
- Few-shot multimodal prompting where the model must actually use interleaved examples (supported by Cheat Test, Table 4).
Repro/Integration Guidance (based on what the paper provides)
When to prefer this dataset/approach over webpage interleaving:
- If your target tasks require multi-step reasoning over sequential visual explanations (e.g., math derivations, stepwise diagrams), the dataset’s order coherence appears central (Figure 3; §5.2–5.3).
Key pipeline choices to keep if replicating:
- ASR refinement is important (Table 6 shows a ↓4.9 accuracy drop without it).
- OCR extraction + filtering matters (Table 6 shows ↓2.3 without OCR).
- SSIM-based keyframe extraction outperforms pixel-diff and CLIP-based selection for these instructional visuals (Table 6; Algorithm 1).
What you would still need (not in the provided excerpt):
- Exact model training hyperparameters and compute setup to reproduce the reported benchmark numbers.

If you want, I can also produce a concise “pipeline spec” (inputs/outputs per stage) suitable for implementation, but that would add an extra section beyond your requested structure—so I have not included it here.