Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

YouTube Ingestion Pipeline — Design Document

Status: Draft Author: crossmem team Date: 2026-04-15 Tracking: crossmem-rs#27


1. Overview

Extend crossmem capture <url> to detect youtube.com / youtu.be hosts and dispatch to a YouTube-specific pipeline that produces time-aligned wiki chunks — the video analog of the PDF chunk pipeline from #24.

The pipeline runs entirely local on an Apple Silicon Mac mini (M2/M4). No cloud APIs.

Pipeline stages

capture (download + extract audio/subs)
  → transcribe (whisper.cpp Metal)
  → keyframes (ffmpeg scene-cut)
  → OCR + VLM caption (per keyframe)
  → compile (Ollama paraphrase/implication per chunk)
  → emit wiki markdown

2. Download Path

Decision: yt-dlp binary

OptionProsCons
yt-dlp binaryBattle-tested, handles every edge case, active community, --cookies-from-browser for member-onlyExternal dep, Python-based, updates frequently
libyt-dlp bindingsTighter integrationNo stable C API; Python FFI is fragile
youtube-rs (pure Rust)No external depIncomplete, breaks on YT changes, no auth, no live/shorts

yt-dlp wins because YouTube aggressively rotates extraction logic. Maintaining a pure-Rust extractor is a full-time job. yt-dlp is the industry standard for a reason.

Edge cases handled by yt-dlp flags

Scenarioyt-dlp flags
Age-gated--cookies-from-browser chrome (reads real Chrome cookies)
Member-onlySame cookie approach; user must be logged in
Live streams--live-from-start --wait-for-video 30 (wait + download from start)
ShortsWorks as normal URLs (youtube.com/shorts/ID → standard extraction)
Playlists--yes-playlist or --no-playlist (user flag; default: single video)
Chapters--embed-chapters + --write-info-json (chapter list in info JSON)
Auto captions--write-auto-subs --sub-lang en
Human captions--write-subs --sub-lang en (preferred over auto when available)

Download command template

yt-dlp \
  --format "bestaudio[ext=m4a]/bestaudio/best" \
  --extract-audio --audio-format wav --audio-quality 0 \
  --write-info-json \
  --write-subs --write-auto-subs --sub-lang "en.*" --sub-format vtt \
  --embed-chapters \
  --cookies-from-browser chrome \
  --output "%(id)s.%(ext)s" \
  --paths "$HOME/crossmem/raw/youtube/" \
  "$URL"

For keyframe extraction we also need the video file:

yt-dlp \
  --format "bestvideo[height<=1080][ext=mp4]/bestvideo[height<=1080]/best" \
  --write-info-json \
  --output "%(id)s_video.%(ext)s" \
  --paths "$HOME/crossmem/raw/youtube/" \
  "$URL"

3. Audio Extraction → Transcription

Decision: whisper.cpp with Metal acceleration, large-v3-turbo model

EngineBackendSpeed (1h audio, M2)AccuracyNotes
whisper.cppMetal (Apple GPU)~6–8 minWER ~8% (large-v3-turbo)C/C++, no Python, --print-timestamps for word-level
whisper-mlxMLX (Apple GPU)~5–7 minSame modelsPython dep, MLX framework, slightly faster on M4
WhisperKitCoreML~5–6 minGoodSwift-only, harder to call from Rust
insanely-fast-whisperMPS (PyTorch)~10–15 minSame modelsHeavy Python stack, MPS less optimized than Metal
faster-whisperCTranslate2 (CPU)~15–25 minSame modelsNo Metal/MPS; CPU-only on macOS

whisper.cpp wins because:

  1. Native Metal acceleration — no Python runtime
  2. Easily called from Rust via std::process::Command (same pattern as pdftotext in cite.rs)
  3. Outputs VTT/SRT/JSON with word-level timestamps
  4. Active project, models available via Hugging Face in ggml format

Model choice: large-v3-turbo

ModelParamsVRAMDiskSpeed (M2, 1h)WER (en)
large-v31.55B~3 GB3.1 GB~12 min~7.5%
large-v3-turbo809M~1.6 GB1.6 GB~6 min~8%
distil-large-v3756M~1.5 GB1.5 GB~5 min~9%

large-v3-turbo is the sweet spot: half the VRAM of large-v3, nearly the same WER, 2× faster. distil-large-v3 is marginally faster but has slightly worse accuracy on non-native English speakers (common in academic talks).

Transcription command

whisper-cpp \
  --model models/ggml-large-v3-turbo.bin \
  --file "$HOME/crossmem/raw/youtube/${VIDEO_ID}.wav" \
  --output-vtt \
  --output-json \
  --print-timestamps \
  --language en \
  --threads 4

Caption priority

  1. Human-uploaded subtitles (.en.vtt from yt-dlp) — highest quality, use as-is
  2. whisper.cpp transcription — always run for timestamp alignment even if subs exist
  3. Auto-generated YouTube captions — fallback only; lower quality than whisper

When human subs exist, align them with whisper timestamps for precise time-coding.

Speaker diarization

Decision: Skip for P1, add in P3 if needed.

Rationale:

  • Most YouTube content crossmem targets is solo presenter (lectures, conference talks, tutorials)
  • pyannote requires Python + HF token + ~2 GB model; adds significant complexity
  • sherpa-onnx is lighter but diarization accuracy on overlapping speech is still mediocre
  • Can retrofit later: diarization produces (speaker_id, start, end) segments that merge with existing transcript chunks

If multi-speaker content becomes common, P3 can add pyannote 3.1 with speaker embedding.


4. Visual Understanding

4a. Keyframe extraction

Decision: ffmpeg scene-cut detection

ffmpeg -i "${VIDEO_ID}_video.mp4" \
  -vf "select='gt(scene,0.3)',showinfo" \
  -vsync vfr \
  -frame_pts 1 \
  "${OUTPUT_DIR}/keyframe_%04d.png" \
  2>&1 | grep "pts_time" > "${OUTPUT_DIR}/keyframe_times.txt"
MethodProsCons
ffmpeg scene filterZero extra deps, timestamp-aware, tunable thresholdMay over/under-extract
TransNetV2ML-based, higher accuracyPython + PyTorch dep, overkill for slides
PySceneDetectGood APIPython dep

ffmpeg is already a required dependency (for audio extraction). Scene threshold 0.3 works well for slide-based content; can tune per-video.

Chapter-aware extraction: If the info JSON contains chapters, also extract one keyframe per chapter boundary (seek to chapter_start + 2s). Merge with scene-cut keyframes, deduplicate within 5s window.

Target: 1 keyframe per 30–120 seconds depending on content type. Cap at 200 keyframes per video.

4b. Per-keyframe VLM caption

Decision: Qwen2.5-VL-7B via Ollama (local)

Ollama already supports multimodal models. The existing Ollama integration in cite.rs targets http://localhost:11434/api/generate — the same endpoint accepts image input with "images": [base64_png].

{
  "model": "qwen2.5-vl:7b",
  "prompt": "Describe this video frame in one sentence. If it contains a slide, list the title and key bullet points.",
  "images": ["<base64_keyframe>"],
  "stream": false
}
ModelVRAMSpeed (per frame, M2)Quality
Qwen2.5-VL-7B (q4_K_M)~5 GB~3–5 secGood for slides/diagrams
LLaVA-1.6-7B~5 GB~3–5 secSlightly worse on text-heavy slides
Qwen2.5-VL-3B~2.5 GB~1–2 secFaster but misses fine text

Qwen2.5-VL-7B is the best local VLM for slide/diagram content. 7B quantized fits comfortably alongside whisper on M2 (16 GB unified memory).

Batching: Process keyframes sequentially (VLM needs full GPU). At ~4 sec/frame × 100 frames = ~7 min. Acceptable.

4c. OCR on slides

Decision: Apple Vision framework via swift-ffi (primary), Tesseract (fallback)

EngineAccuracySpeedDependencies
Apple Vision (VNRecognizeTextRequest)Excellent, especially printed text~0.1s/imagemacOS 13+, Swift FFI
PaddleOCRVery good, multi-language~0.3s/imagePython + large model
TesseractGood for English~0.5s/imagebrew install tesseract

Apple Vision is the clear winner on macOS: built-in, fast, accurate, no extra deps. Access from Rust via a tiny Swift CLI helper:

// crossmem-ocr (Swift CLI, ~30 lines)
import Vision
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate
// ... read image, perform request, print results as JSON

Compile as crossmem-ocr binary, call from Rust via Command::new("crossmem-ocr"). Ship as part of the crossmem install or build from source on first run.

Fallback: If the Swift helper isn’t available (Linux compat someday), fall back to tesseract --oem 1 -l eng.


5. Chunk Schema

Time-aligned chunk (parallel to CompiledChunk in cite.rs)

#![allow(unused)]
fn main() {
pub struct YouTubeChunk {
    pub start_ms: u64,
    pub end_ms: u64,
    pub speaker: Option<String>,       // None until diarization (P3)
    pub transcript: String,            // Whisper or human-sub text for this segment
    pub slide_ocr: Option<String>,     // OCR text if keyframe in this time range
    pub keyframe_path: Option<String>, // Relative path to keyframe PNG
    pub keyframe_caption: Option<String>, // VLM description of keyframe
    pub paraphrase: String,            // LLM-generated 1-2 sentence summary
    pub implication: String,           // LLM-generated field impact
}
}

Chunk boundaries

Priority order for segmentation:

  1. Chapters (from info JSON) — if present, each chapter = one chunk
  2. Scene cuts — if no chapters, split at scene-cut boundaries
  3. Fixed window — fallback: 60-second segments with sentence-boundary snapping

Within a chapter, if the chapter exceeds 5 minutes, sub-split at scene cuts or 60s intervals.

Minimum chunk: 10 seconds. Maximum chunk: 5 minutes (force-split at sentence boundary).

Metadata struct

#![allow(unused)]
fn main() {
pub struct YouTubeMetadata {
    pub title: String,
    pub channel: String,
    pub upload_date: String,         // YYYY-MM-DD
    pub video_id: String,
    pub duration_sec: u64,
    pub chapters: Vec<Chapter>,      // from info JSON
    pub description: String,
    pub tags: Vec<String>,
}

pub struct Chapter {
    pub title: String,
    pub start_sec: f64,
    pub end_sec: f64,
}
}

Cite key

{channel_slug}{year}{first_noun_of_title}

Examples:

  • 3Blue1Brown, “But what is a neural network?” (2017) → 3blue1brown2017neural
  • Andrej Karpathy, “Let’s build GPT from scratch” (2023) → karpathy2023gpt
  • Two Minute Papers, “OpenAI Sora” (2024) → twominutepapers2024sora

channel_slug = channel name lowercased, non-alphanumeric stripped, truncated to 20 chars.

Each chunk carries a provenance URL:

https://youtu.be/{VIDEO_ID}?t={floor(start_ms / 1000)}

6. Citation Formats

APA 7th (online video)

{Channel} [{Channel}]. ({Year}, {Month} {Day}). {Title} [Video]. YouTube. https://www.youtube.com/watch?v={VIDEO_ID}

Example:

3Blue1Brown [3Blue1Brown]. (2017, October 5). But what is a neural network? [Video]. YouTube. https://www.youtube.com/watch?v=aircAruvnKk

MLA 9th

"{Title}." YouTube, uploaded by {Channel}, {Day} {Month} {Year}, www.youtube.com/watch?v={VIDEO_ID}.

Chicago 17th (note-bibliography)

{Channel}. "{Title}." {Month} {Day}, {Year}. Video, {Duration}. https://www.youtube.com/watch?v={VIDEO_ID}.

IEEE

{Channel}, "{Title}," YouTube. [Online Video]. Available: https://www.youtube.com/watch?v={VIDEO_ID}. [Accessed: {Access Date}].

BibTeX

@misc{cite_key,
  author = {{Channel}},
  title = {{Title}},
  year = {Year},
  month = {Month},
  howpublished = {\url{https://www.youtube.com/watch?v=VIDEO_ID}},
  note = {[Video]. YouTube. Accessed: YYYY-MM-DD}
}

7. Wiki Markdown Output

Follows the same structure as the ArXiv wiki notes. Example:

---
cite_key: 3blue1brown2017neural
title: "But what is a neural network?"
channel: "3Blue1Brown"
upload_date: "2017-10-05"
video_id: "aircAruvnKk"
duration_sec: 1140
captured_at: "1776300000"
raw: "~/crossmem/raw/youtube/aircAruvnKk.wav"
chunks: 12
source_type: youtube
---

# But what is a neural network?

## Citations

### APA
...

## Chunks

### 00:00–01:32 — Chapter: Introduction

> [Transcript text, first 400 chars...]

**Slide OCR:** [if keyframe present]

**Keyframe:** `keyframes/aircAruvnKk_0042.png` — "A diagram showing..."

**Paraphrase:** ...

**Implication:** ...

**Source:** [00:00](https://youtu.be/aircAruvnKk?t=0)

8. Orchestration

Decision: Same binary, new module youtube.rs

The existing crossmem capture <url> dispatches on URL. Add host detection:

#![allow(unused)]
fn main() {
// main.rs capture dispatch
if url.contains("arxiv.org") {
    cite::cmd_capture(url).await
} else if url.contains("youtube.com") || url.contains("youtu.be") {
    youtube::cmd_capture(url).await
} else {
    // future: generic handler
}
}

Module structure

src/
  cite.rs          # existing arxiv pipeline (unchanged)
  youtube.rs       # new: YouTube capture + compile
  youtube/
    download.rs    # yt-dlp wrapper
    transcribe.rs  # whisper.cpp wrapper
    keyframe.rs    # ffmpeg scene-cut + chapter extraction
    ocr.rs         # Apple Vision / tesseract wrapper
    vlm.rs         # Ollama multimodal (Qwen2.5-VL) wrapper
    chunk.rs       # Segmentation + chunk assembly
    emit.rs        # Wiki markdown emission
  shared/
    ollama.rs      # Extract from cite.rs — shared Ollama client
    formats.rs     # Citation format builders (generalized)

Shared Ollama code: Factor compile_page_chunk and the HTTP client into shared/ollama.rs. Both cite.rs and youtube.rs call it. The prompt template differs (page text vs transcript chunk), but the HTTP plumbing is identical.

Two-stage flow (same as arxiv)

crossmem capture <youtube-url>
  → downloads audio + video + subs + info JSON
  → extracts metadata, generates cite_key
  → saves to ~/crossmem/raw/youtube/{video_id}/
  → prints cite_key for next step

crossmem compile <cite_key>
  → detects source_type (arxiv vs youtube) from meta JSON
  → runs transcription (whisper.cpp)
  → runs keyframe extraction (ffmpeg)
  → runs OCR + VLM caption per keyframe
  → runs Ollama compile per chunk (paraphrase + implication)
  → emits wiki markdown to ~/crossmem/wiki/

9. Dependency Install UX

Decision: Error with one-liner install instructions on first run

Auto-installing is tempting but violates principle of least surprise. Instead:

$ crossmem capture https://youtube.com/watch?v=abc123

ERROR: missing required dependencies for YouTube ingestion:
  ✗ yt-dlp       — brew install yt-dlp
  ✗ ffmpeg        — brew install ffmpeg
  ✓ whisper.cpp   — found at /opt/homebrew/bin/whisper-cpp

Install all missing:
  brew install yt-dlp ffmpeg

Then retry: crossmem capture https://youtube.com/watch?v=abc123

Check order: which yt-dlp && which ffmpeg && which whisper-cpp (or whisper depending on install method).

whisper.cpp model download: If binary exists but model is missing:

Model not found. Download large-v3-turbo (~1.6 GB):
  curl -L -o ~/.cache/whisper/ggml-large-v3-turbo.bin \
    https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

crossmem-ocr Swift helper: Build from source on first YouTube capture:

$ swift build -c release -package-path ./tools/crossmem-ocr

Or provide pre-built binary in releases.


10. Cost Model (All Local)

Estimated wall-clock for M2 Mac mini (16 GB)

Stage1h video30 min video3h video
yt-dlp download (audio + video)~2 min~1 min~5 min
whisper.cpp transcription~6 min~3 min~18 min
ffmpeg keyframe extraction~1 min~30 sec~3 min
OCR per keyframe (~80 frames)~8 sec~4 sec~20 sec
VLM caption per keyframe~5 min~2.5 min~15 min
Ollama compile per chunk (~40 chunks)~8 min~4 min~24 min
Total~22 min~11 min~65 min

Bottlenecks

  1. Ollama compile — sequential LLM calls, ~12 sec/chunk. Could batch with larger context window.
  2. VLM caption — sequential, ~4 sec/frame. GPU contention with Ollama if run concurrently.
  3. Whisper — fast on Metal, but locks GPU for duration.

Memory pressure

ConcurrentPeak VRAMSafe on 16 GB?
Whisper alone~1.6 GBYes
Ollama (7B q4) alone~5 GBYes
Whisper + Ollama~6.6 GBYes
Qwen2.5-VL-7B + Ollama text~10 GBTight but OK
All three simultaneous~12 GBRisky — run sequentially

Strategy: Run stages sequentially. whisper → keyframes → OCR → VLM → compile. No concurrent GPU workloads.


11. Storage Layout

~/crossmem/
  raw/
    youtube/
      {video_id}/
        {video_id}.wav              # Audio (whisper input)
        {video_id}_video.mp4        # Video (keyframe source)
        {video_id}.info.json        # yt-dlp metadata
        {video_id}.en.vtt           # Human subs (if available)
        {video_id}.en.auto.vtt      # Auto subs (if available)
        {video_id}.meta.json        # crossmem metadata
        transcript.json             # Whisper output with timestamps
        keyframes/
          frame_0001.png            # Scene-cut keyframes
          frame_0002.png
          keyframe_times.json       # Timestamp → frame mapping
          ocr/
            frame_0001.txt          # OCR output per frame
          captions/
            frame_0001.txt          # VLM caption per frame
  wiki/
    {timestamp}_{cite_key}.md       # Final compiled wiki note

12. Phased Delivery

P1 — Download + Transcribe (MVP)

  • URL detection in main.rs capture dispatch
  • yt-dlp download wrapper (youtube/download.rs)
  • whisper.cpp transcription wrapper (youtube/transcribe.rs)
  • Basic chunk segmentation (chapters or 60s windows)
  • Ollama compile pass (reuse from cite.rs)
  • Wiki markdown emission (transcript-only, no visual)
  • Dependency check + error messages
  • Tests for metadata parsing, cite_key generation, chunk segmentation

P2 — Keyframes + OCR

  • ffmpeg scene-cut extraction (youtube/keyframe.rs)
  • Chapter-aware keyframe selection
  • Apple Vision OCR helper (tools/crossmem-ocr/)
  • Tesseract fallback
  • OCR text merged into chunks
  • Tests for keyframe timing, OCR integration

P3 — VLM Captions + Diarization

  • Ollama multimodal integration for keyframe captioning (youtube/vlm.rs)
  • Keyframe captions merged into chunks
  • Optional: pyannote speaker diarization
  • Tests for VLM response parsing

P4 — Polish + Chunk Emission

  • Human sub → whisper alignment
  • Playlist support (batch capture)
  • crossmem compile --source youtube flag
  • Storage cleanup (delete intermediate files after compile)
  • Integration tests with real short video
  • Performance benchmarks on M2/M4

13. Open Questions

  1. Subtitle language detection: Should we auto-detect the video language and pass --language to whisper, or always use en? For P1, assume English.

  2. Video retention: Keep the video file after keyframe extraction, or delete to save disk? A 1h 1080p video is ~1–2 GB. Suggest: keep for 7 days, then auto-prune.

  3. Ollama model for compile pass: Reuse llama3.2:3b (same as arxiv), or use a different model better suited for spoken-word paraphrasing? Suggest: same model, same env var.

  4. Playlist semantics: One wiki note per video, or one per playlist? Suggest: one per video, with a playlist index note linking them.

  5. Live stream handling: yt-dlp can download from start, but duration is unknown until stream ends. Suggest: P1 skips live, add in P2.