YouTube Ingestion Pipeline — Design Document

Status: Draft Author: crossmem team Date: 2026-04-15 Tracking: crossmem-rs#27

1. Overview

Extend crossmem capture <url> to detect youtube.com / youtu.be hosts and dispatch to a YouTube-specific pipeline that produces time-aligned wiki chunks — the video analog of the PDF chunk pipeline from #24.

The pipeline runs entirely local on an Apple Silicon Mac mini (M2/M4). No cloud APIs.

Pipeline stages

capture (download + extract audio/subs)
  → transcribe (whisper.cpp Metal)
  → keyframes (ffmpeg scene-cut)
  → OCR + VLM caption (per keyframe)
  → compile (Ollama paraphrase/implication per chunk)
  → emit wiki markdown

2. Download Path

Decision: yt-dlp binary

Option	Pros	Cons
yt-dlp binary	Battle-tested, handles every edge case, active community, `--cookies-from-browser` for member-only	External dep, Python-based, updates frequently
libyt-dlp bindings	Tighter integration	No stable C API; Python FFI is fragile
youtube-rs (pure Rust)	No external dep	Incomplete, breaks on YT changes, no auth, no live/shorts

yt-dlp wins because YouTube aggressively rotates extraction logic. Maintaining a pure-Rust extractor is a full-time job. yt-dlp is the industry standard for a reason.

Edge cases handled by yt-dlp flags

Scenario	yt-dlp flags
Age-gated	`--cookies-from-browser chrome` (reads real Chrome cookies)
Member-only	Same cookie approach; user must be logged in
Live streams	`--live-from-start --wait-for-video 30` (wait + download from start)
Shorts	Works as normal URLs (`youtube.com/shorts/ID` → standard extraction)
Playlists	`--yes-playlist` or `--no-playlist` (user flag; default: single video)
Chapters	`--embed-chapters` + `--write-info-json` (chapter list in info JSON)
Auto captions	`--write-auto-subs --sub-lang en`
Human captions	`--write-subs --sub-lang en` (preferred over auto when available)

Download command template

yt-dlp \
  --format "bestaudio[ext=m4a]/bestaudio/best" \
  --extract-audio --audio-format wav --audio-quality 0 \
  --write-info-json \
  --write-subs --write-auto-subs --sub-lang "en.*" --sub-format vtt \
  --embed-chapters \
  --cookies-from-browser chrome \
  --output "%(id)s.%(ext)s" \
  --paths "$HOME/crossmem/raw/youtube/" \
  "$URL"

For keyframe extraction we also need the video file:

yt-dlp \
  --format "bestvideo[height<=1080][ext=mp4]/bestvideo[height<=1080]/best" \
  --write-info-json \
  --output "%(id)s_video.%(ext)s" \
  --paths "$HOME/crossmem/raw/youtube/" \
  "$URL"

3. Audio Extraction → Transcription

Decision: whisper.cpp with Metal acceleration, large-v3-turbo model

Engine	Backend	Speed (1h audio, M2)	Accuracy	Notes
whisper.cpp	Metal (Apple GPU)	~6–8 min	WER ~8% (large-v3-turbo)	C/C++, no Python, `--print-timestamps` for word-level
whisper-mlx	MLX (Apple GPU)	~5–7 min	Same models	Python dep, MLX framework, slightly faster on M4
WhisperKit	CoreML	~5–6 min	Good	Swift-only, harder to call from Rust
insanely-fast-whisper	MPS (PyTorch)	~10–15 min	Same models	Heavy Python stack, MPS less optimized than Metal
faster-whisper	CTranslate2 (CPU)	~15–25 min	Same models	No Metal/MPS; CPU-only on macOS

whisper.cpp wins because:

Native Metal acceleration — no Python runtime
Easily called from Rust via std::process::Command (same pattern as pdftotext in cite.rs)
Outputs VTT/SRT/JSON with word-level timestamps
Active project, models available via Hugging Face in ggml format

Model choice: large-v3-turbo

Model	Params	VRAM	Disk	Speed (M2, 1h)	WER (en)
large-v3	1.55B	~3 GB	3.1 GB	~12 min	~7.5%
large-v3-turbo	809M	~1.6 GB	1.6 GB	~6 min	~8%
distil-large-v3	756M	~1.5 GB	1.5 GB	~5 min	~9%

large-v3-turbo is the sweet spot: half the VRAM of large-v3, nearly the same WER, 2× faster. distil-large-v3 is marginally faster but has slightly worse accuracy on non-native English speakers (common in academic talks).

Transcription command

whisper-cpp \
  --model models/ggml-large-v3-turbo.bin \
  --file "$HOME/crossmem/raw/youtube/${VIDEO_ID}.wav" \
  --output-vtt \
  --output-json \
  --print-timestamps \
  --language en \
  --threads 4

Caption priority

Human-uploaded subtitles (.en.vtt from yt-dlp) — highest quality, use as-is
whisper.cpp transcription — always run for timestamp alignment even if subs exist
Auto-generated YouTube captions — fallback only; lower quality than whisper

When human subs exist, align them with whisper timestamps for precise time-coding.

Speaker diarization

Decision: Skip for P1, add in P3 if needed.

Rationale:

Most YouTube content crossmem targets is solo presenter (lectures, conference talks, tutorials)
pyannote requires Python + HF token + ~2 GB model; adds significant complexity
sherpa-onnx is lighter but diarization accuracy on overlapping speech is still mediocre
Can retrofit later: diarization produces (speaker_id, start, end) segments that merge with existing transcript chunks

If multi-speaker content becomes common, P3 can add pyannote 3.1 with speaker embedding.

4. Visual Understanding

4a. Keyframe extraction

Decision: ffmpeg scene-cut detection

ffmpeg -i "${VIDEO_ID}_video.mp4" \
  -vf "select='gt(scene,0.3)',showinfo" \
  -vsync vfr \
  -frame_pts 1 \
  "${OUTPUT_DIR}/keyframe_%04d.png" \
  2>&1 | grep "pts_time" > "${OUTPUT_DIR}/keyframe_times.txt"

Method	Pros	Cons
ffmpeg `scene` filter	Zero extra deps, timestamp-aware, tunable threshold	May over/under-extract
TransNetV2	ML-based, higher accuracy	Python + PyTorch dep, overkill for slides
PySceneDetect	Good API	Python dep

ffmpeg is already a required dependency (for audio extraction). Scene threshold 0.3 works well for slide-based content; can tune per-video.

Chapter-aware extraction: If the info JSON contains chapters, also extract one keyframe per chapter boundary (seek to chapter_start + 2s). Merge with scene-cut keyframes, deduplicate within 5s window.

Target: 1 keyframe per 30–120 seconds depending on content type. Cap at 200 keyframes per video.

4b. Per-keyframe VLM caption

Decision: Qwen2.5-VL-7B via Ollama (local)

Ollama already supports multimodal models. The existing Ollama integration in cite.rs targets http://localhost:11434/api/generate — the same endpoint accepts image input with "images": [base64_png].

{
  "model": "qwen2.5-vl:7b",
  "prompt": "Describe this video frame in one sentence. If it contains a slide, list the title and key bullet points.",
  "images": ["<base64_keyframe>"],
  "stream": false
}

Model	VRAM	Speed (per frame, M2)	Quality
Qwen2.5-VL-7B (q4_K_M)	~5 GB	~3–5 sec	Good for slides/diagrams
LLaVA-1.6-7B	~5 GB	~3–5 sec	Slightly worse on text-heavy slides
Qwen2.5-VL-3B	~2.5 GB	~1–2 sec	Faster but misses fine text

Qwen2.5-VL-7B is the best local VLM for slide/diagram content. 7B quantized fits comfortably alongside whisper on M2 (16 GB unified memory).

Batching: Process keyframes sequentially (VLM needs full GPU). At ~4 sec/frame × 100 frames = ~7 min. Acceptable.

4c. OCR on slides

Decision: Apple Vision framework via swift-ffi (primary), Tesseract (fallback)

Engine	Accuracy	Speed	Dependencies
Apple Vision (VNRecognizeTextRequest)	Excellent, especially printed text	~0.1s/image	macOS 13+, Swift FFI
PaddleOCR	Very good, multi-language	~0.3s/image	Python + large model
Tesseract	Good for English	~0.5s/image	`brew install tesseract`

Apple Vision is the clear winner on macOS: built-in, fast, accurate, no extra deps. Access from Rust via a tiny Swift CLI helper:

// crossmem-ocr (Swift CLI, ~30 lines)
import Vision
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate
// ... read image, perform request, print results as JSON

Compile as crossmem-ocr binary, call from Rust via Command::new("crossmem-ocr"). Ship as part of the crossmem install or build from source on first run.

Fallback: If the Swift helper isn’t available (Linux compat someday), fall back to tesseract --oem 1 -l eng.

5. Chunk Schema

Time-aligned chunk (parallel to `CompiledChunk` in cite.rs)

#![allow(unused)]
fn main() {
pub struct YouTubeChunk {
    pub start_ms: u64,
    pub end_ms: u64,
    pub speaker: Option<String>,       // None until diarization (P3)
    pub transcript: String,            // Whisper or human-sub text for this segment
    pub slide_ocr: Option<String>,     // OCR text if keyframe in this time range
    pub keyframe_path: Option<String>, // Relative path to keyframe PNG
    pub keyframe_caption: Option<String>, // VLM description of keyframe
    pub paraphrase: String,            // LLM-generated 1-2 sentence summary
    pub implication: String,           // LLM-generated field impact
}
}

Chunk boundaries

Priority order for segmentation:

Chapters (from info JSON) — if present, each chapter = one chunk
Scene cuts — if no chapters, split at scene-cut boundaries
Fixed window — fallback: 60-second segments with sentence-boundary snapping

Within a chapter, if the chapter exceeds 5 minutes, sub-split at scene cuts or 60s intervals.

Minimum chunk: 10 seconds. Maximum chunk: 5 minutes (force-split at sentence boundary).

Metadata struct

#![allow(unused)]
fn main() {
pub struct YouTubeMetadata {
    pub title: String,
    pub channel: String,
    pub upload_date: String,         // YYYY-MM-DD
    pub video_id: String,
    pub duration_sec: u64,
    pub chapters: Vec<Chapter>,      // from info JSON
    pub description: String,
    pub tags: Vec<String>,
}

pub struct Chapter {
    pub title: String,
    pub start_sec: f64,
    pub end_sec: f64,
}
}

Cite key

{channel_slug}{year}{first_noun_of_title}

Examples:

3Blue1Brown, “But what is a neural network?” (2017) → 3blue1brown2017neural
Andrej Karpathy, “Let’s build GPT from scratch” (2023) → karpathy2023gpt
Two Minute Papers, “OpenAI Sora” (2024) → twominutepapers2024sora

channel_slug = channel name lowercased, non-alphanumeric stripped, truncated to 20 chars.

Time-coded deep link

Each chunk carries a provenance URL:

https://youtu.be/{VIDEO_ID}?t={floor(start_ms / 1000)}

6. Citation Formats

APA 7th (online video)

{Channel} [{Channel}]. ({Year}, {Month} {Day}). {Title} [Video]. YouTube. https://www.youtube.com/watch?v={VIDEO_ID}

Example:

3Blue1Brown [3Blue1Brown]. (2017, October 5). But what is a neural network? [Video]. YouTube. https://www.youtube.com/watch?v=aircAruvnKk

MLA 9th

"{Title}." YouTube, uploaded by {Channel}, {Day} {Month} {Year}, www.youtube.com/watch?v={VIDEO_ID}.

Chicago 17th (note-bibliography)

{Channel}. "{Title}." {Month} {Day}, {Year}. Video, {Duration}. https://www.youtube.com/watch?v={VIDEO_ID}.

IEEE

{Channel}, "{Title}," YouTube. [Online Video]. Available: https://www.youtube.com/watch?v={VIDEO_ID}. [Accessed: {Access Date}].

BibTeX

@misc{cite_key,
  author = {{Channel}},
  title = {{Title}},
  year = {Year},
  month = {Month},
  howpublished = {\url{https://www.youtube.com/watch?v=VIDEO_ID}},
  note = {[Video]. YouTube. Accessed: YYYY-MM-DD}
}

7. Wiki Markdown Output

Follows the same structure as the ArXiv wiki notes. Example:

---
cite_key: 3blue1brown2017neural
title: "But what is a neural network?"
channel: "3Blue1Brown"
upload_date: "2017-10-05"
video_id: "aircAruvnKk"
duration_sec: 1140
captured_at: "1776300000"
raw: "~/crossmem/raw/youtube/aircAruvnKk.wav"
chunks: 12
source_type: youtube
---

# But what is a neural network?

## Citations

### APA
...

## Chunks

### 00:00–01:32 — Chapter: Introduction

> [Transcript text, first 400 chars...]

**Slide OCR:** [if keyframe present]

**Keyframe:** `keyframes/aircAruvnKk_0042.png` — "A diagram showing..."

**Paraphrase:** ...

**Implication:** ...

**Source:** [00:00](https://youtu.be/aircAruvnKk?t=0)

8. Orchestration

Decision: Same binary, new module `youtube.rs`

The existing crossmem capture <url> dispatches on URL. Add host detection:

#![allow(unused)]
fn main() {
// main.rs capture dispatch
if url.contains("arxiv.org") {
    cite::cmd_capture(url).await
} else if url.contains("youtube.com") || url.contains("youtu.be") {
    youtube::cmd_capture(url).await
} else {
    // future: generic handler
}
}

Module structure

src/
  cite.rs          # existing arxiv pipeline (unchanged)
  youtube.rs       # new: YouTube capture + compile
  youtube/
    download.rs    # yt-dlp wrapper
    transcribe.rs  # whisper.cpp wrapper
    keyframe.rs    # ffmpeg scene-cut + chapter extraction
    ocr.rs         # Apple Vision / tesseract wrapper
    vlm.rs         # Ollama multimodal (Qwen2.5-VL) wrapper
    chunk.rs       # Segmentation + chunk assembly
    emit.rs        # Wiki markdown emission
  shared/
    ollama.rs      # Extract from cite.rs — shared Ollama client
    formats.rs     # Citation format builders (generalized)

Shared Ollama code: Factor compile_page_chunk and the HTTP client into shared/ollama.rs. Both cite.rs and youtube.rs call it. The prompt template differs (page text vs transcript chunk), but the HTTP plumbing is identical.

Two-stage flow (same as arxiv)

crossmem capture <youtube-url>
  → downloads audio + video + subs + info JSON
  → extracts metadata, generates cite_key
  → saves to ~/crossmem/raw/youtube/{video_id}/
  → prints cite_key for next step

crossmem compile <cite_key>
  → detects source_type (arxiv vs youtube) from meta JSON
  → runs transcription (whisper.cpp)
  → runs keyframe extraction (ffmpeg)
  → runs OCR + VLM caption per keyframe
  → runs Ollama compile per chunk (paraphrase + implication)
  → emits wiki markdown to ~/crossmem/wiki/

9. Dependency Install UX

Decision: Error with one-liner install instructions on first run

Auto-installing is tempting but violates principle of least surprise. Instead:

$ crossmem capture https://youtube.com/watch?v=abc123

ERROR: missing required dependencies for YouTube ingestion:
  ✗ yt-dlp       — brew install yt-dlp
  ✗ ffmpeg        — brew install ffmpeg
  ✓ whisper.cpp   — found at /opt/homebrew/bin/whisper-cpp

Install all missing:
  brew install yt-dlp ffmpeg

Then retry: crossmem capture https://youtube.com/watch?v=abc123

Check order: which yt-dlp && which ffmpeg && which whisper-cpp (or whisper depending on install method).

whisper.cpp model download: If binary exists but model is missing:

Model not found. Download large-v3-turbo (~1.6 GB):
  curl -L -o ~/.cache/whisper/ggml-large-v3-turbo.bin \
    https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

crossmem-ocr Swift helper: Build from source on first YouTube capture:

$ swift build -c release -package-path ./tools/crossmem-ocr

Or provide pre-built binary in releases.

10. Cost Model (All Local)

Estimated wall-clock for M2 Mac mini (16 GB)

Stage	1h video	30 min video	3h video
yt-dlp download (audio + video)	~2 min	~1 min	~5 min
whisper.cpp transcription	~6 min	~3 min	~18 min
ffmpeg keyframe extraction	~1 min	~30 sec	~3 min
OCR per keyframe (~80 frames)	~8 sec	~4 sec	~20 sec
VLM caption per keyframe	~5 min	~2.5 min	~15 min
Ollama compile per chunk (~40 chunks)	~8 min	~4 min	~24 min
Total	~22 min	~11 min	~65 min

Bottlenecks

Ollama compile — sequential LLM calls, ~12 sec/chunk. Could batch with larger context window.
VLM caption — sequential, ~4 sec/frame. GPU contention with Ollama if run concurrently.
Whisper — fast on Metal, but locks GPU for duration.

Memory pressure

Concurrent	Peak VRAM	Safe on 16 GB?
Whisper alone	~1.6 GB	Yes
Ollama (7B q4) alone	~5 GB	Yes
Whisper + Ollama	~6.6 GB	Yes
Qwen2.5-VL-7B + Ollama text	~10 GB	Tight but OK
All three simultaneous	~12 GB	Risky — run sequentially

Strategy: Run stages sequentially. whisper → keyframes → OCR → VLM → compile. No concurrent GPU workloads.

11. Storage Layout

~/crossmem/
  raw/
    youtube/
      {video_id}/
        {video_id}.wav              # Audio (whisper input)
        {video_id}_video.mp4        # Video (keyframe source)
        {video_id}.info.json        # yt-dlp metadata
        {video_id}.en.vtt           # Human subs (if available)
        {video_id}.en.auto.vtt      # Auto subs (if available)
        {video_id}.meta.json        # crossmem metadata
        transcript.json             # Whisper output with timestamps
        keyframes/
          frame_0001.png            # Scene-cut keyframes
          frame_0002.png
          keyframe_times.json       # Timestamp → frame mapping
          ocr/
            frame_0001.txt          # OCR output per frame
          captions/
            frame_0001.txt          # VLM caption per frame
  wiki/
    {timestamp}_{cite_key}.md       # Final compiled wiki note

12. Phased Delivery

P1 — Download + Transcribe (MVP)

URL detection in main.rs capture dispatch
yt-dlp download wrapper (youtube/download.rs)
whisper.cpp transcription wrapper (youtube/transcribe.rs)
Basic chunk segmentation (chapters or 60s windows)
Ollama compile pass (reuse from cite.rs)
Wiki markdown emission (transcript-only, no visual)
Dependency check + error messages
Tests for metadata parsing, cite_key generation, chunk segmentation

P2 — Keyframes + OCR

ffmpeg scene-cut extraction (youtube/keyframe.rs)
Chapter-aware keyframe selection
Apple Vision OCR helper (tools/crossmem-ocr/)
Tesseract fallback
OCR text merged into chunks
Tests for keyframe timing, OCR integration

P3 — VLM Captions + Diarization

Ollama multimodal integration for keyframe captioning (youtube/vlm.rs)
Keyframe captions merged into chunks
Optional: pyannote speaker diarization
Tests for VLM response parsing

P4 — Polish + Chunk Emission

Human sub → whisper alignment
Playlist support (batch capture)
crossmem compile --source youtube flag
Storage cleanup (delete intermediate files after compile)
Integration tests with real short video
Performance benchmarks on M2/M4

13. Open Questions

Subtitle language detection: Should we auto-detect the video language and pass --language to whisper, or always use en? For P1, assume English.
Video retention: Keep the video file after keyframe extraction, or delete to save disk? A 1h 1080p video is ~1–2 GB. Suggest: keep for 7 days, then auto-prune.
Ollama model for compile pass: Reuse llama3.2:3b (same as arxiv), or use a different model better suited for spoken-word paraphrasing? Suggest: same model, same env var.
Playlist semantics: One wiki note per video, or one per playlist? Suggest: one per video, with a playlist index note linking them.
Live stream handling: yt-dlp can download from start, but duration is unknown until stream ends. Suggest: P1 skips live, add in P2.

Keyboard shortcuts

crossmem