YouTube Ingestion Pipeline — Design Document
Status: Draft Author: crossmem team Date: 2026-04-15 Tracking: crossmem-rs#27
1. Overview
Extend crossmem capture <url> to detect youtube.com / youtu.be hosts and dispatch to a YouTube-specific pipeline that produces time-aligned wiki chunks — the video analog of the PDF chunk pipeline from #24.
The pipeline runs entirely local on an Apple Silicon Mac mini (M2/M4). No cloud APIs.
Pipeline stages
capture (download + extract audio/subs)
→ transcribe (whisper.cpp Metal)
→ keyframes (ffmpeg scene-cut)
→ OCR + VLM caption (per keyframe)
→ compile (Ollama paraphrase/implication per chunk)
→ emit wiki markdown
2. Download Path
Decision: yt-dlp binary
| Option | Pros | Cons |
|---|---|---|
| yt-dlp binary | Battle-tested, handles every edge case, active community, --cookies-from-browser for member-only | External dep, Python-based, updates frequently |
| libyt-dlp bindings | Tighter integration | No stable C API; Python FFI is fragile |
| youtube-rs (pure Rust) | No external dep | Incomplete, breaks on YT changes, no auth, no live/shorts |
yt-dlp wins because YouTube aggressively rotates extraction logic. Maintaining a pure-Rust extractor is a full-time job. yt-dlp is the industry standard for a reason.
Edge cases handled by yt-dlp flags
| Scenario | yt-dlp flags |
|---|---|
| Age-gated | --cookies-from-browser chrome (reads real Chrome cookies) |
| Member-only | Same cookie approach; user must be logged in |
| Live streams | --live-from-start --wait-for-video 30 (wait + download from start) |
| Shorts | Works as normal URLs (youtube.com/shorts/ID → standard extraction) |
| Playlists | --yes-playlist or --no-playlist (user flag; default: single video) |
| Chapters | --embed-chapters + --write-info-json (chapter list in info JSON) |
| Auto captions | --write-auto-subs --sub-lang en |
| Human captions | --write-subs --sub-lang en (preferred over auto when available) |
Download command template
yt-dlp \
--format "bestaudio[ext=m4a]/bestaudio/best" \
--extract-audio --audio-format wav --audio-quality 0 \
--write-info-json \
--write-subs --write-auto-subs --sub-lang "en.*" --sub-format vtt \
--embed-chapters \
--cookies-from-browser chrome \
--output "%(id)s.%(ext)s" \
--paths "$HOME/crossmem/raw/youtube/" \
"$URL"
For keyframe extraction we also need the video file:
yt-dlp \
--format "bestvideo[height<=1080][ext=mp4]/bestvideo[height<=1080]/best" \
--write-info-json \
--output "%(id)s_video.%(ext)s" \
--paths "$HOME/crossmem/raw/youtube/" \
"$URL"
3. Audio Extraction → Transcription
Decision: whisper.cpp with Metal acceleration, large-v3-turbo model
| Engine | Backend | Speed (1h audio, M2) | Accuracy | Notes |
|---|---|---|---|---|
| whisper.cpp | Metal (Apple GPU) | ~6–8 min | WER ~8% (large-v3-turbo) | C/C++, no Python, --print-timestamps for word-level |
| whisper-mlx | MLX (Apple GPU) | ~5–7 min | Same models | Python dep, MLX framework, slightly faster on M4 |
| WhisperKit | CoreML | ~5–6 min | Good | Swift-only, harder to call from Rust |
| insanely-fast-whisper | MPS (PyTorch) | ~10–15 min | Same models | Heavy Python stack, MPS less optimized than Metal |
| faster-whisper | CTranslate2 (CPU) | ~15–25 min | Same models | No Metal/MPS; CPU-only on macOS |
whisper.cpp wins because:
- Native Metal acceleration — no Python runtime
- Easily called from Rust via
std::process::Command(same pattern aspdftotextin cite.rs) - Outputs VTT/SRT/JSON with word-level timestamps
- Active project, models available via Hugging Face in ggml format
Model choice: large-v3-turbo
| Model | Params | VRAM | Disk | Speed (M2, 1h) | WER (en) |
|---|---|---|---|---|---|
| large-v3 | 1.55B | ~3 GB | 3.1 GB | ~12 min | ~7.5% |
| large-v3-turbo | 809M | ~1.6 GB | 1.6 GB | ~6 min | ~8% |
| distil-large-v3 | 756M | ~1.5 GB | 1.5 GB | ~5 min | ~9% |
large-v3-turbo is the sweet spot: half the VRAM of large-v3, nearly the same WER, 2× faster. distil-large-v3 is marginally faster but has slightly worse accuracy on non-native English speakers (common in academic talks).
Transcription command
whisper-cpp \
--model models/ggml-large-v3-turbo.bin \
--file "$HOME/crossmem/raw/youtube/${VIDEO_ID}.wav" \
--output-vtt \
--output-json \
--print-timestamps \
--language en \
--threads 4
Caption priority
- Human-uploaded subtitles (
.en.vttfrom yt-dlp) — highest quality, use as-is - whisper.cpp transcription — always run for timestamp alignment even if subs exist
- Auto-generated YouTube captions — fallback only; lower quality than whisper
When human subs exist, align them with whisper timestamps for precise time-coding.
Speaker diarization
Decision: Skip for P1, add in P3 if needed.
Rationale:
- Most YouTube content crossmem targets is solo presenter (lectures, conference talks, tutorials)
- pyannote requires Python + HF token + ~2 GB model; adds significant complexity
- sherpa-onnx is lighter but diarization accuracy on overlapping speech is still mediocre
- Can retrofit later: diarization produces
(speaker_id, start, end)segments that merge with existing transcript chunks
If multi-speaker content becomes common, P3 can add pyannote 3.1 with speaker embedding.
4. Visual Understanding
4a. Keyframe extraction
Decision: ffmpeg scene-cut detection
ffmpeg -i "${VIDEO_ID}_video.mp4" \
-vf "select='gt(scene,0.3)',showinfo" \
-vsync vfr \
-frame_pts 1 \
"${OUTPUT_DIR}/keyframe_%04d.png" \
2>&1 | grep "pts_time" > "${OUTPUT_DIR}/keyframe_times.txt"
| Method | Pros | Cons |
|---|---|---|
ffmpeg scene filter | Zero extra deps, timestamp-aware, tunable threshold | May over/under-extract |
| TransNetV2 | ML-based, higher accuracy | Python + PyTorch dep, overkill for slides |
| PySceneDetect | Good API | Python dep |
ffmpeg is already a required dependency (for audio extraction). Scene threshold 0.3 works well for slide-based content; can tune per-video.
Chapter-aware extraction: If the info JSON contains chapters, also extract one keyframe per chapter boundary (seek to chapter_start + 2s). Merge with scene-cut keyframes, deduplicate within 5s window.
Target: 1 keyframe per 30–120 seconds depending on content type. Cap at 200 keyframes per video.
4b. Per-keyframe VLM caption
Decision: Qwen2.5-VL-7B via Ollama (local)
Ollama already supports multimodal models. The existing Ollama integration in cite.rs targets http://localhost:11434/api/generate — the same endpoint accepts image input with "images": [base64_png].
{
"model": "qwen2.5-vl:7b",
"prompt": "Describe this video frame in one sentence. If it contains a slide, list the title and key bullet points.",
"images": ["<base64_keyframe>"],
"stream": false
}
| Model | VRAM | Speed (per frame, M2) | Quality |
|---|---|---|---|
| Qwen2.5-VL-7B (q4_K_M) | ~5 GB | ~3–5 sec | Good for slides/diagrams |
| LLaVA-1.6-7B | ~5 GB | ~3–5 sec | Slightly worse on text-heavy slides |
| Qwen2.5-VL-3B | ~2.5 GB | ~1–2 sec | Faster but misses fine text |
Qwen2.5-VL-7B is the best local VLM for slide/diagram content. 7B quantized fits comfortably alongside whisper on M2 (16 GB unified memory).
Batching: Process keyframes sequentially (VLM needs full GPU). At ~4 sec/frame × 100 frames = ~7 min. Acceptable.
4c. OCR on slides
Decision: Apple Vision framework via swift-ffi (primary), Tesseract (fallback)
| Engine | Accuracy | Speed | Dependencies |
|---|---|---|---|
| Apple Vision (VNRecognizeTextRequest) | Excellent, especially printed text | ~0.1s/image | macOS 13+, Swift FFI |
| PaddleOCR | Very good, multi-language | ~0.3s/image | Python + large model |
| Tesseract | Good for English | ~0.5s/image | brew install tesseract |
Apple Vision is the clear winner on macOS: built-in, fast, accurate, no extra deps. Access from Rust via a tiny Swift CLI helper:
// crossmem-ocr (Swift CLI, ~30 lines)
import Vision
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate
// ... read image, perform request, print results as JSON
Compile as crossmem-ocr binary, call from Rust via Command::new("crossmem-ocr"). Ship as part of the crossmem install or build from source on first run.
Fallback: If the Swift helper isn’t available (Linux compat someday), fall back to tesseract --oem 1 -l eng.
5. Chunk Schema
Time-aligned chunk (parallel to CompiledChunk in cite.rs)
#![allow(unused)]
fn main() {
pub struct YouTubeChunk {
pub start_ms: u64,
pub end_ms: u64,
pub speaker: Option<String>, // None until diarization (P3)
pub transcript: String, // Whisper or human-sub text for this segment
pub slide_ocr: Option<String>, // OCR text if keyframe in this time range
pub keyframe_path: Option<String>, // Relative path to keyframe PNG
pub keyframe_caption: Option<String>, // VLM description of keyframe
pub paraphrase: String, // LLM-generated 1-2 sentence summary
pub implication: String, // LLM-generated field impact
}
}
Chunk boundaries
Priority order for segmentation:
- Chapters (from info JSON) — if present, each chapter = one chunk
- Scene cuts — if no chapters, split at scene-cut boundaries
- Fixed window — fallback: 60-second segments with sentence-boundary snapping
Within a chapter, if the chapter exceeds 5 minutes, sub-split at scene cuts or 60s intervals.
Minimum chunk: 10 seconds. Maximum chunk: 5 minutes (force-split at sentence boundary).
Metadata struct
#![allow(unused)]
fn main() {
pub struct YouTubeMetadata {
pub title: String,
pub channel: String,
pub upload_date: String, // YYYY-MM-DD
pub video_id: String,
pub duration_sec: u64,
pub chapters: Vec<Chapter>, // from info JSON
pub description: String,
pub tags: Vec<String>,
}
pub struct Chapter {
pub title: String,
pub start_sec: f64,
pub end_sec: f64,
}
}
Cite key
{channel_slug}{year}{first_noun_of_title}
Examples:
- 3Blue1Brown, “But what is a neural network?” (2017) →
3blue1brown2017neural - Andrej Karpathy, “Let’s build GPT from scratch” (2023) →
karpathy2023gpt - Two Minute Papers, “OpenAI Sora” (2024) →
twominutepapers2024sora
channel_slug = channel name lowercased, non-alphanumeric stripped, truncated to 20 chars.
Time-coded deep link
Each chunk carries a provenance URL:
https://youtu.be/{VIDEO_ID}?t={floor(start_ms / 1000)}
6. Citation Formats
APA 7th (online video)
{Channel} [{Channel}]. ({Year}, {Month} {Day}). {Title} [Video]. YouTube. https://www.youtube.com/watch?v={VIDEO_ID}
Example:
3Blue1Brown [3Blue1Brown]. (2017, October 5). But what is a neural network? [Video]. YouTube. https://www.youtube.com/watch?v=aircAruvnKk
MLA 9th
"{Title}." YouTube, uploaded by {Channel}, {Day} {Month} {Year}, www.youtube.com/watch?v={VIDEO_ID}.
Chicago 17th (note-bibliography)
{Channel}. "{Title}." {Month} {Day}, {Year}. Video, {Duration}. https://www.youtube.com/watch?v={VIDEO_ID}.
IEEE
{Channel}, "{Title}," YouTube. [Online Video]. Available: https://www.youtube.com/watch?v={VIDEO_ID}. [Accessed: {Access Date}].
BibTeX
@misc{cite_key,
author = {{Channel}},
title = {{Title}},
year = {Year},
month = {Month},
howpublished = {\url{https://www.youtube.com/watch?v=VIDEO_ID}},
note = {[Video]. YouTube. Accessed: YYYY-MM-DD}
}
7. Wiki Markdown Output
Follows the same structure as the ArXiv wiki notes. Example:
---
cite_key: 3blue1brown2017neural
title: "But what is a neural network?"
channel: "3Blue1Brown"
upload_date: "2017-10-05"
video_id: "aircAruvnKk"
duration_sec: 1140
captured_at: "1776300000"
raw: "~/crossmem/raw/youtube/aircAruvnKk.wav"
chunks: 12
source_type: youtube
---
# But what is a neural network?
## Citations
### APA
...
## Chunks
### 00:00–01:32 — Chapter: Introduction
> [Transcript text, first 400 chars...]
**Slide OCR:** [if keyframe present]
**Keyframe:** `keyframes/aircAruvnKk_0042.png` — "A diagram showing..."
**Paraphrase:** ...
**Implication:** ...
**Source:** [00:00](https://youtu.be/aircAruvnKk?t=0)
8. Orchestration
Decision: Same binary, new module youtube.rs
The existing crossmem capture <url> dispatches on URL. Add host detection:
#![allow(unused)]
fn main() {
// main.rs capture dispatch
if url.contains("arxiv.org") {
cite::cmd_capture(url).await
} else if url.contains("youtube.com") || url.contains("youtu.be") {
youtube::cmd_capture(url).await
} else {
// future: generic handler
}
}
Module structure
src/
cite.rs # existing arxiv pipeline (unchanged)
youtube.rs # new: YouTube capture + compile
youtube/
download.rs # yt-dlp wrapper
transcribe.rs # whisper.cpp wrapper
keyframe.rs # ffmpeg scene-cut + chapter extraction
ocr.rs # Apple Vision / tesseract wrapper
vlm.rs # Ollama multimodal (Qwen2.5-VL) wrapper
chunk.rs # Segmentation + chunk assembly
emit.rs # Wiki markdown emission
shared/
ollama.rs # Extract from cite.rs — shared Ollama client
formats.rs # Citation format builders (generalized)
Shared Ollama code: Factor compile_page_chunk and the HTTP client into shared/ollama.rs. Both cite.rs and youtube.rs call it. The prompt template differs (page text vs transcript chunk), but the HTTP plumbing is identical.
Two-stage flow (same as arxiv)
crossmem capture <youtube-url>
→ downloads audio + video + subs + info JSON
→ extracts metadata, generates cite_key
→ saves to ~/crossmem/raw/youtube/{video_id}/
→ prints cite_key for next step
crossmem compile <cite_key>
→ detects source_type (arxiv vs youtube) from meta JSON
→ runs transcription (whisper.cpp)
→ runs keyframe extraction (ffmpeg)
→ runs OCR + VLM caption per keyframe
→ runs Ollama compile per chunk (paraphrase + implication)
→ emits wiki markdown to ~/crossmem/wiki/
9. Dependency Install UX
Decision: Error with one-liner install instructions on first run
Auto-installing is tempting but violates principle of least surprise. Instead:
$ crossmem capture https://youtube.com/watch?v=abc123
ERROR: missing required dependencies for YouTube ingestion:
✗ yt-dlp — brew install yt-dlp
✗ ffmpeg — brew install ffmpeg
✓ whisper.cpp — found at /opt/homebrew/bin/whisper-cpp
Install all missing:
brew install yt-dlp ffmpeg
Then retry: crossmem capture https://youtube.com/watch?v=abc123
Check order: which yt-dlp && which ffmpeg && which whisper-cpp (or whisper depending on install method).
whisper.cpp model download: If binary exists but model is missing:
Model not found. Download large-v3-turbo (~1.6 GB):
curl -L -o ~/.cache/whisper/ggml-large-v3-turbo.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin
crossmem-ocr Swift helper: Build from source on first YouTube capture:
$ swift build -c release -package-path ./tools/crossmem-ocr
Or provide pre-built binary in releases.
10. Cost Model (All Local)
Estimated wall-clock for M2 Mac mini (16 GB)
| Stage | 1h video | 30 min video | 3h video |
|---|---|---|---|
| yt-dlp download (audio + video) | ~2 min | ~1 min | ~5 min |
| whisper.cpp transcription | ~6 min | ~3 min | ~18 min |
| ffmpeg keyframe extraction | ~1 min | ~30 sec | ~3 min |
| OCR per keyframe (~80 frames) | ~8 sec | ~4 sec | ~20 sec |
| VLM caption per keyframe | ~5 min | ~2.5 min | ~15 min |
| Ollama compile per chunk (~40 chunks) | ~8 min | ~4 min | ~24 min |
| Total | ~22 min | ~11 min | ~65 min |
Bottlenecks
- Ollama compile — sequential LLM calls, ~12 sec/chunk. Could batch with larger context window.
- VLM caption — sequential, ~4 sec/frame. GPU contention with Ollama if run concurrently.
- Whisper — fast on Metal, but locks GPU for duration.
Memory pressure
| Concurrent | Peak VRAM | Safe on 16 GB? |
|---|---|---|
| Whisper alone | ~1.6 GB | Yes |
| Ollama (7B q4) alone | ~5 GB | Yes |
| Whisper + Ollama | ~6.6 GB | Yes |
| Qwen2.5-VL-7B + Ollama text | ~10 GB | Tight but OK |
| All three simultaneous | ~12 GB | Risky — run sequentially |
Strategy: Run stages sequentially. whisper → keyframes → OCR → VLM → compile. No concurrent GPU workloads.
11. Storage Layout
~/crossmem/
raw/
youtube/
{video_id}/
{video_id}.wav # Audio (whisper input)
{video_id}_video.mp4 # Video (keyframe source)
{video_id}.info.json # yt-dlp metadata
{video_id}.en.vtt # Human subs (if available)
{video_id}.en.auto.vtt # Auto subs (if available)
{video_id}.meta.json # crossmem metadata
transcript.json # Whisper output with timestamps
keyframes/
frame_0001.png # Scene-cut keyframes
frame_0002.png
keyframe_times.json # Timestamp → frame mapping
ocr/
frame_0001.txt # OCR output per frame
captions/
frame_0001.txt # VLM caption per frame
wiki/
{timestamp}_{cite_key}.md # Final compiled wiki note
12. Phased Delivery
P1 — Download + Transcribe (MVP)
- URL detection in
main.rscapture dispatch - yt-dlp download wrapper (
youtube/download.rs) - whisper.cpp transcription wrapper (
youtube/transcribe.rs) - Basic chunk segmentation (chapters or 60s windows)
- Ollama compile pass (reuse from cite.rs)
- Wiki markdown emission (transcript-only, no visual)
- Dependency check + error messages
- Tests for metadata parsing, cite_key generation, chunk segmentation
P2 — Keyframes + OCR
- ffmpeg scene-cut extraction (
youtube/keyframe.rs) - Chapter-aware keyframe selection
- Apple Vision OCR helper (
tools/crossmem-ocr/) - Tesseract fallback
- OCR text merged into chunks
- Tests for keyframe timing, OCR integration
P3 — VLM Captions + Diarization
- Ollama multimodal integration for keyframe captioning (
youtube/vlm.rs) - Keyframe captions merged into chunks
- Optional: pyannote speaker diarization
- Tests for VLM response parsing
P4 — Polish + Chunk Emission
- Human sub → whisper alignment
- Playlist support (batch capture)
crossmem compile --source youtubeflag- Storage cleanup (delete intermediate files after compile)
- Integration tests with real short video
- Performance benchmarks on M2/M4
13. Open Questions
-
Subtitle language detection: Should we auto-detect the video language and pass
--languageto whisper, or always useen? For P1, assume English. -
Video retention: Keep the video file after keyframe extraction, or delete to save disk? A 1h 1080p video is ~1–2 GB. Suggest: keep for 7 days, then auto-prune.
-
Ollama model for compile pass: Reuse
llama3.2:3b(same as arxiv), or use a different model better suited for spoken-word paraphrasing? Suggest: same model, same env var. -
Playlist semantics: One wiki note per video, or one per playlist? Suggest: one per video, with a playlist index note linking them.
-
Live stream handling: yt-dlp can download from start, but duration is unknown until stream ends. Suggest: P1 skips live, add in P2.