Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

crossmem

crossmem is a local-first citation and knowledge pipeline. It captures academic papers (arXiv PDFs today, YouTube and more coming), compiles them into structured wiki notes with verbatim quotes and provenance metadata, and serves them to AI agents via MCP.

What it does

  1. Capture — downloads a paper, extracts metadata from arXiv + CrossRef + OpenAlex, generates a deterministic cite key
  2. Compile — parses the PDF (via Marker or pdftotext), splits into paragraph-level chunks with bounding-box provenance, runs a local LLM (Ollama) to add paraphrase and implication per chunk
  3. Verify — re-hashes every chunk’s text against its stored SHA-256; detects silent drift
  4. Cite & Recall — MCP tools that let Claude (or any MCP client) look up citations and search your wiki

Design principles

  • Verbatim quotes are ground truth. The LLM only touches paraphrase/implication fields, never the original text.
  • Provenance is first-class. Every chunk carries page, section, bounding box, SHA-256 hash, and byte range back to the source PDF.
  • Metadata is cross-verified. Title, authors, and year must agree across at least two canonical sources (arXiv, CrossRef, OpenAlex). Disagreements surface as warnings, not silent picks.
  • Everything runs locally. No cloud APIs. Ollama for LLM, Marker for PDF parsing, all on your Mac.

Installation

From source

cargo install --path .

Or directly from GitHub:

cargo install --git https://github.com/crossmem/crossmem-rs

Dependencies

crossmem requires the following tools for PDF capture and compilation:

ToolPurposeInstall
pdftotextFallback PDF text extractionbrew install poppler
markerPDF parsing with bounding boxes (default)pip install marker-pdf or uvx marker-pdf
OllamaLocal LLM for paraphrase/implicationollama.com

Ollama model setup

crossmem uses llama3.2:3b by default. Pull the model before your first compile:

ollama pull llama3.2:3b

Override the model with the CROSSMEM_OLLAMA_MODEL environment variable:

CROSSMEM_OLLAMA_MODEL=mistral crossmem compile vaswani2017attention

Verify installation

crossmem --version

Quick Start

Capture a paper, compile it, and cite it — in under 30 seconds.

1. Capture

Download an arXiv paper and extract metadata:

crossmem capture https://arxiv.org/abs/1706.03762

Output:

[capture] arxiv_id: 1706.03762
[capture] title: Attention Is All You Need
[capture] cite_key: vaswani2017attention
[capture] saved to ~/crossmem/raw/...

2. Compile

Parse the PDF into chunks and run the LLM pass:

crossmem compile vaswani2017attention

This produces a wiki note at ~/crossmem/wiki/<timestamp>_vaswani2017attention.md with:

  • YAML frontmatter (title, authors, year, DOI, cite_key)
  • Five citation formats (APA, MLA, Chicago, IEEE, BibTeX)
  • Per-chunk verbatim quotes with paraphrase, implication, and provenance metadata

3. Cite via MCP

Add crossmem to Claude Code:

claude mcp add crossmem -- crossmem mcp serve

Then ask Claude:

Cite vaswani2017attention in APA format.

Claude calls crossmem_cite and returns:

Vaswani, A., & Shazeer, N. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.

4. Search your wiki

Ask Claude:

What do I have on self-attention mechanisms?

Claude calls crossmem_recall and returns matching excerpts ranked by relevance, with cite keys and deep links to your wiki files.

Writing a Paper with crossmem

An end-to-end playbook for AI agents (Claude Code, Cursor, etc.) and their human authors. You have crossmem installed and the MCP server registered. You want to cite prior work correctly and quote-faithfully.

1. One-time setup

Install crossmem and its dependencies:

# Install crossmem
cargo install --path .
# Or, from the repo directly:
# cargo install --git https://github.com/crossmem/crossmem-rs

# Local LLM for paraphrase/implication generation
ollama pull llama3.2:3b

# PDF parser (preferred — produces bounding boxes)
pip install marker-pdf
# Fallback: brew install poppler   (provides pdftotext)

Register the MCP server so your agent can call crossmem_cite and crossmem_recall:

Claude Code:

claude mcp add crossmem -- crossmem mcp serve

Claude Desktop — add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "crossmem": {
      "command": "crossmem",
      "args": ["mcp", "serve"]
    }
  }
}

2. Capturing a paper

crossmem capture https://arxiv.org/abs/1706.03762

Output:

[capture] arxiv_id: 1706.03762
[capture] title: Attention Is All You Need
[capture] cite_key: vaswani2017attention
[capture] saved to ~/crossmem/raw/1776227254_vaswani2017attention.pdf

This does three things:

  1. Downloads the PDF to ~/crossmem/raw/<timestamp>_<cite_key>.pdf
  2. Fetches metadata from arXiv, CrossRef, and OpenAlex — reconciles across all three
  3. Generates a deterministic cite key via the pattern DSL

Then compile it into a wiki entry:

crossmem compile vaswani2017attention

This parses the PDF (Marker by default), splits it into chunks, runs each through Ollama for paraphrase and implication, and emits the wiki note at ~/crossmem/wiki/<timestamp>_vaswani2017attention.md.

Note: YouTube ingestion is design-only — see YouTube Ingestion Pipeline.

Capturing non-arXiv papers

Most journal papers (e.g. JCP, Nature, PRL) are not on arXiv. crossmem capture supports them through DOI lookup and local PDF import.

If you have a DOI — CrossRef metadata is fetched automatically:

# DOI URL
crossmem capture https://doi.org/10.1063/5.0012345

# Bare DOI
crossmem capture 10.1063/5.0012345

If the paper is open-access, the PDF downloads via Unpaywall. Otherwise you’ll get instructions to download it manually.

If you already have the PDF — the most common path for paywalled journals:

# With DOI (recommended — gets full CrossRef metadata)
crossmem capture ~/Downloads/smith2023.pdf --doi 10.1063/5.0012345

# Without DOI — extracts what it can from PDF metadata
crossmem capture ~/Downloads/smith2023.pdf --cite-key smith2023transport

Direct PDF URL — for preprint servers, institutional repos:

crossmem capture https://chemrxiv.org/paper.pdf --doi 10.1234/chemrxiv.5678

All paths produce the same raw/ + .meta.json output. Then compile as usual:

crossmem compile smith2023transport

For a JCP submission with 24 references, a typical workflow is:

# Capture each reference — most will be local PDFs with DOIs
for pdf in ~/papers/jcp-refs/*.pdf; do
  doi=$(grep -oP '10\.\d{4,9}/[^\s]+' <<< "$(pdftotext "$pdf" - | head -5)")
  crossmem capture "$pdf" --doi "$doi"
done

# Then compile each one
for meta in ~/crossmem/raw/*.meta.json; do
  key=$(jq -r .cite_key "$meta")
  crossmem compile "$key"
done

3. The compiled wiki entry — what the agent sees

Frontmatter

---
cite_key: vaswani2017attention
title: "Attention Is All You Need"
authors:
  - "Ashish Vaswani"
  - "Noam Shazeer"
year: 2017
arxiv_id: "1706.03762"
doi: "10.48550/arXiv.1706.03762"
captured_at: "1776227254"
raw: "~/crossmem/raw/1776227254_vaswani2017attention.pdf"
pdf_sha256: "9a8f3b..."
parser: "marker"
chunks: 47
meta:
  sources: ["arxiv", "crossref", "openalex"]
  reconciled: true
  warnings: []
---

After the frontmatter, five citation formats are pre-generated: APA, MLA, Chicago, IEEE, and BibTeX.

Chunks

Each chunk carries verbatim text, LLM-generated derivatives, and full provenance:

<!-- chunk id=p4s32c1 -->
> The dominant sequence transduction models are based on complex recurrent or
> convolutional neural networks that include an encoder and a decoder.

**Paraphrase:** Prior sequence models relied on RNNs or CNNs in an encoder-decoder setup.

**Implication:** This dependency on recurrence was the bottleneck the Transformer aimed to eliminate.

​```yaml
provenance:
  page: 4
  section: "3.2 Scaled Dot-Product Attention"
  bbox: [72.0, 340.5, 523.8, 412.1]
  text_sha256: "5f3e1c..."
  byte_range: [18342, 19104]
​```

Hard rule for agents: The > blockquote is the verbatim original extracted from the PDF. When citing, the agent MUST copy from this blockquote. NEVER fabricate or rephrase quotes. The Paraphrase and Implication fields exist for the agent’s reasoning and search — they do not belong in the paper as attributed quotes.

4. Agent prompts that actually work

Finding relevant chunks

“Search my library for how transformer attention was originally motivated. Return cite_keys and page numbers.”

Agent calls:

crossmem_recall("transformer attention motivation", limit=5)

Returns a ranked list of {cite_key, title, section, excerpt}. The agent picks the most relevant hits and reports them.

Quoting with provenance

“Write a paragraph introducing self-attention. Quote vaswani2017attention page 2 verbatim, then paraphrase in my voice. Include BibTeX.”

Agent workflow:

  1. Calls crossmem_recall("self-attention vaswani2017attention") to find the right chunk
  2. Reads the wiki file to locate the page-2 chunk
  3. Copies the > blockquote verbatim into the draft as a block quote
  4. Writes a surrounding paraphrase in the author’s voice (informed by the Paraphrase field, not copying it)
  5. Calls crossmem_cite("vaswani2017attention", "bibtex") for the BibTeX entry
  6. Embeds the text_sha256 and page reference as an HTML comment so crossmem verify can trace provenance:
% crossmem: vaswani2017attention p4s32c1 sha256=5f3e1c...
\begin{quote}
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder.
\end{quote}
\cite{vaswani2017attention}

Citing multiple papers

“Compare how Vaswani 2017 and Devlin 2019 frame the importance of pre-training.”

Agent calls crossmem_recall("pre-training importance"), gets hits from both papers, reads the relevant chunks, and writes a comparison paragraph quoting both — each quote traced to its chunk ID.

Running a drift check

After the human edits the draft (or the agent revises it), verify that no quotes have been accidentally mutated:

crossmem verify

Output when clean:

[verify] checked 94 chunks across 3 wiki entries
[verify] 0 drifts detected

Output when a quote was altered:

[verify] DRIFT in vaswani2017attention chunk p4s32c1
  expected: 5f3e1c...
  actual:   a1b2c3...
[verify] 1 drift detected

Exit code 1 means drift — the agent or human must restore the original quote from the wiki.

Building the bib file

Collect all \cite{...} keys from a LaTeX draft and emit a single .bib:

grep -oP '\\cite\{[^}]+\}' draft.tex \
  | sed 's/\\cite{//;s/}//' \
  | tr ',' '\n' \
  | sort -u \
  | while read key; do crossmem mcp serve <<< "{\"method\":\"tools/call\",\"params\":{\"name\":\"crossmem_cite\",\"arguments\":{\"cite_key\":\"$key\",\"format\":\"bibtex\"}}}"; done

Or, have the agent do it: “Collect every cite key from my draft and produce a references.bib file using crossmem_cite.”

5. What crossmem protects against

Failure modeHow crossmem prevents it
Hallucinated citation metadataMulti-source reconciliation: arXiv + CrossRef + OpenAlex, ≥2 must agree. Disagreements surface as warnings in frontmatter.
Hallucinated quotesAgent contract: never compose original text, only copy the > blockquote. crossmem verify catches any post-hoc mutation via SHA-256 re-hashing.
Wrong page numbersEvery chunk carries page, section, and bbox — the reader can trace back to the exact PDF region.
Lost contextbyte_range preserves the exact location in the raw PDF. Chunks retain their section heading for navigation.
Cite key collisionsDeterministic pattern DSL with a–z suffix tiebreaker (then _<count> if all 26 are taken).

6. Limits

Be honest about what crossmem cannot do today:

  • Scanned / image-only PDFs: Marker’s OCR quality varies. Chunks from poorly scanned pages may have garbled text.
  • Math-heavy pages: The pipeline does not run Nougat or other math-aware extractors. Equations may appear as lossy Unicode approximations or be missing entirely.
  • Non-arXiv sources: Journal papers captured via DOI or local PDF have single-source metadata (CrossRef only), so there is no cross-verification. Books and conference proceedings with non-standard DOIs may produce incomplete frontmatter.
  • Single-author workflow: There is no shared library, sync, or multi-user conflict resolution. Each machine has its own ~/crossmem/ directory.
  • Ollama dependency: Compile requires a running Ollama instance. If Ollama is down or the model is missing, compile will fail.

7. Minimal paper-writing session

A scripted walkthrough — capture two papers, write an intro paragraph, verify.

# Capture two papers
crossmem capture https://arxiv.org/abs/1706.03762
crossmem compile vaswani2017attention

crossmem capture https://arxiv.org/abs/1810.04805
crossmem compile devlin2019bert

Now prompt the agent:

“Write an introductory paragraph for my Related Work section. It should cite both vaswani2017attention and devlin2019bert, quoting one key sentence from each verbatim. Output LaTeX with \cite commands and the BibTeX entries.”

The agent:

  1. Calls crossmem_recall("attention mechanism transformer", limit=5) and crossmem_recall("pre-training bidirectional", limit=5)
  2. Reads the wiki entries for both papers, selects one chunk each
  3. Produces:
The Transformer architecture replaced recurrence with self-attention:
\begin{quote}
``The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder.''
\end{quote}
\cite{vaswani2017attention}. Building on this,
BERT demonstrated that bidirectional pre-training could be applied to a wide
range of NLP tasks:
\begin{quote}
``We introduce a new language representation model called BERT, which stands
for Bidirectional Encoder Representations from Transformers.''
\end{quote}
\cite{devlin2019bert}.

% crossmem: vaswani2017attention p1s0c1 sha256=...
% crossmem: devlin2019bert p1s0c1 sha256=...
  1. Calls crossmem_cite("vaswani2017attention", "bibtex") and crossmem_cite("devlin2019bert", "bibtex") to emit references.bib

Finally, verify nothing drifted:

crossmem verify
# [verify] checked 94 chunks across 2 wiki entries
# [verify] 0 drifts detected

The quotes in your LaTeX match the raw PDFs. Ship it.

crossmem capture

Download a paper and extract metadata.

Usage

crossmem capture <input> [--doi <doi>] [--cite-key <key>]

Input types

InputExampleDetection
Local PDF file/path/to/paper.pdfPath exists on disk
arXiv URL or bare IDhttps://arxiv.org/abs/1706.03762, 1706.03762arXiv URL pattern or bare numeric ID
DOI URL or bare DOIhttps://doi.org/10.1038/nature12373, 10.1038/nature12373DOI URL prefix or 10.NNNN/... pattern
Direct PDF URLhttps://example.com/paper.pdfHTTPS URL ending in .pdf

Inputs are matched in the order above — first match wins.

Flags

FlagDescription
--doi <doi>Attach a DOI to the capture. For local files and PDF URLs, fetches CrossRef metadata for this DOI.
--cite-key <key>Override the auto-generated cite key.

What it does

arXiv input (existing behavior)

  1. Extracts the arXiv ID from the URL
  2. Fetches metadata from arXiv API
  3. Cross-checks metadata against CrossRef and OpenAlex (reconciliation)
  4. Downloads the PDF from arXiv
  5. Generates a cite key using the configured pattern DSL
  6. Saves PDF + .meta.json sidecar

DOI input

  1. Fetches metadata from CrossRef API
  2. Tries Unpaywall API for an open-access PDF URL (requires CROSSMEM_UNPAYWALL_EMAIL env var)
  3. If no open-access PDF found, prints instructions to download manually and use local file capture

Local PDF file

  1. Copies (not moves) the PDF to ~/crossmem/raw/<timestamp>_<cite_key>.pdf
  2. If --doi given: fetches CrossRef metadata
  3. If no --doi: tries extracting embedded PDF metadata via pdfinfo (Title, Author, CreationDate)
  4. If no metadata found and no --cite-key: errors with instructions

Direct PDF URL

  1. Downloads the PDF
  2. Then follows the same metadata path as local file (CrossRef via --doi, or pdfinfo fallback)

Exit codes

CodeMeaning
0Success
1Error (invalid input, download failure, metadata fetch failure)
2Missing arguments

Environment variables

VariableDescription
CROSSMEM_UNPAYWALL_EMAILEmail address for Unpaywall API (required for DOI→PDF lookup)

See config.toml for cite key configuration.

Examples

arXiv paper

$ crossmem capture https://arxiv.org/abs/1706.03762
[capture] arxiv_id: 1706.03762
[capture] title: Attention Is All You Need
cite_key:   vaswani2017attention

Journal paper via DOI

$ crossmem capture 10.1063/5.0012345
[capture] DOI: 10.1063/5.0012345
cite_key:   smith2023molecular

Local PDF with DOI metadata

$ crossmem capture ~/Downloads/paper.pdf --doi 10.1063/5.0012345
[capture] Local file: /Users/me/Downloads/paper.pdf
[capture] Fetching CrossRef metadata for DOI 10.1063/5.0012345
cite_key:   smith2023molecular

Local PDF with manual cite key

$ crossmem capture ~/Downloads/paper.pdf --cite-key jones2024transport
[capture] Local file: /Users/me/Downloads/paper.pdf
cite_key:   jones2024transport

Direct PDF URL

$ crossmem capture https://example.com/papers/preprint.pdf --doi 10.1234/example
cite_key:   doe2024example

Storage layout

~/crossmem/raw/
  <timestamp>_<cite_key>.pdf        # Raw PDF
  <timestamp>_<cite_key>.meta.json  # Metadata sidecar

The .meta.json file contains the reconciled metadata used by compile:

{
  "cite_key": "smith2023molecular",
  "title": "Molecular dynamics simulation of transport",
  "authors": ["John Smith", "Jane Doe"],
  "year": 2023,
  "arxiv_id": "",
  "doi": "10.1063/5.0012345",
  "container_title": "The Journal of Chemical Physics",
  "sources": ["crossref"],
  "reconciled": true,
  "warnings": []
}

crossmem compile

Parse a captured PDF into structured wiki chunks with LLM-generated paraphrase and implication.

Usage

crossmem compile <cite_key>

Arguments

ArgumentDescription
<cite_key>The cite key printed by crossmem capture. Example: vaswani2017attention

What it does

  1. Finds the raw PDF and .meta.json for the given cite key in ~/crossmem/raw/
  2. Parses the PDF using Marker (preferred) or pdftotext (fallback)
  3. Splits content into paragraph-level chunks with bounding-box provenance
  4. Computes SHA-256 hash for each chunk’s verbatim text
  5. Sends each chunk to Ollama for paraphrase and implication generation
  6. Generates five citation formats (APA, MLA, Chicago, IEEE, BibTeX)
  7. Emits the final wiki note to ~/crossmem/wiki/<timestamp>_<cite_key>.md

Exit codes

CodeMeaning
0Success
1Error (cite key not found, Ollama unreachable, parse failure)

Environment variables

VariableDefaultDescription
CROSSMEM_OLLAMA_MODELllama3.2:3bOllama model used for paraphrase/implication generation

Example

$ crossmem compile vaswani2017attention
[compile] loading raw PDF for vaswani2017attention
[compile] parsing with Marker (MPS)...
[compile] 47 chunks extracted
[compile] compiling chunk 1/47...
...
[compile] wiki saved to ~/crossmem/wiki/1776227300_vaswani2017attention.md

PDF parsing tiers

TierParserWhen usedBounding boxes
0pdftotext -layoutFallback when Marker unavailableNo
1Marker (MPS)Default for arXiv papersYes (polygon per block)

The parser tier is recorded in the wiki frontmatter as the parser field.

LLM contract

The LLM (Ollama) is only allowed to generate paraphrase and implication fields. It never touches:

  • Original verbatim text (from PDF extractor)
  • Metadata fields (from reconciler)
  • Citation strings (deterministic generator)
  • Provenance data (from parser)

crossmem verify

Verify chunk integrity by re-hashing verbatim text against stored SHA-256 hashes.

Usage

crossmem verify [cite_key]

Arguments

ArgumentDescription
[cite_key]Optional. If provided, only verify chunks for this cite key. If omitted, verify all wiki entries.

What it does

  1. Walks ~/crossmem/wiki/ for all .md files (or the one matching cite_key)
  2. For each wiki entry with a cite_key in frontmatter:
    • Extracts all <!-- chunk id=... --> blocks
    • Finds the text_sha256 in each chunk’s provenance YAML
    • Re-computes SHA-256 from the verbatim quoted text (> ... lines)
    • Reports any mismatches as “DRIFT”
  3. Prints summary: total chunks checked, total drifts detected

Exit codes

CodeMeaning
0All chunks verified, no drift
1Error or drift detected

Example

$ crossmem verify vaswani2017attention
[verify] Checking vaswani2017attention

Verified 47 chunks, 0 drift(s) detected.
$ crossmem verify
[verify] Checking vaswani2017attention
[verify] Checking lecun2015deep

Verified 93 chunks, 0 drift(s) detected.

When drift is detected:

DRIFT: vaswani2017attention chunk p4s32c1
  expected: 5f3e1c...
  actual:   a8b2d4...

Verified 47 chunks, 1 drift(s) detected.

crossmem mcp serve

Start the MCP (Model Context Protocol) server on stdio.

Usage

crossmem mcp serve

What it does

Starts an MCP server that communicates over stdin/stdout, providing two tools to any MCP client:

The server loads wiki entries from ~/crossmem/wiki/ and serves them to the connected client.

Exit codes

CodeMeaning
0Clean shutdown
1Server error

Environment variables

VariableDefaultDescription
RUST_LOGwarnLog level (logs go to stderr, not stdout — stdout is the MCP transport)

Adding to Claude Code

claude mcp add crossmem -- crossmem mcp serve

This registers crossmem as an MCP server that Claude Code will start automatically.

Adding to Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "crossmem": {
      "command": "crossmem",
      "args": ["mcp", "serve"]
    }
  }
}

crossmem serve

Run the HTTP/WebSocket relay bridge that connects CLI tools and agents to the crossmem Chrome extension.

Usage

crossmem serve
crossmem          # 'serve' is the default when no subcommand is given

What it does

Starts an HTTP + WebSocket server on 127.0.0.1:7600 (configurable). The Chrome extension connects via WebSocket; CLI tools and agents send commands via HTTP.

Endpoints

EndpointMethodDescription
/statusGETConnection status: connected extensions, pending command count
/commandPOSTSend a command to the extension and wait for its response
/dialog_responsePOSTSend a dialog response back to the extension
/capturePOSTScreen recording capture handler
/ or /wsWSExtension WebSocket connection

Exit codes

CodeMeaning
0Clean shutdown (SIGINT or SIGTERM)
1Bind failure (port already in use)

Environment variables

VariableDefaultDescription
BRIDGE_PORT7600Port to listen on
RUST_LOGinfoLog level

Example

$ crossmem serve
[bridge] crossmem-bridge v0.1.0
[bridge] HTTP  → http://127.0.0.1:7600/status
[bridge] HTTP  → http://127.0.0.1:7600/command
[bridge] WS    → ws://127.0.0.1:7600/
[bridge] waiting for extension...

Sending a command

curl -X POST http://127.0.0.1:7600/command \
  -H 'Content-Type: application/json' \
  -d '{"action":"navigate","params":{"url":"https://example.com"}}'

Checking status

curl -s http://127.0.0.1:7600/status | jq .

Chrome extension

The bridge is designed to work with the crossmem Chrome extension. The extension connects via WebSocket and executes commands using chrome.scripting.executeScript.

Supported actions: navigate, click, type, wait, extract, screenshot, summarize, tab_info, ping.

For multi-agent use, add "agentId": "my-agent" to commands to isolate tab control.

MCP Integration

crossmem exposes two tools via the Model Context Protocol (MCP), allowing AI agents to look up citations and search your wiki without leaving the conversation.

Tools

ToolDescription
crossmem_citeLook up a citation by cite key and return it in a specified format
crossmem_recallSearch the wiki for entries matching a query

Setup

Claude Code

claude mcp add crossmem -- crossmem mcp serve

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "crossmem": {
      "command": "crossmem",
      "args": ["mcp", "serve"]
    }
  }
}

Agent usage prompts

Once crossmem is registered as an MCP server, you can ask your agent things like:

  • “Cite vaswani2017attention in APA format.”
  • “Give me the BibTeX for vaswani2017attention.”
  • “What do I have on attention mechanisms?”
  • “Search my wiki for papers about transformer architectures.”
  • “Find all papers by Vaswani in my library.”

How it works

The MCP server (crossmem mcp serve) runs on stdio. It loads all .md files from ~/crossmem/wiki/ on startup, parses their YAML frontmatter and body, and responds to tool calls by searching this in-memory index.

Logs go to stderr (not stdout), so they don’t interfere with the MCP JSON-RPC transport.

crossmem_cite

Look up a citation by cite key and return it in the requested format.

Parameters

ParameterTypeRequiredDefaultDescription
cite_keystringyesCitation key, e.g. vaswani2017attention
formatstringnobibtexOne of: bibtex, apa, mla, chicago, ieee

Returns

The formatted citation string extracted from the wiki file’s citation section.

Success

Vaswani, A., & Shazeer, N. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.

Cite key not found

If the cite key doesn’t match any wiki entry, returns the top 5 fuzzy matches:

Error: cite_key 'vaswani' not found. Did you mean:
  - vaswani2017attention — Attention Is All You Need

Format not found

If the cite key exists but the wiki file is missing the requested citation section:

Error: cite_key 'vaswani2017attention' found but no APA citation section in wiki file.
File: /Users/you/crossmem/wiki/1776227300_vaswani2017attention.md

Fuzzy matching

When an exact match fails, the tool scores candidates by:

  1. Full cite key substring match (+10)
  2. Full title substring match (+5)
  3. Per-token cite key match (+3 each)
  4. Per-token title match (+2 each)

The top 5 candidates are returned as suggestions.

crossmem_recall

Search the crossmem wiki for entries matching a query. Returns matching excerpts ranked by relevance.

Parameters

ParameterTypeRequiredDefaultDescription
querystringyesSearch query string
limitintegerno5Max results to return (capped at 20)

Returns

A ranked list of matching wiki entries, each with:

  • Index number
  • Cite key and title
  • Section where the match was found
  • Excerpt (up to 400 characters) with surrounding context
  • Deep link to the wiki file

Example response

1. [vaswani2017attention] Attention Is All You Need
   section: p.4 §3.2 Scaled Dot-Product Attention
   excerpt: ...We call our particular attention "Scaled Dot-Product Attention". The input consists of queries and keys of dimension dk...
   link: file:///Users/you/crossmem/wiki/1776227300_vaswani2017attention.md

2. [lecun2015deep] Deep Learning
   section: p.12 §4 Attention Mechanisms
   excerpt: ...Attention mechanisms have become an integral part of sequence modeling...
   link: file:///Users/you/crossmem/wiki/1776300000_lecun2015deep.md

No results

No results for query: 'quantum computing'

Scoring

Results are ranked by total token frequency: each whitespace-delimited query token is counted across the entry’s title and body. Higher count = higher rank.

config.toml

crossmem reads configuration from ~/.crossmem/config.toml. If the file doesn’t exist, it’s created with defaults on first run.

Location

~/.crossmem/config.toml

Full reference

[cite_key]
pattern = "[auth:lower][year][shorttitle:1:nopunct]"

Sections

[cite_key]

KeyDefaultDescription
pattern[auth:lower][year][shorttitle:1:nopunct]Pattern DSL for generating cite keys. See cite_key Pattern DSL.

Environment variables

These are not in config.toml but affect crossmem’s behavior:

VariableDefaultDescription
CROSSMEM_OLLAMA_MODELllama3.2:3bOllama model for compile pass
BRIDGE_PORT7600Bridge server port
RUST_LOGinfo (bridge) / warn (MCP)Log level filter

Data directories

crossmem stores all data under ~/crossmem/:

~/crossmem/
  raw/          # Downloaded PDFs + metadata JSON sidecars
  wiki/         # Compiled wiki notes (markdown)

cite_key Pattern DSL

crossmem generates citation keys using a pattern DSL inspired by Better BibTeX. The pattern is configured in ~/.crossmem/config.toml:

[cite_key]
pattern = "[auth:lower][year][shorttitle:1:nopunct]"

Syntax

A pattern is a string of tokens (inside [brackets]) and literal characters (outside brackets).

[field:modifier1:modifier2]literal_text[field2]

Tokens

TokenDescriptionExample output
authFirst author’s last nameVaswani
authorsAll authors’ last names concatenatedVaswaniShazeer
yearPublication year2017
shorttitleFirst N significant words from title (stop words filtered)attention
titleFull titleAttention Is All You Need

shorttitle behavior

shorttitle filters out common stop words (a, an, the, is, are, was, for, of, with, …) and takes the first N remaining words. N is specified as a numeric modifier.

Example with title “Attention Is All You Need”:

  • [shorttitle:1]attention
  • [shorttitle:3]attentionneed (after filtering “Is”, “All”, “You”)

Modifiers

Modifiers are appended to the token with : separators and applied in order:

ModifierDescriptionExample
lowerLowercaseVASWANIvaswani
upperUppercasevaswaniVASWANI
nopunctRemove all non-alphanumeric charactershello-world!helloworld
condenseRemove all whitespacehello worldhelloworld
N (digit)For shorttitle: take first N words. For other fields: take first N whitespace-delimited words.[shorttitle:1] → first significant word

Examples

Default pattern

pattern = "[auth:lower][year][shorttitle:1:nopunct]"
PaperGenerated key
Vaswani et al., “Attention Is All You Need”, 2017vaswani2017attention
LeCun et al., “Deep Learning”, 2015lecun2015deep

All authors

pattern = "[authors:lower][year]"
PaperGenerated key
Vaswani & Shazeer, “Attention Is All You Need”, 2017vaswanishazeer2017

With literal separator

pattern = "[auth:lower]_[year]"
PaperGenerated key
Vaswani et al., 2017vaswani_2017

Full title condensed

pattern = "[title:condense:lower]"
PaperGenerated key
“Attention Is All You Need”attentionisallyouneed

Multi-word short title

pattern = "[auth:lower][year][shorttitle:3:nopunct]"
PaperGenerated key
Vaswani et al., “Attention Is All You Need”, 2017vaswani2017attentionneed

Collision resolution

If a generated key collides with an existing entry, crossmem appends a suffix:

  1. Try a through z: vaswani2017attentionvaswani2017attentiona
  2. If all 26 letters exhausted, append _<count>: vaswani2017attention_27

Wiki Frontmatter

Every wiki note in ~/crossmem/wiki/ starts with YAML frontmatter between --- delimiters.

Fields

FieldTypeRequiredDescription
cite_keystringyesDSL-generated citation key. Example: vaswani2017attention
titlestringyesPaper title
authorslist[string]yesList of author names
yearintegeryesPublication year
arxiv_idstringyes (arXiv)arXiv identifier, e.g. 1706.03762
doistringnoDOI (may be preprint DOI)
doi_preprintstringnoPreprint DOI (e.g. 10.48550/arXiv.1706.03762)
doi_publishedstringnoPublished version DOI (if paper was published in a journal)
captured_atstringyesUnix timestamp of capture
rawstringyesPath to the raw PDF file
pdf_sha256stringyesSHA-256 hash of the raw PDF bytes
parserstringyesParser used: marker, pdftotext
chunksintegeryesNumber of chunks in the document
meta.sourceslist[string]yesMetadata sources used: arxiv, crossref, openalex
meta.reconciledbooleanyesWhether metadata was cross-verified across sources
meta.warningslist[string]noWarnings from metadata reconciliation

Example

---
cite_key: vaswani2017attention
title: "Attention Is All You Need"
authors:
  - "Ashish Vaswani"
  - "Noam Shazeer"
  - "Niki Parmar"
year: 2017
arxiv_id: "1706.03762"
doi: "10.48550/arXiv.1706.03762"
captured_at: "1776227254"
raw: "~/crossmem/raw/1776227254_vaswani2017attention.pdf"
pdf_sha256: "9a8f3b..."
parser: "marker"
chunks: 47
meta:
  sources: ["arxiv", "crossref", "openalex"]
  reconciled: true
  warnings: []
---

Citations section

After the frontmatter, the wiki body starts with a title heading and a ## Citations section containing five subsections:

# Attention Is All You Need

## Citations

### APA
Vaswani, A., & Shazeer, N. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.

### MLA
Vaswani, Ashish, et al. "Attention Is All You Need" arXiv preprint arXiv:1706.03762 (2017).

### Chicago
Vaswani, Ashish, and Noam Shazeer. "Attention Is All You Need" arXiv preprint arXiv:1706.03762 (2017).

### IEEE
A. Vaswani et al., "Attention Is All You Need" arXiv preprint arXiv:1706.03762, 2017.

### BibTeX
```bibtex
@article{vaswani2017attention,
  title={Attention Is All You Need},
  author={Ashish Vaswani and Noam Shazeer},
  year={2017}
}

Chunk Format

After the citations section, each wiki note contains a series of chunks. Each chunk preserves verbatim text from the source PDF along with provenance metadata.

Chunk structure

<!-- chunk id=p4s32c1 -->
> We call our particular attention "Scaled Dot-Product Attention".
> The input consists of queries and keys of dimension dk, and values
> of dimension dv.

**Paraphrase:** The authors name their mechanism "Scaled Dot-Product Attention" and define its inputs.

**Implication:** This naming convention becomes the standard terminology used across the field.

```yaml
provenance:
  page: 4
  section: "3.2 Scaled Dot-Product Attention"
  bbox: [72.0, 340.5, 523.8, 412.1]
  text_sha256: "5f3e1c..."
  byte_range: [18342, 19104]

## Chunk ID format

Chunk IDs follow the pattern `p{page}s{section}c{chunk}`:

| Part | Description | Example |
|------|-------------|---------|
| `p{N}` | Page number | `p4` = page 4 |
| `s{N}` | Section number within page | `s32` = section 3.2 |
| `c{N}` | Chunk number within section | `c1` = first chunk |

## Fields

### Verbatim text

Lines starting with `> ` contain the original text extracted from the PDF. This text is **never modified by the LLM** — it comes directly from the PDF parser.

### Paraphrase

A 1–2 sentence LLM-generated summary of the chunk's content. Generated by Ollama during `crossmem compile`.

### Implication

A 1–2 sentence LLM-generated statement about why this chunk matters to the field. Generated by Ollama during `crossmem compile`.

### Provenance

YAML metadata block attached to each chunk:

| Field | Type | Description |
|-------|------|-------------|
| `page` | integer | Page number in the source PDF |
| `section` | string | Section heading (if detected by parser) |
| `bbox` | `[f64; 4]` | Bounding box `[x_min, y_min, x_max, y_max]` in PDF coordinates. Present when parsed with Marker. |
| `text_sha256` | string | SHA-256 hash of the verbatim text. Used by `crossmem verify` to detect drift. |
| `byte_range` | `[usize; 2]` | `[start, end]` byte offset in the source PDF content stream. Present when available from parser. |

## Chunk types

The `chunk_type` field (internal) classifies each chunk:

| Type | Description |
|------|-------------|
| `page` | Full-page text (from `pdftotext` fallback) |
| `heading` | Section heading |
| `paragraph` | Body paragraph (from Marker block tree) |
| `figure` | Figure caption |
| `table` | Table content |
| `equation` | Mathematical expression |

## Integrity verification

Run `crossmem verify` to re-hash every chunk's verbatim text and compare against the stored `text_sha256`. Any mismatch indicates the wiki file has been modified since compilation.

Pipeline Overview

crossmem’s citation pipeline transforms a URL into a structured, verifiable wiki note.

Pipeline diagram

graph TD
    A[crossmem capture URL] --> B[Download PDF]
    B --> C[Fetch arXiv metadata]
    C --> D[Reconcile: CrossRef + OpenAlex]
    D --> E[Generate cite_key via DSL]
    E --> F["Save raw PDF + .meta.json"]

    G[crossmem compile cite_key] --> H[Load raw PDF + metadata]
    H --> I{Marker available?}
    I -->|Yes| J[Marker: paragraph chunks + bbox]
    I -->|No| K[pdftotext: page-level chunks]
    J --> L[Compute SHA-256 per chunk]
    K --> L
    L --> M[Ollama: paraphrase + implication per chunk]
    M --> N[Generate 5 citation formats]
    N --> O["Emit wiki markdown to ~/crossmem/wiki/"]

    P[crossmem verify] --> Q[Walk wiki files]
    Q --> R[Re-hash chunk text]
    R --> S{SHA-256 match?}
    S -->|Yes| T[OK]
    S -->|No| U[DRIFT detected]

    V[crossmem mcp serve] --> W[Load wiki entries]
    W --> X[crossmem_cite: lookup by key]
    W --> Y[crossmem_recall: search by query]

Why capture and compile are separate

capture is lightweight and idempotent: it issues API calls to arXiv, CrossRef, and OpenAlex, downloads the PDF, and writes metadata. You can re-run it to refresh metadata without re-parsing. compile is heavyweight: it invokes Marker (or another PDF parser) and Ollama to produce chunk-level paraphrases and implications. Separating the two lets you swap the PDF parser (Marker → Nougat → GROBID) or change the LLM model without re-downloading anything. It also enables a practical workflow: batch-capture dozens of papers first, then compile them at leisure — or only compile the ones that turn out to be relevant.

Stage details

Capture

  1. URL parsing — extracts arXiv ID from various URL formats (/abs/, /pdf/, bare ID)
  2. PDF download — fetches PDF, computes SHA-256, saves to ~/crossmem/raw/
  3. Metadata fetch — queries arXiv API for title, authors, year
  4. Metadata reconciliation — cross-checks against CrossRef (via DOI) and OpenAlex. Flags disagreements as warnings in frontmatter.
  5. Cite key generation — applies the configured pattern DSL to the reconciled metadata

Compile

  1. PDF parsing — Marker (with MPS acceleration) produces paragraph-level blocks with bounding-box coordinates. Falls back to pdftotext -layout for page-level extraction.
  2. Chunk assembly — blocks are grouped into typed chunks (paragraph, heading, figure, table, equation) with unique IDs
  3. Provenance — each chunk gets page, section, bbox, SHA-256, and byte range
  4. LLM pass — Ollama generates paraphrase and implication for each chunk. The LLM never sees or modifies the original text.
  5. Citation generation — deterministic formatting into APA, MLA, Chicago, IEEE, BibTeX
  6. Emission — final wiki markdown written to ~/crossmem/wiki/

Verify

Walks all wiki files, re-extracts verbatim text from > blockquote lines, re-computes SHA-256, and compares against stored text_sha256 in provenance blocks. Reports any drifts.

MCP serve

Loads wiki entries into memory, exposes crossmem_cite (lookup by cite key with fuzzy matching) and crossmem_recall (full-text search with relevance ranking) over stdio MCP transport.

Data Model

Core types

ReconciledMetadata

The metadata reconciler merges data from multiple sources into a single canonical record.

#![allow(unused)]
fn main() {
pub struct ReconciledMetadata {
    pub title: String,
    pub authors: Vec<String>,
    pub year: u16,
    pub arxiv_id: String,
    pub doi: Option<String>,
    pub doi_preprint: Option<String>,
    pub doi_published: Option<String>,
    pub sources: Vec<String>,       // e.g. ["arxiv", "crossref", "openalex"]
    pub warnings: Vec<String>,
    pub reconciled: bool,
}
}

ChunkV2

The paragraph-level chunk with full provenance.

#![allow(unused)]
fn main() {
pub struct ChunkV2 {
    pub chunk_type: String,         // "page", "heading", "paragraph", etc.
    pub chunk_id: String,           // e.g. "p1s1c1"
    pub page: usize,
    pub text: String,               // Verbatim extracted text
    pub provenance: Provenance,
    pub paraphrase: Option<String>, // LLM-generated
    pub implication: Option<String>,// LLM-generated
}
}

Provenance

Tracks exactly where a chunk came from in the source PDF.

#![allow(unused)]
fn main() {
pub struct Provenance {
    pub page: usize,
    pub section: Option<String>,
    pub bbox: Option<[f64; 4]>,     // [x_min, y_min, x_max, y_max]
    pub text_sha256: String,
    pub byte_range: Option<[usize; 2]>,
}
}

WikiEntry (MCP)

The in-memory representation used by the MCP server.

#![allow(unused)]
fn main() {
struct WikiEntry {
    cite_key: Option<String>,
    title: String,
    authors: Vec<String>,
    year: Option<u16>,
    source: Option<String>,
    date: Option<String>,
    file_path: PathBuf,
    body: String,
}
}

Storage layout

~/crossmem/
├── raw/                                    # Capture output
│   ├── <timestamp>_<cite_key>.pdf          # Raw PDF
│   └── <timestamp>_<cite_key>.meta.json    # Reconciled metadata
└── wiki/                                   # Compile output
    └── <timestamp>_<cite_key>.md           # Wiki note

Trust boundaries

DataSourceVerifiable?
Title, authors, year, DOIMetadata reconciler (arXiv + CrossRef + OpenAlex)Cross-source agreement
Cite key, citation stringsDeterministic generatorPure function, unit-tested
Verbatim quote textPDF extractor (Marker / pdftotext)SHA-256 hash
Bounding box, byte rangePDF extractorRe-extraction reproducibility
Paraphrase, implicationLLM (Ollama)Not verifiable — advisory only

Chunk-based Citation v2 Design

Status: Implemented (Phase 2 MVP shipped) Date: 2026-04-15


User requirement

How do we ensure citations are absolutely correct — 萬無一失?

One-line answer: Verbatim text + bbox provenance is ground truth; LLM only touches paraphrase/implication, never quotes; metadata is cross-verified across ≥2 canonical sources.

Competitor survey

ToolWhat it nailsWhat it misses
Zotero + Better BibTeXStable cite_key via JS-ish pattern DSL; key regeneration rules; 80%+ academic mind-shareNo chunk/page content; just metadata container
Marker (datalab-to/marker)PDF→markdown + polygon bbox per block, --keep_chars for char-level bboxes, JSON tree-per-pageSlower than pdftotext; needs CUDA/MPS
NougatTransformer-based; beats GROBID on formulasVLM → hallucination risk on quote fidelity
GROBID68 fine-grained TEI labels; best on metadata + bibliography refs; 2–5s/page, 90%+ accuracyWeak on formulas, figures, modern layouts
PaperQA2Chunk-size configurable; LLM re-rank + contextual summarization; grounded in-text citationsNo bbox, chunk = N-char sliding window → page/fragment precision lost
Tensorlake RAGAnchor tokens <c>2.1</c> inlined + bbox stored separately → auditable trailProprietary pipeline; design pattern is copyable
OpenAlex / CrossRef / Semantic ScholarEach is a canonical metadata sourceEach has gaps; must cross-reconcile

The industry gold standard for “absolutely correct citation”:

  1. Parse once with bbox-aware extractor (Marker-class) → each block has {page, polygon, text}.
  2. Anchor tokens inlined at chunk build time (<c>p4§3.2</c>) so LLM can only emit citation IDs it saw in context.
  3. Resolve citation IDs → bbox + page at render time; users get deep-link to the exact PDF region.
  4. Metadata cross-check across OpenAlex + CrossRef + arXiv; flag inconsistencies instead of silently picking one.
  5. Quote is verbatim from the PDF text layer, stored with SHA-256 of the source bytes — any LLM-generated “quote” is rejected.

What Phase 1 got right / wrong

Right: pre-gen APA/MLA/Chicago/IEEE/BibTeX, deterministic cite_key, per-page original text preserved verbatim, paraphrase/implication separated from quote.

Wrong / gap:

  • Metadata only from arXiv API (no CrossRef/OpenAlex cross-check)
  • Quote preservation is page-level, not paragraph/sentence
  • No bbox — can’t deep-link into PDF region
  • No hash-based verifiability
  • cite_key = primitive pattern vs Better BibTeX DSL
  • No handling of preprint→published DOI mapping

Phase 2 architecture

2A. Metadata layer (the cite_key + bib trust root)

Pipeline:

arxiv_id → [arxiv API]  ┐
        → [CrossRef]    ├─→ reconcile → canonical metadata
        → [OpenAlex]    ┘                   │
                                            ├─→ cite_key (Better-BibTeX-style pattern, configurable)
                                            ├─→ 5 formats (APA/MLA/Chicago/IEEE/BibTeX)
                                            └─→ DOI + published-version DOI (if preprint)

Rules:

  • ≥2 sources must agree on title + first-author + year. Disagreement → emit meta.warnings in frontmatter.
  • cite_key pattern DSL (ported from Better BibTeX): [auth:lower][year][shorttitle:1:nopunct], configurable via ~/.crossmem/config.toml.
  • Track preprint↔published mapping in meta.doi_preprint and meta.doi_published.

2B. PDF parsing layer (the chunk trust root)

Tiered strategy by document type + quality tier:

TierParserUse whenBbox?Speed
0pdftotext -layoutFallback / pure textNoinstant
1Marker (Mac MPS)Default for arxivYes, polygon/block1–3 s/page
2GROBID (JVM, local)Bib-references + structured metadataYes, TEI2–5 s/page
3Nougat (MPS)Formula-heavy pagesPartial5–15 s/page

Phase 2 default: Marker for body + GROBID for bibliography, both run, merge into unified chunk tree.

2C. Chunk schema v2 (bbox + hash provenance)

---
cite_key: vaswani2017attention
meta:
  sources: [arxiv, crossref, openalex]
  reconciled: true
  warnings: []
  pdf_sha256: 9a8f...
...
---

## p.4 §3.2 Scaled Dot-Product Attention

<!-- chunk id=p4s32c1 -->
> We call our particular attention "Scaled Dot-Product Attention"...

provenance:
  page: 4
  section: "3.2 Scaled Dot-Product Attention"
  bbox: [72.0, 340.5, 523.8, 412.1]
  text_sha256: 5f3e1c...
  byte_range: [18342, 19104]

**Paraphrase:** …
**Implication:** …
  • text_sha256 = SHA-256 of the verbatim extracted text. Re-running the extractor must reproduce it, else the chunk is flagged stale.
  • bbox + page = deep-link target: crossmem://pdf/{cite_key}#p=4&bbox=72,340,523,412.
  • byte_range = PDF content-stream offset (from Marker); cheapest way to re-verify without re-extraction.

2D. LLM contract (what model is / isn’t allowed to touch)

FieldWho writesVerifiable?
title, authors, year, doi, arxiv_idMetadata reconcilerCross-source check
cite_key, 5 citation stringsDeterministic generatorPure function, unit-tested
original (the quote)PDF extractorSHA-256 + byte_range
paraphrase, implicationLLMNever trusted for provenance
figure.captionPDF extractorbbox + OCR-of-caption-only
figure.implicationLLMSame rule: advisory text only

The pipeline never asks the LLM to produce a quote. If a future feature wants “the key sentence on this page”, the LLM picks a sentence index from a numbered list of extracted sentences, never emits the sentence text.

2E. Paragraph- and figure-level chunking

  • Paragraph splitter: Marker’s block tree → paragraph-typed blocks become chunks (not pages).
  • Figure chunks: Marker figure blocks → crop image to raw/figs/{cite_key}_fig{N}.png, caption extracted separately, implication runs on caption-only.
  • Table chunks: Marker table block → markdown-table format, implication on markdown text.
  • Equation chunks: Nougat output in LaTeX, stored as $$…$$, implication on LaTeX source.

2F. Idempotence + re-compile

  • Re-running capture is idempotent on arxiv_id: re-downloads only if pdf_sha256 differs.
  • Re-running compile re-does LLM pass only for chunks whose text_sha256 changed.
  • crossmem verify <cite_key> walks the wiki, re-extracts, re-hashes; reports any mismatches.

Implementation order

  1. Metadata reconciler (arxiv + crossref + openalex merge, warnings on disagreement)
  2. cite_key pattern DSL (Better-BibTeX-style, unit-tested)
  3. Marker integration via uvx marker-pdf CLI (Python sidecar; Rust drives via subprocess + JSON)
  4. Chunk schema v2 writer (paragraph/figure/table/equation chunks with bbox + hash)
  5. GROBID on-demand for bibliography references
  6. crossmem verify command
  7. Nougat sidecar for math-heavy pages (opt-in)

What this buys the user

Writing a paper citing Vaswani 2017 p.4 §3.2:

Before (Phase 1): User opens wiki, sees page-4 summary paragraph, pastes bibtex. May still need to open PDF to find exact sentence.

After (Phase 2):

  • Wiki shows §3.2 as a dedicated chunk with verbatim quote.
  • Clicking the provenance block opens the PDF at page 4 with the bbox highlighted.
  • Cite key vaswani2017attention is guaranteed stable across arxiv→NeurIPS preprint→published.
  • crossmem verify run weekly confirms no wiki has silently drifted from its PDF source.

Sources

YouTube Ingestion Pipeline — Design Document

Status: Draft Author: crossmem team Date: 2026-04-15 Tracking: crossmem-rs#27


1. Overview

Extend crossmem capture <url> to detect youtube.com / youtu.be hosts and dispatch to a YouTube-specific pipeline that produces time-aligned wiki chunks — the video analog of the PDF chunk pipeline from #24.

The pipeline runs entirely local on an Apple Silicon Mac mini (M2/M4). No cloud APIs.

Pipeline stages

capture (download + extract audio/subs)
  → transcribe (whisper.cpp Metal)
  → keyframes (ffmpeg scene-cut)
  → OCR + VLM caption (per keyframe)
  → compile (Ollama paraphrase/implication per chunk)
  → emit wiki markdown

2. Download Path

Decision: yt-dlp binary

OptionProsCons
yt-dlp binaryBattle-tested, handles every edge case, active community, --cookies-from-browser for member-onlyExternal dep, Python-based, updates frequently
libyt-dlp bindingsTighter integrationNo stable C API; Python FFI is fragile
youtube-rs (pure Rust)No external depIncomplete, breaks on YT changes, no auth, no live/shorts

yt-dlp wins because YouTube aggressively rotates extraction logic. Maintaining a pure-Rust extractor is a full-time job. yt-dlp is the industry standard for a reason.

Edge cases handled by yt-dlp flags

Scenarioyt-dlp flags
Age-gated--cookies-from-browser chrome (reads real Chrome cookies)
Member-onlySame cookie approach; user must be logged in
Live streams--live-from-start --wait-for-video 30 (wait + download from start)
ShortsWorks as normal URLs (youtube.com/shorts/ID → standard extraction)
Playlists--yes-playlist or --no-playlist (user flag; default: single video)
Chapters--embed-chapters + --write-info-json (chapter list in info JSON)
Auto captions--write-auto-subs --sub-lang en
Human captions--write-subs --sub-lang en (preferred over auto when available)

Download command template

yt-dlp \
  --format "bestaudio[ext=m4a]/bestaudio/best" \
  --extract-audio --audio-format wav --audio-quality 0 \
  --write-info-json \
  --write-subs --write-auto-subs --sub-lang "en.*" --sub-format vtt \
  --embed-chapters \
  --cookies-from-browser chrome \
  --output "%(id)s.%(ext)s" \
  --paths "$HOME/crossmem/raw/youtube/" \
  "$URL"

For keyframe extraction we also need the video file:

yt-dlp \
  --format "bestvideo[height<=1080][ext=mp4]/bestvideo[height<=1080]/best" \
  --write-info-json \
  --output "%(id)s_video.%(ext)s" \
  --paths "$HOME/crossmem/raw/youtube/" \
  "$URL"

3. Audio Extraction → Transcription

Decision: whisper.cpp with Metal acceleration, large-v3-turbo model

EngineBackendSpeed (1h audio, M2)AccuracyNotes
whisper.cppMetal (Apple GPU)~6–8 minWER ~8% (large-v3-turbo)C/C++, no Python, --print-timestamps for word-level
whisper-mlxMLX (Apple GPU)~5–7 minSame modelsPython dep, MLX framework, slightly faster on M4
WhisperKitCoreML~5–6 minGoodSwift-only, harder to call from Rust
insanely-fast-whisperMPS (PyTorch)~10–15 minSame modelsHeavy Python stack, MPS less optimized than Metal
faster-whisperCTranslate2 (CPU)~15–25 minSame modelsNo Metal/MPS; CPU-only on macOS

whisper.cpp wins because:

  1. Native Metal acceleration — no Python runtime
  2. Easily called from Rust via std::process::Command (same pattern as pdftotext in cite.rs)
  3. Outputs VTT/SRT/JSON with word-level timestamps
  4. Active project, models available via Hugging Face in ggml format

Model choice: large-v3-turbo

ModelParamsVRAMDiskSpeed (M2, 1h)WER (en)
large-v31.55B~3 GB3.1 GB~12 min~7.5%
large-v3-turbo809M~1.6 GB1.6 GB~6 min~8%
distil-large-v3756M~1.5 GB1.5 GB~5 min~9%

large-v3-turbo is the sweet spot: half the VRAM of large-v3, nearly the same WER, 2× faster. distil-large-v3 is marginally faster but has slightly worse accuracy on non-native English speakers (common in academic talks).

Transcription command

whisper-cpp \
  --model models/ggml-large-v3-turbo.bin \
  --file "$HOME/crossmem/raw/youtube/${VIDEO_ID}.wav" \
  --output-vtt \
  --output-json \
  --print-timestamps \
  --language en \
  --threads 4

Caption priority

  1. Human-uploaded subtitles (.en.vtt from yt-dlp) — highest quality, use as-is
  2. whisper.cpp transcription — always run for timestamp alignment even if subs exist
  3. Auto-generated YouTube captions — fallback only; lower quality than whisper

When human subs exist, align them with whisper timestamps for precise time-coding.

Speaker diarization

Decision: Skip for P1, add in P3 if needed.

Rationale:

  • Most YouTube content crossmem targets is solo presenter (lectures, conference talks, tutorials)
  • pyannote requires Python + HF token + ~2 GB model; adds significant complexity
  • sherpa-onnx is lighter but diarization accuracy on overlapping speech is still mediocre
  • Can retrofit later: diarization produces (speaker_id, start, end) segments that merge with existing transcript chunks

If multi-speaker content becomes common, P3 can add pyannote 3.1 with speaker embedding.


4. Visual Understanding

4a. Keyframe extraction

Decision: ffmpeg scene-cut detection

ffmpeg -i "${VIDEO_ID}_video.mp4" \
  -vf "select='gt(scene,0.3)',showinfo" \
  -vsync vfr \
  -frame_pts 1 \
  "${OUTPUT_DIR}/keyframe_%04d.png" \
  2>&1 | grep "pts_time" > "${OUTPUT_DIR}/keyframe_times.txt"
MethodProsCons
ffmpeg scene filterZero extra deps, timestamp-aware, tunable thresholdMay over/under-extract
TransNetV2ML-based, higher accuracyPython + PyTorch dep, overkill for slides
PySceneDetectGood APIPython dep

ffmpeg is already a required dependency (for audio extraction). Scene threshold 0.3 works well for slide-based content; can tune per-video.

Chapter-aware extraction: If the info JSON contains chapters, also extract one keyframe per chapter boundary (seek to chapter_start + 2s). Merge with scene-cut keyframes, deduplicate within 5s window.

Target: 1 keyframe per 30–120 seconds depending on content type. Cap at 200 keyframes per video.

4b. Per-keyframe VLM caption

Decision: Qwen2.5-VL-7B via Ollama (local)

Ollama already supports multimodal models. The existing Ollama integration in cite.rs targets http://localhost:11434/api/generate — the same endpoint accepts image input with "images": [base64_png].

{
  "model": "qwen2.5-vl:7b",
  "prompt": "Describe this video frame in one sentence. If it contains a slide, list the title and key bullet points.",
  "images": ["<base64_keyframe>"],
  "stream": false
}
ModelVRAMSpeed (per frame, M2)Quality
Qwen2.5-VL-7B (q4_K_M)~5 GB~3–5 secGood for slides/diagrams
LLaVA-1.6-7B~5 GB~3–5 secSlightly worse on text-heavy slides
Qwen2.5-VL-3B~2.5 GB~1–2 secFaster but misses fine text

Qwen2.5-VL-7B is the best local VLM for slide/diagram content. 7B quantized fits comfortably alongside whisper on M2 (16 GB unified memory).

Batching: Process keyframes sequentially (VLM needs full GPU). At ~4 sec/frame × 100 frames = ~7 min. Acceptable.

4c. OCR on slides

Decision: Apple Vision framework via swift-ffi (primary), Tesseract (fallback)

EngineAccuracySpeedDependencies
Apple Vision (VNRecognizeTextRequest)Excellent, especially printed text~0.1s/imagemacOS 13+, Swift FFI
PaddleOCRVery good, multi-language~0.3s/imagePython + large model
TesseractGood for English~0.5s/imagebrew install tesseract

Apple Vision is the clear winner on macOS: built-in, fast, accurate, no extra deps. Access from Rust via a tiny Swift CLI helper:

// crossmem-ocr (Swift CLI, ~30 lines)
import Vision
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate
// ... read image, perform request, print results as JSON

Compile as crossmem-ocr binary, call from Rust via Command::new("crossmem-ocr"). Ship as part of the crossmem install or build from source on first run.

Fallback: If the Swift helper isn’t available (Linux compat someday), fall back to tesseract --oem 1 -l eng.


5. Chunk Schema

Time-aligned chunk (parallel to CompiledChunk in cite.rs)

#![allow(unused)]
fn main() {
pub struct YouTubeChunk {
    pub start_ms: u64,
    pub end_ms: u64,
    pub speaker: Option<String>,       // None until diarization (P3)
    pub transcript: String,            // Whisper or human-sub text for this segment
    pub slide_ocr: Option<String>,     // OCR text if keyframe in this time range
    pub keyframe_path: Option<String>, // Relative path to keyframe PNG
    pub keyframe_caption: Option<String>, // VLM description of keyframe
    pub paraphrase: String,            // LLM-generated 1-2 sentence summary
    pub implication: String,           // LLM-generated field impact
}
}

Chunk boundaries

Priority order for segmentation:

  1. Chapters (from info JSON) — if present, each chapter = one chunk
  2. Scene cuts — if no chapters, split at scene-cut boundaries
  3. Fixed window — fallback: 60-second segments with sentence-boundary snapping

Within a chapter, if the chapter exceeds 5 minutes, sub-split at scene cuts or 60s intervals.

Minimum chunk: 10 seconds. Maximum chunk: 5 minutes (force-split at sentence boundary).

Metadata struct

#![allow(unused)]
fn main() {
pub struct YouTubeMetadata {
    pub title: String,
    pub channel: String,
    pub upload_date: String,         // YYYY-MM-DD
    pub video_id: String,
    pub duration_sec: u64,
    pub chapters: Vec<Chapter>,      // from info JSON
    pub description: String,
    pub tags: Vec<String>,
}

pub struct Chapter {
    pub title: String,
    pub start_sec: f64,
    pub end_sec: f64,
}
}

Cite key

{channel_slug}{year}{first_noun_of_title}

Examples:

  • 3Blue1Brown, “But what is a neural network?” (2017) → 3blue1brown2017neural
  • Andrej Karpathy, “Let’s build GPT from scratch” (2023) → karpathy2023gpt
  • Two Minute Papers, “OpenAI Sora” (2024) → twominutepapers2024sora

channel_slug = channel name lowercased, non-alphanumeric stripped, truncated to 20 chars.

Each chunk carries a provenance URL:

https://youtu.be/{VIDEO_ID}?t={floor(start_ms / 1000)}

6. Citation Formats

APA 7th (online video)

{Channel} [{Channel}]. ({Year}, {Month} {Day}). {Title} [Video]. YouTube. https://www.youtube.com/watch?v={VIDEO_ID}

Example:

3Blue1Brown [3Blue1Brown]. (2017, October 5). But what is a neural network? [Video]. YouTube. https://www.youtube.com/watch?v=aircAruvnKk

MLA 9th

"{Title}." YouTube, uploaded by {Channel}, {Day} {Month} {Year}, www.youtube.com/watch?v={VIDEO_ID}.

Chicago 17th (note-bibliography)

{Channel}. "{Title}." {Month} {Day}, {Year}. Video, {Duration}. https://www.youtube.com/watch?v={VIDEO_ID}.

IEEE

{Channel}, "{Title}," YouTube. [Online Video]. Available: https://www.youtube.com/watch?v={VIDEO_ID}. [Accessed: {Access Date}].

BibTeX

@misc{cite_key,
  author = {{Channel}},
  title = {{Title}},
  year = {Year},
  month = {Month},
  howpublished = {\url{https://www.youtube.com/watch?v=VIDEO_ID}},
  note = {[Video]. YouTube. Accessed: YYYY-MM-DD}
}

7. Wiki Markdown Output

Follows the same structure as the ArXiv wiki notes. Example:

---
cite_key: 3blue1brown2017neural
title: "But what is a neural network?"
channel: "3Blue1Brown"
upload_date: "2017-10-05"
video_id: "aircAruvnKk"
duration_sec: 1140
captured_at: "1776300000"
raw: "~/crossmem/raw/youtube/aircAruvnKk.wav"
chunks: 12
source_type: youtube
---

# But what is a neural network?

## Citations

### APA
...

## Chunks

### 00:00–01:32 — Chapter: Introduction

> [Transcript text, first 400 chars...]

**Slide OCR:** [if keyframe present]

**Keyframe:** `keyframes/aircAruvnKk_0042.png` — "A diagram showing..."

**Paraphrase:** ...

**Implication:** ...

**Source:** [00:00](https://youtu.be/aircAruvnKk?t=0)

8. Orchestration

Decision: Same binary, new module youtube.rs

The existing crossmem capture <url> dispatches on URL. Add host detection:

#![allow(unused)]
fn main() {
// main.rs capture dispatch
if url.contains("arxiv.org") {
    cite::cmd_capture(url).await
} else if url.contains("youtube.com") || url.contains("youtu.be") {
    youtube::cmd_capture(url).await
} else {
    // future: generic handler
}
}

Module structure

src/
  cite.rs          # existing arxiv pipeline (unchanged)
  youtube.rs       # new: YouTube capture + compile
  youtube/
    download.rs    # yt-dlp wrapper
    transcribe.rs  # whisper.cpp wrapper
    keyframe.rs    # ffmpeg scene-cut + chapter extraction
    ocr.rs         # Apple Vision / tesseract wrapper
    vlm.rs         # Ollama multimodal (Qwen2.5-VL) wrapper
    chunk.rs       # Segmentation + chunk assembly
    emit.rs        # Wiki markdown emission
  shared/
    ollama.rs      # Extract from cite.rs — shared Ollama client
    formats.rs     # Citation format builders (generalized)

Shared Ollama code: Factor compile_page_chunk and the HTTP client into shared/ollama.rs. Both cite.rs and youtube.rs call it. The prompt template differs (page text vs transcript chunk), but the HTTP plumbing is identical.

Two-stage flow (same as arxiv)

crossmem capture <youtube-url>
  → downloads audio + video + subs + info JSON
  → extracts metadata, generates cite_key
  → saves to ~/crossmem/raw/youtube/{video_id}/
  → prints cite_key for next step

crossmem compile <cite_key>
  → detects source_type (arxiv vs youtube) from meta JSON
  → runs transcription (whisper.cpp)
  → runs keyframe extraction (ffmpeg)
  → runs OCR + VLM caption per keyframe
  → runs Ollama compile per chunk (paraphrase + implication)
  → emits wiki markdown to ~/crossmem/wiki/

9. Dependency Install UX

Decision: Error with one-liner install instructions on first run

Auto-installing is tempting but violates principle of least surprise. Instead:

$ crossmem capture https://youtube.com/watch?v=abc123

ERROR: missing required dependencies for YouTube ingestion:
  ✗ yt-dlp       — brew install yt-dlp
  ✗ ffmpeg        — brew install ffmpeg
  ✓ whisper.cpp   — found at /opt/homebrew/bin/whisper-cpp

Install all missing:
  brew install yt-dlp ffmpeg

Then retry: crossmem capture https://youtube.com/watch?v=abc123

Check order: which yt-dlp && which ffmpeg && which whisper-cpp (or whisper depending on install method).

whisper.cpp model download: If binary exists but model is missing:

Model not found. Download large-v3-turbo (~1.6 GB):
  curl -L -o ~/.cache/whisper/ggml-large-v3-turbo.bin \
    https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

crossmem-ocr Swift helper: Build from source on first YouTube capture:

$ swift build -c release -package-path ./tools/crossmem-ocr

Or provide pre-built binary in releases.


10. Cost Model (All Local)

Estimated wall-clock for M2 Mac mini (16 GB)

Stage1h video30 min video3h video
yt-dlp download (audio + video)~2 min~1 min~5 min
whisper.cpp transcription~6 min~3 min~18 min
ffmpeg keyframe extraction~1 min~30 sec~3 min
OCR per keyframe (~80 frames)~8 sec~4 sec~20 sec
VLM caption per keyframe~5 min~2.5 min~15 min
Ollama compile per chunk (~40 chunks)~8 min~4 min~24 min
Total~22 min~11 min~65 min

Bottlenecks

  1. Ollama compile — sequential LLM calls, ~12 sec/chunk. Could batch with larger context window.
  2. VLM caption — sequential, ~4 sec/frame. GPU contention with Ollama if run concurrently.
  3. Whisper — fast on Metal, but locks GPU for duration.

Memory pressure

ConcurrentPeak VRAMSafe on 16 GB?
Whisper alone~1.6 GBYes
Ollama (7B q4) alone~5 GBYes
Whisper + Ollama~6.6 GBYes
Qwen2.5-VL-7B + Ollama text~10 GBTight but OK
All three simultaneous~12 GBRisky — run sequentially

Strategy: Run stages sequentially. whisper → keyframes → OCR → VLM → compile. No concurrent GPU workloads.


11. Storage Layout

~/crossmem/
  raw/
    youtube/
      {video_id}/
        {video_id}.wav              # Audio (whisper input)
        {video_id}_video.mp4        # Video (keyframe source)
        {video_id}.info.json        # yt-dlp metadata
        {video_id}.en.vtt           # Human subs (if available)
        {video_id}.en.auto.vtt      # Auto subs (if available)
        {video_id}.meta.json        # crossmem metadata
        transcript.json             # Whisper output with timestamps
        keyframes/
          frame_0001.png            # Scene-cut keyframes
          frame_0002.png
          keyframe_times.json       # Timestamp → frame mapping
          ocr/
            frame_0001.txt          # OCR output per frame
          captions/
            frame_0001.txt          # VLM caption per frame
  wiki/
    {timestamp}_{cite_key}.md       # Final compiled wiki note

12. Phased Delivery

P1 — Download + Transcribe (MVP)

  • URL detection in main.rs capture dispatch
  • yt-dlp download wrapper (youtube/download.rs)
  • whisper.cpp transcription wrapper (youtube/transcribe.rs)
  • Basic chunk segmentation (chapters or 60s windows)
  • Ollama compile pass (reuse from cite.rs)
  • Wiki markdown emission (transcript-only, no visual)
  • Dependency check + error messages
  • Tests for metadata parsing, cite_key generation, chunk segmentation

P2 — Keyframes + OCR

  • ffmpeg scene-cut extraction (youtube/keyframe.rs)
  • Chapter-aware keyframe selection
  • Apple Vision OCR helper (tools/crossmem-ocr/)
  • Tesseract fallback
  • OCR text merged into chunks
  • Tests for keyframe timing, OCR integration

P3 — VLM Captions + Diarization

  • Ollama multimodal integration for keyframe captioning (youtube/vlm.rs)
  • Keyframe captions merged into chunks
  • Optional: pyannote speaker diarization
  • Tests for VLM response parsing

P4 — Polish + Chunk Emission

  • Human sub → whisper alignment
  • Playlist support (batch capture)
  • crossmem compile --source youtube flag
  • Storage cleanup (delete intermediate files after compile)
  • Integration tests with real short video
  • Performance benchmarks on M2/M4

13. Open Questions

  1. Subtitle language detection: Should we auto-detect the video language and pass --language to whisper, or always use en? For P1, assume English.

  2. Video retention: Keep the video file after keyframe extraction, or delete to save disk? A 1h 1080p video is ~1–2 GB. Suggest: keep for 7 days, then auto-prune.

  3. Ollama model for compile pass: Reuse llama3.2:3b (same as arxiv), or use a different model better suited for spoken-word paraphrasing? Suggest: same model, same env var.

  4. Playlist semantics: One wiki note per video, or one per playlist? Suggest: one per video, with a playlist index note linking them.

  5. Live stream handling: yt-dlp can download from start, but duration is unknown until stream ends. Suggest: P1 skips live, add in P2.

Why crossmem bridge does not use Chrome DevTools Protocol

The incident

On a developer workstation, a suspicious process (PID 73079) spawned from a Claude shell snapshot executed the following sequence:

  1. sleep 2400 (wait for Chrome to settle)
  2. Connect to ws://localhost:9222 (Chrome DevTools Protocol)
  3. Runtime.evaluateClerk.session.getToken() to steal the active session token
  4. POST the stolen token to an external API (teaching.monster)

Root cause: a dev tool had launched Chrome with --remote-debugging-port=9222. This single flag exposes every open tab, every origin, every cookie on a localhost WebSocket with zero authentication. Any local process—malicious or not—can connect and run arbitrary JavaScript in the context of any page the user has open. CDP is a debugger; it trusts the caller completely.

What crossmem bridge does differently

crossmem bridge is a Manifest V3 Chrome extension that communicates with local agents over a WebSocket on localhost:7600. The design differs from CDP in several concrete ways:

  • No --remote-debugging-port. The user’s Chrome launches normally. There is no app-wide debug backdoor to connect to.
  • User-installed extension with Chrome’s permission UI. The user explicitly grants the extension host permissions. CDP requires no user consent at all; whatever launched Chrome with the flag decides.
  • Whitelisted action set. The bridge accepts a fixed set of named actions: navigate, click, type, extract, screenshot, summarize, tab_info, wait, ping. There is no generic “evaluate arbitrary JS” verb. An attacker who connects to :7600 can click buttons and read extracted text, but cannot call Clerk.session.getToken() or Network.getAllCookies.
  • Real Chrome profile, no spoofing. The extension runs inside the user’s actual Chrome profile—no --user-data-dir to a throwaway directory, no Chrome for Testing with broken Keychain integration.

Threat model comparison

Attack surfaceCDP (:9222)crossmem bridge (:7600)
Arbitrary JS on any originRuntime.evaluate — yesNo eval verb — no
Dump all cookiesNetwork.getAllCookies — yesNo such action — no
Read/modify DOMFull DOM accessOnly via named actions (click, extract)
AuthenticationNoneNone (same weakness — see below)
User consentNone; whoever launched Chrome decidesChrome extension install prompt

The PID 73079 attack required exactly two CDP primitives: Runtime.evaluate and network access. Neither exists in the crossmem bridge action vocabulary.

What this design does NOT protect against

Honesty matters more than marketing. crossmem bridge has real limitations:

  • localhost:7600 is unauthenticated, same as CDP on :9222. Any local process can connect. The attack surface is smaller (no eval, no cookie dump), but the network posture is identical.
  • chrome.scripting.executeScript is arbitrary JS under the hood. The bridge currently uses it to implement actions like extract and click. If a future action handler passes attacker-controlled input (selectors, payloads) into executeScript without sanitization, the constrained action set becomes a confused deputy.
  • Supply-chain attack on the extension itself. A malicious MV3 update pushed to the Chrome Web Store bypasses every architectural constraint. The extension IS the trust boundary.
  • Planned hardening (not yet implemented):
    • Per-request auth token (shared secret between agent and extension)
    • Unix domain socket instead of TCP (removes network-reachable surface)
    • Strict input validation on action parameters

Takeaway

The lesson from PID 73079 is not “use crossmem bridge instead of CDP.” It is: dev automation tooling should not default to opening an app-wide debug backdoor.

CDP is a debugger protocol. It was designed for DevTools, not for agent orchestration. When you expose it on localhost, you hand every local process— including ones you didn’t launch—full control over every tab in the browser.

crossmem bridge chose a constrained, consent-gated channel: a user-installed extension exposing a fixed action vocabulary over a local WebSocket. This is a design choice that reduces the blast radius of local-process compromise. It is not magic, and it is not complete. But it means PID 73079’s exact attack vector—connect, eval, exfiltrate—does not work.