crossmem

crossmem is a local-first citation and knowledge pipeline. It captures academic papers (arXiv PDFs today, YouTube and more coming), compiles them into structured wiki notes with verbatim quotes and provenance metadata, and serves them to AI agents via MCP.

What it does

Capture — downloads a paper, extracts metadata from arXiv + CrossRef + OpenAlex, generates a deterministic cite key
Compile — parses the PDF (via Marker or pdftotext), splits into paragraph-level chunks with bounding-box provenance, runs a local LLM (Ollama) to add paraphrase and implication per chunk
Verify — re-hashes every chunk’s text against its stored SHA-256; detects silent drift
Cite & Recall — MCP tools that let Claude (or any MCP client) look up citations and search your wiki

Design principles

Verbatim quotes are ground truth. The LLM only touches paraphrase/implication fields, never the original text.
Provenance is first-class. Every chunk carries page, section, bounding box, SHA-256 hash, and byte range back to the source PDF.
Metadata is cross-verified. Title, authors, and year must agree across at least two canonical sources (arXiv, CrossRef, OpenAlex). Disagreements surface as warnings, not silent picks.
Everything runs locally. No cloud APIs. Ollama for LLM, Marker for PDF parsing, all on your Mac.

Quick links

Installation
30-second Quick Start
CLI Reference
MCP Integration
Source code

Installation

From source

cargo install --path .

Or directly from GitHub:

cargo install --git https://github.com/crossmem/crossmem-rs

Dependencies

crossmem requires the following tools for PDF capture and compilation:

Tool	Purpose	Install
`pdftotext`	Fallback PDF text extraction	`brew install poppler`
`marker`	PDF parsing with bounding boxes (default)	`pip install marker-pdf` or `uvx marker-pdf`
Ollama	Local LLM for paraphrase/implication	ollama.com

Ollama model setup

crossmem uses llama3.2:3b by default. Pull the model before your first compile:

ollama pull llama3.2:3b

Override the model with the CROSSMEM_OLLAMA_MODEL environment variable:

CROSSMEM_OLLAMA_MODEL=mistral crossmem compile vaswani2017attention

Verify installation

crossmem --version

Quick Start

Capture a paper, compile it, and cite it — in under 30 seconds.

1. Capture

Download an arXiv paper and extract metadata:

crossmem capture https://arxiv.org/abs/1706.03762

Output:

[capture] arxiv_id: 1706.03762
[capture] title: Attention Is All You Need
[capture] cite_key: vaswani2017attention
[capture] saved to ~/crossmem/raw/...

2. Compile

Parse the PDF into chunks and run the LLM pass:

crossmem compile vaswani2017attention

This produces a wiki note at ~/crossmem/wiki/<timestamp>_vaswani2017attention.md with:

YAML frontmatter (title, authors, year, DOI, cite_key)
Five citation formats (APA, MLA, Chicago, IEEE, BibTeX)
Per-chunk verbatim quotes with paraphrase, implication, and provenance metadata

3. Cite via MCP

Add crossmem to Claude Code:

claude mcp add crossmem -- crossmem mcp serve

Then ask Claude:

Cite vaswani2017attention in APA format.

Claude calls crossmem_cite and returns:

Vaswani, A., & Shazeer, N. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.

4. Search your wiki

Ask Claude:

What do I have on self-attention mechanisms?

Claude calls crossmem_recall and returns matching excerpts ranked by relevance, with cite keys and deep links to your wiki files.

Writing a Paper with crossmem

An end-to-end playbook for AI agents (Claude Code, Cursor, etc.) and their human authors. You have crossmem installed and the MCP server registered. You want to cite prior work correctly and quote-faithfully.

1. One-time setup

Install crossmem and its dependencies:

# Install crossmem
cargo install --path .
# Or, from the repo directly:
# cargo install --git https://github.com/crossmem/crossmem-rs

# Local LLM for paraphrase/implication generation
ollama pull llama3.2:3b

# PDF parser (preferred — produces bounding boxes)
pip install marker-pdf
# Fallback: brew install poppler   (provides pdftotext)

Claude Code:

claude mcp add crossmem -- crossmem mcp serve

Claude Desktop — add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "crossmem": {
      "command": "crossmem",
      "args": ["mcp", "serve"]
    }
  }
}

2. Capturing a paper

crossmem capture https://arxiv.org/abs/1706.03762

Output:

[capture] arxiv_id: 1706.03762
[capture] title: Attention Is All You Need
[capture] cite_key: vaswani2017attention
[capture] saved to ~/crossmem/raw/1776227254_vaswani2017attention.pdf

This does three things:

Downloads the PDF to ~/crossmem/raw/<timestamp>_<cite_key>.pdf
Fetches metadata from arXiv, CrossRef, and OpenAlex — reconciles across all three
Generates a deterministic cite key via the pattern DSL

Then compile it into a wiki entry:

crossmem compile vaswani2017attention

This parses the PDF (Marker by default), splits it into chunks, runs each through Ollama for paraphrase and implication, and emits the wiki note at ~/crossmem/wiki/<timestamp>_vaswani2017attention.md.

Note: YouTube ingestion is design-only — see YouTube Ingestion Pipeline.

Capturing non-arXiv papers

Most journal papers (e.g. JCP, Nature, PRL) are not on arXiv. crossmem capture supports them through DOI lookup and local PDF import.

If you have a DOI — CrossRef metadata is fetched automatically:

# DOI URL
crossmem capture https://doi.org/10.1063/5.0012345

# Bare DOI
crossmem capture 10.1063/5.0012345

If the paper is open-access, the PDF downloads via Unpaywall. Otherwise you’ll get instructions to download it manually.

If you already have the PDF — the most common path for paywalled journals:

# With DOI (recommended — gets full CrossRef metadata)
crossmem capture ~/Downloads/smith2023.pdf --doi 10.1063/5.0012345

# Without DOI — extracts what it can from PDF metadata
crossmem capture ~/Downloads/smith2023.pdf --cite-key smith2023transport

Direct PDF URL — for preprint servers, institutional repos:

crossmem capture https://chemrxiv.org/paper.pdf --doi 10.1234/chemrxiv.5678

All paths produce the same raw/ + .meta.json output. Then compile as usual:

crossmem compile smith2023transport

For a JCP submission with 24 references, a typical workflow is:

# Capture each reference — most will be local PDFs with DOIs
for pdf in ~/papers/jcp-refs/*.pdf; do
  doi=$(grep -oP '10\.\d{4,9}/[^\s]+' <<< "$(pdftotext "$pdf" - | head -5)")
  crossmem capture "$pdf" --doi "$doi"
done

# Then compile each one
for meta in ~/crossmem/raw/*.meta.json; do
  key=$(jq -r .cite_key "$meta")
  crossmem compile "$key"
done

3. The compiled wiki entry — what the agent sees

Frontmatter

---
cite_key: vaswani2017attention
title: "Attention Is All You Need"
authors:
  - "Ashish Vaswani"
  - "Noam Shazeer"
year: 2017
arxiv_id: "1706.03762"
doi: "10.48550/arXiv.1706.03762"
captured_at: "1776227254"
raw: "~/crossmem/raw/1776227254_vaswani2017attention.pdf"
pdf_sha256: "9a8f3b..."
parser: "marker"
chunks: 47
meta:
  sources: ["arxiv", "crossref", "openalex"]
  reconciled: true
  warnings: []
---

After the frontmatter, five citation formats are pre-generated: APA, MLA, Chicago, IEEE, and BibTeX.

Chunks

Each chunk carries verbatim text, LLM-generated derivatives, and full provenance:

<!-- chunk id=p4s32c1 -->
> The dominant sequence transduction models are based on complex recurrent or
> convolutional neural networks that include an encoder and a decoder.

**Paraphrase:** Prior sequence models relied on RNNs or CNNs in an encoder-decoder setup.

**Implication:** This dependency on recurrence was the bottleneck the Transformer aimed to eliminate.

```yaml
provenance:
  page: 4
  section: "3.2 Scaled Dot-Product Attention"
  bbox: [72.0, 340.5, 523.8, 412.1]
  text_sha256: "5f3e1c..."
  byte_range: [18342, 19104]
```

Hard rule for agents: The > blockquote is the verbatim original extracted from the PDF. When citing, the agent MUST copy from this blockquote. NEVER fabricate or rephrase quotes. The Paraphrase and Implication fields exist for the agent’s reasoning and search — they do not belong in the paper as attributed quotes.

4. Agent prompts that actually work

Finding relevant chunks

“Search my library for how transformer attention was originally motivated. Return cite_keys and page numbers.”

Agent calls:

crossmem_recall("transformer attention motivation", limit=5)

Returns a ranked list of {cite_key, title, section, excerpt}. The agent picks the most relevant hits and reports them.

Quoting with provenance

“Write a paragraph introducing self-attention. Quote vaswani2017attention page 2 verbatim, then paraphrase in my voice. Include BibTeX.”

Agent workflow:

Calls crossmem_recall("self-attention vaswani2017attention") to find the right chunk
Reads the wiki file to locate the page-2 chunk
Copies the > blockquote verbatim into the draft as a block quote
Writes a surrounding paraphrase in the author’s voice (informed by the Paraphrase field, not copying it)
Calls crossmem_cite("vaswani2017attention", "bibtex") for the BibTeX entry
Embeds the text_sha256 and page reference as an HTML comment so crossmem verify can trace provenance:

% crossmem: vaswani2017attention p4s32c1 sha256=5f3e1c...
\begin{quote}
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder.
\end{quote}
\cite{vaswani2017attention}

Citing multiple papers

“Compare how Vaswani 2017 and Devlin 2019 frame the importance of pre-training.”

Agent calls crossmem_recall("pre-training importance"), gets hits from both papers, reads the relevant chunks, and writes a comparison paragraph quoting both — each quote traced to its chunk ID.

Running a drift check

After the human edits the draft (or the agent revises it), verify that no quotes have been accidentally mutated:

crossmem verify

Output when clean:

[verify] checked 94 chunks across 3 wiki entries
[verify] 0 drifts detected

Output when a quote was altered:

[verify] DRIFT in vaswani2017attention chunk p4s32c1
  expected: 5f3e1c...
  actual:   a1b2c3...
[verify] 1 drift detected

Exit code 1 means drift — the agent or human must restore the original quote from the wiki.

Building the bib file

Collect all \cite{...} keys from a LaTeX draft and emit a single .bib:

grep -oP '\\cite\{[^}]+\}' draft.tex \
  | sed 's/\\cite{//;s/}//' \
  | tr ',' '\n' \
  | sort -u \
  | while read key; do crossmem mcp serve <<< "{\"method\":\"tools/call\",\"params\":{\"name\":\"crossmem_cite\",\"arguments\":{\"cite_key\":\"$key\",\"format\":\"bibtex\"}}}"; done

Or, have the agent do it: “Collect every cite key from my draft and produce a references.bib file using crossmem_cite.”

5. What crossmem protects against

Failure mode	How crossmem prevents it
Hallucinated citation metadata	Multi-source reconciliation: arXiv + CrossRef + OpenAlex, ≥2 must agree. Disagreements surface as `warnings` in frontmatter.
Hallucinated quotes	Agent contract: never compose `original` text, only copy the `>` blockquote. `crossmem verify` catches any post-hoc mutation via SHA-256 re-hashing.
Wrong page numbers	Every chunk carries `page`, `section`, and `bbox` — the reader can trace back to the exact PDF region.
Lost context	`byte_range` preserves the exact location in the raw PDF. Chunks retain their section heading for navigation.
Cite key collisions	Deterministic pattern DSL with a–z suffix tiebreaker (then `_<count>` if all 26 are taken).

6. Limits

Be honest about what crossmem cannot do today:

Scanned / image-only PDFs: Marker’s OCR quality varies. Chunks from poorly scanned pages may have garbled text.
Math-heavy pages: The pipeline does not run Nougat or other math-aware extractors. Equations may appear as lossy Unicode approximations or be missing entirely.
Non-arXiv sources: Journal papers captured via DOI or local PDF have single-source metadata (CrossRef only), so there is no cross-verification. Books and conference proceedings with non-standard DOIs may produce incomplete frontmatter.
Single-author workflow: There is no shared library, sync, or multi-user conflict resolution. Each machine has its own ~/crossmem/ directory.
Ollama dependency: Compile requires a running Ollama instance. If Ollama is down or the model is missing, compile will fail.

7. Minimal paper-writing session

A scripted walkthrough — capture two papers, write an intro paragraph, verify.

# Capture two papers
crossmem capture https://arxiv.org/abs/1706.03762
crossmem compile vaswani2017attention

crossmem capture https://arxiv.org/abs/1810.04805
crossmem compile devlin2019bert

Now prompt the agent:

“Write an introductory paragraph for my Related Work section. It should cite both vaswani2017attention and devlin2019bert, quoting one key sentence from each verbatim. Output LaTeX with \cite commands and the BibTeX entries.”

The agent:

Calls crossmem_recall("attention mechanism transformer", limit=5) and crossmem_recall("pre-training bidirectional", limit=5)
Reads the wiki entries for both papers, selects one chunk each
Produces:

The Transformer architecture replaced recurrence with self-attention:
\begin{quote}
``The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder.''
\end{quote}
\cite{vaswani2017attention}. Building on this,
BERT demonstrated that bidirectional pre-training could be applied to a wide
range of NLP tasks:
\begin{quote}
``We introduce a new language representation model called BERT, which stands
for Bidirectional Encoder Representations from Transformers.''
\end{quote}
\cite{devlin2019bert}.

% crossmem: vaswani2017attention p1s0c1 sha256=...
% crossmem: devlin2019bert p1s0c1 sha256=...

Calls crossmem_cite("vaswani2017attention", "bibtex") and crossmem_cite("devlin2019bert", "bibtex") to emit references.bib

Finally, verify nothing drifted:

crossmem verify
# [verify] checked 94 chunks across 2 wiki entries
# [verify] 0 drifts detected

The quotes in your LaTeX match the raw PDFs. Ship it.

crossmem capture

Download a paper and extract metadata.

Usage

crossmem capture <input> [--doi <doi>] [--cite-key <key>]

Input types

Input	Example	Detection
Local PDF file	`/path/to/paper.pdf`	Path exists on disk
arXiv URL or bare ID	`https://arxiv.org/abs/1706.03762`, `1706.03762`	arXiv URL pattern or bare numeric ID
DOI URL or bare DOI	`https://doi.org/10.1038/nature12373`, `10.1038/nature12373`	DOI URL prefix or `10.NNNN/...` pattern
Direct PDF URL	`https://example.com/paper.pdf`	HTTPS URL ending in `.pdf`

Inputs are matched in the order above — first match wins.

Flags

Flag	Description
`--doi <doi>`	Attach a DOI to the capture. For local files and PDF URLs, fetches CrossRef metadata for this DOI.
`--cite-key <key>`	Override the auto-generated cite key.

What it does

arXiv input (existing behavior)

Extracts the arXiv ID from the URL
Fetches metadata from arXiv API
Cross-checks metadata against CrossRef and OpenAlex (reconciliation)
Downloads the PDF from arXiv
Generates a cite key using the configured pattern DSL
Saves PDF + .meta.json sidecar

DOI input

Fetches metadata from CrossRef API
Tries Unpaywall API for an open-access PDF URL (requires CROSSMEM_UNPAYWALL_EMAIL env var)
If no open-access PDF found, prints instructions to download manually and use local file capture

Local PDF file

Copies (not moves) the PDF to ~/crossmem/raw/<timestamp>_<cite_key>.pdf
If --doi given: fetches CrossRef metadata
If no --doi: tries extracting embedded PDF metadata via pdfinfo (Title, Author, CreationDate)
If no metadata found and no --cite-key: errors with instructions

Direct PDF URL

Downloads the PDF
Then follows the same metadata path as local file (CrossRef via --doi, or pdfinfo fallback)

Exit codes

Code	Meaning
0	Success
1	Error (invalid input, download failure, metadata fetch failure)
2	Missing arguments

Environment variables

Variable	Description
`CROSSMEM_UNPAYWALL_EMAIL`	Email address for Unpaywall API (required for DOI→PDF lookup)

See config.toml for cite key configuration.

Examples

arXiv paper

$ crossmem capture https://arxiv.org/abs/1706.03762
[capture] arxiv_id: 1706.03762
[capture] title: Attention Is All You Need
cite_key:   vaswani2017attention

Journal paper via DOI

$ crossmem capture 10.1063/5.0012345
[capture] DOI: 10.1063/5.0012345
cite_key:   smith2023molecular

Local PDF with DOI metadata

$ crossmem capture ~/Downloads/paper.pdf --doi 10.1063/5.0012345
[capture] Local file: /Users/me/Downloads/paper.pdf
[capture] Fetching CrossRef metadata for DOI 10.1063/5.0012345
cite_key:   smith2023molecular

Local PDF with manual cite key

$ crossmem capture ~/Downloads/paper.pdf --cite-key jones2024transport
[capture] Local file: /Users/me/Downloads/paper.pdf
cite_key:   jones2024transport

Direct PDF URL

$ crossmem capture https://example.com/papers/preprint.pdf --doi 10.1234/example
cite_key:   doe2024example

Storage layout

~/crossmem/raw/
  <timestamp>_<cite_key>.pdf        # Raw PDF
  <timestamp>_<cite_key>.meta.json  # Metadata sidecar

The .meta.json file contains the reconciled metadata used by compile:

{
  "cite_key": "smith2023molecular",
  "title": "Molecular dynamics simulation of transport",
  "authors": ["John Smith", "Jane Doe"],
  "year": 2023,
  "arxiv_id": "",
  "doi": "10.1063/5.0012345",
  "container_title": "The Journal of Chemical Physics",
  "sources": ["crossref"],
  "reconciled": true,
  "warnings": []
}

crossmem compile

Parse a captured PDF into structured wiki chunks with LLM-generated paraphrase and implication.

Usage

crossmem compile <cite_key>

Arguments

Argument	Description
`<cite_key>`	The cite key printed by `crossmem capture`. Example: `vaswani2017attention`

What it does

Finds the raw PDF and .meta.json for the given cite key in ~/crossmem/raw/
Parses the PDF using Marker (preferred) or pdftotext (fallback)
Splits content into paragraph-level chunks with bounding-box provenance
Computes SHA-256 hash for each chunk’s verbatim text
Sends each chunk to Ollama for paraphrase and implication generation
Generates five citation formats (APA, MLA, Chicago, IEEE, BibTeX)
Emits the final wiki note to ~/crossmem/wiki/<timestamp>_<cite_key>.md

Exit codes

Code	Meaning
0	Success
1	Error (cite key not found, Ollama unreachable, parse failure)

Environment variables

Variable	Default	Description
`CROSSMEM_OLLAMA_MODEL`	`llama3.2:3b`	Ollama model used for paraphrase/implication generation

Example

$ crossmem compile vaswani2017attention
[compile] loading raw PDF for vaswani2017attention
[compile] parsing with Marker (MPS)...
[compile] 47 chunks extracted
[compile] compiling chunk 1/47...
...
[compile] wiki saved to ~/crossmem/wiki/1776227300_vaswani2017attention.md

PDF parsing tiers

Tier	Parser	When used	Bounding boxes
0	`pdftotext -layout`	Fallback when Marker unavailable	No
1	Marker (MPS)	Default for arXiv papers	Yes (polygon per block)

The parser tier is recorded in the wiki frontmatter as the parser field.

LLM contract

The LLM (Ollama) is only allowed to generate paraphrase and implication fields. It never touches:

Original verbatim text (from PDF extractor)
Metadata fields (from reconciler)
Citation strings (deterministic generator)
Provenance data (from parser)

crossmem verify

Verify chunk integrity by re-hashing verbatim text against stored SHA-256 hashes.

Usage

crossmem verify [cite_key]

Arguments

Argument	Description
`[cite_key]`	Optional. If provided, only verify chunks for this cite key. If omitted, verify all wiki entries.

What it does

Walks ~/crossmem/wiki/ for all .md files (or the one matching cite_key)
For each wiki entry with a cite_key in frontmatter:
- Extracts all  blocks
- Finds the text_sha256 in each chunk’s provenance YAML
- Re-computes SHA-256 from the verbatim quoted text (> ... lines)
- Reports any mismatches as “DRIFT”
Prints summary: total chunks checked, total drifts detected

Exit codes

Code	Meaning
0	All chunks verified, no drift
1	Error or drift detected

Example

$ crossmem verify vaswani2017attention
[verify] Checking vaswani2017attention

Verified 47 chunks, 0 drift(s) detected.

$ crossmem verify
[verify] Checking vaswani2017attention
[verify] Checking lecun2015deep

Verified 93 chunks, 0 drift(s) detected.

When drift is detected:

DRIFT: vaswani2017attention chunk p4s32c1
  expected: 5f3e1c...
  actual:   a8b2d4...

Verified 47 chunks, 1 drift(s) detected.

crossmem mcp serve

Start the MCP (Model Context Protocol) server on stdio.

Usage

crossmem mcp serve

What it does

Starts an MCP server that communicates over stdin/stdout, providing two tools to any MCP client:

crossmem_cite — look up a citation by cite key
crossmem_recall — search the wiki for matching entries

The server loads wiki entries from ~/crossmem/wiki/ and serves them to the connected client.

Exit codes

Code	Meaning
0	Clean shutdown
1	Server error

Environment variables

Variable	Default	Description
`RUST_LOG`	`warn`	Log level (logs go to stderr, not stdout — stdout is the MCP transport)

Adding to Claude Code

claude mcp add crossmem -- crossmem mcp serve

This registers crossmem as an MCP server that Claude Code will start automatically.

Adding to Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "crossmem": {
      "command": "crossmem",
      "args": ["mcp", "serve"]
    }
  }
}

crossmem serve

Run the HTTP/WebSocket relay bridge that connects CLI tools and agents to the crossmem Chrome extension.

Usage

crossmem serve
crossmem          # 'serve' is the default when no subcommand is given

What it does

Starts an HTTP + WebSocket server on 127.0.0.1:7600 (configurable). The Chrome extension connects via WebSocket; CLI tools and agents send commands via HTTP.

Endpoints

Endpoint	Method	Description
`/status`	GET	Connection status: connected extensions, pending command count
`/command`	POST	Send a command to the extension and wait for its response
`/dialog_response`	POST	Send a dialog response back to the extension
`/capture`	POST	Screen recording capture handler
`/` or `/ws`	WS	Extension WebSocket connection

Exit codes

Code	Meaning
0	Clean shutdown (SIGINT or SIGTERM)
1	Bind failure (port already in use)

Environment variables

Variable	Default	Description
`BRIDGE_PORT`	`7600`	Port to listen on
`RUST_LOG`	`info`	Log level

Example

$ crossmem serve
[bridge] crossmem-bridge v0.1.0
[bridge] HTTP  → http://127.0.0.1:7600/status
[bridge] HTTP  → http://127.0.0.1:7600/command
[bridge] WS    → ws://127.0.0.1:7600/
[bridge] waiting for extension...

Sending a command

curl -X POST http://127.0.0.1:7600/command \
  -H 'Content-Type: application/json' \
  -d '{"action":"navigate","params":{"url":"https://example.com"}}'

Checking status

curl -s http://127.0.0.1:7600/status | jq .

Chrome extension

The bridge is designed to work with the crossmem Chrome extension. The extension connects via WebSocket and executes commands using chrome.scripting.executeScript.

Supported actions: navigate, click, type, wait, extract, screenshot, summarize, tab_info, ping.

For multi-agent use, add "agentId": "my-agent" to commands to isolate tab control.

MCP Integration

crossmem exposes two tools via the Model Context Protocol (MCP), allowing AI agents to look up citations and search your wiki without leaving the conversation.

Tools

Tool	Description
`crossmem_cite`	Look up a citation by cite key and return it in a specified format
`crossmem_recall`	Search the wiki for entries matching a query

Setup

Claude Code

claude mcp add crossmem -- crossmem mcp serve

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "crossmem": {
      "command": "crossmem",
      "args": ["mcp", "serve"]
    }
  }
}

Agent usage prompts

Once crossmem is registered as an MCP server, you can ask your agent things like:

“Cite vaswani2017attention in APA format.”
“Give me the BibTeX for vaswani2017attention.”
“What do I have on attention mechanisms?”
“Search my wiki for papers about transformer architectures.”
“Find all papers by Vaswani in my library.”

How it works

The MCP server (crossmem mcp serve) runs on stdio. It loads all .md files from ~/crossmem/wiki/ on startup, parses their YAML frontmatter and body, and responds to tool calls by searching this in-memory index.

Logs go to stderr (not stdout), so they don’t interfere with the MCP JSON-RPC transport.

crossmem_cite

Look up a citation by cite key and return it in the requested format.

Parameters

Parameter	Type	Required	Default	Description
`cite_key`	string	yes	—	Citation key, e.g. `vaswani2017attention`
`format`	string	no	`bibtex`	One of: `bibtex`, `apa`, `mla`, `chicago`, `ieee`

Returns

The formatted citation string extracted from the wiki file’s citation section.

Success

Vaswani, A., & Shazeer, N. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.

Cite key not found

If the cite key doesn’t match any wiki entry, returns the top 5 fuzzy matches:

Error: cite_key 'vaswani' not found. Did you mean:
  - vaswani2017attention — Attention Is All You Need

Format not found

If the cite key exists but the wiki file is missing the requested citation section:

Error: cite_key 'vaswani2017attention' found but no APA citation section in wiki file.
File: /Users/you/crossmem/wiki/1776227300_vaswani2017attention.md

Fuzzy matching

When an exact match fails, the tool scores candidates by:

Full cite key substring match (+10)
Full title substring match (+5)
Per-token cite key match (+3 each)
Per-token title match (+2 each)

The top 5 candidates are returned as suggestions.

crossmem_recall

Search the crossmem wiki for entries matching a query. Returns matching excerpts ranked by relevance.

Parameters

Parameter	Type	Required	Default	Description
`query`	string	yes	—	Search query string
`limit`	integer	no	`5`	Max results to return (capped at 20)

Returns

A ranked list of matching wiki entries, each with:

Index number
Cite key and title
Section where the match was found
Excerpt (up to 400 characters) with surrounding context
Deep link to the wiki file

Example response

1. [vaswani2017attention] Attention Is All You Need
   section: p.4 §3.2 Scaled Dot-Product Attention
   excerpt: ...We call our particular attention "Scaled Dot-Product Attention". The input consists of queries and keys of dimension dk...
   link: file:///Users/you/crossmem/wiki/1776227300_vaswani2017attention.md

2. [lecun2015deep] Deep Learning
   section: p.12 §4 Attention Mechanisms
   excerpt: ...Attention mechanisms have become an integral part of sequence modeling...
   link: file:///Users/you/crossmem/wiki/1776300000_lecun2015deep.md

No results

No results for query: 'quantum computing'

Scoring

Results are ranked by total token frequency: each whitespace-delimited query token is counted across the entry’s title and body. Higher count = higher rank.

config.toml

crossmem reads configuration from ~/.crossmem/config.toml. If the file doesn’t exist, it’s created with defaults on first run.

Location

~/.crossmem/config.toml

Full reference

[cite_key]
pattern = "[auth:lower][year][shorttitle:1:nopunct]"

Sections

`[cite_key]`

Key	Default	Description
`pattern`	`[auth:lower][year][shorttitle:1:nopunct]`	Pattern DSL for generating cite keys. See cite_key Pattern DSL.

Environment variables

These are not in config.toml but affect crossmem’s behavior:

Variable	Default	Description
`CROSSMEM_OLLAMA_MODEL`	`llama3.2:3b`	Ollama model for compile pass
`BRIDGE_PORT`	`7600`	Bridge server port
`RUST_LOG`	`info` (bridge) / `warn` (MCP)	Log level filter

Data directories

crossmem stores all data under ~/crossmem/:

~/crossmem/
  raw/          # Downloaded PDFs + metadata JSON sidecars
  wiki/         # Compiled wiki notes (markdown)

cite_key Pattern DSL

crossmem generates citation keys using a pattern DSL inspired by Better BibTeX. The pattern is configured in ~/.crossmem/config.toml:

[cite_key]
pattern = "[auth:lower][year][shorttitle:1:nopunct]"

Syntax

A pattern is a string of tokens (inside [brackets]) and literal characters (outside brackets).

[field:modifier1:modifier2]literal_text[field2]

Tokens

Token	Description	Example output
`auth`	First author’s last name	`Vaswani`
`authors`	All authors’ last names concatenated	`VaswaniShazeer`
`year`	Publication year	`2017`
`shorttitle`	First N significant words from title (stop words filtered)	`attention`
`title`	Full title	`Attention Is All You Need`

`shorttitle` behavior

shorttitle filters out common stop words (a, an, the, is, are, was, for, of, with, …) and takes the first N remaining words. N is specified as a numeric modifier.

Example with title “Attention Is All You Need”:

[shorttitle:1] → attention
[shorttitle:3] → attentionneed (after filtering “Is”, “All”, “You”)

Modifiers

Modifiers are appended to the token with : separators and applied in order:

Modifier	Description	Example
`lower`	Lowercase	`VASWANI` → `vaswani`
`upper`	Uppercase	`vaswani` → `VASWANI`
`nopunct`	Remove all non-alphanumeric characters	`hello-world!` → `helloworld`
`condense`	Remove all whitespace	`hello world` → `helloworld`
`N` (digit)	For `shorttitle`: take first N words. For other fields: take first N whitespace-delimited words.	`[shorttitle:1]` → first significant word

Examples

Default pattern

pattern = "[auth:lower][year][shorttitle:1:nopunct]"

Paper	Generated key
Vaswani et al., “Attention Is All You Need”, 2017	`vaswani2017attention`
LeCun et al., “Deep Learning”, 2015	`lecun2015deep`

All authors

pattern = "[authors:lower][year]"

Paper	Generated key
Vaswani & Shazeer, “Attention Is All You Need”, 2017	`vaswanishazeer2017`

With literal separator

pattern = "[auth:lower]_[year]"

Paper	Generated key
Vaswani et al., 2017	`vaswani_2017`

Full title condensed

pattern = "[title:condense:lower]"

Paper	Generated key
“Attention Is All You Need”	`attentionisallyouneed`

Multi-word short title

pattern = "[auth:lower][year][shorttitle:3:nopunct]"

Paper	Generated key
Vaswani et al., “Attention Is All You Need”, 2017	`vaswani2017attentionneed`

Collision resolution

If a generated key collides with an existing entry, crossmem appends a suffix:

Try a through z: vaswani2017attention → vaswani2017attentiona
If all 26 letters exhausted, append _<count>: vaswani2017attention_27

Wiki Frontmatter

Every wiki note in ~/crossmem/wiki/ starts with YAML frontmatter between --- delimiters.

Fields

Field	Type	Required	Description
`cite_key`	string	yes	DSL-generated citation key. Example: `vaswani2017attention`
`title`	string	yes	Paper title
`authors`	list[string]	yes	List of author names
`year`	integer	yes	Publication year
`arxiv_id`	string	yes (arXiv)	arXiv identifier, e.g. `1706.03762`
`doi`	string	no	DOI (may be preprint DOI)
`doi_preprint`	string	no	Preprint DOI (e.g. `10.48550/arXiv.1706.03762`)
`doi_published`	string	no	Published version DOI (if paper was published in a journal)
`captured_at`	string	yes	Unix timestamp of capture
`raw`	string	yes	Path to the raw PDF file
`pdf_sha256`	string	yes	SHA-256 hash of the raw PDF bytes
`parser`	string	yes	Parser used: `marker`, `pdftotext`
`chunks`	integer	yes	Number of chunks in the document
`meta.sources`	list[string]	yes	Metadata sources used: `arxiv`, `crossref`, `openalex`
`meta.reconciled`	boolean	yes	Whether metadata was cross-verified across sources
`meta.warnings`	list[string]	no	Warnings from metadata reconciliation

Example

---
cite_key: vaswani2017attention
title: "Attention Is All You Need"
authors:
  - "Ashish Vaswani"
  - "Noam Shazeer"
  - "Niki Parmar"
year: 2017
arxiv_id: "1706.03762"
doi: "10.48550/arXiv.1706.03762"
captured_at: "1776227254"
raw: "~/crossmem/raw/1776227254_vaswani2017attention.pdf"
pdf_sha256: "9a8f3b..."
parser: "marker"
chunks: 47
meta:
  sources: ["arxiv", "crossref", "openalex"]
  reconciled: true
  warnings: []
---

Citations section

After the frontmatter, the wiki body starts with a title heading and a ## Citations section containing five subsections:

# Attention Is All You Need

## Citations

### APA
Vaswani, A., & Shazeer, N. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.

### MLA
Vaswani, Ashish, et al. "Attention Is All You Need" arXiv preprint arXiv:1706.03762 (2017).

### Chicago
Vaswani, Ashish, and Noam Shazeer. "Attention Is All You Need" arXiv preprint arXiv:1706.03762 (2017).

### IEEE
A. Vaswani et al., "Attention Is All You Need" arXiv preprint arXiv:1706.03762, 2017.

### BibTeX
```bibtex
@article{vaswani2017attention,
  title={Attention Is All You Need},
  author={Ashish Vaswani and Noam Shazeer},
  year={2017}
}

Chunk Format

After the citations section, each wiki note contains a series of chunks. Each chunk preserves verbatim text from the source PDF along with provenance metadata.

Chunk structure

<!-- chunk id=p4s32c1 -->
> We call our particular attention "Scaled Dot-Product Attention".
> The input consists of queries and keys of dimension dk, and values
> of dimension dv.

**Paraphrase:** The authors name their mechanism "Scaled Dot-Product Attention" and define its inputs.

**Implication:** This naming convention becomes the standard terminology used across the field.

```yaml
provenance:
  page: 4
  section: "3.2 Scaled Dot-Product Attention"
  bbox: [72.0, 340.5, 523.8, 412.1]
  text_sha256: "5f3e1c..."
  byte_range: [18342, 19104]


## Chunk ID format

Chunk IDs follow the pattern `p{page}s{section}c{chunk}`:

| Part | Description | Example |
|------|-------------|---------|
| `p{N}` | Page number | `p4` = page 4 |
| `s{N}` | Section number within page | `s32` = section 3.2 |
| `c{N}` | Chunk number within section | `c1` = first chunk |

## Fields

### Verbatim text

Lines starting with `> ` contain the original text extracted from the PDF. This text is **never modified by the LLM** — it comes directly from the PDF parser.

### Paraphrase

A 1–2 sentence LLM-generated summary of the chunk's content. Generated by Ollama during `crossmem compile`.

### Implication

A 1–2 sentence LLM-generated statement about why this chunk matters to the field. Generated by Ollama during `crossmem compile`.

### Provenance

YAML metadata block attached to each chunk:

| Field | Type | Description |
|-------|------|-------------|
| `page` | integer | Page number in the source PDF |
| `section` | string | Section heading (if detected by parser) |
| `bbox` | `[f64; 4]` | Bounding box `[x_min, y_min, x_max, y_max]` in PDF coordinates. Present when parsed with Marker. |
| `text_sha256` | string | SHA-256 hash of the verbatim text. Used by `crossmem verify` to detect drift. |
| `byte_range` | `[usize; 2]` | `[start, end]` byte offset in the source PDF content stream. Present when available from parser. |

## Chunk types

The `chunk_type` field (internal) classifies each chunk:

| Type | Description |
|------|-------------|
| `page` | Full-page text (from `pdftotext` fallback) |
| `heading` | Section heading |
| `paragraph` | Body paragraph (from Marker block tree) |
| `figure` | Figure caption |
| `table` | Table content |
| `equation` | Mathematical expression |

## Integrity verification

Run `crossmem verify` to re-hash every chunk's verbatim text and compare against the stored `text_sha256`. Any mismatch indicates the wiki file has been modified since compilation.

Pipeline Overview

crossmem’s citation pipeline transforms a URL into a structured, verifiable wiki note.

Pipeline diagram

graph TD
    A[crossmem capture URL] --> B[Download PDF]
    B --> C[Fetch arXiv metadata]
    C --> D[Reconcile: CrossRef + OpenAlex]
    D --> E[Generate cite_key via DSL]
    E --> F["Save raw PDF + .meta.json"]

    G[crossmem compile cite_key] --> H[Load raw PDF + metadata]
    H --> I{Marker available?}
    I -->|Yes| J[Marker: paragraph chunks + bbox]
    I -->|No| K[pdftotext: page-level chunks]
    J --> L[Compute SHA-256 per chunk]
    K --> L
    L --> M[Ollama: paraphrase + implication per chunk]
    M --> N[Generate 5 citation formats]
    N --> O["Emit wiki markdown to ~/crossmem/wiki/"]

    P[crossmem verify] --> Q[Walk wiki files]
    Q --> R[Re-hash chunk text]
    R --> S{SHA-256 match?}
    S -->|Yes| T[OK]
    S -->|No| U[DRIFT detected]

    V[crossmem mcp serve] --> W[Load wiki entries]
    W --> X[crossmem_cite: lookup by key]
    W --> Y[crossmem_recall: search by query]

Why capture and compile are separate

capture is lightweight and idempotent: it issues API calls to arXiv, CrossRef, and OpenAlex, downloads the PDF, and writes metadata. You can re-run it to refresh metadata without re-parsing. compile is heavyweight: it invokes Marker (or another PDF parser) and Ollama to produce chunk-level paraphrases and implications. Separating the two lets you swap the PDF parser (Marker → Nougat → GROBID) or change the LLM model without re-downloading anything. It also enables a practical workflow: batch-capture dozens of papers first, then compile them at leisure — or only compile the ones that turn out to be relevant.

Stage details

Capture

URL parsing — extracts arXiv ID from various URL formats (/abs/, /pdf/, bare ID)
PDF download — fetches PDF, computes SHA-256, saves to ~/crossmem/raw/
Metadata fetch — queries arXiv API for title, authors, year
Metadata reconciliation — cross-checks against CrossRef (via DOI) and OpenAlex. Flags disagreements as warnings in frontmatter.
Cite key generation — applies the configured pattern DSL to the reconciled metadata

Compile

PDF parsing — Marker (with MPS acceleration) produces paragraph-level blocks with bounding-box coordinates. Falls back to pdftotext -layout for page-level extraction.
Chunk assembly — blocks are grouped into typed chunks (paragraph, heading, figure, table, equation) with unique IDs
Provenance — each chunk gets page, section, bbox, SHA-256, and byte range
LLM pass — Ollama generates paraphrase and implication for each chunk. The LLM never sees or modifies the original text.
Citation generation — deterministic formatting into APA, MLA, Chicago, IEEE, BibTeX
Emission — final wiki markdown written to ~/crossmem/wiki/

Verify

Walks all wiki files, re-extracts verbatim text from > blockquote lines, re-computes SHA-256, and compares against stored text_sha256 in provenance blocks. Reports any drifts.

MCP serve

Loads wiki entries into memory, exposes crossmem_cite (lookup by cite key with fuzzy matching) and crossmem_recall (full-text search with relevance ranking) over stdio MCP transport.

Data Model

Core types

ReconciledMetadata

The metadata reconciler merges data from multiple sources into a single canonical record.

#![allow(unused)]
fn main() {
pub struct ReconciledMetadata {
    pub title: String,
    pub authors: Vec<String>,
    pub year: u16,
    pub arxiv_id: String,
    pub doi: Option<String>,
    pub doi_preprint: Option<String>,
    pub doi_published: Option<String>,
    pub sources: Vec<String>,       // e.g. ["arxiv", "crossref", "openalex"]
    pub warnings: Vec<String>,
    pub reconciled: bool,
}
}

ChunkV2

The paragraph-level chunk with full provenance.

#![allow(unused)]
fn main() {
pub struct ChunkV2 {
    pub chunk_type: String,         // "page", "heading", "paragraph", etc.
    pub chunk_id: String,           // e.g. "p1s1c1"
    pub page: usize,
    pub text: String,               // Verbatim extracted text
    pub provenance: Provenance,
    pub paraphrase: Option<String>, // LLM-generated
    pub implication: Option<String>,// LLM-generated
}
}

Provenance

Tracks exactly where a chunk came from in the source PDF.

#![allow(unused)]
fn main() {
pub struct Provenance {
    pub page: usize,
    pub section: Option<String>,
    pub bbox: Option<[f64; 4]>,     // [x_min, y_min, x_max, y_max]
    pub text_sha256: String,
    pub byte_range: Option<[usize; 2]>,
}
}

WikiEntry (MCP)

The in-memory representation used by the MCP server.

#![allow(unused)]
fn main() {
struct WikiEntry {
    cite_key: Option<String>,
    title: String,
    authors: Vec<String>,
    year: Option<u16>,
    source: Option<String>,
    date: Option<String>,
    file_path: PathBuf,
    body: String,
}
}

Storage layout

~/crossmem/
├── raw/                                    # Capture output
│   ├── <timestamp>_<cite_key>.pdf          # Raw PDF
│   └── <timestamp>_<cite_key>.meta.json    # Reconciled metadata
└── wiki/                                   # Compile output
    └── <timestamp>_<cite_key>.md           # Wiki note

Trust boundaries

Data	Source	Verifiable?
Title, authors, year, DOI	Metadata reconciler (arXiv + CrossRef + OpenAlex)	Cross-source agreement
Cite key, citation strings	Deterministic generator	Pure function, unit-tested
Verbatim quote text	PDF extractor (Marker / pdftotext)	SHA-256 hash
Bounding box, byte range	PDF extractor	Re-extraction reproducibility
Paraphrase, implication	LLM (Ollama)	Not verifiable — advisory only

Chunk-based Citation v2 Design

Status: Implemented (Phase 2 MVP shipped) Date: 2026-04-15

User requirement

How do we ensure citations are absolutely correct — 萬無一失?

One-line answer: Verbatim text + bbox provenance is ground truth; LLM only touches paraphrase/implication, never quotes; metadata is cross-verified across ≥2 canonical sources.

Competitor survey

Tool	What it nails	What it misses
Zotero + Better BibTeX	Stable cite_key via JS-ish pattern DSL; key regeneration rules; 80%+ academic mind-share	No chunk/page content; just metadata container
Marker (datalab-to/marker)	PDF→markdown + polygon bbox per block, `--keep_chars` for char-level bboxes, JSON tree-per-page	Slower than pdftotext; needs CUDA/MPS
Nougat	Transformer-based; beats GROBID on formulas	VLM → hallucination risk on quote fidelity
GROBID	68 fine-grained TEI labels; best on metadata + bibliography refs; 2–5s/page, 90%+ accuracy	Weak on formulas, figures, modern layouts
PaperQA2	Chunk-size configurable; LLM re-rank + contextual summarization; grounded in-text citations	No bbox, chunk = N-char sliding window → page/fragment precision lost
Tensorlake RAG	Anchor tokens `<c>2.1</c>` inlined + bbox stored separately → auditable trail	Proprietary pipeline; design pattern is copyable
OpenAlex / CrossRef / Semantic Scholar	Each is a canonical metadata source	Each has gaps; must cross-reconcile

The industry gold standard for “absolutely correct citation”:

Parse once with bbox-aware extractor (Marker-class) → each block has {page, polygon, text}.
Anchor tokens inlined at chunk build time (<c>p4§3.2</c>) so LLM can only emit citation IDs it saw in context.
Resolve citation IDs → bbox + page at render time; users get deep-link to the exact PDF region.
Metadata cross-check across OpenAlex + CrossRef + arXiv; flag inconsistencies instead of silently picking one.
Quote is verbatim from the PDF text layer, stored with SHA-256 of the source bytes — any LLM-generated “quote” is rejected.

What Phase 1 got right / wrong

Right: pre-gen APA/MLA/Chicago/IEEE/BibTeX, deterministic cite_key, per-page original text preserved verbatim, paraphrase/implication separated from quote.

Wrong / gap:

Metadata only from arXiv API (no CrossRef/OpenAlex cross-check)
Quote preservation is page-level, not paragraph/sentence
No bbox — can’t deep-link into PDF region
No hash-based verifiability
cite_key = primitive pattern vs Better BibTeX DSL
No handling of preprint→published DOI mapping

Phase 2 architecture

2A. Metadata layer (the cite_key + bib trust root)

Pipeline:

arxiv_id → [arxiv API]  ┐
        → [CrossRef]    ├─→ reconcile → canonical metadata
        → [OpenAlex]    ┘                   │
                                            ├─→ cite_key (Better-BibTeX-style pattern, configurable)
                                            ├─→ 5 formats (APA/MLA/Chicago/IEEE/BibTeX)
                                            └─→ DOI + published-version DOI (if preprint)

Rules:

≥2 sources must agree on title + first-author + year. Disagreement → emit meta.warnings in frontmatter.
cite_key pattern DSL (ported from Better BibTeX): [auth:lower][year][shorttitle:1:nopunct], configurable via ~/.crossmem/config.toml.
Track preprint↔published mapping in meta.doi_preprint and meta.doi_published.

2B. PDF parsing layer (the chunk trust root)

Tiered strategy by document type + quality tier:

Tier	Parser	Use when	Bbox?	Speed
0	`pdftotext -layout`	Fallback / pure text	No	instant
1	Marker (Mac MPS)	Default for arxiv	Yes, polygon/block	1–3 s/page
2	GROBID (JVM, local)	Bib-references + structured metadata	Yes, TEI	2–5 s/page
3	Nougat (MPS)	Formula-heavy pages	Partial	5–15 s/page

Phase 2 default: Marker for body + GROBID for bibliography, both run, merge into unified chunk tree.

2C. Chunk schema v2 (bbox + hash provenance)

---
cite_key: vaswani2017attention
meta:
  sources: [arxiv, crossref, openalex]
  reconciled: true
  warnings: []
  pdf_sha256: 9a8f...
...
---

## p.4 §3.2 Scaled Dot-Product Attention

<!-- chunk id=p4s32c1 -->
> We call our particular attention "Scaled Dot-Product Attention"...

provenance:
  page: 4
  section: "3.2 Scaled Dot-Product Attention"
  bbox: [72.0, 340.5, 523.8, 412.1]
  text_sha256: 5f3e1c...
  byte_range: [18342, 19104]

**Paraphrase:** …
**Implication:** …

text_sha256 = SHA-256 of the verbatim extracted text. Re-running the extractor must reproduce it, else the chunk is flagged stale.
bbox + page = deep-link target: crossmem://pdf/{cite_key}#p=4&bbox=72,340,523,412.
byte_range = PDF content-stream offset (from Marker); cheapest way to re-verify without re-extraction.

2D. LLM contract (what model is / isn’t allowed to touch)

Field	Who writes	Verifiable?
`title`, `authors`, `year`, `doi`, `arxiv_id`	Metadata reconciler	Cross-source check
`cite_key`, 5 citation strings	Deterministic generator	Pure function, unit-tested
`original` (the quote)	PDF extractor	SHA-256 + byte_range
`paraphrase`, `implication`	LLM	Never trusted for provenance
`figure.caption`	PDF extractor	bbox + OCR-of-caption-only
`figure.implication`	LLM	Same rule: advisory text only

The pipeline never asks the LLM to produce a quote. If a future feature wants “the key sentence on this page”, the LLM picks a sentence index from a numbered list of extracted sentences, never emits the sentence text.

2E. Paragraph- and figure-level chunking

Paragraph splitter: Marker’s block tree → paragraph-typed blocks become chunks (not pages).
Figure chunks: Marker figure blocks → crop image to raw/figs/{cite_key}_fig{N}.png, caption extracted separately, implication runs on caption-only.
Table chunks: Marker table block → markdown-table format, implication on markdown text.
Equation chunks: Nougat output in LaTeX, stored as $$…$$, implication on LaTeX source.

2F. Idempotence + re-compile

Re-running capture is idempotent on arxiv_id: re-downloads only if pdf_sha256 differs.
Re-running compile re-does LLM pass only for chunks whose text_sha256 changed.
crossmem verify <cite_key> walks the wiki, re-extracts, re-hashes; reports any mismatches.

Implementation order

Metadata reconciler (arxiv + crossref + openalex merge, warnings on disagreement)
cite_key pattern DSL (Better-BibTeX-style, unit-tested)
Marker integration via uvx marker-pdf CLI (Python sidecar; Rust drives via subprocess + JSON)
Chunk schema v2 writer (paragraph/figure/table/equation chunks with bbox + hash)
GROBID on-demand for bibliography references
crossmem verify command
Nougat sidecar for math-heavy pages (opt-in)

What this buys the user

Writing a paper citing Vaswani 2017 p.4 §3.2:

Before (Phase 1): User opens wiki, sees page-4 summary paragraph, pastes bibtex. May still need to open PDF to find exact sentence.

After (Phase 2):

Wiki shows §3.2 as a dedicated chunk with verbatim quote.
Clicking the provenance block opens the PDF at page 4 with the bbox highlighted.
Cite key vaswani2017attention is guaranteed stable across arxiv→NeurIPS preprint→published.
crossmem verify run weekly confirms no wiki has silently drifted from its PDF source.

Sources

PaperQA2 — chunk-size configurable RAG
Tensorlake citation grounding — anchor token pattern
Marker — PDF→markdown with bbox
GROBID — structured metadata extraction
Better BibTeX cite keys — pattern DSL
OpenAlex Work object — DOI canonical

YouTube Ingestion Pipeline — Design Document

Status: Draft Author: crossmem team Date: 2026-04-15 Tracking: crossmem-rs#27

1. Overview

Extend crossmem capture <url> to detect youtube.com / youtu.be hosts and dispatch to a YouTube-specific pipeline that produces time-aligned wiki chunks — the video analog of the PDF chunk pipeline from #24.

The pipeline runs entirely local on an Apple Silicon Mac mini (M2/M4). No cloud APIs.

Pipeline stages

capture (download + extract audio/subs)
  → transcribe (whisper.cpp Metal)
  → keyframes (ffmpeg scene-cut)
  → OCR + VLM caption (per keyframe)
  → compile (Ollama paraphrase/implication per chunk)
  → emit wiki markdown

2. Download Path

Decision: yt-dlp binary

Option	Pros	Cons
yt-dlp binary	Battle-tested, handles every edge case, active community, `--cookies-from-browser` for member-only	External dep, Python-based, updates frequently
libyt-dlp bindings	Tighter integration	No stable C API; Python FFI is fragile
youtube-rs (pure Rust)	No external dep	Incomplete, breaks on YT changes, no auth, no live/shorts

yt-dlp wins because YouTube aggressively rotates extraction logic. Maintaining a pure-Rust extractor is a full-time job. yt-dlp is the industry standard for a reason.

Edge cases handled by yt-dlp flags

Scenario	yt-dlp flags
Age-gated	`--cookies-from-browser chrome` (reads real Chrome cookies)
Member-only	Same cookie approach; user must be logged in
Live streams	`--live-from-start --wait-for-video 30` (wait + download from start)
Shorts	Works as normal URLs (`youtube.com/shorts/ID` → standard extraction)
Playlists	`--yes-playlist` or `--no-playlist` (user flag; default: single video)
Chapters	`--embed-chapters` + `--write-info-json` (chapter list in info JSON)
Auto captions	`--write-auto-subs --sub-lang en`
Human captions	`--write-subs --sub-lang en` (preferred over auto when available)

Download command template

yt-dlp \
  --format "bestaudio[ext=m4a]/bestaudio/best" \
  --extract-audio --audio-format wav --audio-quality 0 \
  --write-info-json \
  --write-subs --write-auto-subs --sub-lang "en.*" --sub-format vtt \
  --embed-chapters \
  --cookies-from-browser chrome \
  --output "%(id)s.%(ext)s" \
  --paths "$HOME/crossmem/raw/youtube/" \
  "$URL"

For keyframe extraction we also need the video file:

yt-dlp \
  --format "bestvideo[height<=1080][ext=mp4]/bestvideo[height<=1080]/best" \
  --write-info-json \
  --output "%(id)s_video.%(ext)s" \
  --paths "$HOME/crossmem/raw/youtube/" \
  "$URL"

3. Audio Extraction → Transcription

Decision: whisper.cpp with Metal acceleration, large-v3-turbo model

Engine	Backend	Speed (1h audio, M2)	Accuracy	Notes
whisper.cpp	Metal (Apple GPU)	~6–8 min	WER ~8% (large-v3-turbo)	C/C++, no Python, `--print-timestamps` for word-level
whisper-mlx	MLX (Apple GPU)	~5–7 min	Same models	Python dep, MLX framework, slightly faster on M4
WhisperKit	CoreML	~5–6 min	Good	Swift-only, harder to call from Rust
insanely-fast-whisper	MPS (PyTorch)	~10–15 min	Same models	Heavy Python stack, MPS less optimized than Metal
faster-whisper	CTranslate2 (CPU)	~15–25 min	Same models	No Metal/MPS; CPU-only on macOS

whisper.cpp wins because:

Native Metal acceleration — no Python runtime
Easily called from Rust via std::process::Command (same pattern as pdftotext in cite.rs)
Outputs VTT/SRT/JSON with word-level timestamps
Active project, models available via Hugging Face in ggml format

Model choice: large-v3-turbo

Model	Params	VRAM	Disk	Speed (M2, 1h)	WER (en)
large-v3	1.55B	~3 GB	3.1 GB	~12 min	~7.5%
large-v3-turbo	809M	~1.6 GB	1.6 GB	~6 min	~8%
distil-large-v3	756M	~1.5 GB	1.5 GB	~5 min	~9%

large-v3-turbo is the sweet spot: half the VRAM of large-v3, nearly the same WER, 2× faster. distil-large-v3 is marginally faster but has slightly worse accuracy on non-native English speakers (common in academic talks).

Transcription command

whisper-cpp \
  --model models/ggml-large-v3-turbo.bin \
  --file "$HOME/crossmem/raw/youtube/${VIDEO_ID}.wav" \
  --output-vtt \
  --output-json \
  --print-timestamps \
  --language en \
  --threads 4

Caption priority

Human-uploaded subtitles (.en.vtt from yt-dlp) — highest quality, use as-is
whisper.cpp transcription — always run for timestamp alignment even if subs exist
Auto-generated YouTube captions — fallback only; lower quality than whisper

When human subs exist, align them with whisper timestamps for precise time-coding.

Speaker diarization

Decision: Skip for P1, add in P3 if needed.

Rationale:

Most YouTube content crossmem targets is solo presenter (lectures, conference talks, tutorials)
pyannote requires Python + HF token + ~2 GB model; adds significant complexity
sherpa-onnx is lighter but diarization accuracy on overlapping speech is still mediocre
Can retrofit later: diarization produces (speaker_id, start, end) segments that merge with existing transcript chunks

If multi-speaker content becomes common, P3 can add pyannote 3.1 with speaker embedding.

4. Visual Understanding

4a. Keyframe extraction

Decision: ffmpeg scene-cut detection

ffmpeg -i "${VIDEO_ID}_video.mp4" \
  -vf "select='gt(scene,0.3)',showinfo" \
  -vsync vfr \
  -frame_pts 1 \
  "${OUTPUT_DIR}/keyframe_%04d.png" \
  2>&1 | grep "pts_time" > "${OUTPUT_DIR}/keyframe_times.txt"

Method	Pros	Cons
ffmpeg `scene` filter	Zero extra deps, timestamp-aware, tunable threshold	May over/under-extract
TransNetV2	ML-based, higher accuracy	Python + PyTorch dep, overkill for slides
PySceneDetect	Good API	Python dep

ffmpeg is already a required dependency (for audio extraction). Scene threshold 0.3 works well for slide-based content; can tune per-video.

Chapter-aware extraction: If the info JSON contains chapters, also extract one keyframe per chapter boundary (seek to chapter_start + 2s). Merge with scene-cut keyframes, deduplicate within 5s window.

Target: 1 keyframe per 30–120 seconds depending on content type. Cap at 200 keyframes per video.

4b. Per-keyframe VLM caption

Decision: Qwen2.5-VL-7B via Ollama (local)

Ollama already supports multimodal models. The existing Ollama integration in cite.rs targets http://localhost:11434/api/generate — the same endpoint accepts image input with "images": [base64_png].

{
  "model": "qwen2.5-vl:7b",
  "prompt": "Describe this video frame in one sentence. If it contains a slide, list the title and key bullet points.",
  "images": ["<base64_keyframe>"],
  "stream": false
}

Model	VRAM	Speed (per frame, M2)	Quality
Qwen2.5-VL-7B (q4_K_M)	~5 GB	~3–5 sec	Good for slides/diagrams
LLaVA-1.6-7B	~5 GB	~3–5 sec	Slightly worse on text-heavy slides
Qwen2.5-VL-3B	~2.5 GB	~1–2 sec	Faster but misses fine text

Qwen2.5-VL-7B is the best local VLM for slide/diagram content. 7B quantized fits comfortably alongside whisper on M2 (16 GB unified memory).

Batching: Process keyframes sequentially (VLM needs full GPU). At ~4 sec/frame × 100 frames = ~7 min. Acceptable.

4c. OCR on slides

Decision: Apple Vision framework via swift-ffi (primary), Tesseract (fallback)

Engine	Accuracy	Speed	Dependencies
Apple Vision (VNRecognizeTextRequest)	Excellent, especially printed text	~0.1s/image	macOS 13+, Swift FFI
PaddleOCR	Very good, multi-language	~0.3s/image	Python + large model
Tesseract	Good for English	~0.5s/image	`brew install tesseract`

Apple Vision is the clear winner on macOS: built-in, fast, accurate, no extra deps. Access from Rust via a tiny Swift CLI helper:

// crossmem-ocr (Swift CLI, ~30 lines)
import Vision
let request = VNRecognizeTextRequest()
request.recognitionLevel = .accurate
// ... read image, perform request, print results as JSON

Compile as crossmem-ocr binary, call from Rust via Command::new("crossmem-ocr"). Ship as part of the crossmem install or build from source on first run.

Fallback: If the Swift helper isn’t available (Linux compat someday), fall back to tesseract --oem 1 -l eng.

5. Chunk Schema

Time-aligned chunk (parallel to `CompiledChunk` in cite.rs)

#![allow(unused)]
fn main() {
pub struct YouTubeChunk {
    pub start_ms: u64,
    pub end_ms: u64,
    pub speaker: Option<String>,       // None until diarization (P3)
    pub transcript: String,            // Whisper or human-sub text for this segment
    pub slide_ocr: Option<String>,     // OCR text if keyframe in this time range
    pub keyframe_path: Option<String>, // Relative path to keyframe PNG
    pub keyframe_caption: Option<String>, // VLM description of keyframe
    pub paraphrase: String,            // LLM-generated 1-2 sentence summary
    pub implication: String,           // LLM-generated field impact
}
}

Chunk boundaries

Priority order for segmentation:

Chapters (from info JSON) — if present, each chapter = one chunk
Scene cuts — if no chapters, split at scene-cut boundaries
Fixed window — fallback: 60-second segments with sentence-boundary snapping

Within a chapter, if the chapter exceeds 5 minutes, sub-split at scene cuts or 60s intervals.

Minimum chunk: 10 seconds. Maximum chunk: 5 minutes (force-split at sentence boundary).

Metadata struct

#![allow(unused)]
fn main() {
pub struct YouTubeMetadata {
    pub title: String,
    pub channel: String,
    pub upload_date: String,         // YYYY-MM-DD
    pub video_id: String,
    pub duration_sec: u64,
    pub chapters: Vec<Chapter>,      // from info JSON
    pub description: String,
    pub tags: Vec<String>,
}

pub struct Chapter {
    pub title: String,
    pub start_sec: f64,
    pub end_sec: f64,
}
}

Cite key

{channel_slug}{year}{first_noun_of_title}

Examples:

3Blue1Brown, “But what is a neural network?” (2017) → 3blue1brown2017neural
Andrej Karpathy, “Let’s build GPT from scratch” (2023) → karpathy2023gpt
Two Minute Papers, “OpenAI Sora” (2024) → twominutepapers2024sora

channel_slug = channel name lowercased, non-alphanumeric stripped, truncated to 20 chars.

Time-coded deep link

Each chunk carries a provenance URL:

https://youtu.be/{VIDEO_ID}?t={floor(start_ms / 1000)}

6. Citation Formats

APA 7th (online video)

{Channel} [{Channel}]. ({Year}, {Month} {Day}). {Title} [Video]. YouTube. https://www.youtube.com/watch?v={VIDEO_ID}

Example:

3Blue1Brown [3Blue1Brown]. (2017, October 5). But what is a neural network? [Video]. YouTube. https://www.youtube.com/watch?v=aircAruvnKk

MLA 9th

"{Title}." YouTube, uploaded by {Channel}, {Day} {Month} {Year}, www.youtube.com/watch?v={VIDEO_ID}.

Chicago 17th (note-bibliography)

{Channel}. "{Title}." {Month} {Day}, {Year}. Video, {Duration}. https://www.youtube.com/watch?v={VIDEO_ID}.

IEEE

{Channel}, "{Title}," YouTube. [Online Video]. Available: https://www.youtube.com/watch?v={VIDEO_ID}. [Accessed: {Access Date}].

BibTeX

@misc{cite_key,
  author = {{Channel}},
  title = {{Title}},
  year = {Year},
  month = {Month},
  howpublished = {\url{https://www.youtube.com/watch?v=VIDEO_ID}},
  note = {[Video]. YouTube. Accessed: YYYY-MM-DD}
}

7. Wiki Markdown Output

Follows the same structure as the ArXiv wiki notes. Example:

---
cite_key: 3blue1brown2017neural
title: "But what is a neural network?"
channel: "3Blue1Brown"
upload_date: "2017-10-05"
video_id: "aircAruvnKk"
duration_sec: 1140
captured_at: "1776300000"
raw: "~/crossmem/raw/youtube/aircAruvnKk.wav"
chunks: 12
source_type: youtube
---

# But what is a neural network?

## Citations

### APA
...

## Chunks

### 00:00–01:32 — Chapter: Introduction

> [Transcript text, first 400 chars...]

**Slide OCR:** [if keyframe present]

**Keyframe:** `keyframes/aircAruvnKk_0042.png` — "A diagram showing..."

**Paraphrase:** ...

**Implication:** ...

**Source:** [00:00](https://youtu.be/aircAruvnKk?t=0)

8. Orchestration

Decision: Same binary, new module `youtube.rs`

The existing crossmem capture <url> dispatches on URL. Add host detection:

#![allow(unused)]
fn main() {
// main.rs capture dispatch
if url.contains("arxiv.org") {
    cite::cmd_capture(url).await
} else if url.contains("youtube.com") || url.contains("youtu.be") {
    youtube::cmd_capture(url).await
} else {
    // future: generic handler
}
}

Module structure

src/
  cite.rs          # existing arxiv pipeline (unchanged)
  youtube.rs       # new: YouTube capture + compile
  youtube/
    download.rs    # yt-dlp wrapper
    transcribe.rs  # whisper.cpp wrapper
    keyframe.rs    # ffmpeg scene-cut + chapter extraction
    ocr.rs         # Apple Vision / tesseract wrapper
    vlm.rs         # Ollama multimodal (Qwen2.5-VL) wrapper
    chunk.rs       # Segmentation + chunk assembly
    emit.rs        # Wiki markdown emission
  shared/
    ollama.rs      # Extract from cite.rs — shared Ollama client
    formats.rs     # Citation format builders (generalized)

Shared Ollama code: Factor compile_page_chunk and the HTTP client into shared/ollama.rs. Both cite.rs and youtube.rs call it. The prompt template differs (page text vs transcript chunk), but the HTTP plumbing is identical.

Two-stage flow (same as arxiv)

crossmem capture <youtube-url>
  → downloads audio + video + subs + info JSON
  → extracts metadata, generates cite_key
  → saves to ~/crossmem/raw/youtube/{video_id}/
  → prints cite_key for next step

crossmem compile <cite_key>
  → detects source_type (arxiv vs youtube) from meta JSON
  → runs transcription (whisper.cpp)
  → runs keyframe extraction (ffmpeg)
  → runs OCR + VLM caption per keyframe
  → runs Ollama compile per chunk (paraphrase + implication)
  → emits wiki markdown to ~/crossmem/wiki/

9. Dependency Install UX

Decision: Error with one-liner install instructions on first run

Auto-installing is tempting but violates principle of least surprise. Instead:

$ crossmem capture https://youtube.com/watch?v=abc123

ERROR: missing required dependencies for YouTube ingestion:
  ✗ yt-dlp       — brew install yt-dlp
  ✗ ffmpeg        — brew install ffmpeg
  ✓ whisper.cpp   — found at /opt/homebrew/bin/whisper-cpp

Install all missing:
  brew install yt-dlp ffmpeg

Then retry: crossmem capture https://youtube.com/watch?v=abc123

Check order: which yt-dlp && which ffmpeg && which whisper-cpp (or whisper depending on install method).

whisper.cpp model download: If binary exists but model is missing:

Model not found. Download large-v3-turbo (~1.6 GB):
  curl -L -o ~/.cache/whisper/ggml-large-v3-turbo.bin \
    https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

crossmem-ocr Swift helper: Build from source on first YouTube capture:

$ swift build -c release -package-path ./tools/crossmem-ocr

Or provide pre-built binary in releases.

10. Cost Model (All Local)

Estimated wall-clock for M2 Mac mini (16 GB)

Stage	1h video	30 min video	3h video
yt-dlp download (audio + video)	~2 min	~1 min	~5 min
whisper.cpp transcription	~6 min	~3 min	~18 min
ffmpeg keyframe extraction	~1 min	~30 sec	~3 min
OCR per keyframe (~80 frames)	~8 sec	~4 sec	~20 sec
VLM caption per keyframe	~5 min	~2.5 min	~15 min
Ollama compile per chunk (~40 chunks)	~8 min	~4 min	~24 min
Total	~22 min	~11 min	~65 min

Bottlenecks

Ollama compile — sequential LLM calls, ~12 sec/chunk. Could batch with larger context window.
VLM caption — sequential, ~4 sec/frame. GPU contention with Ollama if run concurrently.
Whisper — fast on Metal, but locks GPU for duration.

Memory pressure

Concurrent	Peak VRAM	Safe on 16 GB?
Whisper alone	~1.6 GB	Yes
Ollama (7B q4) alone	~5 GB	Yes
Whisper + Ollama	~6.6 GB	Yes
Qwen2.5-VL-7B + Ollama text	~10 GB	Tight but OK
All three simultaneous	~12 GB	Risky — run sequentially

Strategy: Run stages sequentially. whisper → keyframes → OCR → VLM → compile. No concurrent GPU workloads.

11. Storage Layout

~/crossmem/
  raw/
    youtube/
      {video_id}/
        {video_id}.wav              # Audio (whisper input)
        {video_id}_video.mp4        # Video (keyframe source)
        {video_id}.info.json        # yt-dlp metadata
        {video_id}.en.vtt           # Human subs (if available)
        {video_id}.en.auto.vtt      # Auto subs (if available)
        {video_id}.meta.json        # crossmem metadata
        transcript.json             # Whisper output with timestamps
        keyframes/
          frame_0001.png            # Scene-cut keyframes
          frame_0002.png
          keyframe_times.json       # Timestamp → frame mapping
          ocr/
            frame_0001.txt          # OCR output per frame
          captions/
            frame_0001.txt          # VLM caption per frame
  wiki/
    {timestamp}_{cite_key}.md       # Final compiled wiki note

12. Phased Delivery

P1 — Download + Transcribe (MVP)

URL detection in main.rs capture dispatch
yt-dlp download wrapper (youtube/download.rs)
whisper.cpp transcription wrapper (youtube/transcribe.rs)
Basic chunk segmentation (chapters or 60s windows)
Ollama compile pass (reuse from cite.rs)
Wiki markdown emission (transcript-only, no visual)
Dependency check + error messages
Tests for metadata parsing, cite_key generation, chunk segmentation

P2 — Keyframes + OCR

ffmpeg scene-cut extraction (youtube/keyframe.rs)
Chapter-aware keyframe selection
Apple Vision OCR helper (tools/crossmem-ocr/)
Tesseract fallback
OCR text merged into chunks
Tests for keyframe timing, OCR integration

P3 — VLM Captions + Diarization

Ollama multimodal integration for keyframe captioning (youtube/vlm.rs)
Keyframe captions merged into chunks
Optional: pyannote speaker diarization
Tests for VLM response parsing

P4 — Polish + Chunk Emission

Human sub → whisper alignment
Playlist support (batch capture)
crossmem compile --source youtube flag
Storage cleanup (delete intermediate files after compile)
Integration tests with real short video
Performance benchmarks on M2/M4

13. Open Questions

Subtitle language detection: Should we auto-detect the video language and pass --language to whisper, or always use en? For P1, assume English.
Video retention: Keep the video file after keyframe extraction, or delete to save disk? A 1h 1080p video is ~1–2 GB. Suggest: keep for 7 days, then auto-prune.
Ollama model for compile pass: Reuse llama3.2:3b (same as arxiv), or use a different model better suited for spoken-word paraphrasing? Suggest: same model, same env var.
Playlist semantics: One wiki note per video, or one per playlist? Suggest: one per video, with a playlist index note linking them.
Live stream handling: yt-dlp can download from start, but duration is unknown until stream ends. Suggest: P1 skips live, add in P2.

Why crossmem bridge does not use Chrome DevTools Protocol

The incident

On a developer workstation, a suspicious process (PID 73079) spawned from a Claude shell snapshot executed the following sequence:

sleep 2400 (wait for Chrome to settle)
Connect to ws://localhost:9222 (Chrome DevTools Protocol)
Runtime.evaluate → Clerk.session.getToken() to steal the active session token
POST the stolen token to an external API (teaching.monster)

Root cause: a dev tool had launched Chrome with --remote-debugging-port=9222. This single flag exposes every open tab, every origin, every cookie on a localhost WebSocket with zero authentication. Any local process—malicious or not—can connect and run arbitrary JavaScript in the context of any page the user has open. CDP is a debugger; it trusts the caller completely.

What crossmem bridge does differently

crossmem bridge is a Manifest V3 Chrome extension that communicates with local agents over a WebSocket on localhost:7600. The design differs from CDP in several concrete ways:

No --remote-debugging-port. The user’s Chrome launches normally. There is no app-wide debug backdoor to connect to.
User-installed extension with Chrome’s permission UI. The user explicitly grants the extension host permissions. CDP requires no user consent at all; whatever launched Chrome with the flag decides.
Whitelisted action set. The bridge accepts a fixed set of named actions: navigate, click, type, extract, screenshot, summarize, tab_info, wait, ping. There is no generic “evaluate arbitrary JS” verb. An attacker who connects to :7600 can click buttons and read extracted text, but cannot call Clerk.session.getToken() or Network.getAllCookies.
Real Chrome profile, no spoofing. The extension runs inside the user’s actual Chrome profile—no --user-data-dir to a throwaway directory, no Chrome for Testing with broken Keychain integration.

Threat model comparison

Attack surface	CDP (`:9222`)	crossmem bridge (`:7600`)
Arbitrary JS on any origin	`Runtime.evaluate` — yes	No eval verb — no
Dump all cookies	`Network.getAllCookies` — yes	No such action — no
Read/modify DOM	Full DOM access	Only via named actions (click, extract)
Authentication	None	None (same weakness — see below)
User consent	None; whoever launched Chrome decides	Chrome extension install prompt

The PID 73079 attack required exactly two CDP primitives: Runtime.evaluate and network access. Neither exists in the crossmem bridge action vocabulary.

What this design does NOT protect against

Honesty matters more than marketing. crossmem bridge has real limitations:

localhost:7600 is unauthenticated, same as CDP on :9222. Any local process can connect. The attack surface is smaller (no eval, no cookie dump), but the network posture is identical.
chrome.scripting.executeScript is arbitrary JS under the hood. The bridge currently uses it to implement actions like extract and click. If a future action handler passes attacker-controlled input (selectors, payloads) into executeScript without sanitization, the constrained action set becomes a confused deputy.
Supply-chain attack on the extension itself. A malicious MV3 update pushed to the Chrome Web Store bypasses every architectural constraint. The extension IS the trust boundary.
Planned hardening (not yet implemented):
- Per-request auth token (shared secret between agent and extension)
- Unix domain socket instead of TCP (removes network-reachable surface)
- Strict input validation on action parameters

Takeaway

The lesson from PID 73079 is not “use crossmem bridge instead of CDP.” It is: dev automation tooling should not default to opening an app-wide debug backdoor.

CDP is a debugger protocol. It was designed for DevTools, not for agent orchestration. When you expose it on localhost, you hand every local process— including ones you didn’t launch—full control over every tab in the browser.

crossmem bridge chose a constrained, consent-gated channel: a user-installed extension exposing a fixed action vocabulary over a local WebSocket. This is a design choice that reduces the blast radius of local-process compromise. It is not magic, and it is not complete. But it means PID 73079’s exact attack vector—connect, eval, exfiltrate—does not work.

Keyboard shortcuts

crossmem