Pipeline Overview
crossmem’s citation pipeline transforms a URL into a structured, verifiable wiki note.
Pipeline diagram
graph TD
A[crossmem capture URL] --> B[Download PDF]
B --> C[Fetch arXiv metadata]
C --> D[Reconcile: CrossRef + OpenAlex]
D --> E[Generate cite_key via DSL]
E --> F["Save raw PDF + .meta.json"]
G[crossmem compile cite_key] --> H[Load raw PDF + metadata]
H --> I{Marker available?}
I -->|Yes| J[Marker: paragraph chunks + bbox]
I -->|No| K[pdftotext: page-level chunks]
J --> L[Compute SHA-256 per chunk]
K --> L
L --> M[Ollama: paraphrase + implication per chunk]
M --> N[Generate 5 citation formats]
N --> O["Emit wiki markdown to ~/crossmem/wiki/"]
P[crossmem verify] --> Q[Walk wiki files]
Q --> R[Re-hash chunk text]
R --> S{SHA-256 match?}
S -->|Yes| T[OK]
S -->|No| U[DRIFT detected]
V[crossmem mcp serve] --> W[Load wiki entries]
W --> X[crossmem_cite: lookup by key]
W --> Y[crossmem_recall: search by query]
Why capture and compile are separate
capture is lightweight and idempotent: it issues API calls to arXiv, CrossRef, and OpenAlex, downloads the PDF, and writes metadata. You can re-run it to refresh metadata without re-parsing. compile is heavyweight: it invokes Marker (or another PDF parser) and Ollama to produce chunk-level paraphrases and implications. Separating the two lets you swap the PDF parser (Marker → Nougat → GROBID) or change the LLM model without re-downloading anything. It also enables a practical workflow: batch-capture dozens of papers first, then compile them at leisure — or only compile the ones that turn out to be relevant.
Stage details
Capture
- URL parsing — extracts arXiv ID from various URL formats (
/abs/,/pdf/, bare ID) - PDF download — fetches PDF, computes SHA-256, saves to
~/crossmem/raw/ - Metadata fetch — queries arXiv API for title, authors, year
- Metadata reconciliation — cross-checks against CrossRef (via DOI) and OpenAlex. Flags disagreements as warnings in frontmatter.
- Cite key generation — applies the configured pattern DSL to the reconciled metadata
Compile
- PDF parsing — Marker (with MPS acceleration) produces paragraph-level blocks with bounding-box coordinates. Falls back to
pdftotext -layoutfor page-level extraction. - Chunk assembly — blocks are grouped into typed chunks (paragraph, heading, figure, table, equation) with unique IDs
- Provenance — each chunk gets page, section, bbox, SHA-256, and byte range
- LLM pass — Ollama generates paraphrase and implication for each chunk. The LLM never sees or modifies the original text.
- Citation generation — deterministic formatting into APA, MLA, Chicago, IEEE, BibTeX
- Emission — final wiki markdown written to
~/crossmem/wiki/
Verify
Walks all wiki files, re-extracts verbatim text from > blockquote lines, re-computes SHA-256, and compares against stored text_sha256 in provenance blocks. Reports any drifts.
MCP serve
Loads wiki entries into memory, exposes crossmem_cite (lookup by cite key with fuzzy matching) and crossmem_recall (full-text search with relevance ranking) over stdio MCP transport.