Pipeline Overview

crossmem’s citation pipeline transforms a URL into a structured, verifiable wiki note.

Pipeline diagram

graph TD
    A[crossmem capture URL] --> B[Download PDF]
    B --> C[Fetch arXiv metadata]
    C --> D[Reconcile: CrossRef + OpenAlex]
    D --> E[Generate cite_key via DSL]
    E --> F["Save raw PDF + .meta.json"]

    G[crossmem compile cite_key] --> H[Load raw PDF + metadata]
    H --> I{Marker available?}
    I -->|Yes| J[Marker: paragraph chunks + bbox]
    I -->|No| K[pdftotext: page-level chunks]
    J --> L[Compute SHA-256 per chunk]
    K --> L
    L --> M[Ollama: paraphrase + implication per chunk]
    M --> N[Generate 5 citation formats]
    N --> O["Emit wiki markdown to ~/crossmem/wiki/"]

    P[crossmem verify] --> Q[Walk wiki files]
    Q --> R[Re-hash chunk text]
    R --> S{SHA-256 match?}
    S -->|Yes| T[OK]
    S -->|No| U[DRIFT detected]

    V[crossmem mcp serve] --> W[Load wiki entries]
    W --> X[crossmem_cite: lookup by key]
    W --> Y[crossmem_recall: search by query]

Why capture and compile are separate

capture is lightweight and idempotent: it issues API calls to arXiv, CrossRef, and OpenAlex, downloads the PDF, and writes metadata. You can re-run it to refresh metadata without re-parsing. compile is heavyweight: it invokes Marker (or another PDF parser) and Ollama to produce chunk-level paraphrases and implications. Separating the two lets you swap the PDF parser (Marker → Nougat → GROBID) or change the LLM model without re-downloading anything. It also enables a practical workflow: batch-capture dozens of papers first, then compile them at leisure — or only compile the ones that turn out to be relevant.

Stage details

Capture

URL parsing — extracts arXiv ID from various URL formats (/abs/, /pdf/, bare ID)
PDF download — fetches PDF, computes SHA-256, saves to ~/crossmem/raw/
Metadata fetch — queries arXiv API for title, authors, year
Metadata reconciliation — cross-checks against CrossRef (via DOI) and OpenAlex. Flags disagreements as warnings in frontmatter.
Cite key generation — applies the configured pattern DSL to the reconciled metadata

Compile

PDF parsing — Marker (with MPS acceleration) produces paragraph-level blocks with bounding-box coordinates. Falls back to pdftotext -layout for page-level extraction.
Chunk assembly — blocks are grouped into typed chunks (paragraph, heading, figure, table, equation) with unique IDs
Provenance — each chunk gets page, section, bbox, SHA-256, and byte range
LLM pass — Ollama generates paraphrase and implication for each chunk. The LLM never sees or modifies the original text.
Citation generation — deterministic formatting into APA, MLA, Chicago, IEEE, BibTeX
Emission — final wiki markdown written to ~/crossmem/wiki/

Verify

Walks all wiki files, re-extracts verbatim text from > blockquote lines, re-computes SHA-256, and compares against stored text_sha256 in provenance blocks. Reports any drifts.

MCP serve

Loads wiki entries into memory, exposes crossmem_cite (lookup by cite key with fuzzy matching) and crossmem_recall (full-text search with relevance ranking) over stdio MCP transport.

Keyboard shortcuts

crossmem