Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chunk-based Citation v2 Design

Status: Implemented (Phase 2 MVP shipped) Date: 2026-04-15


User requirement

How do we ensure citations are absolutely correct — 萬無一失?

One-line answer: Verbatim text + bbox provenance is ground truth; LLM only touches paraphrase/implication, never quotes; metadata is cross-verified across ≥2 canonical sources.

Competitor survey

ToolWhat it nailsWhat it misses
Zotero + Better BibTeXStable cite_key via JS-ish pattern DSL; key regeneration rules; 80%+ academic mind-shareNo chunk/page content; just metadata container
Marker (datalab-to/marker)PDF→markdown + polygon bbox per block, --keep_chars for char-level bboxes, JSON tree-per-pageSlower than pdftotext; needs CUDA/MPS
NougatTransformer-based; beats GROBID on formulasVLM → hallucination risk on quote fidelity
GROBID68 fine-grained TEI labels; best on metadata + bibliography refs; 2–5s/page, 90%+ accuracyWeak on formulas, figures, modern layouts
PaperQA2Chunk-size configurable; LLM re-rank + contextual summarization; grounded in-text citationsNo bbox, chunk = N-char sliding window → page/fragment precision lost
Tensorlake RAGAnchor tokens <c>2.1</c> inlined + bbox stored separately → auditable trailProprietary pipeline; design pattern is copyable
OpenAlex / CrossRef / Semantic ScholarEach is a canonical metadata sourceEach has gaps; must cross-reconcile

The industry gold standard for “absolutely correct citation”:

  1. Parse once with bbox-aware extractor (Marker-class) → each block has {page, polygon, text}.
  2. Anchor tokens inlined at chunk build time (<c>p4§3.2</c>) so LLM can only emit citation IDs it saw in context.
  3. Resolve citation IDs → bbox + page at render time; users get deep-link to the exact PDF region.
  4. Metadata cross-check across OpenAlex + CrossRef + arXiv; flag inconsistencies instead of silently picking one.
  5. Quote is verbatim from the PDF text layer, stored with SHA-256 of the source bytes — any LLM-generated “quote” is rejected.

What Phase 1 got right / wrong

Right: pre-gen APA/MLA/Chicago/IEEE/BibTeX, deterministic cite_key, per-page original text preserved verbatim, paraphrase/implication separated from quote.

Wrong / gap:

  • Metadata only from arXiv API (no CrossRef/OpenAlex cross-check)
  • Quote preservation is page-level, not paragraph/sentence
  • No bbox — can’t deep-link into PDF region
  • No hash-based verifiability
  • cite_key = primitive pattern vs Better BibTeX DSL
  • No handling of preprint→published DOI mapping

Phase 2 architecture

2A. Metadata layer (the cite_key + bib trust root)

Pipeline:

arxiv_id → [arxiv API]  ┐
        → [CrossRef]    ├─→ reconcile → canonical metadata
        → [OpenAlex]    ┘                   │
                                            ├─→ cite_key (Better-BibTeX-style pattern, configurable)
                                            ├─→ 5 formats (APA/MLA/Chicago/IEEE/BibTeX)
                                            └─→ DOI + published-version DOI (if preprint)

Rules:

  • ≥2 sources must agree on title + first-author + year. Disagreement → emit meta.warnings in frontmatter.
  • cite_key pattern DSL (ported from Better BibTeX): [auth:lower][year][shorttitle:1:nopunct], configurable via ~/.crossmem/config.toml.
  • Track preprint↔published mapping in meta.doi_preprint and meta.doi_published.

2B. PDF parsing layer (the chunk trust root)

Tiered strategy by document type + quality tier:

TierParserUse whenBbox?Speed
0pdftotext -layoutFallback / pure textNoinstant
1Marker (Mac MPS)Default for arxivYes, polygon/block1–3 s/page
2GROBID (JVM, local)Bib-references + structured metadataYes, TEI2–5 s/page
3Nougat (MPS)Formula-heavy pagesPartial5–15 s/page

Phase 2 default: Marker for body + GROBID for bibliography, both run, merge into unified chunk tree.

2C. Chunk schema v2 (bbox + hash provenance)

---
cite_key: vaswani2017attention
meta:
  sources: [arxiv, crossref, openalex]
  reconciled: true
  warnings: []
  pdf_sha256: 9a8f...
...
---

## p.4 §3.2 Scaled Dot-Product Attention

<!-- chunk id=p4s32c1 -->
> We call our particular attention "Scaled Dot-Product Attention"...

provenance:
  page: 4
  section: "3.2 Scaled Dot-Product Attention"
  bbox: [72.0, 340.5, 523.8, 412.1]
  text_sha256: 5f3e1c...
  byte_range: [18342, 19104]

**Paraphrase:** …
**Implication:** …
  • text_sha256 = SHA-256 of the verbatim extracted text. Re-running the extractor must reproduce it, else the chunk is flagged stale.
  • bbox + page = deep-link target: crossmem://pdf/{cite_key}#p=4&bbox=72,340,523,412.
  • byte_range = PDF content-stream offset (from Marker); cheapest way to re-verify without re-extraction.

2D. LLM contract (what model is / isn’t allowed to touch)

FieldWho writesVerifiable?
title, authors, year, doi, arxiv_idMetadata reconcilerCross-source check
cite_key, 5 citation stringsDeterministic generatorPure function, unit-tested
original (the quote)PDF extractorSHA-256 + byte_range
paraphrase, implicationLLMNever trusted for provenance
figure.captionPDF extractorbbox + OCR-of-caption-only
figure.implicationLLMSame rule: advisory text only

The pipeline never asks the LLM to produce a quote. If a future feature wants “the key sentence on this page”, the LLM picks a sentence index from a numbered list of extracted sentences, never emits the sentence text.

2E. Paragraph- and figure-level chunking

  • Paragraph splitter: Marker’s block tree → paragraph-typed blocks become chunks (not pages).
  • Figure chunks: Marker figure blocks → crop image to raw/figs/{cite_key}_fig{N}.png, caption extracted separately, implication runs on caption-only.
  • Table chunks: Marker table block → markdown-table format, implication on markdown text.
  • Equation chunks: Nougat output in LaTeX, stored as $$…$$, implication on LaTeX source.

2F. Idempotence + re-compile

  • Re-running capture is idempotent on arxiv_id: re-downloads only if pdf_sha256 differs.
  • Re-running compile re-does LLM pass only for chunks whose text_sha256 changed.
  • crossmem verify <cite_key> walks the wiki, re-extracts, re-hashes; reports any mismatches.

Implementation order

  1. Metadata reconciler (arxiv + crossref + openalex merge, warnings on disagreement)
  2. cite_key pattern DSL (Better-BibTeX-style, unit-tested)
  3. Marker integration via uvx marker-pdf CLI (Python sidecar; Rust drives via subprocess + JSON)
  4. Chunk schema v2 writer (paragraph/figure/table/equation chunks with bbox + hash)
  5. GROBID on-demand for bibliography references
  6. crossmem verify command
  7. Nougat sidecar for math-heavy pages (opt-in)

What this buys the user

Writing a paper citing Vaswani 2017 p.4 §3.2:

Before (Phase 1): User opens wiki, sees page-4 summary paragraph, pastes bibtex. May still need to open PDF to find exact sentence.

After (Phase 2):

  • Wiki shows §3.2 as a dedicated chunk with verbatim quote.
  • Clicking the provenance block opens the PDF at page 4 with the bbox highlighted.
  • Cite key vaswani2017attention is guaranteed stable across arxiv→NeurIPS preprint→published.
  • crossmem verify run weekly confirms no wiki has silently drifted from its PDF source.

Sources