Chunk-based Citation v2 Design

Status: Implemented (Phase 2 MVP shipped) Date: 2026-04-15

User requirement

How do we ensure citations are absolutely correct — 萬無一失?

One-line answer: Verbatim text + bbox provenance is ground truth; LLM only touches paraphrase/implication, never quotes; metadata is cross-verified across ≥2 canonical sources.

Competitor survey

Tool	What it nails	What it misses
Zotero + Better BibTeX	Stable cite_key via JS-ish pattern DSL; key regeneration rules; 80%+ academic mind-share	No chunk/page content; just metadata container
Marker (datalab-to/marker)	PDF→markdown + polygon bbox per block, `--keep_chars` for char-level bboxes, JSON tree-per-page	Slower than pdftotext; needs CUDA/MPS
Nougat	Transformer-based; beats GROBID on formulas	VLM → hallucination risk on quote fidelity
GROBID	68 fine-grained TEI labels; best on metadata + bibliography refs; 2–5s/page, 90%+ accuracy	Weak on formulas, figures, modern layouts
PaperQA2	Chunk-size configurable; LLM re-rank + contextual summarization; grounded in-text citations	No bbox, chunk = N-char sliding window → page/fragment precision lost
Tensorlake RAG	Anchor tokens `<c>2.1</c>` inlined + bbox stored separately → auditable trail	Proprietary pipeline; design pattern is copyable
OpenAlex / CrossRef / Semantic Scholar	Each is a canonical metadata source	Each has gaps; must cross-reconcile

The industry gold standard for “absolutely correct citation”:

Parse once with bbox-aware extractor (Marker-class) → each block has {page, polygon, text}.
Anchor tokens inlined at chunk build time (<c>p4§3.2</c>) so LLM can only emit citation IDs it saw in context.
Resolve citation IDs → bbox + page at render time; users get deep-link to the exact PDF region.
Metadata cross-check across OpenAlex + CrossRef + arXiv; flag inconsistencies instead of silently picking one.
Quote is verbatim from the PDF text layer, stored with SHA-256 of the source bytes — any LLM-generated “quote” is rejected.

What Phase 1 got right / wrong

Right: pre-gen APA/MLA/Chicago/IEEE/BibTeX, deterministic cite_key, per-page original text preserved verbatim, paraphrase/implication separated from quote.

Wrong / gap:

Metadata only from arXiv API (no CrossRef/OpenAlex cross-check)
Quote preservation is page-level, not paragraph/sentence
No bbox — can’t deep-link into PDF region
No hash-based verifiability
cite_key = primitive pattern vs Better BibTeX DSL
No handling of preprint→published DOI mapping

Phase 2 architecture

2A. Metadata layer (the cite_key + bib trust root)

Pipeline:

arxiv_id → [arxiv API]  ┐
        → [CrossRef]    ├─→ reconcile → canonical metadata
        → [OpenAlex]    ┘                   │
                                            ├─→ cite_key (Better-BibTeX-style pattern, configurable)
                                            ├─→ 5 formats (APA/MLA/Chicago/IEEE/BibTeX)
                                            └─→ DOI + published-version DOI (if preprint)

Rules:

≥2 sources must agree on title + first-author + year. Disagreement → emit meta.warnings in frontmatter.
cite_key pattern DSL (ported from Better BibTeX): [auth:lower][year][shorttitle:1:nopunct], configurable via ~/.crossmem/config.toml.
Track preprint↔published mapping in meta.doi_preprint and meta.doi_published.

2B. PDF parsing layer (the chunk trust root)

Tiered strategy by document type + quality tier:

Tier	Parser	Use when	Bbox?	Speed
0	`pdftotext -layout`	Fallback / pure text	No	instant
1	Marker (Mac MPS)	Default for arxiv	Yes, polygon/block	1–3 s/page
2	GROBID (JVM, local)	Bib-references + structured metadata	Yes, TEI	2–5 s/page
3	Nougat (MPS)	Formula-heavy pages	Partial	5–15 s/page

Phase 2 default: Marker for body + GROBID for bibliography, both run, merge into unified chunk tree.

2C. Chunk schema v2 (bbox + hash provenance)

---
cite_key: vaswani2017attention
meta:
  sources: [arxiv, crossref, openalex]
  reconciled: true
  warnings: []
  pdf_sha256: 9a8f...
...
---

## p.4 §3.2 Scaled Dot-Product Attention

<!-- chunk id=p4s32c1 -->
> We call our particular attention "Scaled Dot-Product Attention"...

provenance:
  page: 4
  section: "3.2 Scaled Dot-Product Attention"
  bbox: [72.0, 340.5, 523.8, 412.1]
  text_sha256: 5f3e1c...
  byte_range: [18342, 19104]

**Paraphrase:** …
**Implication:** …

text_sha256 = SHA-256 of the verbatim extracted text. Re-running the extractor must reproduce it, else the chunk is flagged stale.
bbox + page = deep-link target: crossmem://pdf/{cite_key}#p=4&bbox=72,340,523,412.
byte_range = PDF content-stream offset (from Marker); cheapest way to re-verify without re-extraction.

2D. LLM contract (what model is / isn’t allowed to touch)

Field	Who writes	Verifiable?
`title`, `authors`, `year`, `doi`, `arxiv_id`	Metadata reconciler	Cross-source check
`cite_key`, 5 citation strings	Deterministic generator	Pure function, unit-tested
`original` (the quote)	PDF extractor	SHA-256 + byte_range
`paraphrase`, `implication`	LLM	Never trusted for provenance
`figure.caption`	PDF extractor	bbox + OCR-of-caption-only
`figure.implication`	LLM	Same rule: advisory text only

The pipeline never asks the LLM to produce a quote. If a future feature wants “the key sentence on this page”, the LLM picks a sentence index from a numbered list of extracted sentences, never emits the sentence text.

2E. Paragraph- and figure-level chunking

Paragraph splitter: Marker’s block tree → paragraph-typed blocks become chunks (not pages).
Figure chunks: Marker figure blocks → crop image to raw/figs/{cite_key}_fig{N}.png, caption extracted separately, implication runs on caption-only.
Table chunks: Marker table block → markdown-table format, implication on markdown text.
Equation chunks: Nougat output in LaTeX, stored as $$…$$, implication on LaTeX source.

2F. Idempotence + re-compile

Re-running capture is idempotent on arxiv_id: re-downloads only if pdf_sha256 differs.
Re-running compile re-does LLM pass only for chunks whose text_sha256 changed.
crossmem verify <cite_key> walks the wiki, re-extracts, re-hashes; reports any mismatches.

Implementation order

Metadata reconciler (arxiv + crossref + openalex merge, warnings on disagreement)
cite_key pattern DSL (Better-BibTeX-style, unit-tested)
Marker integration via uvx marker-pdf CLI (Python sidecar; Rust drives via subprocess + JSON)
Chunk schema v2 writer (paragraph/figure/table/equation chunks with bbox + hash)
GROBID on-demand for bibliography references
crossmem verify command
Nougat sidecar for math-heavy pages (opt-in)

What this buys the user

Writing a paper citing Vaswani 2017 p.4 §3.2:

Before (Phase 1): User opens wiki, sees page-4 summary paragraph, pastes bibtex. May still need to open PDF to find exact sentence.

After (Phase 2):

Wiki shows §3.2 as a dedicated chunk with verbatim quote.
Clicking the provenance block opens the PDF at page 4 with the bbox highlighted.
Cite key vaswani2017attention is guaranteed stable across arxiv→NeurIPS preprint→published.
crossmem verify run weekly confirms no wiki has silently drifted from its PDF source.

Sources

PaperQA2 — chunk-size configurable RAG
Tensorlake citation grounding — anchor token pattern
Marker — PDF→markdown with bbox
GROBID — structured metadata extraction
Better BibTeX cite keys — pattern DSL
OpenAlex Work object — DOI canonical

Keyboard shortcuts

crossmem