Chunk-based Citation v2 Design
Status: Implemented (Phase 2 MVP shipped) Date: 2026-04-15
User requirement
How do we ensure citations are absolutely correct — 萬無一失?
One-line answer: Verbatim text + bbox provenance is ground truth; LLM only touches paraphrase/implication, never quotes; metadata is cross-verified across ≥2 canonical sources.
Competitor survey
| Tool | What it nails | What it misses |
|---|---|---|
| Zotero + Better BibTeX | Stable cite_key via JS-ish pattern DSL; key regeneration rules; 80%+ academic mind-share | No chunk/page content; just metadata container |
| Marker (datalab-to/marker) | PDF→markdown + polygon bbox per block, --keep_chars for char-level bboxes, JSON tree-per-page | Slower than pdftotext; needs CUDA/MPS |
| Nougat | Transformer-based; beats GROBID on formulas | VLM → hallucination risk on quote fidelity |
| GROBID | 68 fine-grained TEI labels; best on metadata + bibliography refs; 2–5s/page, 90%+ accuracy | Weak on formulas, figures, modern layouts |
| PaperQA2 | Chunk-size configurable; LLM re-rank + contextual summarization; grounded in-text citations | No bbox, chunk = N-char sliding window → page/fragment precision lost |
| Tensorlake RAG | Anchor tokens <c>2.1</c> inlined + bbox stored separately → auditable trail | Proprietary pipeline; design pattern is copyable |
| OpenAlex / CrossRef / Semantic Scholar | Each is a canonical metadata source | Each has gaps; must cross-reconcile |
The industry gold standard for “absolutely correct citation”:
- Parse once with bbox-aware extractor (Marker-class) → each block has
{page, polygon, text}. - Anchor tokens inlined at chunk build time (
<c>p4§3.2</c>) so LLM can only emit citation IDs it saw in context. - Resolve citation IDs → bbox + page at render time; users get deep-link to the exact PDF region.
- Metadata cross-check across OpenAlex + CrossRef + arXiv; flag inconsistencies instead of silently picking one.
- Quote is verbatim from the PDF text layer, stored with SHA-256 of the source bytes — any LLM-generated “quote” is rejected.
What Phase 1 got right / wrong
Right: pre-gen APA/MLA/Chicago/IEEE/BibTeX, deterministic cite_key, per-page original text preserved verbatim, paraphrase/implication separated from quote.
Wrong / gap:
- Metadata only from arXiv API (no CrossRef/OpenAlex cross-check)
- Quote preservation is page-level, not paragraph/sentence
- No bbox — can’t deep-link into PDF region
- No hash-based verifiability
- cite_key = primitive pattern vs Better BibTeX DSL
- No handling of preprint→published DOI mapping
Phase 2 architecture
2A. Metadata layer (the cite_key + bib trust root)
Pipeline:
arxiv_id → [arxiv API] ┐
→ [CrossRef] ├─→ reconcile → canonical metadata
→ [OpenAlex] ┘ │
├─→ cite_key (Better-BibTeX-style pattern, configurable)
├─→ 5 formats (APA/MLA/Chicago/IEEE/BibTeX)
└─→ DOI + published-version DOI (if preprint)
Rules:
- ≥2 sources must agree on title + first-author + year. Disagreement → emit
meta.warningsin frontmatter. - cite_key pattern DSL (ported from Better BibTeX):
[auth:lower][year][shorttitle:1:nopunct], configurable via~/.crossmem/config.toml. - Track preprint↔published mapping in
meta.doi_preprintandmeta.doi_published.
2B. PDF parsing layer (the chunk trust root)
Tiered strategy by document type + quality tier:
| Tier | Parser | Use when | Bbox? | Speed |
|---|---|---|---|---|
| 0 | pdftotext -layout | Fallback / pure text | No | instant |
| 1 | Marker (Mac MPS) | Default for arxiv | Yes, polygon/block | 1–3 s/page |
| 2 | GROBID (JVM, local) | Bib-references + structured metadata | Yes, TEI | 2–5 s/page |
| 3 | Nougat (MPS) | Formula-heavy pages | Partial | 5–15 s/page |
Phase 2 default: Marker for body + GROBID for bibliography, both run, merge into unified chunk tree.
2C. Chunk schema v2 (bbox + hash provenance)
---
cite_key: vaswani2017attention
meta:
sources: [arxiv, crossref, openalex]
reconciled: true
warnings: []
pdf_sha256: 9a8f...
...
---
## p.4 §3.2 Scaled Dot-Product Attention
<!-- chunk id=p4s32c1 -->
> We call our particular attention "Scaled Dot-Product Attention"...
provenance:
page: 4
section: "3.2 Scaled Dot-Product Attention"
bbox: [72.0, 340.5, 523.8, 412.1]
text_sha256: 5f3e1c...
byte_range: [18342, 19104]
**Paraphrase:** …
**Implication:** …
text_sha256= SHA-256 of the verbatim extracted text. Re-running the extractor must reproduce it, else the chunk is flagged stale.bbox+page= deep-link target:crossmem://pdf/{cite_key}#p=4&bbox=72,340,523,412.byte_range= PDF content-stream offset (from Marker); cheapest way to re-verify without re-extraction.
2D. LLM contract (what model is / isn’t allowed to touch)
| Field | Who writes | Verifiable? |
|---|---|---|
title, authors, year, doi, arxiv_id | Metadata reconciler | Cross-source check |
cite_key, 5 citation strings | Deterministic generator | Pure function, unit-tested |
original (the quote) | PDF extractor | SHA-256 + byte_range |
paraphrase, implication | LLM | Never trusted for provenance |
figure.caption | PDF extractor | bbox + OCR-of-caption-only |
figure.implication | LLM | Same rule: advisory text only |
The pipeline never asks the LLM to produce a quote. If a future feature wants “the key sentence on this page”, the LLM picks a sentence index from a numbered list of extracted sentences, never emits the sentence text.
2E. Paragraph- and figure-level chunking
- Paragraph splitter: Marker’s block tree →
paragraph-typed blocks become chunks (not pages). - Figure chunks: Marker
figureblocks → crop image toraw/figs/{cite_key}_fig{N}.png, caption extracted separately, implication runs on caption-only. - Table chunks: Marker
tableblock → markdown-table format, implication on markdown text. - Equation chunks: Nougat output in LaTeX, stored as
$$…$$, implication on LaTeX source.
2F. Idempotence + re-compile
- Re-running
captureis idempotent onarxiv_id: re-downloads only ifpdf_sha256differs. - Re-running
compilere-does LLM pass only for chunks whosetext_sha256changed. crossmem verify <cite_key>walks the wiki, re-extracts, re-hashes; reports any mismatches.
Implementation order
- Metadata reconciler (arxiv + crossref + openalex merge, warnings on disagreement)
- cite_key pattern DSL (Better-BibTeX-style, unit-tested)
- Marker integration via
uvx marker-pdfCLI (Python sidecar; Rust drives via subprocess + JSON) - Chunk schema v2 writer (paragraph/figure/table/equation chunks with bbox + hash)
- GROBID on-demand for bibliography references
crossmem verifycommand- Nougat sidecar for math-heavy pages (opt-in)
What this buys the user
Writing a paper citing Vaswani 2017 p.4 §3.2:
Before (Phase 1): User opens wiki, sees page-4 summary paragraph, pastes bibtex. May still need to open PDF to find exact sentence.
After (Phase 2):
- Wiki shows
§3.2as a dedicated chunk with verbatim quote. - Clicking the provenance block opens the PDF at page 4 with the bbox highlighted.
- Cite key
vaswani2017attentionis guaranteed stable across arxiv→NeurIPS preprint→published. crossmem verifyrun weekly confirms no wiki has silently drifted from its PDF source.
Sources
- PaperQA2 — chunk-size configurable RAG
- Tensorlake citation grounding — anchor token pattern
- Marker — PDF→markdown with bbox
- GROBID — structured metadata extraction
- Better BibTeX cite keys — pattern DSL
- OpenAlex Work object — DOI canonical