Data Model
Core types
ReconciledMetadata
The metadata reconciler merges data from multiple sources into a single canonical record.
#![allow(unused)]
fn main() {
pub struct ReconciledMetadata {
pub title: String,
pub authors: Vec<String>,
pub year: u16,
pub arxiv_id: String,
pub doi: Option<String>,
pub doi_preprint: Option<String>,
pub doi_published: Option<String>,
pub sources: Vec<String>, // e.g. ["arxiv", "crossref", "openalex"]
pub warnings: Vec<String>,
pub reconciled: bool,
}
}
ChunkV2
The paragraph-level chunk with full provenance.
#![allow(unused)]
fn main() {
pub struct ChunkV2 {
pub chunk_type: String, // "page", "heading", "paragraph", etc.
pub chunk_id: String, // e.g. "p1s1c1"
pub page: usize,
pub text: String, // Verbatim extracted text
pub provenance: Provenance,
pub paraphrase: Option<String>, // LLM-generated
pub implication: Option<String>,// LLM-generated
}
}
Provenance
Tracks exactly where a chunk came from in the source PDF.
#![allow(unused)]
fn main() {
pub struct Provenance {
pub page: usize,
pub section: Option<String>,
pub bbox: Option<[f64; 4]>, // [x_min, y_min, x_max, y_max]
pub text_sha256: String,
pub byte_range: Option<[usize; 2]>,
}
}
WikiEntry (MCP)
The in-memory representation used by the MCP server.
#![allow(unused)]
fn main() {
struct WikiEntry {
cite_key: Option<String>,
title: String,
authors: Vec<String>,
year: Option<u16>,
source: Option<String>,
date: Option<String>,
file_path: PathBuf,
body: String,
}
}
Storage layout
~/crossmem/
├── raw/ # Capture output
│ ├── <timestamp>_<cite_key>.pdf # Raw PDF
│ └── <timestamp>_<cite_key>.meta.json # Reconciled metadata
└── wiki/ # Compile output
└── <timestamp>_<cite_key>.md # Wiki note
Trust boundaries
| Data | Source | Verifiable? |
|---|---|---|
| Title, authors, year, DOI | Metadata reconciler (arXiv + CrossRef + OpenAlex) | Cross-source agreement |
| Cite key, citation strings | Deterministic generator | Pure function, unit-tested |
| Verbatim quote text | PDF extractor (Marker / pdftotext) | SHA-256 hash |
| Bounding box, byte range | PDF extractor | Re-extraction reproducibility |
| Paraphrase, implication | LLM (Ollama) | Not verifiable — advisory only |