Chunk Format
After the citations section, each wiki note contains a series of chunks. Each chunk preserves verbatim text from the source PDF along with provenance metadata.
Chunk structure
<!-- chunk id=p4s32c1 -->
> We call our particular attention "Scaled Dot-Product Attention".
> The input consists of queries and keys of dimension dk, and values
> of dimension dv.
**Paraphrase:** The authors name their mechanism "Scaled Dot-Product Attention" and define its inputs.
**Implication:** This naming convention becomes the standard terminology used across the field.
```yaml
provenance:
page: 4
section: "3.2 Scaled Dot-Product Attention"
bbox: [72.0, 340.5, 523.8, 412.1]
text_sha256: "5f3e1c..."
byte_range: [18342, 19104]
## Chunk ID format
Chunk IDs follow the pattern `p{page}s{section}c{chunk}`:
| Part | Description | Example |
|------|-------------|---------|
| `p{N}` | Page number | `p4` = page 4 |
| `s{N}` | Section number within page | `s32` = section 3.2 |
| `c{N}` | Chunk number within section | `c1` = first chunk |
## Fields
### Verbatim text
Lines starting with `> ` contain the original text extracted from the PDF. This text is **never modified by the LLM** — it comes directly from the PDF parser.
### Paraphrase
A 1–2 sentence LLM-generated summary of the chunk's content. Generated by Ollama during `crossmem compile`.
### Implication
A 1–2 sentence LLM-generated statement about why this chunk matters to the field. Generated by Ollama during `crossmem compile`.
### Provenance
YAML metadata block attached to each chunk:
| Field | Type | Description |
|-------|------|-------------|
| `page` | integer | Page number in the source PDF |
| `section` | string | Section heading (if detected by parser) |
| `bbox` | `[f64; 4]` | Bounding box `[x_min, y_min, x_max, y_max]` in PDF coordinates. Present when parsed with Marker. |
| `text_sha256` | string | SHA-256 hash of the verbatim text. Used by `crossmem verify` to detect drift. |
| `byte_range` | `[usize; 2]` | `[start, end]` byte offset in the source PDF content stream. Present when available from parser. |
## Chunk types
The `chunk_type` field (internal) classifies each chunk:
| Type | Description |
|------|-------------|
| `page` | Full-page text (from `pdftotext` fallback) |
| `heading` | Section heading |
| `paragraph` | Body paragraph (from Marker block tree) |
| `figure` | Figure caption |
| `table` | Table content |
| `equation` | Mathematical expression |
## Integrity verification
Run `crossmem verify` to re-hash every chunk's verbatim text and compare against the stored `text_sha256`. Any mismatch indicates the wiki file has been modified since compilation.