Chunk Format

After the citations section, each wiki note contains a series of chunks. Each chunk preserves verbatim text from the source PDF along with provenance metadata.

Chunk structure

<!-- chunk id=p4s32c1 -->
> We call our particular attention "Scaled Dot-Product Attention".
> The input consists of queries and keys of dimension dk, and values
> of dimension dv.

**Paraphrase:** The authors name their mechanism "Scaled Dot-Product Attention" and define its inputs.

**Implication:** This naming convention becomes the standard terminology used across the field.

```yaml
provenance:
  page: 4
  section: "3.2 Scaled Dot-Product Attention"
  bbox: [72.0, 340.5, 523.8, 412.1]
  text_sha256: "5f3e1c..."
  byte_range: [18342, 19104]


## Chunk ID format

Chunk IDs follow the pattern `p{page}s{section}c{chunk}`:

| Part | Description | Example |
|------|-------------|---------|
| `p{N}` | Page number | `p4` = page 4 |
| `s{N}` | Section number within page | `s32` = section 3.2 |
| `c{N}` | Chunk number within section | `c1` = first chunk |

## Fields

### Verbatim text

Lines starting with `> ` contain the original text extracted from the PDF. This text is **never modified by the LLM** — it comes directly from the PDF parser.

### Paraphrase

A 1–2 sentence LLM-generated summary of the chunk's content. Generated by Ollama during `crossmem compile`.

### Implication

A 1–2 sentence LLM-generated statement about why this chunk matters to the field. Generated by Ollama during `crossmem compile`.

### Provenance

YAML metadata block attached to each chunk:

| Field | Type | Description |
|-------|------|-------------|
| `page` | integer | Page number in the source PDF |
| `section` | string | Section heading (if detected by parser) |
| `bbox` | `[f64; 4]` | Bounding box `[x_min, y_min, x_max, y_max]` in PDF coordinates. Present when parsed with Marker. |
| `text_sha256` | string | SHA-256 hash of the verbatim text. Used by `crossmem verify` to detect drift. |
| `byte_range` | `[usize; 2]` | `[start, end]` byte offset in the source PDF content stream. Present when available from parser. |

## Chunk types

The `chunk_type` field (internal) classifies each chunk:

| Type | Description |
|------|-------------|
| `page` | Full-page text (from `pdftotext` fallback) |
| `heading` | Section heading |
| `paragraph` | Body paragraph (from Marker block tree) |
| `figure` | Figure caption |
| `table` | Table content |
| `equation` | Mathematical expression |

## Integrity verification

Run `crossmem verify` to re-hash every chunk's verbatim text and compare against the stored `text_sha256`. Any mismatch indicates the wiki file has been modified since compilation.

Keyboard shortcuts

crossmem

Chunk Format

Chunk structure