add v5 substrate training data (40 examples)

Manual hand-crafted examples for generic substrate LoRA:
- 01: identity/self-model
- 02-18: tool categories (159 tools grouped by function)
- 19-26: tricky tool disambiguation
- 27-32: framing (substrate vs persona, HMT, psychoacoustic trust)
- 33-40: EEMS deep-dive (architecture, recall, store, graph, legacy)

Pretty-printed JSON, compiled to JSONL via build pipeline.
System prompt uses generic assistant, not BT persona.
This commit is contained in:
marauder-actual
2026-05-31 14:32:19 +02:00
parent 175d28a5aa
commit 0e88c3c2ae
40 changed files with 1033 additions and 0 deletions
+16
View File
@@ -0,0 +1,16 @@
{
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant with access to tools. Use tools when they help answer the user's question. Be concise and accurate."
},
{
"role": "user",
"content": "How does EEMS work technically?"
},
{
"role": "assistant",
"content": "<think>\nEEMS architecture — the technical stack. SQLite, FTS5, vector search, hybrid recall, embedding model, schema.\n</think>\n\nEEMS is built on SQLite with three search layers fused together:\n\n**Storage:** Each memory is a row in a `memories` table with these fields:\n- `id` — 26-char ULID (time-sortable unique ID)\n- `namespace` — isolation boundary (default: 'default')\n- `subject` — dot-namespaced category (e.g. `doctrine.training.lora`, `infra.network`)\n- `content` — the memory text itself\n- `content_hash` — SHA-256 for deduplication\n- `classification` — 'standard' (deletable) or 'core' (immutable)\n- `confidence` — 0.0 to 1.0\n- `embedding` — 3072-byte BLOB (768 floats, nomic-embed-text-v1.5)\n- `metadata` — JSON object for arbitrary structured data\n- `created_at` / `updated_at` — epoch milliseconds\n\n**Three search layers:**\n\n1. **FTS5** (Full-Text Search) — BM25 keyword search with stop words, covers subject + content. Sub-millisecond. Good for exact term matches.\n2. **sqlite-vec** (Vector Search) — 768-dimensional cosine similarity against nomic-embed-text-v1.5 embeddings. ~2ms. Good for semantic similarity (\"find memories about network topology\" finds entries that don't contain those exact words).\n3. **Hybrid** (Reciprocal Rank Fusion) — merges FTS5 and vector results via RRF with k=60. ~22ms (dominated by embedding latency). This is the default recall mode.\n\n**Performance benchmarks:**\n- FTS5 p50: 166 microseconds\n- Vector p50: 2.0 milliseconds\n- Hybrid p50: 22 milliseconds\n- Recall@5: 90% hybrid, 80% FTS5, 75% vector (baseline before hybrid was 65%)\n\n**Per-host topology:** EEMS is per-host with no automatic replication. Each host has its own `~/.local/share/madcat/eems-v2.db`. A query on fuji searches fuji's store; it cannot see sin's memories without explicit cross-host tools. This is a deliberate design choice — local-first, with sync as a future layer.\n\n**Related tables:**\n- `memories_fts` — FTS5 virtual table, auto-maintained via triggers\n- `edges` — graph relations between memories (from_id, to_id, relation, weight)\n- `chunks_code` / `chunks_code_fts` — separate index for code chunks (distinct from memory storage)"
}
]
}