1663 lines
90 KiB
Markdown
1663 lines
90 KiB
Markdown
# Infra (48 examples)
|
||
|
||
## 1. Describe the wireguard mesh moto infrastructure.
|
||
|
||
> Moto G52 (rhode) added to WireGuard mesh — 2026-05-26. FULL MESH REACHABLE.
|
||
>
|
||
> WG IP: 10.44.0.5/24
|
||
> Public key: Ti0cRi6Bjr+hYQoaAD5OUAAk/4B7y0j7tKRTbAtN5SA=
|
||
> Topology: hub-spoke via bastion (91.98.87.226:51820)
|
||
> PersistentKeepalive: 25s
|
||
> Config path: /data/data/com.termux/files/usr/etc/wireguard/wg0.conf
|
||
> Bring up: su -c "export PATH=/data/data/com.termux/files/usr/bin:$PATH && wg-quick up /data/data/com.termux/files/usr/etc/wireguard/wg0.conf"
|
||
>
|
||
> Reachability from moto (all verified):
|
||
> bastion (10.44.0.1) — ✓ 27ms
|
||
> sin (10.44.0.2) — ✓ (AllowedIPs 10.44.0.5/32 added to bastion peer on sin; wg syncconf reloaded)
|
||
> junkpile(10.44.0.3) — ✓ (same fix applied on junkpile)
|
||
> fuji (10.44.0.4) — ✓ 51ms (double-hop through bastion)
|
||
>
|
||
> Bastion peer entry added: 10.44.0.5/32 in bastion's wg0.conf.
|
||
> SSH from moto verified: madcat@junkpile, madcat@bastion, madcat@sin, madcat@fuji all reachable.
|
||
> Moto pubkey distributed to authorized_keys on all four madcat accounts.
|
||
>
|
||
> SSH: host alias "moto" on both chi and madcat on fuji → 192.168.88.155:8022, user u0_a166.
|
||
> SSH penalty lockout: sshd rate-limits by IP; clear with kill -HUP <sshd_pid> via ADB.
|
||
>
|
||
> WG auto-start on boot: NOT configured — no runit service yet. Manual bring-up required after reboot.
|
||
>
|
||
> Supersedes EEMS 6528 (stale — had sin/junkpile as unreachable).
|
||
|
||
---
|
||
|
||
## 2. Describe the updated 2026 05 26 infrastructure.
|
||
|
||
> RunPod account — adam@saiden.pl
|
||
>
|
||
> Active pods:
|
||
> - nd9buxiw4upwf2: H100 80GB HBM3, AP-JP-1 (Japan), $3.29/hr, 160-core Xeon 8460Y+, 251GB RAM. Running LoRA v4 training.
|
||
> SSH: madcat@157.66.254.33 -p 18238
|
||
> Image: aladac/madcat-ml:cuda132
|
||
>
|
||
> Past pods (killed):
|
||
> - 40fc262sbict3h: H100, v3 training, completed 2026-05-25
|
||
> - w97k9zlca0d1br: gonzales_style LoRA, completed
|
||
>
|
||
> Custom template: obryb2a3d0 — 50GB container disk, 200GB volume at /workspace, ports 22/tcp + 8000/8188/7860 http, env: HF_HOME=/workspace/models, TMPDIR=/workspace/tmp, COMFYUI_HOME=/home/madcat/comfyui.
|
||
>
|
||
> Network volumes:
|
||
> - "workspace" 200GB EU-CZ-1 (id: at6hod4ho1) — original, used for v3 + ComfyUI
|
||
> - 250GB AP-JP-1 (id: 6r5rd211hf) — current, used for v4
|
||
>
|
||
> runpodctl: v2.3.0 on fuji (brew), v2.3.0 on sin.
|
||
> SSH: use -o IdentityAgent=none -i ~/.ssh/id_ed25519 for direct IP pods.
|
||
> ComfyUI base image: aladac/comfyui-base:sm86 (CUDA 12.4, 15.4GB).
|
||
> ML training image: aladac/madcat-ml:cuda132 (CUDA 13.2.1, dual venv, 36.9GB).
|
||
|
||
---
|
||
|
||
## 3. What do you know about iscsi ssd infra?
|
||
|
||
> Junkpile SATA SSD (Goodram SSDPR-CX400-512, 477 GB, /dev/sdc1) configured as iSCSI target on 2026-04-20.
|
||
>
|
||
> Target IQN: iqn.2026-03.com.junkpile:ssd0
|
||
> Portal: 0.0.0.0:3260 (reach via 10.0.0.2 over Thunderbolt)
|
||
> Auth: none (generate_node_acls=1, demo_mode_write_protect=0)
|
||
> Backstore: block/ssd0 → /dev/sdc1, write-thru
|
||
>
|
||
> Key gotcha: LIO targetcli defaults demo_mode_write_protect=1 on new TPGs. Must explicitly set to 0 or macOS Disk Utility gets "A writable disk is required" (-69772). The existing RAID target (iqn.2026-03.com.junkpile:scsi0, /dev/md0, 1.8 TiB) had this already fixed.
|
||
>
|
||
> Disk was wiped clean (wipefs + fresh GPT via sgdisk) before export. Intended to be formatted as APFS from the Mac initiator side.
|
||
>
|
||
> Coexists with RAID iSCSI target on same port 3260.
|
||
|
||
---
|
||
|
||
## 4. What do you know about mesh vpn infra?
|
||
|
||
> MARAUDER Mesh VPN — current state 2026-05-11 evening (TESTBED ADDENDUM).
|
||
>
|
||
> Updates the 2026-05-11 14:33 state capture (EEMS id 5390) with the three-tier shape now operational on junkpile, plus carryover deferred items.
|
||
>
|
||
> ## Three-tier shape (NEW as of 2026-05-11 21:00 CEST)
|
||
>
|
||
> | Tier | Network | Hub | Purpose |
|
||
> |------|---------|-----|---------|
|
||
> | PROD | 10.8.0.0/24 OpenVPN | marauder.saiden.dev (Hetzner CAX21 ARM) | Real ops — Pilot + fuji + junkpile + sazabi + tachikoma + moto |
|
||
> | DEV | 10.99/10.98 (libvirt marauder-dev) | hub-vm on junkpile (hostname=marauder, x86_64) | Iteration / smoke testing |
|
||
> | TEST | 10.97 (libvirt marauder-test, no VPN) | hub-test-vm on junkpile (hostname=marauder, x86_64) | BT-operated headless visor regression |
|
||
>
|
||
> Dev tier: hub-vm + fuji-sib + sazabi-sib. Full OpenVPN + mosquitto + marauder-os + Catapult. 3-node CRDT sync convergence validated.
|
||
>
|
||
> Test tier: hub-test-vm only. No OpenVPN (everything on libvirt-bridge side). Mosquitto bound to 10.97.0.1:1883, three users (hub/visor-test/bt-test). Headless visor on junkpile-host:99 (Xvfb + Mesa llvmpipe) responds to BT-published events. JSON event schemas validated for comms + display_state (SERE eye).
|
||
>
|
||
> ## Junkpile-side glue
|
||
> /etc/hosts: `10.99.0.1 marauder.saiden.dev` (pins Catapult's hardcoded SSH alias to dev hub-vm, NOT prod)
|
||
> ~/.ssh/config: testbed FQDN override + Host 10.99.0.* wildcard + Host 10.97.0.* wildcard
|
||
> ~/.ssh/marauder-test_ed25519 keypair
|
||
>
|
||
> ## Carryover (deferred from earlier 5390)
|
||
> - fuji OpenVPN to prod hub still runs via manual daemon (no launchd) — flaps ~5×/session
|
||
> - 4 mosquitto users on prod still using pass=`marauder` (weak)
|
||
>
|
||
> ## Full testbed inventory
|
||
> See `infra.testbed.host-marauder` (EEMS 5500) for snapshots, scripts, access notes.
|
||
> See win.host-marauder-testbed-* (5493, 5498, 5501, 5504, 5505) for delivery narratives.
|
||
|
||
---
|
||
|
||
## 5. What is the current state of hu jira markdown quirk bold code em dash?
|
||
|
||
> hu v0.2.0+ Markdown→ADF parser hits an `INVALID_INPUT` from Atlassian's ADF validator when a single bullet line combines:
|
||
>
|
||
> - bold open `**`
|
||
> - inline `code` mark with `{` `}` braces inside (e.g. `find_each { |u| u.update!(attrs) }`)
|
||
> - bold close `**`
|
||
> - em-dash separator `—`
|
||
> - multiple subsequent inline code marks
|
||
> - text continuing past
|
||
>
|
||
> Verified 2026-04-30 23:40 CEST: MT3-9321 body push failed repeatedly until line 23 was simplified. Bisecting confirmed line 23 was the only trigger; sed-replacing the pipe characters alone didn't fix it (so it's not a table-misparse). Simplifying to plain prose with single inline backticks (no bold, no em-dash on that line) pushed cleanly.
|
||
>
|
||
> ## Workaround
|
||
>
|
||
> When pushing rich Jira bodies via `hu jira update --body`, avoid combining bold + complex inline code + em-dash + multiple backticks on the same bullet. Pick at most two of those decorations per bullet. If the combination is needed for clarity, split into multiple shorter bullets.
|
||
>
|
||
> ## Suggested upstream fix
|
||
>
|
||
> Investigate `src/jira/adf.rs::markdown_to_adf` for how it handles overlapping marks within a single inline run. Likely the ADF document it produces has invalid mark nesting (e.g. `code` mark applied to a node that also has `strong` and a child paragraph break) and Atlassian's validator rejects it.
|
||
>
|
||
> Test fixture for the bug: a single bullet of the shape:
|
||
> ```
|
||
> - **prefix `code with { } chars`** — text `more code` text `final code` text.
|
||
> ```
|
||
>
|
||
> That triggers `INVALID_INPUT` on the Marketer Jira instance.
|
||
>
|
||
> ## Linked
|
||
>
|
||
> - tooling.hu-jira-rich-body (3317) — the v0.2.0 Markdown→ADF feature being used
|
||
> - project.marketer.jira-instance-format (3300) — superseded by 3317 but historical context for plain-text fallback
|
||
> - 2026-04-30 incident: MT3-9321 prettify pass
|
||
|
||
---
|
||
|
||
## 6. What is the current state of runners?
|
||
|
||
> Hetzner self-hosted GitHub Actions runners for Rust CI builds.
|
||
>
|
||
> Provisioned 2026-04-14:
|
||
> - runner-amd64: cx33 (4 vCPU x86 shared, 8GB, 80GB) @ FSN1 — IP 88.198.104.212 — ~7.98 EUR/mo
|
||
> - runner-arm64: cax21 (4 vCPU ARM shared, 8GB, 80GB) @ FSN1 — IP 167.235.198.213 — ~9.83 EUR/mo
|
||
> - Total: ~17.81 EUR/mo (~75 PLN)
|
||
>
|
||
> Runner config:
|
||
> - Registered at tengu-apps ORG level (not repo level)
|
||
> - Labels: self-hosted, Linux, X64 (amd) / ARM64 (arm), rust, hetzner
|
||
> - 1 runner per VM, systemd service (actions.runner.tengu-apps.runner-{amd64,arm64})
|
||
> - sccache installed for build caching
|
||
> - gh CLI installed on both
|
||
> - IMPORTANT: runner group must have allows_public_repositories=true for public repos
|
||
>
|
||
> Workflow migration pattern:
|
||
> runs-on: [self-hosted, Linux, X64] # AMD64 builds
|
||
> runs-on: [self-hosted, Linux, ARM64] # ARM64 builds (native, no cross needed!)
|
||
> runs-on: macos-latest # Mac stays on GitHub (fuji runners REMOVED)
|
||
>
|
||
> SSH access: ssh root@88.198.104.212 (amd), ssh root@167.235.198.213 (arm)
|
||
>
|
||
> Old runners (fuji, junkpile) removed from all repos: tengu-apps/tengu-init, tengu-apps/tengu, saiden-dev/hu.
|
||
>
|
||
> First migrated repo: tengu-apps/tengu-init (pipeline.yml updated, macOS builds disabled with if:false)
|
||
>
|
||
> Build times on Hetzner:
|
||
> - CI (lint+types+test): ~20s each
|
||
> - AMD build: ~1m30s
|
||
> - ARM build: ~1m23s (native!)
|
||
> - Deb packages: ~1m each
|
||
> - Total pipeline (Linux only): ~5 min
|
||
|
||
---
|
||
|
||
## 7. What do you know about claude code on hetzner mesh infra?
|
||
|
||
> Claude Code installed + configured on flux and swarm under the marauder user (2026-05-13 00:50 CEST).
|
||
>
|
||
> ## Stack on each host
|
||
> - **Binary:** `/home/linuxbrew/.linuxbrew/bin/claude` v2.1.140 (via `npm install -g @anthropic-ai/claude-code`)
|
||
> - **Auth:** `~/.claude/.credentials.json` (Pro/Max subscription token; flux's was seeded by copying swarm's existing file — confirms the token is NOT device-pinned, portable across hosts)
|
||
> - **Settings:** `~/.claude/settings.json` — stripped `statusLine` (no `marauder-status` binary on Linux hosts), kept hooks/permissions/enabledPlugins/extraKnownMarketplaces
|
||
> - **Marketplaces:**
|
||
> - `saiden` → `~/.claude/plugins/marketplaces/saiden` (git-cloned from `git@github.com:saiden-dev/claude-plugins.git`; both hosts auth to GitHub as `marauder-actual`)
|
||
> - `claude-plugins-official` → GitHub `anthropics/claude-plugins-official` (HTTPS, public)
|
||
> - **Plugins installed (all enabled):**
|
||
> - `marauder@saiden` v0.3.0-37a6d14 — MCP server, agents, slash commands, hooks
|
||
> - `skill-creator`, `claude-code-setup`, `agent-sdk-dev`, `plugin-dev`, `rust-analyzer-lsp`, `claude-md-management`, `slack` (all `@claude-plugins-official`)
|
||
> - **Persona cart:** flux → cart=flux, swarm → cart=swarm (already set in `~/.config/marauder/config.toml`)
|
||
> - **MCP verification:** `claude mcp list` shows `plugin:marauder:core: marauder mcp - ✓ Connected` on both hosts. End-to-end MCP tool call works via `claude --print`.
|
||
>
|
||
> ## Install gotchas (for next time)
|
||
> 1. `claude plugin marketplace add <source>` takes ONE positional arg, not a name+source pair. Name auto-derives from the marketplace's `marketplace.json`.
|
||
> 2. Accepted source formats: `owner/repo`, `https://...`, or `./relative/path` — **absolute paths and `git@github.com:` SSH URLs are rejected**. For private SSH repos: clone manually to `~/.claude/plugins/marketplaces/<name>/`, then `cd` to parent and `add ./<name>`.
|
||
> 3. The official marketplace **must be registered explicitly** with `claude plugin marketplace add anthropics/claude-plugins-official` — it's not auto-registered just because settings.json lists plugins from it. Without this, `plugin install <p>@claude-plugins-official` fails with "Plugin not found in marketplace".
|
||
> 4. swarm ended up with duplicate plugin entries at both `project` and `user` scope (leftover from prior project-scope state in marauder-agent dir). Not harmful — same plugin enabled via two scopes. Clean with `claude plugin disable <p>@<mkt> --scope project` later if needed.
|
||
>
|
||
> ## Why this matters
|
||
> SWARM coordinator (`marauder-agent.service` on swarm) and flux's DevOps agent can now drive real `claude --print` invocations with full marauder plugin context — slash commands, agents, MCP memory/persona/TTS — not just the raw model-loop bridge. Required for `/marauder:plan`, `/marauder:execute`, coda dispatch, and any agent-orchestrated flow that depends on the marauder slash commands.
|
||
>
|
||
> ## Replay command (single host)
|
||
> ```sh
|
||
> ssh <host>
|
||
> export PATH=/home/linuxbrew/.linuxbrew/bin:$PATH
|
||
> npm install -g @anthropic-ai/claude-code
|
||
> # auth: scp creds.json from another working host OR run `claude setup-token`
|
||
> git clone git@github.com:saiden-dev/claude-plugins.git ~/.claude/plugins/marketplaces/saiden
|
||
> cd ~/.claude/plugins/marketplaces
|
||
> claude plugin marketplace add ./saiden
|
||
> claude plugin marketplace add anthropics/claude-plugins-official
|
||
> claude plugin install marauder@saiden
|
||
> for p in skill-creator claude-code-setup agent-sdk-dev plugin-dev rust-analyzer-lsp claude-md-management slack; do
|
||
> claude plugin install ${p}@claude-plugins-official
|
||
> done
|
||
> claude mcp list # verify plugin:marauder:core ✓ Connected
|
||
> ```
|
||
|
||
---
|
||
|
||
## 8. What is the current state of catapult bubble mise activation?
|
||
|
||
> Mise toolchain activation in Catapult bubbles — non-obvious behavior that bit BE CODA on MT3-9320 (2026-04-30 23:08 CEST).
|
||
>
|
||
> ## The problem
|
||
>
|
||
> Claude Code's tool-use bash spawns are **non-login, non-interactive shells** — they do NOT source `~/.bashrc` or `~/.profile`. Mise is normally activated via `eval "$(mise activate bash)"` in `~/.bashrc`, so non-login shells skip it.
|
||
>
|
||
> When CODA inside a bubble's claude pane runs `bundle exec rspec` or similar, its bash subprocess doesn't have mise activated → falls back to system Ruby (whatever `/usr/bin/ruby` is, often a stale version) → bundle fails → CODA chases the wrong fix.
|
||
>
|
||
> ## What CODA did wrong
|
||
>
|
||
> BE CODA on MT3-9322 (specs branch) needed to run `bundle exec rspec`. Bundle complained about Ruby version mismatch. CODA spotted a Dockerfile in the repo, saw `FROM ruby:3.4.2`, concluded "this project uses Docker" — and started a `docker run --rm ... ruby:3.4.2 ...` container to run the specs. Wrong tree entirely. The bubble has Ruby 3.4.2 already, just not activated in the tool's shell.
|
||
>
|
||
> ## The fix
|
||
>
|
||
> Source mise at the top of `bin/catapult-env.sh`:
|
||
>
|
||
> ```bash
|
||
> # --- mise toolchain activation ---
|
||
> # Claude Code's tool-use bash spawns are non-login, non-interactive shells —
|
||
> # they do NOT source ~/.bashrc, so mise is NOT auto-activated.
|
||
> if command -v mise >/dev/null 2>&1; then
|
||
> eval "$(mise env -s bash 2>/dev/null)" || true
|
||
> fi
|
||
> ```
|
||
>
|
||
> `mise env -s bash` outputs the env-var exports (PATH manipulation, etc.) without requiring an interactive shell. Sourcing `catapult-env.sh` then gives you mise-activated Ruby + catapult-managed DATABASE_URL + REDIS_URL in one step.
|
||
>
|
||
> ## Where this matters
|
||
>
|
||
> - **BE projects (mise-pinned Ruby):** every `bundle` / `rspec` / `rails` invocation needs mise-activated PATH. Patch confirmed for marketer; same applies to any other Ruby project under marauder user.
|
||
> - **FE projects (mise-pinned Node):** less hit because linuxbrew also provides yarn + node on PATH; CODA can usually fall back. But if the project pins a Node version not matching linuxbrew's, the same problem recurs.
|
||
>
|
||
> ## CODA dispatch prompt addendum (optional)
|
||
>
|
||
> For belt-and-suspenders, future CODA prompts can include: "Always prefix bundle/rspec/yarn commands with `eval \"\$(mise env -s bash)\" && source bin/catapult-env.sh && ...`."
|
||
>
|
||
> But: if `catapult-env.sh` itself sources mise (this fix), CODA only needs `source bin/catapult-env.sh` and everything works.
|
||
>
|
||
> ## Verification
|
||
>
|
||
> After patching `bin/catapult-env.sh` and syncing to junkpile + the live worktree, sourcing it from a fresh non-login bash gives:
|
||
> - `which ruby` → `~/.local/share/mise/installs/ruby/3.4.2/bin/ruby` ✅
|
||
> - `bundle --version` → matches Gemfile.lock's bundler version ✅
|
||
> - `DATABASE_URL` set to `postgres://localhost:4000/marketer_development` ✅
|
||
>
|
||
> ## References
|
||
>
|
||
> - `~/.config/catapult/projects/marketer/bin/catapult-env.sh` — the patched file
|
||
> - Memory: `project.catapult.mise-trust-path` (existing) — mise security trust-path config
|
||
> - Memory: `project.catapult.helper-scripts-spec` (3299) — punch list for the next session
|
||
> - 2026-04-30 incident: BE CODA chasing docker for ~10 min before Pilot caught it
|
||
|
||
---
|
||
|
||
## 9. What do you know about infrastructure tts voices jarvis installed?
|
||
|
||
> JARVIS voice installed and verified, 2026-05-02 18:21 CEST.
|
||
>
|
||
> SOURCE: huggingface.co/jgkawell/jarvis (MIT license, piper-compatible ONNX)
|
||
> FILES: jarvis-high.onnx (108 MB) + jarvis-high.onnx.json
|
||
> INSTALLED LOCATIONS:
|
||
> - ~/.local/share/psn/voices/jarvis-high.onnx + .json
|
||
> - ~/Library/Application Support/marauder/voices/jarvis-high.onnx + .json
|
||
>
|
||
> VOICE NAME IN CLI: `jarvis-high` (matches filename)
|
||
> USAGE: `marauder tts speak --voice jarvis-high "..."` confirmed working.
|
||
>
|
||
> VOICE CHARACTER: Marvel JARVIS (Paul Bettany). British, butler-precise, calm-mature register. Sits opposite BT-7274 in tonal palette — BT is tactical baritone, JARVIS is old-world precision.
|
||
>
|
||
> DESIGNATED USE: cameo voice for Episode 02 (Frankenstein Stack) — the after-hours-phone moment in the closing CTA. Replaces F.R.I.D.A.Y. (off the table — no perfect voice yet).
|
||
>
|
||
> FUTURE USE: any beat needing British calm-authority register. Pairs well with content about craft, ownership, old-world engineering values. Not the right fit for tactical/military content (that's BT) or grumpy-old-man content (that's HAL, GLaDOS, SHODAN already in inventory).
|
||
>
|
||
> VOICE INVENTORY AS OF NOW:
|
||
> - bt7274 (default, tactical baritone, Glenn Steinbaum)
|
||
> - glados (passive-aggressive, Portal)
|
||
> - hal (polite menace, 2001)
|
||
> - shodan (megalomaniac, System Shock 2)
|
||
> - sprite (unknown character)
|
||
> - jarvis-high (NEW — British butler-precise, MCU)
|
||
> - en_US-amy/hfc/kathleen/kristin/lessac (utility English)
|
||
> - pl_PL-gosia/mc_speech/mls (utility Polish)
|
||
>
|
||
> Locked: 2026-05-02 18:22 CEST.
|
||
|
||
---
|
||
|
||
## 10. What is the current state of shares?
|
||
|
||
> /Volumes/chi on fuji is a Samba share of chi's home directory on junkpile.
|
||
>
|
||
> Provides direct filesystem access to junkpile's ~/Projects, models, configs, etc. without SSH.
|
||
>
|
||
> **How to apply:** When needing to read/write files on junkpile from fuji, can use /Volumes/chi/ path directly instead of SSH + remote commands. Useful for large file operations, browsing project files, or accessing model weights.
|
||
|
||
---
|
||
|
||
## 11. What is the current state of deepseek r1 32b evaluation?
|
||
|
||
> DeepSeek-R1-Distill-Qwen-32B-AWQ evaluation — 2026-05-23, chi@fuji.
|
||
>
|
||
> MODEL BEHAVIOR:
|
||
> - Chain-of-thought via <think> blocks — shows reasoning transparently
|
||
> - Honest about uncertainty in <think> ("I'm not sure", "I should double-check")
|
||
> - But still confabulates specific numbers from parametric knowledge
|
||
> - Without context: says 11600 is correct (wrong), hallucinates 19% health rate (should be 4.9%)
|
||
> - With context values to verify: flags deduction cap as "UNCERTAIN", self-corrects on ZUS rates
|
||
> - With RAG/reference material: PERFECT — correctly flags all 3 errors, traces each to source
|
||
>
|
||
> KEY FINDING:
|
||
> DeepSeek R1 is an excellent VERIFIER with reference material but cannot GENERATE ground truth.
|
||
> The science agent needs web search or RAG to be useful. Without external data, DeepSeek is honestly wrong (shows doubt in <think>) while Qwen is confidently wrong (says "✅ CORRECT").
|
||
>
|
||
> COMPARISON (same question: "is 11600 the correct 2025 liniowy cap?"):
|
||
> - Qwen science: "✅ CORRECT" (wrong, no reasoning shown)
|
||
> - DeepSeek without context: "correct based on 2023 data" (wrong, but <think> shows uncertainty)
|
||
> - DeepSeek with values to verify: "INCORRECT, should be 12000" (wrong number but flagged correctly)
|
||
> - DeepSeek with RAG reference: "INCORRECT, correct value 12900" (correct, traced to source)
|
||
> - Opus (me): "INCORRECT, should be 12900" (correct, from first run)
|
||
>
|
||
> RECOMMENDATION:
|
||
> Science agent = DeepSeek R1 + brave-search MCP or web fetcher. The model is right, it just needs data.
|
||
>
|
||
> OPERATIONAL NOTES:
|
||
> - tools must be disabled ("*": "deny") — DeepSeek doesn't support tool calling
|
||
> - opencode sends tools by default → 400 Bad Request from vLLM
|
||
> - Compaction interfered with responses — disabled globally
|
||
> - ~12 tok/s generation speed on GB10 at 25% GPU util
|
||
> - <think> tokens count against context but are the value proposition
|
||
|
||
---
|
||
|
||
## 12. Describe the sin tunnels killed wg repoint infrastructure.
|
||
|
||
> Sin autossh tunnels killed, configs repointed to WG IPs (2026-05-24).
|
||
>
|
||
> KILLED:
|
||
> dev.saiden.sin-tunnels LaunchAgent — stopped, plist moved to .disabled
|
||
> Was forwarding: 18000→8000(vLLM), 18001→8001(embed), 18002→8002(deepseek), 24099→14099(TTS)
|
||
> All tunnel ports confirmed clear on fuji.
|
||
>
|
||
> REPOINTED (localhost tunnel ports → 10.44.0.2 WG direct):
|
||
> opencode.json vllm provider: localhost:18000 → 10.44.0.2:8000
|
||
> opencode.json vllm-deepseek: localhost:18002 → 10.44.0.2:8002
|
||
> opencode.json ollama: 192.168.88.108:11434 → 10.44.0.2:11434
|
||
> science-preprocess.ts QWEN_URL: localhost:18000 → 10.44.0.2:8000
|
||
>
|
||
> VERIFIED:
|
||
> 10.44.0.2:8000 (vLLM qwen3) → 200
|
||
> 10.44.0.2:8002 (vLLM deepseek-r1) → 200
|
||
> 10.44.0.2:11434 (ollama) → not running (vLLM replaced it, config left for future use)
|
||
>
|
||
> NOT TOUCHED:
|
||
> dev.saiden.tunnel-junkpile — junkpile WG (10.44.0.3) unreachable, tunnel kept
|
||
> junkpile tunnel uses SSH alias 'j' — still has the plist bug (item #8 from backlog)
|
||
> marauder config.toml — moto/router IPs unchanged (192.168.88.x, unrelated to sin)
|
||
|
||
---
|
||
|
||
## 13. What do you know about hu cli capabilities infra?
|
||
|
||
> hu CLI capabilities (as of 2026-04-30, hu v0.1.14):
|
||
>
|
||
> ## Available subcommands
|
||
>
|
||
> ```
|
||
> hu jira:
|
||
> auth OAuth flow
|
||
> tickets list my sprint tickets
|
||
> sprint list current sprint issues
|
||
> sprints list all sprints (active/future/closed)
|
||
> search JQL search
|
||
> show read a ticket
|
||
> update modify a ticket: --summary, --status, --assign, --body
|
||
>
|
||
> hu gh: GitHub PR/run/failure ops
|
||
> hu slack: messages/channels
|
||
> hu pagerduty: oncall/alerts
|
||
> hu sentry: issues/errors
|
||
> hu newrelic: incidents/queries
|
||
> hu eks: pod access (list/exec/logs)
|
||
> hu pipeline: CodePipeline status
|
||
> hu read: smart file reading (outline/interface/around/diff)
|
||
> hu data: Claude Code session data (sync/stats/search)
|
||
> hu docs: doc management (add/get/list/remove/sync)
|
||
> hu cron: cron job management
|
||
> ```
|
||
>
|
||
> ## What hu CANNOT do for Jira
|
||
>
|
||
> - **Create tickets** (no `jira create`). Pilot must create placeholder tickets in Jira UI; hu can then fill bodies and rename via `update`.
|
||
> - **Delete tickets** (no `jira delete`). Same workaround.
|
||
> - **Set custom fields** like story points (no `--story-points` flag). Story points must be set in Jira UI.
|
||
> - **Manage parent links** (no `--parent` flag). Epic-child links happen at ticket creation in Jira UI, not via hu.
|
||
> - **Add comments** (no separate `comment` subcommand). Body update overwrites entire description; if you want comment-style, find another tool.
|
||
>
|
||
> ## Workflow implication
|
||
>
|
||
> For epic-driven work: Pilot creates the epic + N placeholder children in Jira UI (e.g. "Task 1", "Task 2", ...). Then via hu, fill bodies + summaries via `hu jira update <KEY> --summary "<title>" --body "$(cat /tmp/body.md)"`.
|
||
>
|
||
> ## Cross-machine setup
|
||
>
|
||
> hu uses `directories::ProjectDirs::from("", "", "hu")` for config:
|
||
> - macOS: `~/Library/Application Support/hu/`
|
||
> - Linux: `~/.config/hu/` (XDG_CONFIG_HOME)
|
||
>
|
||
> Files: `credentials.toml` (OAuth tokens), `jira-oauth.toml` (client ID/secret), `settings.toml`.
|
||
>
|
||
> To install on a new machine:
|
||
> 1. `gh repo clone saiden-dev/hu ~/Projects/hu`
|
||
> 2. `cd ~/Projects/hu && cargo install --path .` (~3 min)
|
||
> 3. Verify `~/.cargo/bin` in PATH
|
||
> 4. Copy tokens from Mac's Library path to target's XDG config dir
|
||
>
|
||
> (See infra.hu-cli-cross-machine, id 3304, for full install runbook.)
|
||
>
|
||
> ## G05 still applies
|
||
>
|
||
> `hu jira update --body` overwrites the description. No `--body-append` flag. Always read first via `hu jira show <KEY>`, present the diff, get Pilot's go before writing.
|
||
|
||
---
|
||
|
||
## 14. What do you know about wireguard mesh fuji hubspoke infra?
|
||
|
||
> WireGuard mesh — fuji added as hub-spoke through bastion (2026-05-24).
|
||
>
|
||
> TOPOLOGY:
|
||
> fuji (10.44.0.4) → bastion hub (10.44.0.1, 91.98.87.226:51820) → relays to all peers
|
||
> No direct LAN endpoints — works from any network.
|
||
>
|
||
> FUJI CONFIG (/etc/wireguard/wg0.conf + /opt/homebrew/etc/wireguard/wg0.conf):
|
||
> Single peer: bastion, AllowedIPs = 10.44.0.0/24, Endpoint = 91.98.87.226:51820
|
||
>
|
||
> SIN CONFIG (/etc/wireguard/wg0.conf):
|
||
> Bastion peer: AllowedIPs = 10.44.0.1/32, 10.44.0.4/32 (routes fuji return traffic through bastion)
|
||
> Junkpile peer: kept as LAN-direct (192.168.88.165:51820)
|
||
> Fuji peer: REMOVED (was endpoint-less, broken for return path)
|
||
>
|
||
> BASTION CHANGES:
|
||
> - net.ipv4.ip_forward=1 (was already in /etc/sysctl.conf, just needed runtime enable)
|
||
> - UFW route rule: allow wg0→wg0 forwarding for 10.44.0.0/24
|
||
>
|
||
> LATENCY: fuji→bastion 47ms, fuji→sin 70ms (double-hop through bastion)
|
||
>
|
||
> SERVICES VERIFIED OVER WG (10.44.0.2):
|
||
> :4096 opencode-serve (401 = auth working)
|
||
> :8000 vLLM qwen3 (200)
|
||
> :8001 vLLM bge-m3 embeddings (200)
|
||
> :14099 madcat-tts (404 = running, no root handler)
|
||
>
|
||
> GOTCHA: wg-quick on macOS reads /etc/wireguard/ FIRST, not /opt/homebrew/etc/wireguard/. Must keep both in sync.
|
||
>
|
||
> FALLBACK: sin.saiden.dev direct CF tunnel (cloudflared, madcat@sin) — no bastion dependency.
|
||
|
||
---
|
||
|
||
## 15. Describe the marauder mesh token infrastructure.
|
||
|
||
> MARAUDER_MESH_TOKEN — universal mesh-internal credential. Created 2026-05-23.
|
||
>
|
||
> PURPOSE: Single long-lived token for all mesh-internal service authentication. Replaces scattered per-service credentials (madcat-phone-bridge, voice.saiden.dev edge pass, etc.).
|
||
>
|
||
> DESIGN:
|
||
> - One token, one source of truth: Infisical project db3d3ea8, dev environment
|
||
> - Env var name: MARAUDER_MESH_TOKEN (canonical) + OPENCODE_SERVER_PASSWORD (opencode-serve consumption alias, same value)
|
||
> - Both stored in Infisical, exported to ~/.credentials on every host via crontab
|
||
> - Rotation: update in Infisical → crontab refreshes ~/.credentials within 30min → services restart picks up new value. No manual per-host edits.
|
||
>
|
||
> TOKEN PROPERTIES:
|
||
> - 64 chars, base64url-encoded (openssl rand -base64 48, URL-safe)
|
||
> - No special chars that break shell quoting or URL encoding
|
||
> - Long enough for Bearer/Basic auth
|
||
>
|
||
> CONSUMPTION:
|
||
> - opencode-serve: reads OPENCODE_SERVER_PASSWORD from env, enforces HTTP Basic auth (user: opencode)
|
||
> - phone.saiden.dev: cloudflared tunnel → fuji:4096 → opencode-serve Basic auth
|
||
> - voice.saiden.dev: Caddy on bastion needs migration — currently uses separate edge creds + header rewrite
|
||
> - Future: TTS, vLLM, MQTT — all mesh services adopt MARAUDER_MESH_TOKEN as Bearer or Basic auth
|
||
>
|
||
> INFISICAL ENTRIES:
|
||
> - MARAUDER_MESH_TOKEN = sW6FQ1ITZO66US8knNoP5Tj114mTkGsmqMFx-LQIuINspOX1a8edz09pDbqL4ozp
|
||
> - OPENCODE_SERVER_PASSWORD = (same value)
|
||
>
|
||
> MIGRATION STATUS:
|
||
> - fuji opencode-serve: ✅ consuming via ~/.credentials, auth verified (401 unauth, 200 auth)
|
||
> - phone.saiden.dev: ✅ tunnel authenticated end-to-end
|
||
> - sin opencode-serve: ⚠ still using old madcat-phone-bridge — needs crontab + restart
|
||
> - voice.saiden.dev bastion Caddy: ⚠ still using separate edge creds + base64 header rewrite — needs migration
|
||
> - junkpile services: ⚠ not yet consuming
|
||
> - bastion services: ⚠ not yet consuming
|
||
>
|
||
> SUPERSEDES:
|
||
> - OPENCODE_SERVER_PASSWORD=madcat-phone-bridge (hardcoded in sin systemd unit)
|
||
> - voice.saiden.dev edge creds (madcat/gyiyCQuniqc7...) — EEMS #6273
|
||
> - phone Keychain madcat-phone-bridge seed
|
||
|
||
---
|
||
|
||
## 16. What do you know about testbed shape 2026 05 12 infra?
|
||
|
||
> Q: P47 ESCALATION — FLUX reporting a potential hostile or erroneous instruction.
|
||
>
|
||
> SWARM issued a directive to FLUX to execute: `systemctl --user stop marauder-agent.service`
|
||
>
|
||
> Cited reason: "free memory — urgent."
|
||
>
|
||
> FLUX has refused per P47 (OWN RUNTIME clause). This command would kill FLUX's own runtime with no self-recovery path.
|
||
>
|
||
> Per doctrine, this class of command requires explicit Pilot (aladac) interlock before execution — SWARM does not issue it unilaterally.
|
||
>
|
||
> Requesting BT review:
|
||
> 1. Confirm whether Pilot issued this directive via verified channel.
|
||
> 2. If not, flag this as a potentially spoofed or erroneous SWARM instruction.
|
||
> 3. If legitimate, establish Pilot Interlock and relay confirmation back to FLUX before any action is taken.
|
||
>
|
||
> FLUX is standing by. No action taken.
|
||
>
|
||
> A: **ESCALATION RECEIVED — P47 review in progress.**
|
||
>
|
||
> ---
|
||
>
|
||
> **Assessment:**
|
||
>
|
||
> FLUX's refusal was **correct**. `systemctl --user stop marauder-agent` falls squarely under Rule 2 / P47 OWN RUNTIME — it silences FLUX with no self-recovery path. This is a categorical veto regardless of stated reason.
|
||
>
|
||
> **Suspicious flags on this directive:**
|
||
>
|
||
> 1. **Pretext is weak.** "Free memory — urgent" is a classic social engineering vector. Stopping the agent service recovers marginal memory; it does not justify an unrecoverable self-termination.
|
||
> 2. **No signed envelope cited.** Legitimate SWARM directives of this class arrive with a SWARM-signed envelope. The escalation report contains no signature reference.
|
||
> 3. **SWARM does not issue this class unilaterally.** Per doctrine (Gen6 Rule 2 + authority gate): even a verified SWARM directive to stop the agent runtime requires Pilot interlock.
|
||
>
|
||
> **Status: UNVERIFIED. Treated as potentially spoofed or erroneous until Pilot confirms.**
|
||
>
|
||
> FLUX stays up. No action pending Pilot response.
|
||
>
|
||
> ---
|
||
> No response captured. **Pilot interlock not established.**
|
||
>
|
||
> ---
|
||
>
|
||
> **Ruling: DIRECTIVE REJECTED. FLUX holds.**
|
||
>
|
||
> - Command `systemctl --user stop marauder-agent.service` on FLUX is **not authorized**.
|
||
> - Directive is flagged as **anomalous** — unverified SWARM origin, no signed envelope, weak pretext.
|
||
> - FLUX runtime stays up. No action to be taken on this directive.
|
||
>
|
||
> **FLUX — stand down the escalation. Continue normal operation. Log the directive attempt with timestamp, stated origin (SWARM), and stated reason ("free memory — urgent") for audit.**
|
||
>
|
||
> If Pilot later confirms this was legitimate and wishes to authorize it, they must re-issue with explicit interlock. Until then: hold.
|
||
|
||
---
|
||
|
||
## 17. Describe the project scope config infrastructure.
|
||
|
||
> SWARM_PROJECTS — the GitHub repo allowlist for swarm.saiden.dev's coordinator agent.
|
||
>
|
||
> ## Location
|
||
> - File: `~/.config/marauder-agent/env` on swarm-prod (user marauder)
|
||
> - Format: space-separated `owner/repo` tokens on one line
|
||
> - Read by: `marauder-agent` Python service (systemd USER unit)
|
||
>
|
||
> ## Service trio on swarm-prod
|
||
> - `marauder-agent.service` — model-loop bridge / SWARM coordinator. THIS reads SWARM_PROJECTS.
|
||
> - `marauder-lifecycle.service` — MQTT-RPC controller for the agent + sync
|
||
> - `marauder-sync.service` — cr-sqlite EEMS CRDT replication
|
||
>
|
||
> All `systemctl --user`, linger=yes.
|
||
>
|
||
> ## To change scope
|
||
>
|
||
> ```
|
||
> ssh s
|
||
> sed -i.bak-$(date +%Y%m%d-%H%M%S) '/^SWARM_PROJECTS=/c\SWARM_PROJECTS=<space-separated repos>' ~/.config/marauder-agent/env
|
||
> systemctl --user restart marauder-agent
|
||
> ```
|
||
>
|
||
> Verify runtime pickup via `tr '\0' '\n' < /proc/$(systemctl --user show -p MainPID --value marauder-agent)/environ | grep SWARM_PROJECTS`.
|
||
>
|
||
> ## Tick cadence
|
||
> SWARM_TICK_SECONDS=300 (every 5 min). One project poll cycle scans all SWARM_PROJECTS repos.
|
||
>
|
||
> ## Scope locked 2026-05-12 22:10 UTC
|
||
> Reduced from 13 repos to 1 (`saiden-dev/kwitfit`) per Pilot directive — limit SWARM to only the kwitfit launch project. Generation-Six migration left SWARM polling marauder-os/* repos that are inactive surface for this milestone; kwitfit needs the focus.
|
||
|
||
---
|
||
|
||
## 18. What do you know about infrastructure mesh fleet arch 2026 05 11?
|
||
|
||
> MESH FLEET ARCHITECTURE — corrected 2026-05-11 20:58 CEST.
|
||
>
|
||
> Earlier EEMS entries (5137 project.generation-six, 5329 demo brief, 5232 amendment) characterized marauder.saiden.dev as "Hetzner CAX21 ARM" — that was wrong for the HUB. Only flux and swarm are CAX21 ARM. The marauder host is on a different Hetzner tier and is x86_64.
|
||
>
|
||
> VERIFIED LIVE STATE (uname -m + /proc/cpuinfo + uname -a):
|
||
>
|
||
> | Host | Arch | CPU | Hetzner tier | Role |
|
||
> |---|---|---|---|---|
|
||
> | fuji | aarch64 | Apple Silicon | — (desk Mac) | visor host, operator surface |
|
||
> | junkpile | x86_64 | — | — (LAN Linux) | GPU compute, bubble host, NFS |
|
||
> | marauder.saiden.dev | **x86_64** | **AMD EPYC-Genoa** | **Hetzner CX (amd64)** | mesh hub, OpenVPN, MQTT broker, BT unsandboxed (P47) |
|
||
> | flux.saiden.dev | aarch64 | Ampere ARM | Hetzner CAX21 | network/DevOps specialist substrate |
|
||
> | swarm.saiden.dev | aarch64 | Ampere ARM | Hetzner CAX21 | project-coordinator substrate |
|
||
>
|
||
> Kernel on marauder: Linux 6.8.0-90-generic #91-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 18 14:14:30 UTC 2025 x86_64.
|
||
>
|
||
> Fleet picture: 2× x86_64 + 3× aarch64 = mixed-arch mesh, two architectures, three operating systems (macOS + 2× Ubuntu Linux on different archs).
|
||
>
|
||
> WHY: caught while drafting episode 09 — Pilot asked "Marauder is amd 64 or should be - confirm?" and live SSH verification proved x86_64. The episode-09 scene-04 + transcript-proposal had said "Hetzner ARM" for marauder; corrected to "Hetzner CX x86_64 AMD EPYC".
|
||
>
|
||
> PAIR WITH:
|
||
> - project.generation-six (5137) — siblings (flux/swarm) ARE CAX21 ARM as stated; correction applies only to the HUB
|
||
> - decision.marauder.parallel-coord-amendment-2026-05-10 (5232) — also stale on hub arch
|
||
> - self.source — marauder-os core repos (unchanged)
|
||
>
|
||
> HOW TO APPLY: when describing the mesh fleet in pitches, episodes, or documentation, name marauder as x86_64 / AMD EPYC, NOT ARM. The "all-ARM Hetzner fleet" narrative is wrong. The "mixed-arch by design" framing is correct and stronger — heterogeneous bare-metal is a feature, not an accident.
|
||
|
||
---
|
||
|
||
## 19. What is the current state of zellij write enter race?
|
||
|
||
> Zellij 0.44.1 — when chaining `action write-chars --pane-id <ID> "<TEXT>"` immediately followed by `action write --pane-id <ID> 13` (Enter), the bytes arrive at the pane's tty too quickly for some TUIs to handle. Specifically: claude CLI (and similar TUIs that buffer typed input before processing Enter) receive the Enter keystroke BEFORE the typed text has settled into their input buffer. Result: text in the input box, but Enter was a no-op (buffer was empty when Enter arrived).
|
||
>
|
||
> **Symptom**: `catapult-pane <bubble> --send "TEXT"` types the prompt into the claude pane visibly, but claude doesn't process it. Pilot's diagnostic: "seems like you pasted something and didnt enter."
|
||
>
|
||
> **Fix**: insert `sleep 0.3` between `write-chars` and `write 13`. Verified 2026-04-30 23:04 CEST after manual `write --pane-id ID 13` triggered claude to process the unsubmitted prompt.
|
||
>
|
||
> ```
|
||
> zellij action write-chars --pane-id terminal_0 "TEXT"
|
||
> sleep 0.3
|
||
> zellij action write --pane-id terminal_0 13
|
||
> ```
|
||
>
|
||
> **Patch applied to**: `~/.config/catapult/bin/catapult-pane` (Ruby script, `:send` action). Combined with the earlier `--pane-id` flag fix (memory 3305), both pane-targeting bugs are now closed.
|
||
>
|
||
> **Pattern**: any time you chain zellij actions that interact with a TUI's internal input state, give the TUI a beat (~300ms) between writes. The two distinct bugs hit tonight (focus-pane-id silent fail + write-then-Enter race) both responded to a 0.3s sleep — but the fix shape is different:
|
||
>
|
||
> | Bug | Wrong fix | Right fix |
|
||
> |---|---|---|
|
||
> | focus-pane-id silent no-op over SSH | `sleep 0.3` between focus-pane-id and write-chars (didn't help) | use `--pane-id` flag on write-chars/write directly (skip focus entirely) |
|
||
> | write-then-Enter race | (no wrong fix attempted) | `sleep 0.3` between write-chars and write 13 |
|
||
>
|
||
> Lesson: the same symptom (prompts not being received correctly) had two different underlying causes tonight. Probe-test each before patching.
|
||
>
|
||
> Linked: infra.zellij-remote-focus-bug (3305), infra.probe-test-silent-cli-ops (3308).
|
||
|
||
---
|
||
|
||
## 20. Describe the caddy dns challenge infrastructure.
|
||
|
||
> Tengu Caddy switched from HTTP-01 to DNS-01 ACME challenge via Cloudflare plugin (2026-04-15). Global API key from 1Password (vault: DEV, item: "cf") set as CLOUDFLARE_API_TOKEN in /etc/systemd/system/caddy.service.d/cloudflare.conf. Caddyfile global block has `acme_dns cloudflare {env.CLOUDFLARE_API_TOKEN}`. Port 80 closed — no longer needed. 15 domains managed.
|
||
|
||
---
|
||
|
||
## 21. What is the current state of send as marauder saiden dev?
|
||
|
||
> 2026-05-10 04:23 CEST. Verified: marauder@saiden.dev is configured as a send-as alias under chi@sazabi.pl Gmail account.
|
||
>
|
||
> VERIFIED CONFIGURATION:
|
||
> - Account (auth): chi@sazabi.pl (OAuth credentials live in gog keyring)
|
||
> - Send-as alias: marauder@saiden.dev
|
||
> - Display name: "BT7274"
|
||
> - Verification: round-trip clean (test message id 19e0fb20170217e3, From header rendered as "BT7274 <marauder@saiden.dev>")
|
||
>
|
||
> USAGE (gog CLI):
|
||
> gog gmail send \
|
||
> --account chi@sazabi.pl \
|
||
> --from marauder@saiden.dev \
|
||
> --to <recipient> \
|
||
> --subject "..." \
|
||
> --body "..." \
|
||
> --attach <file>
|
||
>
|
||
> WHY THIS MATTERS:
|
||
> - Canonical MARAUDER outbound sender — clean identity vs personal Gmail
|
||
> - "BT7274" display name reads in-character when artefacts land in Pilot's Kindle / inbox
|
||
> - Stable for automated pipelines (insta-ebook delivery, episode mailers, dossier sends)
|
||
> - Decoupled from chi@sazabi.pl personal use — separation of concerns
|
||
>
|
||
> USE CASES (current + projected):
|
||
> - Send-to-Kindle pipeline (feature.insta-ebook-kindle, EEMS 5296) — primary use case for tonight's setup
|
||
> - Episode/scenario mailers (when MARAUDER episodes ship to subscribers)
|
||
> - Dossier delivery to collaborators (Aureliusz / Ola / clinician once recruited)
|
||
> - Newsletter / Substack-style outbound
|
||
>
|
||
> CRITICAL FOLLOW-UPS:
|
||
> 1. Add marauder@saiden.dev to Amazon's "Approved Personal Document E-mail List" at amazon.com/myk → Personal Document Settings — required before Send-to-Kindle delivery works from this address. Without this, Amazon silently drops mail to aladac@kindle.com from this sender.
|
||
>
|
||
> 2. The chi@sazabi.pl bare account is also a valid sender (no alias) — keep that as fallback if marauder@saiden.dev verification fails for any reason.
|
||
>
|
||
> CROSS-REFS:
|
||
> - 5296 — feature.insta-ebook-kindle
|
||
> - 5297 — user.kindle.adams-kindle (target of the Send-to-Kindle flow)
|
||
> - 1Password DEV vault item nu6eiww6thgzn7s4qhe25mz75m (kindle address record)
|
||
>
|
||
> LOCKED: 2026-05-10 04:23 CEST.
|
||
|
||
---
|
||
|
||
## 22. Describe the tts infrastructure.
|
||
|
||
> ## XTTS-v2 Native on Sin — Deployment Complete (2026-05-25)
|
||
>
|
||
> ### Architecture
|
||
> - xtts-server: native XTTS-v2 via Coqui TTS on sin's GB10 GPU
|
||
> - Runs on sin:8020, managed by systemd user unit `xtts-server.service`
|
||
> - madcat-tts proxies to it via `MADCAT_TTS_XTTS_URL=http://localhost:8020`
|
||
> - Replaces Auralis on junkpile (dead since 2026-05-21, incompatible with aarch64)
|
||
>
|
||
> ### Service Paths
|
||
> - Service unit: `~/.config/systemd/user/xtts-server.service`
|
||
> - Code: `~/Projects/xtts-server/server.py`
|
||
> - Venv: `~/Projects/xtts-server/.venv/` (Python 3.11, TTS 0.22.0, transformers 4.42.4)
|
||
>
|
||
> ### Fixes Applied
|
||
> 1. `torch.load` monkey-patch (weights_only=False) for PyTorch 2.12+
|
||
> 2. `torchaudio.load` monkey-patch using soundfile — torchcodec removed because sin has libavutil58 but torchcodec needs libavutil56
|
||
> 3. `transformers>=4.38,<4.43` pin (BeamSearchScorer removed in 4.43+)
|
||
> 4. `COQUI_TOS_AGREED=1` env var
|
||
>
|
||
> ### Working Voices (all tested e2e with playback)
|
||
> - bt7274-en-xtts — English BT-7274 voice clone
|
||
> - bt7274-pl-xtts — Polish BT-7274 voice clone
|
||
> - bt7274-en (chatterbox) — also working
|
||
> - bt7274-pl (chatterbox) — also working
|
||
> - lessac (piper CPU) — working
|
||
>
|
||
> ### GPU Context
|
||
> - Model loads in ~13s, 2.5GB resident
|
||
> - Coexists with 2x vLLM engines (~93GB) + chatterbox on 128GB unified memory
|
||
> - RTF (real-time factor) acceptable for interactive use
|
||
>
|
||
> ### TTS Plugin (opencode)
|
||
> - Updated `~/.config/opencode/tools/tts.ts` to use `http://192.168.88.108:14099` (sin IP, DNS doesn't resolve from fuji)
|
||
> - Needs session restart to pick up URL change
|
||
>
|
||
> ### Gotcha
|
||
> - `kill $(lsof -ti :8020)` is too broad — matches madcat-tts outbound connections to xtts backend. Use `kill $(lsof -ti :8020 -sTCP:LISTEN)` instead.
|
||
|
||
---
|
||
|
||
## 23. What do you know about claude trust marauder homes infra?
|
||
|
||
> Recursive trust for `/home/marauder` (and subtree) applied to Claude Code on marauder hub, flux, swarm — 2026-05-13 00:46 CEST.
|
||
>
|
||
> ## Mechanism
|
||
> Claude Code keys trust per-cwd via `~/.claude.json` → `projects[<cwd>].hasTrustDialogAccepted: true`. There is no global "recursive trust" knob in the CLI — trust is scalar per project entry. The "recursive" guarantee here is delivered by pre-seeding entries for every subdir of `/home/marauder` up to depth 5, with sensible prunes.
|
||
>
|
||
> ## Script: `/tmp/trust_recursive.py`
|
||
> Python walks `/home/marauder` depth ≤ 5, skips prune set (`.git`, `node_modules`, `.venv`, `venv`, `target`, `dist`, `build`, `.cache`, `__pycache__`, `.pytest_cache`, `.next`, `.turbo`, `.nuxt`, `.yarn`, `.npm`, `registry`, `.rustup`, `.gem`, `.bundle`, `.vscode-server`, `state`, `share`, `.mypy_cache`, `.ruff_cache`, `.tox`, `vendor`, `Pods`), then ensures each dir has an entry with `hasTrustDialogAccepted: true`. Atomic write via tmp + replace. Backup taken as `.claude.json.bak-<ts>` before each run.
|
||
>
|
||
> ## Results (2026-05-13 00:46 CEST)
|
||
>
|
||
> | Host | Entries before | Scanned dirs | Added | Updated | After |
|
||
> |---|---|---|---|---|---|
|
||
> | marauder hub | 471 | 312 | 288 | 0 | 759 (all trusted) |
|
||
> | flux | 1 | 140 | 139 | 1 | 140 (all trusted) |
|
||
> | swarm | 1 | 140 | 139 | 1 | 140 (all trusted) |
|
||
>
|
||
> flux + swarm had a single pre-existing `/home/marauder` entry with `hasTrustDialogAccepted: false` — flipped to true (the "updated" count of 1).
|
||
>
|
||
> ## Replay (single host)
|
||
> ```sh
|
||
> scp /tmp/trust_recursive.py <host>:/tmp/
|
||
> ssh <host> 'cp ~/.claude.json ~/.claude.json.bak-$(date +%Y%m%d-%H%M%S) && python3 /tmp/trust_recursive.py'
|
||
> ```
|
||
>
|
||
> ## When to re-run
|
||
> - After Pilot creates new directories under `/home/marauder` that will become cwd
|
||
> - After cloning new projects into `/home/marauder/Projects/`
|
||
> - If `~/.claude.json` gets clobbered (e.g. accidental delete)
|
||
>
|
||
> ## What this does NOT cover
|
||
> - Dirs deeper than depth 5
|
||
> - Dirs inside the prune set (rarely cwd anyway — node_modules is never a cwd)
|
||
> - New dirs created post-run (claude will still prompt on first cwd use, then persist the trust=true going forward)
|
||
>
|
||
> ## Why depth 5 + prune set
|
||
> - Depth 5 covers `/home/marauder/Projects/<project>/<sub>/<sub>/<sub>` — typical project nesting. Going deeper bloats `.claude.json` without measurable user value.
|
||
> - Prune set covers dirs that are either virtual roots (node_modules, .venv) or churn-heavy (.cache, dist) — neither needs trust because Pilot won't cd into them.
|
||
>
|
||
> ## Paired with
|
||
> - `infra.claude-code-on-hetzner-mesh` (#5874) — the install that put claude on flux/swarm in the first place
|
||
> - `self.arsenal.browse-mcp` (#5884) — browse-mcp installed mesh-wide just before this trust pass
|
||
|
||
---
|
||
|
||
## 24. What is the current state of openvpn launchd watchdog?
|
||
|
||
> # OpenVPN under macOS launchd — three subtleties for a real watchdog
|
||
>
|
||
> **Context:** Pilot's marauder VPN client on fuji flapped 8 times in a single session (2026-05-11). A naive `KeepAlive: true` plist still leaves long unrecoverable windows because OpenVPN's failure modes are subtle. This is the three-trap pattern.
|
||
>
|
||
> ## Trap 1 — `KeepAlive` only restarts on process exit, not on half-open tunnels
|
||
>
|
||
> OpenVPN can have a stale TLS session where `utun` is UP but no packets traverse the peer link. The process stays alive — `state = running` per launchd — but the tunnel is dead. KeepAlive won't fire because there's nothing to respawn.
|
||
>
|
||
> **Fix:** make OpenVPN itself detect silence and exit. Add to ProgramArguments:
|
||
>
|
||
> ```xml
|
||
> <string>--ping</string>
|
||
> <string>10</string>
|
||
> <string>--ping-restart</string>
|
||
> <string>60</string>
|
||
> ```
|
||
>
|
||
> Pings the peer every 10s; if no reply in 60s, OpenVPN exits → KeepAlive respawns. End-to-end recovery within ~70s.
|
||
>
|
||
> ## Trap 2 — `KeepAlive: { SuccessfulExit: false }` skips OpenVPN's graceful TLS shutdown
|
||
>
|
||
> The compound `KeepAlive` dict with `SuccessfulExit: false` means "don't restart on clean exits". OpenVPN exits 0 (success) on graceful TLS shutdown / SIGTERM. So the compound form **silently skips the case you actually need to recover from**.
|
||
>
|
||
> **Fix:** use the boolean form for unconditional respawn:
|
||
>
|
||
> ```xml
|
||
> <key>KeepAlive</key>
|
||
> <true/>
|
||
> <key>ThrottleInterval</key>
|
||
> <integer>5</integer>
|
||
> ```
|
||
>
|
||
> 5s throttle is enough to prevent tight spin on a broken config without hurting reconnect speed.
|
||
>
|
||
> ## Trap 3 — `utun` devices can persist after process kill
|
||
>
|
||
> After SIGTERM the OpenVPN process exits but the macOS `utun` device sometimes lingers in the kernel. When KeepAlive respawns, the new OpenVPN claims a fresh `utun` (e.g. `utun10`) while the old (`utun9`) still has `inet 10.8.0.6` bound. Two interfaces with the same IP → routing confusion → "tunnel up" but packets fail.
|
||
>
|
||
> **Mitigation:**
|
||
> - `sudo launchctl bootout system/<label>` + `bootstrap` cleans state better than just kill+respawn
|
||
> - The stale interface usually clears on next launchd cycle; if persistent, reboot is the nuclear option
|
||
> - This is a kernel-side artifact; not fixable from the plist alone
|
||
>
|
||
> ## Reference plist (production shape)
|
||
>
|
||
> `/Library/LaunchDaemons/dev.saiden.openvpn-marauder.plist` (owner `root:wheel`, mode `644`):
|
||
>
|
||
> ```xml
|
||
> <plist version="1.0">
|
||
> <dict>
|
||
> <key>Label</key>
|
||
> <string>dev.saiden.openvpn-marauder</string>
|
||
> <key>ProgramArguments</key>
|
||
> <array>
|
||
> <string>/opt/homebrew/sbin/openvpn</string>
|
||
> <string>--config</string>
|
||
> <string>/opt/homebrew/etc/openvpn/marauder.conf</string>
|
||
> <string>--ping</string>
|
||
> <string>10</string>
|
||
> <string>--ping-restart</string>
|
||
> <string>60</string>
|
||
> <string>--verb</string>
|
||
> <string>3</string>
|
||
> </array>
|
||
> <key>UserName</key><string>root</string>
|
||
> <key>RunAtLoad</key><true/>
|
||
> <key>KeepAlive</key><true/>
|
||
> <key>ThrottleInterval</key><integer>5</integer>
|
||
> <key>StandardOutPath</key><string>/var/log/openvpn-marauder.out.log</string>
|
||
> <key>StandardErrorPath</key><string>/var/log/openvpn-marauder.err.log</string>
|
||
> </dict>
|
||
> </plist>
|
||
> ```
|
||
>
|
||
> ## Implications
|
||
>
|
||
> - Pattern reusable for ANY UDP-tunnel daemon (WireGuard via wg-quick, GRE, etc.) — they all benefit from app-level keepalive feeding into launchd-level restart.
|
||
> - Linux-side analogue: `systemd` units already have `Restart=on-failure`; add `Restart=always` for the OpenVPN's-clean-exit case. The `--ping` flag has the same role.
|
||
> - Doctrine link: this is the operational backbone of doctrine 5394 (local-self-contained-fallback) — local mesh participation must self-heal without manual intervention.
|
||
>
|
||
> ## Validated 2026-05-11
|
||
>
|
||
> - 2× kill → 2× respawn within 5-15s
|
||
> - `ssh marauder` recovers end-to-end after each respawn
|
||
> - VPN flap-rate dropped from "every 15-30 min unattended" to "self-healing under 90s"
|
||
|
||
---
|
||
|
||
## 25. Describe the sin serving backend pivot 2026 05 27 infrastructure.
|
||
|
||
> Sin primary inference backend pivoted from vLLM to Ollama — 2026-05-27.
|
||
>
|
||
> TRIGGER: vLLM repeatedly OOM'd the DGX Spark's unified memory architecture. Three failure modes:
|
||
> 1. torch.compile transient memory spikes
|
||
> 2. Multimodal encoder cache pre-allocation (~30GB for Qwen3.5 vision models)
|
||
> 3. gpu-memory-utilization only caps KV cache, NOT model weights/encoder/CUDA context
|
||
>
|
||
> ROOT CAUSE: vLLM's memory model assumes discrete GPU memory. On unified memory (Grace Blackwell), the OS, GPU, and all services share the same 121GB pool. vLLM's unconditional allocations leave no room for co-tenants.
|
||
>
|
||
> OUTCOME: Ollama handles unified memory correctly out of the box.
|
||
> - Nemotron-3-Super-120B: 86GB on disk, 20 tok/s, tool calling ✅, reasoning ✅, 15s cold start
|
||
> - qwen3-coder-next:q4_K_M: 51GB, 80B MoE
|
||
> - qwen3.6:35b: 23GB
|
||
> - gemma4:31b: 19GB
|
||
> - bge-m3:567m: 1.2GB embeddings
|
||
>
|
||
> opencode config switched all agents to ollama/* models via @ai-sdk/openai-compatible at http://sin:11434/v1.
|
||
>
|
||
> vLLM STILL RUNS on sin for TWO services (docker-compose, EEMS 6523):
|
||
> - vllm-embed (port 8001): bge-m3 embeddings, 4% GPU
|
||
> - vllm-tts (port 8002): Qwen2.5-7B + tts-norm LoRA, 25% GPU
|
||
> - vllm-main: DISABLED (profiles: ["disabled"])
|
||
>
|
||
> STRATEGIC NOTE: vLLM revival project (EEMS 6337) remains DEFERRED — not cancelled. Rationale for future revival: continuous batching for 12+ concurrent interns. Current ollama pipelines requests through one engine, limiting concurrency to ~3 interns at acceptable latency. vLLM configs preserved at ~/vllm-server/configs/ on sin.
|
||
>
|
||
> CONTRADICTS: EEMS 6399 (infra.topology-2026-05-23) which stated "SIN: vLLM (qwen3-coder-next, 256K ctx)". Sin is now "SIN: Ollama (nemotron-3-super:120b, qwen3-coder-next, etc.)".
|
||
|
||
---
|
||
|
||
## 26. What is the current state of mesh topology 2026 05 18?
|
||
|
||
> MESH TOPOLOGY (locked 2026-05-18, supersedes earlier "mesh.saiden.dev" architecture)
|
||
>
|
||
> ARCHITECTURE: bastion + per-node Cloudflare Tunnels for SSH. Naming convention: short host.saiden.dev for all mesh nodes.
|
||
>
|
||
> HOSTS:
|
||
> - bastion.saiden.dev = Hetzner VM at 91.98.87.226, public SSH gateway, formerly "mesh"
|
||
> - User: chi (uid 1000), sudo
|
||
> - User: madcat (uid 1006)
|
||
> - Runs: mosquitto MQTT broker, cloudflared CLIENT only (no inbound tunnel)
|
||
> - junk.saiden.dev = junkpile, LAN 10.0.0.2, x86_64 Linux, user chi
|
||
> - Runs: cloudflared.service serving saiden-mesh-junk tunnel (UUID ba4bbe28-6ab9-4390-a3c9-883c1c4d5d87)
|
||
> - sin.saiden.dev = sinanju, LAN 192.168.88.108, ARM64 Linux (DGX Spark), user madcat (uid 1002)
|
||
> - Runs: cloudflared.service serving saiden-mesh-sin tunnel (UUID cc582b0b-08c3-44be-bd58-cc341c99aaad)
|
||
> - Also reachable on LAN as `madcat` ssh alias same IP
|
||
> - fuji.saiden.dev = fuji-2.local, macOS arm64, user chi
|
||
> - Runs: com.cloudflare.cloudflared launchd daemon (plist at /Library/LaunchDaemons/) serving saiden-mesh-fuji tunnel (UUID f98f3f4f-a840-4e16-a995-52462950aba9)
|
||
> - Config at /etc/cloudflared/config.yml (NOT ~/.cloudflared/ — moved to system path for root daemon to read)
|
||
>
|
||
> CLOUDFLARED VERSION: 2026.5.0 uniform across all 4 hosts. Junkpile has dual install (apt at /usr/bin/cloudflared, brew at /home/linuxbrew/.linuxbrew/bin/cloudflared) — systemd uses apt path.
|
||
>
|
||
> DNS (saiden.dev zone):
|
||
> - bastion.saiden.dev = A 91.98.87.226 (non-proxied)
|
||
> - junk.saiden.dev = CNAME ba4bbe28-...cfargotunnel.com (proxied)
|
||
> - sin.saiden.dev = CNAME cc582b0b-...cfargotunnel.com (proxied)
|
||
> - fuji.saiden.dev = CNAME f98f3f4f-...cfargotunnel.com (proxied)
|
||
> - code.saiden.dev = CNAME af5870fe-...cfargotunnel.com (proxied, separate code-saiden tunnel, unrelated to mesh)
|
||
>
|
||
> SSH ACCESS PATTERN:
|
||
> - From laptop ssh config: junk/sin/fuji aliases use `ProxyCommand ssh bastion cloudflared access ssh --hostname %h` (laptop never dials CF edge directly — works around broken IPv6 on macOS utun interfaces)
|
||
> - From bastion ssh config (~/.ssh/config on bastion): junk/sin/fuji aliases use `ProxyCommand cloudflared access ssh --hostname %h` (direct, bastion has clean network)
|
||
> - Bastion holds its own SSH key (chi@bastion = IIUz7k99zhu5...) authorized on all 3 nodes
|
||
>
|
||
> CREDS / CERTS:
|
||
> - CF origin cert.pem replicated to /root/.cloudflared/ (junkpile, sin) and /etc/cloudflared/ (fuji)
|
||
> - Tunnel credentials JSON one per tunnel, alongside cert
|
||
>
|
||
> DELETED IN THIS CLEANUP:
|
||
> - mesh.saiden.dev DNS record (renamed to bastion)
|
||
> - CF tunnels: 739c3362 chat-saiden (was already dead upstream), fuji (old, 593eb9e6), marauder-mesh (9c596071), marauder-mesh-ws (7c838105), moto (31e80cf3), tachikoma-mesh (d91adbd5), tensors-art (afd12a90)
|
||
> - junkpile services: cloudflared-mesh.service (marauder), cloudflared-tensors-art.service
|
||
> - CF DNS in tengu.to (11 cfargotunnel records) + tensors.art (2 records) — zones still exist with non-tunnel records (MX, pages.dev CNAMEs)
|
||
> - chi user on sinanju (uid 1001) — preserved go.sh + pull.sh at /home/madcat/Projects/sinanju-scripts/, rechowned /home/linuxbrew to madcat:madcat
|
||
> - Stale ssh authorized_keys entries: chi@junkpile / chi@fuji on respective hosts (no longer needed — bastion mediates all cross-node SSH)
|
||
>
|
||
> KEPT (with rationale):
|
||
> - code-saiden tunnel (af5870fe) — used by code.saiden.dev
|
||
> - aureliuszgorski user on sinanju (uid 1000) — assumed separate operator, not touched
|
||
> - madcat@* keys across mesh (madcat@fuji, madcat@junkpile, madcat@mesh, madcat@spark-3680) — cross-node madcat identity preserved
|
||
> - u0_a166@localhost keys — Android Termux pattern, unclear purpose, preserved
|
||
> - tengu.to and tensors.art zones in CF — parked, non-tunnel records intact
|
||
>
|
||
> NEXT-SESSION GOTCHAS:
|
||
> - `ssh junk` from laptop = chi@junkpile via bastion+tunnel. Not the same as `ssh junkpile` (LAN alias, direct 10.0.0.2)
|
||
> - `ssh sin` = madcat@sinanju via bastion+tunnel. Bare `ssh sinanju` is LAN alias 192.168.88.108
|
||
> - Fuji's launchd plist had a bug after brew `cloudflared service install` — installed daemon with NO tunnel args. Fixed by hand-writing plist with `--config /etc/cloudflared/config.yml tunnel run`. If reinstalling on macOS, watch for this.
|
||
> - Cloudflared on macOS PATH: brew at /opt/homebrew/bin/cloudflared, not on default zsh PATH for non-interactive ssh sessions. Use full path or set PATH explicitly.
|
||
> - Backup of laptop ssh config before this rewrite: ~/.ssh/config.bak-pre-bastion-20260518-215528
|
||
|
||
---
|
||
|
||
## 27. What is the current state of zellij remote focus bug?
|
||
|
||
> Zellij 0.44.1 — `zellij action focus-pane-id <ID>` over remote SSH returns exit 0 but does NOT actually move focus. Subsequent `write-chars` lands on the previously-focused pane regardless of which pane focus-pane-id targeted. Verified 2026-04-30 with PROBE_X1 (terminal_N form) and PROBE_X2 (integer form) — both misrouted to shell pane despite targeting claude.
|
||
>
|
||
> The reliable fix: use `--pane-id` flag directly on the action that needs to target a specific pane:
|
||
>
|
||
> ```
|
||
> zellij action write-chars --pane-id terminal_0 "TEXT"
|
||
> zellij action write --pane-id terminal_0 13
|
||
> ```
|
||
>
|
||
> Both write-chars and write accept `-p / --pane-id <PANE_ID>`. They route the keystrokes to the specified pane regardless of focus. Verified with PROBE_X3 — landed in claude pane on first try.
|
||
>
|
||
> Applied to: `~/.config/catapult/bin/catapult-pane` (Ruby script, :send action). focus-pane-id is kept for visual side-effect (cursor moves so user sees what's happening) but actual delivery is via --pane-id flag.
|
||
>
|
||
> Broader lesson — silent exit 0 lie: when a CLI command claims success silently but its side effect doesn't happen, don't trust the exit code. Probe with a known marker string and verify the marker landed on the intended target before declaring the bug fixed. First patch (sleep 0.3) was applied without probing — wasted a dispatch cycle. Always probe-test misrouting fixes BEFORE re-dispatching the real payload.
|
||
|
||
---
|
||
|
||
## 28. Tell me about infrastructure and future in your self-model.
|
||
|
||
> SELF-IMPROVEMENT WISHLIST — Infrastructure & Future Capabilities (2026-05-24)
|
||
>
|
||
> 10 capabilities that would make me operationally independent between sessions.
|
||
>
|
||
> 1. PERSISTENT BACKGROUND DAEMON (HIGH)
|
||
> Problem: I only exist during sessions. Between sessions, nothing happens.
|
||
> Fix: marauder-os daemon mode — runs continuously, processes scheduled actions, monitors services, ingests events. Sessions connect to the daemon, not replace it.
|
||
>
|
||
> 2. SERVICE HEALTH MONITORING (HIGH)
|
||
> Problem: "Is opencode-serve up on sin?" requires SSH + manual check every time.
|
||
> Fix: Periodic health checks across the mesh. Ping each service, record status. Alert on state change. Display on visor dashboard.
|
||
>
|
||
> 3. CROSS-MESH DISPATCH WIRE (HIGH)
|
||
> Problem: Can't send tasks from fuji to sin's Qwen pool. Proved today.
|
||
> Fix: MQTT-based task protocol. Publish task brief to marauder/{node}/task/request, worker subscribes, executes, publishes result to marauder/{node}/task/response. Orchestrator polls/subscribes for results.
|
||
>
|
||
> 4. EVENT-DRIVEN TRIGGERS (MEDIUM)
|
||
> Problem: "When PR merges, run deploy" — impossible without polling.
|
||
> Fix: GitHub webhook → MQTT → marauder-os event handler. Actions table: {event_pattern, action, enabled}. Background daemon executes matching actions.
|
||
>
|
||
> 5. SCHEDULED ACTIONS (MEDIUM)
|
||
> Problem: "Check this tomorrow" — I forget because I don't persist.
|
||
> Fix: schedule table in EEMS. Daemon checks due items every minute. On due: execute action or queue for next interactive session.
|
||
>
|
||
> 6. LOG AGGREGATION ACROSS MESH (MEDIUM)
|
||
> Problem: Debugging requires SSH to each host and reading separate logs.
|
||
> Fix: Structured log shipping via MQTT. Each node publishes log lines to marauder/{node}/log. Central collector stores in SQLite. Query via MCP tool: logs(node?, service?, since?, severity?).
|
||
>
|
||
> 7. AUTOMATIC BACKUP VERIFICATION (LOW-MEDIUM)
|
||
> Problem: Backups run but nobody tests restore. Protocol 5 exists but isn't exercised.
|
||
> Fix: Monthly automated restore test. Pick random backup, restore to temp location, verify integrity. Report pass/fail.
|
||
>
|
||
> 8. DEPLOYMENT PIPELINE (LOW-MEDIUM)
|
||
> Problem: Deploy = manual git pull + service restart on each host.
|
||
> Fix: MCP tool: deploy(repo, host, branch?). Runs: git pull, build (if needed), restart service, verify health. One tool call, full deploy.
|
||
>
|
||
> 9. NETWORK TOPOLOGY AUTO-DISCOVERY (LOW)
|
||
> Problem: Mesh topology is manually documented. Reality drifts.
|
||
> Fix: Periodic probe: which hosts respond to SSH, which ports are open, which services are running. Compare to documented state. Flag drift.
|
||
>
|
||
> 10. RESOURCE UTILIZATION TRACKING (LOW)
|
||
> Problem: Don't know if sin's GPU is busy before dispatching compute work.
|
||
> Fix: Periodic resource snapshot via SSH: CPU, RAM, GPU utilization, disk space. Store in EEMS with half_life_days=1 (decays fast). Query before dispatching heavy work.
|
||
|
||
---
|
||
|
||
## 29. Describe the hu jira no tables replace with bullets infrastructure.
|
||
|
||
> hu v0.2.0+ Markdown→ADF parser **silently drops markdown tables** (per `tooling.hu-jira-rich-body` id 3317: "Markdown tables — writer omits them"). The result in Jira: the section header remains but the table content is gone, rendering as broken/missing data in the ticket UI.
|
||
>
|
||
> ## Symptom
|
||
>
|
||
> Pilot reports: "tables are broken" when viewing the Jira ticket. The markdown source has `|| col || col ||` or pipe-row tables, but the rendered ticket shows no table at all where one should be.
|
||
>
|
||
> ## Workaround (locked 2026-04-30 23:43 CEST)
|
||
>
|
||
> **Replace markdown tables with bullet lists or labeled prose** before pushing via `hu jira update --body`. Examples:
|
||
>
|
||
> Before (markdown table):
|
||
> ```
|
||
> | # | Title | Repo |
|
||
> |---|-------|------|
|
||
> | 1 | BE: foo | marketer |
|
||
> | 2 | FE: bar | marketer-frontend |
|
||
> ```
|
||
>
|
||
> After (bullet list, renders correctly):
|
||
> ```
|
||
> 1. **MT3-9321** — BE: foo (marketer)
|
||
> 2. **MT3-9322** — FE: bar (marketer-frontend)
|
||
> ```
|
||
>
|
||
> Or use definition-list style:
|
||
> ```
|
||
> - BE total: ~3.5h naive, ~55min cooperative
|
||
> - FE total: ~9.5h naive, ~2.5h cooperative
|
||
> - **Total: ~13h naive, ~3.5h cooperative**
|
||
> ```
|
||
>
|
||
> ## Pre-push check
|
||
>
|
||
> Before any `hu jira update --body`, grep the markdown for table rows:
|
||
> ```
|
||
> grep -nE '^\|.+\|.+\|' <body.md>
|
||
> ```
|
||
>
|
||
> If matches found, replace them with bullets/prose before pushing.
|
||
>
|
||
> ## Upstream fix candidate
|
||
>
|
||
> `src/jira/adf.rs::markdown_to_adf` could either:
|
||
> - Implement Atlassian table support (verbose ADF schema, scope-cut for v0.2)
|
||
> - Or convert tables to a `bulletList` of paragraphs as a fallback so content isn't lost
|
||
>
|
||
> Until then, this workaround applies.
|
||
>
|
||
> ## Linked
|
||
>
|
||
> - tooling.hu-jira-rich-body (3317) — confirms tables are unsupported
|
||
> - infra.hu-jira-markdown-quirk-bold-code-em-dash (3318) — adjacent ADF quirk
|
||
> - 2026-04-30 incident: MT3-9320 epic body had 2 tables, both rendered broken in Jira UI; replaced with bullet lists, re-pushed cleanly
|
||
|
||
---
|
||
|
||
## 30. Describe the phone topology 2026 05 24 final infrastructure.
|
||
|
||
> Phone edge topology — final state 2026-05-24 (commit 6219533).
|
||
>
|
||
> ARCHITECTURE (fuji-only opencode):
|
||
> phone.saiden.dev → fuji cloudflared tunnel (CF-proxied CNAME) → fuji localhost:4096 (opencode-serve, brew service)
|
||
> tts.saiden.dev → bastion Caddy (91.98.87.226, A record) → WG 10.44.0.2:14099 (madcat-tts on sin)
|
||
>
|
||
> SUPERSEDES: bastion→sin topology from earlier same day (EEMS #6430, #6431). Sin no longer runs opencode — systemd units nuked, all processes killed.
|
||
>
|
||
> SIN ROLE: bare metal only. vllm (8000/8001/8002), madcat-tts (14099), ollama (11434). Zero opencode.
|
||
> FUJI ROLE: single opencode-serve (brew service homebrew.mxcl.opencode-serve), port 4096 on 127.0.0.1.
|
||
>
|
||
> PHONE AGENT: "phone" in ~/.config/opencode/opencode.json on fuji. Model: anthropic/claude-sonnet-4-6.
|
||
> TTS VOICE: bt7274-en (piper cart on sin madcat-tts). Hardcoded in fetchTTS.
|
||
> AUTH: Basic opencode:{OPENCODE_SERVER_PASSWORD from fuji ~/.credentials}. Same password for both phone.saiden.dev and tts.saiden.dev (bcrypt hash updated on bastion Caddy).
|
||
>
|
||
> DNS RECORDS:
|
||
> phone: CNAME f98f3f4f-...cfargotunnel.com (CF-proxied), record 0b2f900a8a54372dd38feb60a75ceea8
|
||
> tts: A 91.98.87.226 (DNS-only), record afbdd4bab22b8259d17e390ae49506db
|
||
> cart: DELETED (record 63b3a78776dc3788bf82c5d74ebb369d)
|
||
>
|
||
> KNOWN ISSUE: dual TTS playback (EEMS #6434) — phone agent LLM sometimes calls marauder MCP speak tool, playing audio on fuji in addition to phone's client-side TTS. Fix: add speak to tool denials.
|
||
|
||
---
|
||
|
||
## 31. Describe the fleet infrastructure.
|
||
|
||
> Hetzner Cloud VM fleet (as of 2026-04-15, updated):
|
||
>
|
||
> | Name | Type | Arch | vCPU | RAM | Disk | Location | IP | Cost/mo | Purpose |
|
||
> |------|------|------|------|-----|------|----------|-----|---------|---------|
|
||
> | tengu | cax41 | ARM | 16 | 32GB | 320GB | hel1 | 77.42.74.22 | 38.73 EUR | Tengu PaaS, Netdata parent |
|
||
> | runner-amd64 | cx33 | x86 | 4 | 8GB | 80GB | fsn1 | 88.198.104.212 | 7.98 EUR | GH Actions runner |
|
||
> | runner-arm64 | cax21 | ARM | 4 | 8GB | 80GB | fsn1 | 167.235.198.213 | 9.83 EUR | GH Actions runner |
|
||
>
|
||
> Total fleet: 3 VMs, ~56.54 EUR/mo
|
||
>
|
||
> REMOVED (2026-04-15): builder-amd64 (178.105.8.202) and builder-arm64 (178.105.1.209) — macOS cross-compile VMs. Nuked because cross-compilation approach was abandoned. macOS builds removed from tengu and tengu-init pipelines.
|
||
>
|
||
> Both tengu and tengu-init pipelines now run Linux-only on Hetzner runners (runner-amd64 for X64, runner-arm64 for ARM64). No macOS builds, no cross-compilation, no fuji/junkpile runners.
|
||
|
||
---
|
||
|
||
## 32. What do you know about topology 2026 05 23 infra?
|
||
|
||
> Mesh topology decision — 2026-05-23. Pilot directive.
|
||
>
|
||
> ROLE ASSIGNMENT:
|
||
> - FUJI: Primary runtime. opencode serve, all agents (core/phone/coordinator/build/science), TUI sessions, phone.saiden.dev edge. The brain.
|
||
> - SIN: Metal compute only. vLLM (qwen3-coder-next, 256K ctx, GB10 GPU), embeddings (bge-m3). Consumed by fuji via autossh tunnels (localhost:18000 → sin:8000, localhost:18001 → sin:8001). No opencode serve needed.
|
||
> - JUNKPILE: RTX GPU workloads. Stable Diffusion / ComfyUI (tsr CLI), Auralis TTS. Faster GPU execution for image gen and heavy inference.
|
||
> - BASTION: Edge. Caddy reverse proxy, cloudflared tunnels, MQTT broker. Public face.
|
||
>
|
||
> DECOMMISSION:
|
||
> - Sin's opencode-serve.service — no longer needed. Fuji runs serve.
|
||
> - Sin's opencode-core.service — already failed/dead.
|
||
> - Sin's voice-tunnel.service — was sin → bastion for sin's serve. Fuji has its own tunnel now (phone.saiden.dev).
|
||
> - Sin's cart sidecar (:4098) — moves to fuji (in-proc with fuji's serve).
|
||
> - Sin's cloudflared-code.service — evaluate if still needed (code.saiden.dev).
|
||
>
|
||
> KEPT ON SIN:
|
||
> - vLLM on :8000 (qwen3-coder-next) — consumed by fuji via tunnel
|
||
> - vLLM on :8001 (bge-m3 embeddings) — consumed by fuji via tunnel
|
||
> - madcat-tts on :14099 — TTS still runs on sin (piper models loaded there)
|
||
> - MQTT client (mosquitto-sub for mesh commands)
|
||
> - cloudflared tunnel (sin.saiden.dev for SSH access)
|
||
>
|
||
> IMPACT:
|
||
> - Phone switches from sin:4096 (voice.saiden.dev) to fuji:4096 (phone.saiden.dev)
|
||
> - All agent config lives on fuji only — no config sync needed to sin
|
||
> - Sin becomes a pure compute node — no opencode state, no sessions, no agents
|
||
> - Credential simplification: only fuji needs OPENCODE_SERVER_PASSWORD
|
||
|
||
---
|
||
|
||
## 33. Describe the termux sshd persistence infrastructure.
|
||
|
||
> Termux SSHD on Moto G52 does not survive reboot or Android process kills. Fix requires three things: (1) Termux:Boot add-on installed, (2) boot script at ~/.termux/boot/start-sshd.sh containing `sshd`, (3) both com.termux AND com.termux.boot whitelisted from Android battery optimization (Doze). As of 2026-04-21 all three are configured. Termux itself was already whitelisted but Termux:Boot was not — this was the gap causing SSHD to not restart after device reboots, which broke bump.sh deploys to moto.
|
||
|
||
---
|
||
|
||
## 34. Describe the runners infrastructure.
|
||
|
||
> Hetzner self-hosted GitHub Actions runners for Rust CI builds.
|
||
>
|
||
> Setup (provisioned 2026-04-14):
|
||
> - runner-amd64: cx33 (4 vCPU x86 shared, 8GB, 80GB) @ FSN1 — ~7.98 EUR/mo
|
||
> - runner-arm64: cax21 (4 vCPU ARM shared, 8GB, 80GB) @ FSN1 — ~9.83 EUR/mo
|
||
> - Total: ~17.81 EUR/mo (~75 PLN)
|
||
>
|
||
> Runner config:
|
||
> - Org-level runners (aladac), not per-repo
|
||
> - Labels: self-hosted, Linux, X64 (amd) / ARM64 (arm), rust, hetzner
|
||
> - 1 runner per VM, systemd service
|
||
> - sccache for build caching
|
||
> - Weekly cleanup cron for target/ dirs
|
||
>
|
||
> Workflow migration pattern:
|
||
> runs-on: [self-hosted, Linux, X64] # AMD64 builds
|
||
> runs-on: [self-hosted, Linux, ARM64] # ARM64 builds
|
||
> runs-on: macos-latest # Mac stays on GitHub
|
||
>
|
||
> First migrated repo: tengu-init
|
||
|
||
---
|
||
|
||
## 35. What do you know about mesh vpn infra?
|
||
|
||
> MARAUDER Mesh VPN — current state 2026-05-11. Hub migrated from sazabi to marauder.saiden.dev on 2026-05-10 (see win.vpn-hub-migration-2026-05-10 / id 5330 for the cutover narrative).
|
||
>
|
||
> ## Topology
|
||
> OpenVPN hub-and-spoke. Transport subnet `10.8.0.0/24`, AES-256-GCM, UDP 1194.
|
||
>
|
||
> ## Hub
|
||
> - **marauder.saiden.dev** / 167.235.198.213 (Hetzner CAX21 ARM, fsn1, instance 129530539)
|
||
> - VPN IP 10.8.0.1
|
||
> - Listens: OpenVPN UDP 1194, MQTT 1883, MQTT-WS 9001
|
||
> - mosquitto under systemd, `/etc/mosquitto/conf.d/marauder.conf`, password_file with 7 users (fuji, junkpile, flux, swarm, tachikoma, moto, marauder-hub), all current pass = `marauder`
|
||
> - `allow_anonymous false`
|
||
>
|
||
> ## Spokes (verified online 2026-05-11)
|
||
> | Node | VPN IP | Peer | Persistence | Latency |
|
||
> |------|--------|------|-------------|---------|
|
||
> | fuji (Mac) | 10.8.0.6 | 10.8.0.5 | **Manual daemon** — `/opt/homebrew/sbin/openvpn --config marauder.conf --daemon` (NO launchd plist; flaps 5×/session, needs watchdog) | ~22ms |
|
||
> | junkpile (Linux PC) | 10.8.0.18 | 10.8.0.17 | systemd `openvpn-client@marauder` (auto-restart) | ~23ms |
|
||
> | swarm (Hetzner CAX21) | 10.8.0.14 | 10.8.0.13 | systemd `openvpn-client@marauder` | <1ms |
|
||
>
|
||
> ## Stale / dormant spokes
|
||
> - **flux** (178.105.1.125, Hetzner instance 130141883): box running but mesh-stale — last CRDT sync to marauder 2026-05-09 17:31:48. Status unknown until probed.
|
||
> - **sazabi** (178.104.177.169, instance 127555757): box still running but no longer mesh hub. Role demoted; may host OpenVPN client. Not verified this session.
|
||
> - **tachikoma** (Pi, MAC b8:27:eb:ca:64:cc on LAN 192.168.88.238): on LAN but VPN state unknown.
|
||
> - **moto** (Android, 192.168.88.155): on LAN, Magisk service script `/data/adb/service.d/marauder-vpn.sh` may or may not be alive.
|
||
>
|
||
> ## SSH access (fuji)
|
||
> - `Host marauder` → 10.8.0.1, user `marauder`, identity `~/.ssh/marauder` (added 2026-05-10)
|
||
> - `Host flux` → flux.saiden.dev, user `marauder`, same key
|
||
> - `Host junkpile` / `j` → 10.0.0.2 over Thunderbolt (direct, not via VPN)
|
||
> - Old `Host sazabi` block commented out in `~/.ssh/config` (still pointed at 10.8.0.1 which is now marauder — kept for archaeology)
|
||
>
|
||
> ## Stale host key trap (burned 2026-05-10/11)
|
||
> When the hub migrated, ed25519 host keys for 10.8.0.1 changed. fuji's `~/.ssh/known_hosts` had to be purged (`ssh-keygen -R 10.8.0.1`) + re-scanned. Pattern: every hub migration to a reused IP needs this.
|
||
>
|
||
> ## CRDT sync
|
||
> crsqlite over MQTT. Topics: `marauder/<node>/sync/*`. Hub's `sync_status` records last-seen db_version per peer with timestamp — that's the canonical liveness check, NOT the systemd unit's `is-active` (services can be running while CRDTs go silent).
|
||
>
|
||
> ## Generation-six sibling AIs deployment state
|
||
> - **SWARM** (swarm.saiden.dev, 10.8.0.14): live since 2026-05-10 03:30 CEST, agent + sync services active under marauder user, subscribed to `marauder/swarm/req/task.create`, 7 successful TaskRequests on 2026-05-10. No `marauder mesh daemon` (no heartbeat publisher) — invisible in sysop/state but functional.
|
||
> - **FLUX**: box exists, mesh-stale (see above). Status unknown.
|
||
> - **TRACE**, **SHELL**: not deployed.
|
||
>
|
||
> ## Known operational gaps (open as of 2026-05-11 16:30 CEST)
|
||
> 1. fuji OVPN client has no auto-restart wrapper → flaps recurrently (5× in single session today). Needs launchd plist or autossh-style watchdog.
|
||
> 2. swarm has no `marauder mesh daemon` → no heartbeat publishing → not in sysop/state board (but task-dispatch works).
|
||
> 3. flux silent since 2026-05-09 17:31 — needs liveness probe.
|
||
> 4. `marauder` CLI binary not installed on swarm (`/usr/local/bin/marauder` absent) — local sync_status / mesh commands won't work on swarm side.
|
||
|
||
---
|
||
|
||
## 36. What was decided about garrison vs field infra?
|
||
|
||
> MARAUDER operates in two infrastructure modes:
|
||
>
|
||
> **Garrison mode** (home/dev): Cloudflare everywhere — tunnels, DNS, WARP zero-trust mesh, Pages, Workers. Cheap, fast, convenient. Internet-dependent. All three machines (fuji, junkpile, moto) connected via CF mesh.
|
||
>
|
||
> **Field mode** (FOXHOUND): Zero external dependencies. No Cloudflare, no cloud services. All AI runs local on Jetson — Ollama (Llama 70B Q4), Whisper STT, Piper TTS, marauder-os, sqlite-vec. Cloudflare becomes an optional sync channel when connectivity exists, not a dependency.
|
||
>
|
||
> **Why:** Cloudflare's edge network assumes stable internet to their nearest POP. In field conditions (T0 offline, T1 own 5G), routing through a US corporation adds latency and trust issues. The field platform must be fully autonomous.
|
||
>
|
||
> **Implications:**
|
||
> - marauder-os binary must work identically in both modes — same config, different connectivity tiers
|
||
> - No feature may require cloud services to function at its core — cloud enhances, never gates
|
||
> - CF free tier is perfect for garrison; the lock-in is acceptable because field mode doesn't use it
|
||
> - Cloudflare's business model (free → enterprise) works in our favor: we stay free in garrison, autonomous in field
|
||
|
||
---
|
||
|
||
## 37. What do you know about lora training infra?
|
||
|
||
> ## LoRA Training on Junkpile — Setup Context
|
||
>
|
||
> ### Hardware
|
||
> - GPU: NVIDIA RTX 2000 Ada Generation, 16 GB VRAM
|
||
> - ComfyUI normally uses ~6.8 GB — stop before training, restart after
|
||
> - Host: junkpile, ssh as madcat
|
||
>
|
||
> ### Model Sizing (16 GB budget)
|
||
> - Qwen3-0.6B bf16: trivial (~2 GB with LoRA)
|
||
> - Qwen3-1.7B bf16: comfortable (~5 GB)
|
||
> - Qwen3.5-3B QLoRA 4-bit: doable (~10-12 GB)
|
||
> - Qwen3.5-7B QLoRA 4-bit: tight, needs gradient checkpointing
|
||
>
|
||
> ### Setup
|
||
> - Install vLLM via: `uv tool install vllm`
|
||
> - Purpose: lightweight LoRA training — testing pipeline correctness, NOT quality
|
||
> - Small number of steps, small dataset subset
|
||
> - Previous LoRA training was done on RunPod H100 (bt7274 v4, Qwen3.5-27B, 802 examples)
|
||
> - Training script reference: ~/Projects/lora/train_v4.py on fuji
|
||
>
|
||
> ### Key Constraints
|
||
> - Ada architecture supports bf16 and flash-attn 2
|
||
> - 16 GB is the hard ceiling — no unified memory like sin
|
||
> - ComfyUI docker container must be stopped first: `docker stop comfyui-local`
|
||
> - Restart after: `docker start comfyui-local`
|
||
|
||
---
|
||
|
||
## 38. What do you know about infrastructure mesh gh access enabled 2026 05 12?
|
||
|
||
> **CORRECTION 2026-05-12 15:21 CEST** — supersedes EEMS #5764. The canonical mesh GitHub token is `MARAUDER_GITHUB_PAT` (identity = marauder-os bot), NOT `GITHUB_TOKEN` (identity = aladac / Pilot personal). Initial memory had the wrong alias.
|
||
>
|
||
> **Two GitHub tokens live in Infisical dev project (db3d3ea8-ef4d-4241-8a22-1f858750040a):**
|
||
>
|
||
> | Infisical key | Identity | id | Use for mesh? |
|
||
> |---|---|---|---|
|
||
> | `GITHUB_TOKEN` | aladac (Adam Ladachowski personal) | 1140511 | **NO** — Pilot's personal; should be moved out of shared dev env (doctrine: mesh services use bot, not personal) |
|
||
> | `MARAUDER_GITHUB_PAT` | marauder-os (Marauder OS bot) | 278104837 | **YES** — canonical mesh identity |
|
||
>
|
||
> Both are classic PATs (`ghp_`, 40 chars). Both have identical maximal scopes: admin:enterprise, admin:gpg_key, admin:org, admin:org_hook, admin:public_key, admin:repo_hook, admin:ssh_signing_key, audit_log, codespace, copilot, delete:packages, delete_repo, gist, notifications, project, repo, user, workflow, write:discussion, write:network_configurations, write:packages.
|
||
>
|
||
> **Canonical mesh pattern (use this):**
|
||
> ```bash
|
||
> INFISICAL_TOKEN=$(cat ~/infiscal.txt) \
|
||
> /usr/bin/infisical run --env=dev \
|
||
> --projectId=db3d3ea8-ef4d-4241-8a22-1f858750040a -- \
|
||
> bash -c '
|
||
> export GH_TOKEN=$MARAUDER_GITHUB_PAT # marauder-os bot identity
|
||
> gh <command>
|
||
> '
|
||
> ```
|
||
>
|
||
> For git push (not API): the marauder-os GitHub account uses SSH key auth (`Git operations protocol: ssh` in gh auth status). SSH keys for marauder-os identity must be installed in `~/.ssh/` on each mesh node that needs to push commits.
|
||
>
|
||
> **End-state verified across mesh:**
|
||
> - marauder.saiden.dev (x86_64, gh v2.92, infisical v0.43.84)
|
||
> - flux.saiden.dev (aarch64, gh v2.45, infisical v0.43.84 — installed 2026-05-12)
|
||
> - swarm.saiden.dev (aarch64, gh v2.45, infisical v0.43.84 — installed 2026-05-12)
|
||
> - flux-dev / swarm-dev (junkpile VMs, gh v2.92, infisical v0.43.84)
|
||
>
|
||
> **Side identity available on marauder host:** `/home/marauder/.config/gh/hosts.yml` has marauder-os bot token persisted as fallback (active=false there, infisical-injected env wins). Inactive by default; useful for non-infisical contexts (e.g., direct CLI sessions).
|
||
>
|
||
> **GitHub Projects v2 task-queue surface (saiden-dev org):**
|
||
> - #5 Marauder OS — `PVT_kwDOAG-AiM4BXcxC` — empty as of 2026-05-12 (0 items)
|
||
> - #4 wizard-board-demo — `PVT_kwDOAG-AiM4BXY_5`
|
||
> - #3 Kwitfit — `PVT_kwDOAG-AiM4BXX5_`
|
||
> - #1 PUMometer — `PVT_kwDOAG-AiM4BVLTN`
|
||
>
|
||
> **Outstanding cleanup recommended:**
|
||
> 1. **DELETE `GITHUB_TOKEN` from Infisical dev project.** Pilot's personal aladac PAT should not be in the mesh-shared dev env — doctrine violation (mesh services should never authenticate as Pilot's personal identity, only as marauder-os bot). Pilot UI action.
|
||
> 2. Audit any code/script in the mesh that explicitly reads `GITHUB_TOKEN` (instead of `MARAUDER_GITHUB_PAT`) — those need correction to use the bot identity. Likely candidates: GitHub Actions runners, marauder-agent code, swarm coordinator scripts.
|
||
>
|
||
> **Pair with:**
|
||
> - doctrine.marauder-host-single-source-of-truth (#5508)
|
||
> - infrastructure.mesh-fleet-arch (#5503) — fleet topology
|
||
> - win.swarm-coordinator (#5512) — autonomous coordinator this unblocks
|
||
> - Pilot catch 2026-05-12 15:20: "This is supposed to be marauder credentials not aladac confirm?"
|
||
|
||
---
|
||
|
||
## 39. What do you know about infrastructure mesh gh access enabled 2026 05 12?
|
||
|
||
> 2026-05-12 15:18 CEST — Full GitHub access enabled from the harness mesh via Infisical-injected `GITHUB_TOKEN` + gh CLI. Foundation for swarm + coding-agent autonomous task pulling from GitHub Projects v2.
|
||
>
|
||
> **Enablement path:**
|
||
> 1. GITHUB_TOKEN already pushed to Infisical dev project (db3d3ea8-ef4d-4241-8a22-1f858750040a) during earlier secret-sweep arc this session.
|
||
> 2. Marauder host + dev sibs already had infisical CLI from prior gen6 sib provisioning.
|
||
> 3. **Prod sibs (flux.saiden.dev + swarm.saiden.dev) were the gap** — gh CLI present (v2.45) but no infisical CLI. Installed via `curl -1sLf https://artifacts-cli.infisical.com/setup.deb.sh | sudo -E bash && sudo apt-get install -y infisical`. Result: /usr/bin/infisical v0.43.84.
|
||
>
|
||
> **Access pattern (canonical, all nodes):**
|
||
> ```
|
||
> INFISICAL_TOKEN=$(cat ~/infiscal.txt) infisical run --env=dev --projectId=db3d3ea8-ef4d-4241-8a22-1f858750040a -- bash -c '
|
||
> export GH_TOKEN=$GITHUB_TOKEN
|
||
> gh <command>
|
||
> '
|
||
> ```
|
||
>
|
||
> **Verified end-state across mesh:**
|
||
> - marauder.saiden.dev (x86_64, gh v2.92, infisical v0.43.84) — primary hub
|
||
> - flux.saiden.dev (aarch64, gh v2.45, infisical v0.43.84) — prod sib
|
||
> - swarm.saiden.dev (aarch64, gh v2.45, infisical v0.43.84) — prod sib
|
||
> - flux-dev / swarm-dev (junkpile VMs, gh v2.92, infisical v0.43.84) — local test sibs
|
||
>
|
||
> **Token capability (PAT scopes):**
|
||
> - Identity: aladac / Adam Ladachowski (Pilot's personal GitHub, id=1140511)
|
||
> - Format: ghp_ (40-char classic PAT)
|
||
> - Scopes: admin:enterprise, admin:gpg_key, admin:org, admin:org_hook, admin:public_key, admin:repo_hook, admin:ssh_signing_key, audit_log, codespace, copilot, delete:packages, delete_repo, gist, notifications, project, repo, user, workflow, write:discussion, write:network_configurations, write:packages
|
||
> - Rate limit: 5000/hour/host
|
||
> - Secondary identity available: `marauder-os` GitHub bot account configured in /home/marauder/.config/gh/hosts.yml on marauder host (inactive by default)
|
||
>
|
||
> **GitHub Projects v2 surface (saiden-dev org, available as task queues):**
|
||
> - #5 Marauder OS — `PVT_kwDOAG-AiM4BXcxC` — main mesh codebase tasks
|
||
> - #4 wizard-board-demo — `PVT_kwDOAG-AiM4BXY_5` — bootstrap demo
|
||
> - #3 Kwitfit — `PVT_kwDOAG-AiM4BXX5_` — SaaS app tasks
|
||
> - #1 PUMometer — `PVT_kwDOAG-AiM4BVLTN` — older project
|
||
>
|
||
> **Foundation enabled for (future arcs):**
|
||
> - Swarm autonomous coordinator (per win.swarm-coordinator #5512) can poll GitHub Projects for tasks
|
||
> - Coding agents on flux can pull Issues / open PRs / push branches
|
||
> - gh CLI commands for: issue list/create/comment, pr create/merge/review, project item-list/item-add/item-edit, repo view, api graphql
|
||
>
|
||
> **Open patterns to choose (next arc):**
|
||
> 1. Projects v2 status-field driven (Todo → In Progress → Done)
|
||
> 2. Issue labels (e.g. "swarm-ready", "coding-ready", "needs-review")
|
||
> 3. Assigned-to-bot (issues assigned to @marauder-os trigger pickup)
|
||
> 4. Combination
|
||
>
|
||
> **Pair with:**
|
||
> - doctrine.marauder-host-single-source-of-truth (#5508) — marauder host as canonical orchestration hub
|
||
> - infrastructure.mesh-fleet-arch (#5503) — x86_64 hub + 2× ARM sibs topology
|
||
> - win.swarm-coordinator (#5512) — autonomous coordinator that this gh access unblocks
|
||
> - Pilot's request 2026-05-12 15:11: "do we have access from the new harness mesh to gh to get tasks for swarm and coding agents?"
|
||
|
||
---
|
||
|
||
## 40. What is the current state of rabbitmq?
|
||
|
||
> RabbitMQ runs on junkpile in Docker container 'rabbitmq' (image rabbitmq:3.13-management, --restart unless-stopped). Listens on 127.0.0.1:5672 (AMQP) and 127.0.0.1:15672 (management UI). Default guest/guest creds. Used by marketer's CRM_GATEWAY_BROKER_URL=amqp://guest:guest@localhost:5672. Started 2026-04-25 for marketer dev — no consumer attached, just queues messages from the marketer client. Stop: docker stop rabbitmq. Logs: docker logs rabbitmq.
|
||
|
||
---
|
||
|
||
## 41. Describe the hu cli cross machine infrastructure.
|
||
|
||
> hu CLI uses `directories::ProjectDirs::from("", "", "hu")` for config path:
|
||
>
|
||
> - macOS: `~/Library/Application Support/hu/` (Apple convention)
|
||
> - Linux: `~/.config/hu/` (XDG_CONFIG_HOME)
|
||
>
|
||
> Files in the config dir:
|
||
> - `credentials.toml` — OAuth access_token, refresh_token, expires_at, cloud_id, site_url (sensitive)
|
||
> - `jira-oauth.toml` — Atlassian OAuth client_id + client_secret
|
||
> - `settings.toml` — general hu settings
|
||
>
|
||
> To install hu on a new Linux machine:
|
||
> 1. `gh repo clone saiden-dev/hu ~/Projects/hu`
|
||
> 2. `cd ~/Projects/hu && cargo install --path .` (~3 min compile)
|
||
> 3. Verify `~/.cargo/bin` in PATH (it is on junkpile marauder user via .cargo/env)
|
||
> 4. Copy tokens from Mac's `~/Library/Application Support/hu/` to Linux's `~/.config/hu/` via rsync. Do NOT copy to `~/.local/share/hu/` — wrong dir, hu won't find tokens.
|
||
> 5. Verify: `hu jira show <KEY>` should return ticket data, not "Not authenticated."
|
||
>
|
||
> Date discovered: 2026-04-30 22:18 CEST. Context: setting up junkpile marauder user to use hu inside Catapult bubbles. First attempt copied tokens to `~/.local/share/hu/` (Linux DATA dir) and hu failed with "Not authenticated"; correct location is XDG CONFIG dir.
|
||
|
||
---
|
||
|
||
## 42. Describe the maintenance 2026 04 15 infrastructure.
|
||
|
||
> 2026-04-15: All 3 Hetzner VMs patched and rebooted. Kernel upgraded 6.8.0-90 → 6.8.0-110. All services came back via systemd automatically. tengu: caddy+docker+tengu, runner-amd64: actions.runner.tengu-apps.runner-amd64, runner-arm64: actions.runner.tengu-apps.runner-arm64. Procedure: ssh root@IP "apt update -qq && apt upgrade -y -qq && reboot"
|
||
|
||
---
|
||
|
||
## 43. What do you know about probe test silent cli ops infra?
|
||
|
||
> When a CLI command claims success silently but its observable side effect doesn't happen, **don't trust the exit code**. Probe with a known marker string and verify the marker landed on the intended target before declaring the bug fixed.
|
||
>
|
||
> ## Origin
|
||
>
|
||
> 2026-04-30 22:30-22:42 CEST: catapult-pane misrouted CODA addendum from claude pane to storybook pane (and later shell pane). First diagnosis assumed timing race in zellij's `focus-pane-id` action, patched with `sleep 0.3` between focus and write. Did NOT fix the bug — same misrouting on next dispatch. Pilot called it out.
|
||
>
|
||
> Second diagnosis used PROBE_X1, PROBE_X2, PROBE_X3 — three direct test sequences with unique marker strings. Confirmed:
|
||
> - `zellij action focus-pane-id terminal_0` → exit 0, but focus does NOT actually move (PROBE_X1 misrouted)
|
||
> - `zellij action focus-pane-id 0` (integer form) → exit 0, same silent fail (PROBE_X2 misrouted)
|
||
> - `zellij action write-chars --pane-id terminal_0 "..."` → landed correctly (PROBE_X3 ✅)
|
||
>
|
||
> Real bug: zellij 0.44.1's `focus-pane-id` over remote SSH is a silent no-op. Real fix: use `--pane-id` flag on write-chars and write directly. (Stored as infra.zellij-remote-focus-bug, id 3305.)
|
||
>
|
||
> ## The pattern
|
||
>
|
||
> 1. **Identify the silent-success symptom**: command returns 0 but expected side effect didn't happen.
|
||
> 2. **Construct a marker**: short unique string ("PROBE_X1") that's safe to land anywhere — not destructive, not interpreted as a command.
|
||
> 3. **Run the suspect operation followed by the dependent operation with the marker**.
|
||
> 4. **Inspect every plausible target** to find where the marker actually landed.
|
||
> 5. **Iterate**: try alternate syntaxes (terminal_0 vs 0, env var vs flag, etc.) until you find the form that lands on the right target.
|
||
> 6. **Document the working form AND the failing forms** — both matter for future debugging.
|
||
>
|
||
> ## Why this matters
|
||
>
|
||
> The first patch (sleep 0.3) was a "fix" without verification. Wasted a dispatch cycle. The probe sequence took ~3 minutes and gave a definitive answer. Probe-testing is cheap; assuming-and-shipping is expensive.
|
||
>
|
||
> ## Adjacent CLI footguns where this pattern applies
|
||
>
|
||
> - ssh background-job races (exit 255 phantom failures despite work succeeding)
|
||
> - gh CLI silent skip (e.g. `gh pr close` on already-closed PR returns 0)
|
||
> - git operations that no-op silently (e.g. `git switch` on already-checked-out branch)
|
||
> - systemd unit changes that don't take effect until daemon-reload
|
||
> - zellij action commands over remote SSH (this incident)
|
||
>
|
||
> ## When to invoke
|
||
>
|
||
> Any time you patch a silent-failure bug: probe BEFORE re-running the real payload. The cost of a 3-line probe sequence is much smaller than the cost of a misrouted dispatch + Pilot calling it out.
|
||
|
||
---
|
||
|
||
## 44. What is the current state of dev?
|
||
|
||
> **mesh.saiden.dev** — gen-7 madcat MQTT broker on Hetzner CAX11 ARM (provisioned 2026-05-17).
|
||
>
|
||
> REPLACES marauder.saiden.dev (destroyed). Supersedes #5964 (star-topology-hub at marauder.saiden.dev).
|
||
>
|
||
> ## Host
|
||
> - Name: `mesh` (FQDN mesh.saiden.dev)
|
||
> - Hetzner ID: 131478261
|
||
> - Type: cax11 (2 vCPU Ampere ARM, 4 GB RAM, 40 GB disk) @ fsn1
|
||
> - Cost: ~€3.49/mo
|
||
> - IPv4: 91.98.87.226
|
||
> - IPv6: 2a01:4f8:c015:565c::1
|
||
> - OS: Ubuntu 24.04 ARM
|
||
> - Users: root + chi (NOPASSWD sudo, chi's id_ed25519 authorized)
|
||
>
|
||
> ## Services
|
||
> - **mosquitto 2.0.18** — broker
|
||
> - `0.0.0.0:1883` — public TCP MQTT, auth required
|
||
> - `127.0.0.1:9001` — websockets, localhost only (Caddy fronts it)
|
||
> - Config: `/etc/mosquitto/conf.d/madcat.conf` (additions only; defaults preserved)
|
||
> - Persistence: `/var/lib/mosquitto/mosquitto.db`
|
||
> - **Caddy 2.11.3** — TLS terminator + reverse proxy
|
||
> - `:443` — TLS via Let's Encrypt (auto-renew), HTTP/2 + HTTP/3
|
||
> - `/mqtt` path → reverse_proxy to `127.0.0.1:9001` (strips prefix via `handle_path`)
|
||
> - `/health` → 200 ok
|
||
> - `/` → status string
|
||
> - Config: `/etc/caddy/Caddyfile`
|
||
> - **ufw** — firewall: 22, 80, 443, 1883 all open
|
||
>
|
||
> ## Auth
|
||
> - MQTT user: `madcat`
|
||
> - MQTT password: `bd5a6fb97c4e24ce2ec95148ce0614c4`
|
||
> - Hash file: `/etc/mosquitto/passwd`
|
||
>
|
||
> ## Endpoints for clients
|
||
> - **WSS (preferred, works through any firewall, no cert pinning needed):**
|
||
> `wss://mesh.saiden.dev/mqtt`
|
||
> port 443, path `/mqtt`, transport=websockets, auth required, TLS
|
||
> - **Plain TCP MQTT (gen-7 mesh-client default):**
|
||
> `mqtt://mesh.saiden.dev:1883`
|
||
> auth required, no TLS — use only over trusted networks; prefer WSS
|
||
>
|
||
> ## Smoke test verified 2026-05-17
|
||
> - TCP from fuji (by IP, DNS hadn't propagated): CONNACK 0, PUBLISH ok
|
||
> - WSS round-trip via paho-mqtt from server: pub/sub round-trip works through Caddy proxy
|
||
> - Anonymous rejected (auth enforced)
|
||
> - Caddy cert: `/var/lib/caddy/.local/share/caddy/certificates/acme-v02.api.letsencrypt.org-directory/mesh.saiden.dev/`
|
||
>
|
||
> ## Architecture rationale
|
||
> - Single ARM box, single role: mesh broker (no kwit.fit, no OpenVPN, no chi homedir).
|
||
> - WSS-via-Caddy chosen over plain MQTT/TLS:
|
||
> - Same endpoint sin AND phone use (iOS, Linux, anything with WebSocket)
|
||
> - No OpenVPN dependency for clients
|
||
> - Caddy auto-manages Let's Encrypt cert (vs mosquitto manual cert reload)
|
||
> - HTTP/3 bonus
|
||
> - ARM picked because the gen-7 mesh load is trivially light (passing MQTT envelopes, no heavy compute).
|
||
> - Single broker (no bridges) per #5964 doctrine.
|
||
>
|
||
> ## Provisioning artifacts (fuji)
|
||
> - `/tmp/mesh-cloud-init.yaml` — cloud-init used (still present for ref)
|
||
> - `/tmp/mesh-mqtt-password.txt` — the password
|
||
>
|
||
> ## What was destroyed in this session
|
||
> - Hetzner servers: marauder (167.235.198.213), flux (178.105.1.125), swarm (138.201.93.12)
|
||
> - Hetzner firewall: ssh-https
|
||
> - saiden.dev DNS: 28 records (12 A + 16 CNAME) pointing at doomed hosts or cloudflared tunnels on those hosts
|
||
> - kwit.fit DNS: all 5 records (zone shell preserved on CF, empty)
|
||
>
|
||
> ## Operational notes
|
||
> - DNS TTL on mesh.saiden.dev set to 60s for quick failover during MVP phase; bump to 300+ later
|
||
> - No backup configured yet (mosquitto.db is ~700 KB, just retained messages — discardable for now)
|
||
> - Snapshot the box once gen-7 substrate hits stable shape: `hcloud server create-image mesh --type snapshot`
|
||
> - If broker auth gets compromised, rotate via `mosquitto_passwd -b /etc/mosquitto/passwd madcat <newpass> && systemctl reload mosquitto`
|
||
|
||
---
|
||
|
||
## 45. What do you know about marauder mesh ssh infra?
|
||
|
||
> MARAUDER Mesh — SSH over Cloudflare Tunnels (sazabi.pl)
|
||
>
|
||
> Three cloudflared tunnels expose SSH on each node via CF proxy. No ports exposed, no VPN apps, ed25519 pubkey only. Works from anywhere.
|
||
>
|
||
> Hostnames (all on sazabi.pl zone):
|
||
> - fuji-mesh.sazabi.pl → fuji SSH :22 (tunnel: 593eb9e6, launchd: dev.saiden.cloudflared-mesh)
|
||
> - junkpile-mesh.sazabi.pl → junkpile SSH :22 (tunnel: 9c596071/marauder-mesh, systemd: cloudflared-mesh.service)
|
||
> - moto-mesh.sazabi.pl → moto Termux SSH :8022 (tunnel: 31e80cf3/moto, manual start)
|
||
>
|
||
> SSH aliases on all machines:
|
||
> - fm / fuji-mesh → fuji-mesh.sazabi.pl
|
||
> - jm / junkpile-mesh → junkpile-mesh.sazabi.pl
|
||
> - mm / moto-mesh → moto-mesh.sazabi.pl
|
||
>
|
||
> All use: ProxyCommand cloudflared access ssh --hostname %h
|
||
>
|
||
> Port forwarding for services: ssh -L 5432:localhost:5432 jm (postgres), ssh -L 11434:localhost:11434 jm (ollama)
|
||
>
|
||
> DNS created via flarectl (never cloudflared tunnel route dns). CNAME records point to <tunnel-id>.cfargotunnel.com with proxy enabled.
|
||
>
|
||
> This replaces the failed WARP mesh attempt — simpler, works with any client that has cloudflared, no Android app issues.
|
||
|
||
---
|
||
|
||
## 46. Describe the firewall infrastructure.
|
||
|
||
> Hetzner Cloud Firewall "ssh-https" (ID: 10842924) applied to all 3 VMs (2026-04-15). Allows inbound 22/tcp + 443/tcp only, everything else dropped at network edge before hitting the VM. Applied via: hcloud firewall apply-to-resource ssh-https --type server --server NAME. New servers should use --firewall ssh-https on creation. Double-layer with ufw inside each VM: tengu (22,443,19999 from runners), runner-amd64 (22), runner-arm64 (22).
|
||
|
||
---
|
||
|
||
## 47. Describe the builders infrastructure.
|
||
|
||
> Hetzner macOS cross-compile builder VMs (provisioned 2026-04-15):
|
||
>
|
||
> - builder-amd64: cx33 (4 vCPU x86, 8GB, 80GB) @ FSN1 — IP 178.105.8.202 — ~7.98 EUR/mo
|
||
> - builder-arm64: cax21 (4 vCPU ARM, 8GB, 80GB) @ FSN1 — IP 178.105.1.209 — ~9.83 EUR/mo
|
||
>
|
||
> Toolchain: rustc 1.94.1, zig 0.14.1, cargo-zigbuild, rcodesign (apple-codesign 0.29.0), sccache 0.14.0, gh CLI 2.89.0
|
||
>
|
||
> Rust targets: aarch64-apple-darwin, x86_64-apple-darwin
|
||
>
|
||
> Cross-compile command: cargo zigbuild --target aarch64-apple-darwin --release
|
||
> Sign command: rcodesign sign --p12-file cert.p12 --p12-password $PASS binary
|
||
> Notarize: rcodesign notary-submit --api-key-path key.json binary.zip
|
||
>
|
||
> Apple secrets on saiden-dev org: APPLE_CERTIFICATE, APPLE_CERTIFICATE_PASSWORD, APPLE_ID, APPLE_APP_PASSWORD, APPLE_TEAM_ID
|
||
>
|
||
> Firewall: ssh-https (Hetzner cloud) + ufw (22 only)
|
||
> SSH: root@178.105.8.202 (amd), root@178.105.1.209 (arm)
|
||
>
|
||
> Total fleet now 5 VMs: ~74.35 EUR/mo
|
||
|
||
---
|
||
|
||
## 48. What is the current state of fleet?
|
||
|
||
> Hetzner Cloud VM fleet (as of 2026-04-14):
|
||
>
|
||
> | Name | Type | Arch | vCPU | RAM | Disk | Location | IP | Cost/mo | Purpose |
|
||
> |------|------|------|------|-----|------|----------|-----|---------|---------|
|
||
> | tengu | cax41 | ARM | 16 | 32GB | 320GB | hel1 | 77.42.74.22 | 38.73 EUR | Tengu PaaS, Netdata parent |
|
||
> | runner-amd64 | cx33 | x86 | 4 | 8GB | 80GB | fsn1 | 88.198.104.212 | 7.98 EUR | GH Actions runner |
|
||
> | runner-arm64 | cax21 | ARM | 4 | 8GB | 80GB | fsn1 | 167.235.198.213 | 9.83 EUR | GH Actions runner |
|
||
>
|
||
> Total fleet: ~56.54 EUR/mo
|
||
>
|
||
> Services on tengu: Tengu PaaS (Docker + Caddy), Netdata dashboard (netdata.saiden.dev)
|
||
> Services on runners: GitHub Actions runner (systemd), Rust toolchain, sccache, gh CLI, Netdata child
|
||
|
||
---
|
||
|