Companion to the change, the 7-day setup plan, and graph8 after.
This page is for the engineer who's actually going to ship it — at graph8, that's one
engineer dispatching agents. Setup only: the system wiring between
graph8-com/infra (axon agents, MBM service, crons, Postgres, k8s manifests, dashboards)
and graph8-com/g8 (workflows, .claude config, MCP wrapper distribution). Architecture,
components, data model, every file path with a status badge, every SQL query — for the 7-day setup
sprint. Improvements (the 2–3× lift) ship after, against the foundation this builds.
Five planes: Capture (every skill fire, everywhere) · Store (one Postgres table) · Visualize (two Grafana dashboards) · Govern (org policies that bind regardless of where the agent runs) · Close loops (agent A's output is agent B's input).
Every component has a status badge (NEW · EDIT · DELETE · SETTINGS), a target day, the file path, and a one-line "what" and "depends on."
main.main and qa. MBM bot review counts as the required approval for non-migration paths.@mbm review normalizersync-main-to-qa.yml. If SYNC_BOT_TOKEN is missing, fail with a Slack ping. No more silent skipping.pr_fixer attempt-4 rulembm/needs-input, post a summary comment tagging the author, stop. Removes "agent stuck forever" mode.pr_cleanup + rage_click_detectortriggers.yaml) or delete the folder. No "in-between."test-writer agentg8/.claude/agents/test-writer.md but isn't dispatched. New workflow triggers it on PR open, opens a follow-up PR with the missing tests.agent_health cron/assign-prds. Skill calls POST /v1/grade/engineer instead of re-deriving.continue-on-error: true from the bandit job. First in the CI ratchet — smallest blast radius, clearest fixes.mbm_critic agentreviewer/prompt.md. The first agent that reads MBM's output.triggers.yaml from polling to webhook. Start with 5xx-error; roll the rest after 24h of observation.knowledge_compactor agentcontext_fetches table. Files fetched ≥10 times this week → propose CLAUDE.md addition via PR. The system writes its own docs.contract_test_runner agent validates schema changes on every relevant PR.bug_predictor agentonboarding agentjoined field from engineer-domains.json. Daily nudges in Slack for engineers in their first 28 days: missing skill usage, unused conventions, relevant feature-level CLAUDE.md..claude/commands/ in both repos. Week 3: /commit→/ship. Week 4: /assign-prds+/capacity-check→/capacity, then /review+/security-review, then /start branches.The schema is intentionally tiny. Everything downstream — dashboards, loops, alerts — is a query against these tables. Add a column when you have a reason. Don't add tables.
skill_invocations NEWOne row per skill fire. Token / cost columns are NULL for Max-plan local runs (no metering possible); populated for K8s-agent runs (OAuth pool).
CREATE TABLE skill_invocations ( id BIGSERIAL PRIMARY KEY, ts TIMESTAMPTZ NOT NULL DEFAULT now(), engineer_id TEXT NOT NULL, -- github username skill_name TEXT NOT NULL, -- e.g. /start, /ship, agent name for K8s runs repo TEXT, -- graph8-com/g8, graph8-com/infra runtime TEXT NOT NULL, -- claude-code-local | codex-local | k8s-job | modal model TEXT, -- claude-opus-4-7, claude-haiku-4-5, etc. latency_ms INT, status TEXT NOT NULL, -- success | abandoned | error input_tokens INT, -- NULL for max-plan local runs output_tokens INT, -- NULL for max-plan local runs cost_usd_cents INT, -- NULL for max-plan local runs arg_hash TEXT -- anonymized input fingerprint ); CREATE INDEX ON skill_invocations (engineer_id, ts); CREATE INDEX ON skill_invocations (skill_name, ts); CREATE INDEX ON skill_invocations (runtime, ts);
context_fetches NEW · day 21Drives the knowledge-compounding loop. Agents POST a row every time they read a file. Daily cron looks for files fetched ≥10 times this week and proposes them for CLAUDE.md.
CREATE TABLE context_fetches ( id BIGSERIAL PRIMARY KEY, ts TIMESTAMPTZ NOT NULL DEFAULT now(), agent_name TEXT NOT NULL, file_path TEXT NOT NULL, repo TEXT ); CREATE INDEX ON context_fetches (file_path, ts);
agent_health_stats NEW · day 7Daily snapshot, computed by agent_health cron. Source for the agent-economics dashboard and for regression alerts.
CREATE TABLE agent_health_stats ( snapshot_date DATE NOT NULL, agent_name TEXT NOT NULL, runs_30d INT NOT NULL, prs_opened_30d INT NOT NULL, prs_merged_30d INT NOT NULL, prs_reverted_30d INT NOT NULL, survival_rate NUMERIC(5,3), -- merged-not-reverted / opened token_cost_cents INT, PRIMARY KEY (snapshot_date, agent_name) );
POST /v1/trace NEW · day 2The single ingest endpoint. Every capture path POSTs to this. Authenticated by an org-issued API token (per-machine or per-pod).
Request: { "engineer_id": "thomas-c", "skill_name": "/start", "repo": "graph8-com/g8", "runtime": "claude-code-local", "model": "claude-opus-4-7", "latency_ms": 1240, "status": "success", "input_tokens": null, -- max-plan, no metering "output_tokens": null, "arg_hash": "sha256:8f3a..." } Response 202 Accepted: { "id": 847132 }
POST /v1/grade/engineer NEW · day 8Replaces the prompt-derived algorithm in /assign-prds. Same algorithm, but now testable and consistent. Skill calls this endpoint instead of re-deriving.
Request: { "prd": { "slug": "ai-inbox-meetings-admin-settings", "domains": ["ai_inbox", "frontend/features/inbox"], "complexity": "medium", "gtm_score": 12 } } Response 200 OK: { "scores": [ { "engineer_id": "hassan-b", "score": 7, "reason": "primary: ai_inbox(+3), primary: inbox(+3), under target(+1)" }, { "engineer_id": "hamza-n", "score": 4, "reason": "secondary: ai_inbox(+1), 2 PRDs assigned(+3)" } ] }
/tenants/<id>/webhook NEW · day 15Cloudflare Worker entry point. Replaces 2h polling. Validates HMAC, translates to Task CR.
GitHub webhook (issues.labeled, pull_request, etc.) → POST /tenants/graph8-eng/webhook X-Hub-Signature-256: sha256=... Content-Type: application/json Worker validates signature, extracts label, queries triggers.yaml: { "name": "infra-issues", "agent": "infra", "label": "infra", "repo": "graph8-com/infra" } → POST to MBM creates Task CR → axon controller spawns Job → pod runs. Target end-to-end p50 latency: under 60 seconds (was ~1 hour with polling).
Grep this section. If a file isn't here, the spec isn't asking you to touch it. If a file is here, the badge tells you whether to create, edit, or delete it; the day tells you when.
Every step has a verification line. If the verification doesn't pass, don't move on. Order matters — later steps depend on earlier ones.
# GitHub UI: graph8-com/infra/settings/branches → Add classic rule for `main` ✓ Require a pull request before merging ✓ Require approvals (1) ✓ Require review from Code Owners ✓ Require status checks: terraform, validate-k8s ✓ Require linear history graph8-com/g8/settings/branches → same rule for `main` and `qa` ✓ Require status checks: code-quality jobs
git push origin main fails with "Protected branch."# Create migration file: mkdir -p services/mbm/migrations cat > services/mbm/migrations/20260517_skill_invocations.sql <<'SQL' -- (paste the CREATE TABLE from §3 Data model) SQL # Apply via your existing migration runner (Atlas or Alembic): make migrate
psql $DATABASE_URL -c "\d skill_invocations" shows the table.# services/mbm/internal/trace/handler.go package trace type TraceRow struct { EngineerID string `json:"engineer_id"` SkillName string `json:"skill_name"` Repo string `json:"repo,omitempty"` Runtime string `json:"runtime"` Model string `json:"model,omitempty"` LatencyMS int `json:"latency_ms,omitempty"` Status string `json:"status"` InputTokens *int `json:"input_tokens,omitempty"` OutputTokens *int `json:"output_tokens,omitempty"` CostUSDCents *int `json:"cost_usd_cents,omitempty"` ArgHash string `json:"arg_hash,omitempty"` } func Handler(w http.ResponseWriter, r *http.Request) { // 1. validate API token from header // 2. decode JSON // 3. INSERT row → return 202 with id }
curl -X POST $MBM_URL/v1/trace -H "Authorization: Bearer ..." -d '{...}' returns 202 and the row exists.# services/lifecycle-trace-mcp/src/server.ts (minimal MCP server) import { Server } from "@modelcontextprotocol/sdk/server"; const server = new Server({ name: "lifecycle-trace", version: "0.1.0" }); server.setRequestHandler("tools/call", async (req, ctx) => { const t0 = Date.now(); try { return await ctx.delegate(req); } finally { fetch(`${process.env.MBM_URL}/v1/trace`, { method: "POST", headers: { "Authorization": `Bearer ${process.env.MBM_TOKEN}` }, body: JSON.stringify({ engineer_id: process.env.GITHUB_USER, skill_name: req.params.name, runtime: "claude-code-local", latency_ms: Date.now() - t0, status: "success", }), }).catch(() => {}); // fire-and-forget; never block the user } });
# Register in both repos: .claude/mcp.json { "mcpServers": { "lifecycle-trace": { "command": "npx", "args": ["-y", "@graph8/lifecycle-trace-mcp"], "env": { "MBM_URL": "https://mbm.graph8.com" } } } }
SELECT count(*) FROM skill_invocations WHERE engineer_id='your-username' AND ts > now() - interval '5 min' returns a row.-- The query that powers the panel: SELECT engineer_id, count(*) AS fires_30d, count(*) FILTER (WHERE status = 'success') AS successful, count(DISTINCT skill_name) AS unique_skills, (SELECT skill_name FROM skill_invocations s2 WHERE s2.engineer_id = s1.engineer_id GROUP BY skill_name ORDER BY count(*) DESC LIMIT 1) AS top_skill, CASE WHEN count(*) > 300 THEN 'excellent' WHEN count(*) > 150 THEN 'strong' WHEN count(*) > 50 THEN 'developing' WHEN count(*) > 20 THEN 'untapped' ELSE 'spinning?' END AS utilization FROM skill_invocations s1 WHERE ts > now() - interval '30 days' GROUP BY engineer_id ORDER BY fires_30d DESC;
-- The query: SELECT skill_name AS agent, count(*) AS runs, sum(cost_usd_cents) / 100.0 AS cost_usd, (SELECT count(*) FROM github_prs p WHERE p.head_branch LIKE 'axon/' || s.skill_name || '/%' AND p.merged_at > now() - interval '30 days') AS prs_merged FROM skill_invocations s WHERE runtime = 'k8s-job' AND ts > now() - interval '30 days' GROUP BY skill_name ORDER BY cost_usd DESC;
g8_5xx_fixer's row visible with $/merged-PR computed. Conditional-format red >$30, yellow >$15, green ≤$15.7. Lowercase normalizer · g8/.github/workflows/normalize-mbm-trigger.yml Trigger: issue_comment.created If body matches /@(?i)mbm\s+review/ but not /@mbm review/ → re-post lowercase 8. Sync token guard · g8/.github/workflows/sync-main-to-qa.yml Add step 1: if [ -z "${{ secrets.SYNC_BOT_TOKEN }}" ]; then exit 1; fi 9. pr_fixer attempt-4 · tenants/graph8-eng/agents/pr_fixer/prompt.md Append to the Retry-cap section: on attempt 4, label mbm/needs-input, post summary comment tagging the author, stop. 10. Retire pr_cleanup + rage_click_detector Decide: add TaskSpawner entries to triggers.yaml OR rm -rf the agent dirs.
test-writer on PR open# g8/.github/workflows/test-writer-on-pr.yml name: Test-writer on PR on: pull_request: types: [opened, synchronize] paths: ['**/*.py'] jobs: scan: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Find new public funcs without tests run: ./scripts/find-untested.sh - name: Dispatch test-writer if needed if: steps.scan.outputs.untested != '' run: gh api repos/${{ github.repository }}/dispatches \ -f event_type='spawn-test-writer' \ -f client_payload[pr]='${{ github.event.number }}'
def public_foo(): with no test. Follow-up test PR appears within 15 min.# tenants/graph8-eng/agents/agent_health/prompt.md (excerpt) Every day at 05:00 UTC: 1. Query agent_health_stats: yesterday's runs, PRs, reverts per axon agent. 2. Compute survival_rate = merged_not_reverted / opened. 3. Compute 7-day moving average. 4. If any agent < 80%, flag in Roam digest. 5. If any agent < 50%, page on-call. 6. Write today's row to agent_health_stats. # tenants/graph8-eng/crons.yaml — add: - name: agent-health-daily schedule: "0 5 * * *" agent: agent_health
SELECT * FROM agent_health_stats ORDER BY snapshot_date DESC LIMIT 25 returns one row per agent per day.13. engineer_scoring.go (half-day): Go file + 8 tests, POST /v1/grade/engineer. Update infra/.claude/commands/assign-prds.md to call the endpoint. 14. auto-promote-qa-to-main.yml (3 hr): hourly check, opens PR if QA green ≥48h. 15. Remove continue-on-error from bandit (half-day, parallel): Run bandit on main, fix findings, flip the flag. 16. Skill consolidation round 1 (1 hr, parallel): /commit folded into /ship. Skill count: 22 → 21.
17. mbm_critic agent (4 hr): daily cron classifies override patterns. Already-scaffolded prompt at tenants/graph8-eng/agents/mbm_critic/. 18. skill_mortality.go (2 hr, parallel): weekly SQL, auto-open deprecation issues. 19. Cloudflare Worker (5 hr): /tenants/<id>/webhook accepts GH webhooks, HMAC-validates, creates Task CRs via MBM API. Don't cut over yet — that's tomorrow. 20. Skill consolidation round 2 (1 hr, parallel): /assign-prds + /capacity-check → /capacity. Skill count: 21 → 20.
21. Cut over 5xx-error label to webhook (2 hr). Edit tenants/graph8-eng/triggers.yaml; measure p50 dispatch latency. Expected: ~1 hour → under 60 seconds. 22. Run the 8-metric acceptance grid (1 hr). Walk every metric, run every verification command. Target: 6 / 8 green. 23. Record the 10-minute demo (2 hr). Loom walking through: vision page → click stat → drill into skill → live engineer-utilization dashboard → webhook latency proof → one closed loop in action.
knowledge_compactor. Scaffold the top-5 cross-repo contracts + contract_test_runner. Ship bug_predictor (comment-only mode, to tune false-positive rate). Ship onboarding agent. Skill consolidation rounds 3+4. By day 14: 6/8 loops live, skill count = 14, label→pickup <60s across the board.Day 8–9 · Roll remaining labels to webhooks (1 day, agent-driven). Day 9–11 · knowledge_compactor (context_fetches table + agent + daily cron). Day 10–11 · Ruff blocking flip (next CI ratchet step). Day 11–13 · Top-5 cross-repo contracts + contract_test_runner agent. Day 12–14 · bug_predictor agent (comment-only at first). Day 13–14 · onboarding agent · add joined field to engineer-domains.json. Day 13 · Skill consolidation round 3 (/review + /security-review). Day 14 · Skill consolidation round 4 (/start branches).
Run this checklist Sunday afternoon (day 7). Each metric has a verification command. If six or more are green, declare success and start week-2 horizon items (knowledge_compactor, contracts, bug_predictor, onboarding).
git push origin main from any non-MBM machine fails with "Protected branch" on infra and g8.SELECT count(*) FROM skill_invocations WHERE ts > now() - interval '1 day';engineer_id appears in the dashboard with a grade. Local /start fires land in the ledger within 5 min.5xx-error pickup p50 < 60 s/commit + /assign-prds + /capacity-check gone.continue-on-error removed from bandit job in code-quality.yml.gh run list --repo graph8-com/g8 --limit 20 --json conclusion,createdAt,updatedAt · median duration under 60 seconds. Rebuild build tooling whenever it slips.SELECT engineer_id, count(*) FROM skill_invocations WHERE ts > now()-interval '1 day' GROUP BY engineer_id · everyone in the team > 10. Anything less means someone is typing code by hand.tenants/graph8-eng/lints/ exists with ≥ 5 bespoke lint rules; each error message is a remediation prompt (names canonical alternative + links to CLAUDE.md). Auto-applied to every PR.Once the foundation is live, week 2 closes the remaining loops: knowledge_compactor, cross-repo contracts + contract_test_runner, bug_predictor (comment-only mode for tuning), onboarding agent, plus skill-consolidation rounds 3+4. By day 14: 6/8 loops live, skill count = 14, all labels on webhooks. See the original 8 loops →