graph8 today: 15 engineers + 3 QA across 13 product boards (Studio · Inbox · Enrichment · Web Chat · Copilot · Agents · Signals · Voice AI · Dialer · Stripe/Credits · Engage · Mashup · UX/UI). Real throughput last 30 days: ~565 merged PRs across the top 4 repos (top performer Usjid: 37 PRs/30d, hand-writing the code). Most agent activity (~70%) runs invisibly on personal Max plans. The change is harness engineering, full send: one shared trace, one policy plane, one dashboard — engineers banned from touching editors — each running 10+ parallel agents at any moment against the 13 boards, plus overnight long-runs. Set up in 7 days. Same 18 people. 10–15× the output. The new bottleneck isn't engineering — it's how fast we feed the assembly line with PRDs. Read the 10 principles first →
graph8-com/g8 · after-Lifecycle = 8–15× the same engineer dispatching agents instead of typinggh pr list --repo graph8-com/g8 --state merged --search "author:<you> merged:>2026-02-17". Don't see yourself? Eeshan · Musa · Ibrahim · Joaquin · Hamza · Muhammad I · plus the 3 QA (Ayesha · Rania · Immama) are all in the full engineer wall.Click any number below to drill into the full list. Click any item in the list to see the underlying file content — the actual skill prompt, the actual agent definition, the actual workflow YAML.
Click any journey to walk through it end-to-end. Click a skill name (purple) to see what it does, who can fire it, and whether the org has any record of the run. The most common surprise: more than half of the steps run somewhere the org cannot see.
K8s-job agent runs are observable (Loki logs, MBM postgres). Engineer-local Claude Code and Codex sessions — where most skill invocations actually happen — are not. There's no audit log of which skills fired, on whose machine, against which repo, with what model token. This is the single biggest control gap.
Each move is one PR away from starting. The order matters: the first three unlock the rest by giving the org the data and the levers it currently lacks.
22 skills sounds modest, but several do the same job under different names. Here's the merge table — fewer, sharper skills means engineers actually remember what fires when, and the org can write meaningful policy.
| Skills today | Becomes | Why | Removes |
|---|---|---|---|
/start · /analyse-system · /investigate-bug | /start (auto-routes by tier) | /start already classifies tier; the other two should be branches inside it | 2 skills |
/commit · /ship | /ship (commit becomes a sub-step) | You never want to commit without intent to ship | 1 skill |
/review · /security-review | /review --security | Security review is a stricter rubric on the same skill | 1 skill |
/assign-prds · /capacity-check | /capacity (one command, two views) | Both read the same data; splitting forces two queries for one decision | 1 skill |
/check-prds · /check-architecture | /check (subcommand picks target) | "Check" skills should be a verb with an object, not separate skills | 1 skill |
/article-image · /changelog-image · /write-article · /changelog | /publish (article + image bundled) | These belong together — writing an article without picking its image is rare | 2 skills |
| 22 skills → 14 skills | 8 fewer skills. Same coverage. Org can write 14 policies instead of 22. | ||
That's the leverage. Right now 25 agents fan out into Roam digests, Slack pings, and GitHub issues — and a human reads them all and decides what to do. That doesn't scale at 18 engineers running 13 product boards. Ten loops below close the gap. All shippable in days or weeks. None take 30 days end-to-end. Each loop maps to one or more of the 10 principles →
Ryan Lapo's team at OpenAI Frontier (3 people · 1 million lines of code · 1,500 PRs · zero human-typed lines · 9 months) ships exactly this way: "PR comments indicate some context failure on behalf of the agent — get it into the repository and figure out ways to automatically prompt-inject the agent so it self-heals." Same pattern, different name. Their full playbook → openai.com/index/harness-engineering · our adaptation → the 10 principles.
Closing question: why are bugs reaching prod when their patterns are visible at PR time? Mechanism: an agent reads each PR diff plus the recent Sentry corpus, flags lines whose patterns historically broke things. Comments on the PR with predicted risk.
bug_predictor agent (new TaskSpawner triggered on pull_request.opened in g8). Day 6–7, tune false-positive rate on the last 30 days of merged PRs before turning on for everyone.Closing question: why are PRs landing with new public functions that have no tests? Mechanism: The test-writer agent already exists at g8/.claude/agents/test-writer.md — it's defined but not dispatched. Wire it to PR-open events on any new public function without coverage.
test-writer agent against the diff. Day 3, agent opens a follow-up PR with the missing tests, MBM reviews both PRs together.Closing question: is MBM getting smarter, or drifting? Mechanism: a nightly agent reads PRs where MBM's CHANGES_REQUESTED was dismissed by the merger; classifies as "MBM was wrong" vs "human shipped anyway." Proposes rubric updates as a PR against tenants/graph8-eng/agents/reviewer/prompt.md.
mbm_critic agent (runs daily via cron). Day 8–14, after first week of data, draft the first rubric-update PR. Each future PR is opt-in via human approval, not auto-applied.Closing question: are our 25 agents actually good, or are some silently making things worse? Mechanism: measure each axon agent's PR survival rate (merged + not reverted within 30 days). Flag agents below threshold. Surface the unwired ones (pr_cleanup, rage_click_detector) and dispatch-them-or-delete-them.
agent_health cron — reads MBM postgres + GitHub, writes per-agent stats to a table. Day 6–7, add a single per-agent panel to the dashboard, alert on regression.Closing question: why do agents re-derive the same context every run? Mechanism: instrument agents to log which files / searches they repeatedly fetch. Daily cron analyzes the log, opens a PR proposing additions to CLAUDE.md (root or feature-level). Same pattern for engineer-domains.json — derive expertise from actual commit patterns, not human-maintained guesses.
knowledge_compactor cron that opens the first CLAUDE.md-update PR.Closing question: which of our 22 skills should we delete this week? Mechanism: once the trace ledger lands (rec #1, day 7), usage analytics flag skills with <5 fires/month. Auto-open consolidation issues, link to the merge-table proposal already documented.
auto-issue creator. Cleanup PR per dead skill. By day 30, you're back to 14.Closing question: how do we catch a change in g8 that breaks graph8-com/g8-eda-server before deploy, not after? Mechanism: a registry of cross-repo contracts (event schemas, RPC signatures). Agent runs contract tests on every PR that touches a known interface. Failure blocks merge.
contract_test_runner agent triggered on PR. Week 3–4, broaden to remaining contracts, turn on enforcement.Closing question: why does a new engineer ask Shaharyar the same questions every cohort? Mechanism: per-engineer onboarding agent watches the first 4 weeks of activity. Surfaces "you haven't tried /start yet," "you keep editing this file — here's the convention," points to feature-level CLAUDE.md they'd benefit from. Graduates them off training wheels at week 4.
joined: YYYY-MM-DD to engineer-domains.json. Week 2–4, build onboarding agent triggered for engineers within 28 days of join.Closing question: what stops Friday from being the day where last week's slop becomes next week's encoded knowledge? Mechanism: every Friday afternoon, each engineer picks one recurring pattern from the week's PR comments / fix-cycles / friction and turns it into a lint, a CLAUDE.md addition, a review-agent rubric update, or a new test. Eliminates a class of misbehavior, not an instance. This is principle #5.
knowledge_compactor + mbm_critic deliver a curated "GC candidates" digest. Friday afternoon: ship one cleanup per engineer. Commit prefix gc-friday:.Closing question: why is every lint error message generic — "unused variable" — when it could be a remediation prompt that names the canonical alternative and links to the right CLAUDE.md section? Mechanism: audit every lint message in the codebase. Rewrite each to be a one-line prompt for the agent ("use fx_org_id from tests/conftest.py:42 — canonical at graph8"). Then add bespoke rules that enforce the patterns the agents most often violate (file size, package boundaries, single canonical zod schema). This is principle #7.
Of the 25 axon agents running today, not one reads the output of another. The improver agent writes to Roam. The pipeline-analyzer writes to Roam. The monitor opens GitHub issues. The flywheel writes monthly digests. MBM's reviews go to PRs. Every agent's output is consumed by a human, who then decides whether to do something about it. That's the bottleneck. The eight loops above are eight places to wire agent → agent directly. Compound intelligence starts there.
Built on top of what graph8 already runs — Claude Code locally, Codex locally, K8s axon agents, MBM, Modal, GitHub, Slack, Roam — these five capabilities turn the 25 existing agents into a measurable, governed fleet. Then the 18-person team starts dispatching 3+ agents per day in parallel. See what the team looks like after →
Every skill, every agent, every run — captured wherever it happens. Local Claude Code, local Codex, K8s Jobs, Modal, webhook handlers. One timeline per engineer, per repo, per journey.
Two slices that actually matter for AI-first teams. Per-engineer Max-plan utilization — are they leveraging their seat or coasting? — and per-K8s-agent unit economics — which autonomous agents are worth their token cost? Token $-per-engineer isn't a thing on Max plans; utilization is.
Write a policy once, enforce it everywhere. "No /ship on prod database migrations without 2 reviewers" — binds whether the engineer fires from laptop or K8s agent.
Every workflow visualized as a traceable journey. Stale PR? Click it, see every retry, every skill that fired, every reviewer comment, every backoff, every human handoff.
Work doesn't stop when engineers log off. Lifecycle keeps agents running on cron + webhook triggers, escalates to human-in-loop only when policy requires, presents results as a morning digest.
The metering that actually exists on AI-first teams using Max plans. Engineer side: $ per engineer is flat ($200/mo seat), so the live metric is utilization — skill fires, skill-to-PR ratio, whether the seat is being leveraged or wasted. Agent side: K8s autonomous agents pay per-token via the OAuth pool, so unit economics are real and measurable. Both tables below are illustrative — real numbers populate the day the ledger ships (target: 7 days).
| Engineer | Skill fires | PRs merged | Skill→PR | Top skill | Utilization |
|---|---|---|---|---|---|
| Thomas C. | 312 | 24 | 13:1 | /start | |
| Shaharyar K. | 198 | 19 | 10:1 | /investigate-bug | |
| Engineer C | 254 | 16 | 16:1 | /analyse-system | |
| Engineer D | 387 | 28 | 14:1 | /ship | |
| Engineer E | 84 | 11 | 8:1 | /commit | |
| Engineer F | 412 | 7 | 59:1 | /investigate-bug | |
| Engineer G | 142 | 14 | 10:1 | /start | |
| Engineer H | 12 | 6 | 2:1 | (mostly manual) | |
| Team median | 170 | 15 | 11:1 | — | — |
| Agent | Runs | Token $ | PRs opened | Merged (30d survival) | $ / merged PR |
|---|---|---|---|---|---|
pr_fixer | 142 | $580 | 89 | 76 | $7.63 |
infra | 24 | $390 | 22 | 22 | $17.73 |
g8_5xx_fixer | 38 | $1,240 | 38 | 31 | $40.00 |
g8_frontend | 19 | $310 | 18 | 16 | $19.38 |
monitor | 60 | $180 | opens issues, no PRs | — | |
improver | 30 | $290 | writes to Roam, no PRs | — | |
pr_cleanup · rage_click_detector | 0 | $0 | defined, not dispatched | — | |
Engineer F fired 412 skills for 7 merged PRs (59:1 vs team median 11:1). Most likely looping in /investigate-bug on a hard problem. A pair-session today beats another week of solo grinding.
Engineer H fired 12 skills total this month — the seat is paid for and barely used. Either skeptical of the tools or never onboarded properly. AI-first orgs can't afford a 1-in-8 "mostly manual" engineer.
g8_5xx_fixer costs $40 per merged PR — 5× the median agent cost. Either the fixer prompt is bloated, or it's burning tokens on impossible bugs. Either way, audit the last 10 runs and trim.
pr_cleanup and rage_click_detector are defined with prompts but never dispatched. They're noise in the catalog. Either wire them this week or delete them today.
bug_predictor · 5xx auto-fixed in < 2 hr 85% of the timegraph8 is past the point where one eng lead holds it all in their head — too many products, too many surfaces, too many agents. But graph8 is not yet at the size where a platform team funds itself. The 7-day setup gives the existing 18 people the leverage of a 70-person team — without the hiring, the comms overhead, or the platform-org tax. The window is open right now; close it before the next product launch.
Three first-party repos hold everything: g8 monorepo (all 13 product boards) · agent-os (company operations) · infra (autonomous K8s engineering). Plus jitsu as an open-source dependency to fold back into g8. Each engineer context-switches across 2–3 product boards inside g8; the dashboard collapses that into one queue.
The axon platform is here, the OAuth pool is here, MBM is here. Lifecycle isn't a from-scratch build — it's wiring + visibility on what graph8 already has. The 7 days is mostly plumbing.
Real today: ~565 PRs/30d org-wide. After: 3,400–6,500 PRs/30d. graph8 ships like a 75-person team at 5×, a 150-person team at 10×, a 270-person team at the ambitious 15×. Full velocity math →