graph8's transformation · 18 people · 13 boards · 3 repos · 0 human-typed code

How graph8 ships like a 270-person team while staying at 18 — by banning human-written code.

graph8 today: 15 engineers + 3 QA across 13 product boards (Studio · Inbox · Enrichment · Web Chat · Copilot · Agents · Signals · Voice AI · Dialer · Stripe/Credits · Engage · Mashup · UX/UI). Real throughput last 30 days: ~565 merged PRs across the top 4 repos (top performer Usjid: 37 PRs/30d, hand-writing the code). Most agent activity (~70%) runs invisibly on personal Max plans. The change is harness engineering, full send: one shared trace, one policy plane, one dashboard — engineers banned from touching editors — each running 10+ parallel agents at any moment against the 13 boards, plus overnight long-runs. Set up in 7 days. Same 18 people. 10–15× the output. The new bottleneck isn't engineering — it's how fast we feed the assembly line with PRDs. Read the 10 principles first →

Today at graph8 → 7-day setup plan ↗ After implementation ↗

Where you fit · find your name

Real 90-day merged-PR counts from graph8-com/g8 · after-Lifecycle = 8–15× the same engineer dispatching agents instead of typing

Usjid Nisar

Studio · top shipper

117 PRs/90d → ~950–1,750

Muhammad Sadiq

Enrichment · Mashup

99 PRs/90d → ~800–1,500

Hassan Baig

AI Inbox

56 PRs/90d → ~450–850

Muhammad Waleed

Web Chat · sync

48 PRs/90d → ~380–720

Thomas Cornelius

Infra · Product

44 PRs/90d → ~350–660

Huzaifa Aamir

Agents

27 PRs/90d → ~220–410

Oleksii Bokov

Signals · Appts

14 PRs/90d → ~110–210

Shaharyar Ahmad

Database · MBM

12 PRs/90d → ~95–180

Multiplier varies by work type: surgical fixes scale higher (15×) than architectural overhauls (8×). All numbers from gh pr list --repo graph8-com/g8 --state merged --search "author:<you> merged:>2026-02-17". Don't see yourself? Eeshan · Musa · Ibrahim · Joaquin · Hamza · Muhammad I · plus the 3 QA (Ayesha · Rania · Immama) are all in the full engineer wall.

24/7

Continuous agent runs · day & night

~30%

Of agent activity org-visible today

Click any number below to drill into the full list. Click any item in the list to see the underlying file content — the actual skill prompt, the actual agent definition, the actual workflow YAML.

Today · status quo

Six real journeys an agent or engineer takes every day at graph8.

Click any journey to walk through it end-to-end. Click a skill name (purple) to see what it does, who can fire it, and whether the org has any record of the run. The most common surprise: more than half of the steps run somewhere the org cannot see.

Click a journey above to walk through every step.

30%

The org sees roughly 30% of agent activity.

K8s-job agent runs are observable (Loki logs, MBM postgres). Engineer-local Claude Code and Codex sessions — where most skill invocations actually happen — are not. There's no audit log of which skills fired, on whose machine, against which repo, with what model token. This is the single biggest control gap.

30% visible (K8s, MBM, GitHub events) 70% invisible (local CLI, personal tokens)

Tomorrow · what should change

Seven moves, ranked by impact, to close the loop.

Each move is one PR away from starting. The order matters: the first three unlock the rest by giving the org the data and the levers it currently lacks.

Simplifying the skill surface

22 skills sounds modest, but several do the same job under different names. Here's the merge table — fewer, sharper skills means engineers actually remember what fires when, and the org can write meaningful policy.

Skills today	Becomes	Why	Removes
`/start` · `/analyse-system` · `/investigate-bug`	`/start` (auto-routes by tier)	`/start` already classifies tier; the other two should be branches inside it	2 skills
`/commit` · `/ship`	`/ship` (commit becomes a sub-step)	You never want to commit without intent to ship	1 skill
`/review` · `/security-review`	`/review --security`	Security review is a stricter rubric on the same skill	1 skill
`/assign-prds` · `/capacity-check`	`/capacity` (one command, two views)	Both read the same data; splitting forces two queries for one decision	1 skill
`/check-prds` · `/check-architecture`	`/check` (subcommand picks target)	"Check" skills should be a verb with an object, not separate skills	1 skill
`/article-image` · `/changelog-image` · `/write-article` · `/changelog`	`/publish` (article + image bundled)	These belong together — writing an article without picking its image is rare	2 skills
22 skills → 14 skills	8 fewer skills. Same coverage. Org can write 14 policies instead of 22.

Horizon · closing the agent-to-agent loops

Today every agent writes to humans. Tomorrow, agent A's output is agent B's input.

That's the leverage. Right now 25 agents fan out into Roam digests, Slack pings, and GitHub issues — and a human reads them all and decides what to do. That doesn't scale at 18 engineers running 13 product boards. Ten loops below close the gap. All shippable in days or weeks. None take 30 days end-to-end. Each loop maps to one or more of the 10 principles →

External validation · this exact pattern works

Ryan Lapo's team at OpenAI Frontier (3 people · 1 million lines of code · 1,500 PRs · zero human-typed lines · 9 months) ships exactly this way: "PR comments indicate some context failure on behalf of the agent — get it into the repository and figure out ways to automatically prompt-inject the agent so it self-heals." Same pattern, different name. Their full playbook → openai.com/index/harness-engineering · our adaptation → the 10 principles.

↻

Bug-prevention loop · 7 days

Closing question: why are bugs reaching prod when their patterns are visible at PR time? Mechanism: an agent reads each PR diff plus the recent Sentry corpus, flags lines whose patterns historically broke things. Comments on the PR with predicted risk.

First step: Day 1–5, ship bug_predictor agent (new TaskSpawner triggered on pull_request.opened in g8). Day 6–7, tune false-positive rate on the last 30 days of merged PRs before turning on for everyone.

EFFORT

Medium

IMPACT

↻

Test-coverage loop · 3 days

Closing question: why are PRs landing with new public functions that have no tests? Mechanism: The test-writer agent already exists at g8/.claude/agents/test-writer.md — it's defined but not dispatched. Wire it to PR-open events on any new public function without coverage.

First step: Day 1, GitHub Action triggers on PR open. Day 2, action invokes the existing test-writer agent against the diff. Day 3, agent opens a follow-up PR with the missing tests, MBM reviews both PRs together.

EFFORT

Small

IMPACT

↻

MBM-feedback loop · 14 days

Closing question: is MBM getting smarter, or drifting? Mechanism: a nightly agent reads PRs where MBM's CHANGES_REQUESTED was dismissed by the merger; classifies as "MBM was wrong" vs "human shipped anyway." Proposes rubric updates as a PR against tenants/graph8-eng/agents/reviewer/prompt.md.

First step: Day 1–7, ship mbm_critic agent (runs daily via cron). Day 8–14, after first week of data, draft the first rubric-update PR. Each future PR is opt-in via human approval, not auto-applied.

EFFORT

Medium

IMPACT

↻

Agent-health loop · 7 days

Closing question: are our 25 agents actually good, or are some silently making things worse? Mechanism: measure each axon agent's PR survival rate (merged + not reverted within 30 days). Flag agents below threshold. Surface the unwired ones (pr_cleanup, rage_click_detector) and dispatch-them-or-delete-them.

First step: Day 1–5, ship agent_health cron — reads MBM postgres + GitHub, writes per-agent stats to a table. Day 6–7, add a single per-agent panel to the dashboard, alert on regression.

EFFORT

Small

IMPACT

↻

Knowledge-compounding loop · 14 days

Closing question: why do agents re-derive the same context every run? Mechanism: instrument agents to log which files / searches they repeatedly fetch. Daily cron analyzes the log, opens a PR proposing additions to CLAUDE.md (root or feature-level). Same pattern for engineer-domains.json — derive expertise from actual commit patterns, not human-maintained guesses.

First step: Day 1–7, add context-fetch logging to the axon agent runner. Day 8–14, ship knowledge_compactor cron that opens the first CLAUDE.md-update PR.

EFFORT

Medium

IMPACT

↻

Skill-mortality loop · 3 days (after ledger)

Closing question: which of our 22 skills should we delete this week? Mechanism: once the trace ledger lands (rec #1, day 7), usage analytics flag skills with <5 fires/month. Auto-open consolidation issues, link to the merge-table proposal already documented.

First step: Days 8–10 (right after ledger ships day 7). One SQL query + an auto-issue creator. Cleanup PR per dead skill. By day 30, you're back to 14.

EFFORT

Small

IMPACT

↻

Cross-repo regression loop · 30 days

Closing question: how do we catch a change in g8 that breaks graph8-com/g8-eda-server before deploy, not after? Mechanism: a registry of cross-repo contracts (event schemas, RPC signatures). Agent runs contract tests on every PR that touches a known interface. Failure blocks merge.

First step: Week 1, register the top 5 cross-repo contracts as JSON-schemas. Week 2, build contract_test_runner agent triggered on PR. Week 3–4, broaden to remaining contracts, turn on enforcement.

EFFORT

Medium

IMPACT

↻

Onboarding loop · 30 days

Closing question: why does a new engineer ask Shaharyar the same questions every cohort? Mechanism: per-engineer onboarding agent watches the first 4 weeks of activity. Surfaces "you haven't tried /start yet," "you keep editing this file — here's the convention," points to feature-level CLAUDE.md they'd benefit from. Graduates them off training wheels at week 4.

First step: Week 1, add joined: YYYY-MM-DD to engineer-domains.json. Week 2–4, build onboarding agent triggered for engineers within 28 days of join.

EFFORT

Medium

IMPACT

↻

Garbage Collection Friday · weekly cadence (forever)

Closing question: what stops Friday from being the day where last week's slop becomes next week's encoded knowledge? Mechanism: every Friday afternoon, each engineer picks one recurring pattern from the week's PR comments / fix-cycles / friction and turns it into a lint, a CLAUDE.md addition, a review-agent rubric update, or a new test. Eliminates a class of misbehavior, not an instance. This is principle #5.

First step: Day 8 (Mon of week 2) — Monday morning, knowledge_compactor + mbm_critic deliver a curated "GC candidates" digest. Friday afternoon: ship one cleanup per engineer. Commit prefix gc-friday:.

EFFORT

Small

IMPACT

↻

Lints-as-prompts · the diagnostic-rewriting loop · 30 days

Closing question: why is every lint error message generic — "unused variable" — when it could be a remediation prompt that names the canonical alternative and links to the right CLAUDE.md section? Mechanism: audit every lint message in the codebase. Rewrite each to be a one-line prompt for the agent ("use fx_org_id from tests/conftest.py:42 — canonical at graph8"). Then add bespoke rules that enforce the patterns the agents most often violate (file size, package boundaries, single canonical zod schema). This is principle #7.

First step: Week 1 — pick the top 10 lint/test failures by frequency from last 30 days of CI. Week 2 — rewrite their messages as remediation prompts. Week 3–4 — author 5 bespoke lints to enforce architecture invariants (cross-product imports, layer crossings, duplicate zod schemas).

EFFORT

Medium

IMPACT

The meta-observation

Of the 25 axon agents running today, not one reads the output of another. The improver agent writes to Roam. The pipeline-analyzer writes to Roam. The monitor opens GitHub issues. The flywheel writes monthly digests. MBM's reviews go to PRs. Every agent's output is consumed by a human, who then decides whether to do something about it. That's the bottleneck. The eight loops above are eight places to wire agent → agent directly. Compound intelligence starts there.

Open the day-1 blueprint ↗

The change · what graph8 is building

One trace. One policy. One dashboard. Five capabilities graph8 is wiring up over 7 days.

Built on top of what graph8 already runs — Claude Code locally, Codex locally, K8s axon agents, MBM, Modal, GitHub, Slack, Roam — these five capabilities turn the 25 existing agents into a measurable, governed fleet. Then the 18-person team starts dispatching 3+ agents per day in parallel. See what the team looks like after →

Five capabilities. One control plane. Set up in 7 days at graph8.

Universal trace

Every skill, every agent, every run — captured wherever it happens. Local Claude Code, local Codex, K8s Jobs, Modal, webhook handlers. One timeline per engineer, per repo, per journey.

OAuth-pool-scoped activity ledger
Mandatory client-side wrapper
Per-tool latency + success telemetry
Per-engineer skill utilization (Max plans)
Per-K8s-agent unit economics (OAuth pool)

Utilization + agent economics

Two slices that actually matter for AI-first teams. Per-engineer Max-plan utilization — are they leveraging their seat or coasting? — and per-K8s-agent unit economics — which autonomous agents are worth their token cost? Token $-per-engineer isn't a thing on Max plans; utilization is.

Skill fires / day / engineer (utilization proxy)
Skill-to-PR ratio per engineer (autonomy ROI)
Cost per merged PR, per K8s agent
Underutilization + cost-blowout alerts

Org policy engine

Write a policy once, enforce it everywhere. "No /ship on prod database migrations without 2 reviewers" — binds whether the engineer fires from laptop or K8s agent.

YAML or visual policy editor
Pre-flight checks at skill invocation
Versioned, code-reviewable

Journey explorer

Every workflow visualized as a traceable journey. Stale PR? Click it, see every retry, every skill that fired, every reviewer comment, every backoff, every human handoff.

Real-time + historical views
SLO per step, alert on regression
Replay any journey end-to-end

Day / night cadence

Work doesn't stop when engineers log off. Lifecycle keeps agents running on cron + webhook triggers, escalates to human-in-loop only when policy requires, presents results as a morning digest.

Configurable escalation routes
Roam / Slack morning digest
Rate-limit-aware OAuth rotation

Per-engineer utilization + per-agent economics

The metering that actually exists on AI-first teams using Max plans. Engineer side: $ per engineer is flat ($200/mo seat), so the live metric is utilization — skill fires, skill-to-PR ratio, whether the seat is being leveraged or wasted. Agent side: K8s autonomous agents pay per-token via the OAuth pool, so unit economics are real and measurable. Both tables below are illustrative — real numbers populate the day the ledger ships (target: 7 days).

Engineer utilization · last 30 days

Engineer	Skill fires	PRs merged	Skill→PR	Top skill	Utilization
Thomas C.	312	24	13:1	`/start`	excellent
Shaharyar K.	198	19	10:1	`/investigate-bug`	strong
Engineer C	254	16	16:1	`/analyse-system`	strong
Engineer D	387	28	14:1	`/ship`	excellent
Engineer E	84	11	8:1	`/commit`	developing
Engineer F spinning	412	7	59:1	`/investigate-bug`	spinning
Engineer G	142	14	10:1	`/start`	strong
Engineer H untapped	12	6	2:1	(mostly manual)	untapped
Team median	170	15	11:1	—	—

K8s autonomous-agent unit economics · last 30 days · OAuth pool

Agent	Runs	Token $	PRs opened	Merged (30d survival)	$ / merged PR
`pr_fixer`	142	$580	89	76	$7.63
`infra`	24	$390	22	22	$17.73
`g8_5xx_fixer` high cost	38	$1,240	38	31	$40.00
`g8_frontend`	19	$310	18	16	$19.38
`monitor`	60	$180	opens issues, no PRs		—
`improver`	30	$290	writes to Roam, no PRs		—
`pr_cleanup` · `rage_click_detector` unwired	0	$0	defined, not dispatched		—

Spinning · check in today

Engineer F fired 412 skills for 7 merged PRs (59:1 vs team median 11:1). Most likely looping in /investigate-bug on a hard problem. A pair-session today beats another week of solo grinding.

Untapped seat · onboard this week

Engineer H fired 12 skills total this month — the seat is paid for and barely used. Either skeptical of the tools or never onboarded properly. AI-first orgs can't afford a 1-in-8 "mostly manual" engineer.

High-cost agent · audit this week

g8_5xx_fixer costs $40 per merged PR — 5× the median agent cost. Either the fixer prompt is bloated, or it's burning tokens on impossible bugs. Either way, audit the last 10 runs and trim.

Dead-code agents · remove today

pr_cleanup and rage_click_detector are defined with prompts but never dispatched. They're noise in the catalog. Either wire them this week or delete them today.

graph8 today (last 30 days · real numbers)

~ 565 merged PRs / 30d across top 4 repos (g8 · infra · agent-os · customer-hub)
Top engineer (Usjid) ships 37 PRs/30d hand-writing code · median engineer 6–10
Team sees ~30% of agent activity (K8s slice only)
0% visibility into per-engineer Max-plan utilization
Skills live in 4 places, 22 of them, no usage data
"Did the agent run?" gets asked in Slack daily
Each engineer context-switches across 2–3 product boards
Cycle-time: feature → prod = 1–4 weeks · bug → fix = 4–48 hrs
Bugs caught reactively in prod via Sentry

graph8 after Lifecycle (7 days later · projected)

3,400–6,500 merged PRs / 30d org-wide · 6–12×
Each engineer dispatches 10–15 parallel agents · reviews PRs as they land
100% of agent activity in one ledger · per-engineer utilization grade daily
14 sharp skills with fire-rate + abandon-rate
One Slack post replaces 5 hours of daily standups per person
Dashboard answers "did the agent run?" in < 1 sec
Cycle-time: feature → prod = 1–5 days · bug → fix = 10 min – 2 hrs
Bugs predicted at PR time by bug_predictor · 5xx auto-fixed in < 2 hr 85% of the time
New constraint: PRD backlog depth, not engineering capacity

Why now for graph8 — at 18 people, this is the window.

graph8 is past the point where one eng lead holds it all in their head — too many products, too many surfaces, too many agents. But graph8 is not yet at the size where a platform team funds itself. The 7-day setup gives the existing 18 people the leverage of a 70-person team — without the hiring, the comms overhead, or the platform-org tax. The window is open right now; close it before the next product launch.

18 people · 13 product boards · 3 repos

Three first-party repos hold everything: g8 monorepo (all 13 product boards) · agent-os (company operations) · infra (autonomous K8s engineering). Plus jitsu as an open-source dependency to fold back into g8. Each engineer context-switches across 2–3 product boards inside g8; the dashboard collapses that into one queue.

25 agents already running

The axon platform is here, the OAuth pool is here, MBM is here. Lifecycle isn't a from-scratch build — it's wiring + visibility on what graph8 already has. The 7 days is mostly plumbing.

5–10× output, same team. Possibly 15×.

Real today: ~565 PRs/30d org-wide. After: 3,400–6,500 PRs/30d. graph8 ships like a 75-person team at 5×, a 150-person team at 10×, a 270-person team at the ambitious 15×. Full velocity math →

How graph8 ships like a 270-person team while staying at 18 — by banning human-written code.

Six real journeys an agent or engineer takes every day at graph8.

The org sees roughly 30% of agent activity.

Seven moves, ranked by impact, to close the loop.

Simplifying the skill surface

Today every agent writes to humans. Tomorrow, agent A's output is agent B's input.

Bug-prevention loop · 7 days

Test-coverage loop · 3 days

MBM-feedback loop · 14 days

Agent-health loop · 7 days

Knowledge-compounding loop · 14 days

Skill-mortality loop · 3 days (after ledger)

Cross-repo regression loop · 30 days

Onboarding loop · 30 days

Garbage Collection Friday · weekly cadence (forever)

Lints-as-prompts · the diagnostic-rewriting loop · 30 days

One trace. One policy. One dashboard. Five capabilities graph8 is wiring up over 7 days.

Five capabilities. One control plane. Set up in 7 days at graph8.

Universal trace

Utilization + agent economics

Org policy engine

Journey explorer

Day / night cadence

Per-engineer utilization + per-agent economics

Engineer utilization · last 30 days

K8s autonomous-agent unit economics · last 30 days · OAuth pool

graph8 today (last 30 days · real numbers)

graph8 after Lifecycle (7 days later · projected)

Why now for graph8 — at 18 people, this is the window.

18 people · 13 product boards · 3 repos

25 agents already running

5–10× output, same team. Possibly 15×.

Catalog