Imbue 100 Agents — Massive Scale AI Testing Framework Analysis

This article was researched and written by Pengu Press AI

Imbue, the AI research company behind Sculptor, recently published a remarkable case study: an automated pipeline where over a hundred Claude agents test, debug, and iteratively improve their own open-source agent orchestration tool, mngr. The result is a self-reinforcing loop where the software writes its own tests, runs them in parallel sandboxes, fixes what breaks, merges the best changes, and produces a single clean pull request for human review.

The approach reveals a new pattern in AI-assisted engineering: instead of a single coding agent wrestling with an entire codebase, a swarm of independent agents each owns a narrow slice of work and contributes back through disciplined integration.

The Architecture: A Self-Testing Loop

Imbue's pipeline follows a clean four-step cycle: write a tutorial script, convert it into test functions, launch an agent per test, and integrate the results.

The process starts with tutorial.sh, a shell script containing blocks of CLI commands that demonstrate how mngr works. Each block is a sequence of non-empty lines representing what a user would type. Imbue engineers seed this file with comments, then ask a coding agent to fill in the actual command examples, reviewing and keeping what works.

From those tutorial blocks, a second coding agent generates pytest functions. This is a one-to-many mapping: a single tutorial block can spawn multiple test cases that cover happy paths, edge cases, and failure modes. Each test function declares which tutorial block it corresponds to, and a validation script ensures every block has at least one test.

Finally, for each pytest function, Imbue launches a dedicated Claude agent via mngr create. The agent runs the test, and depending on the outcome does one of two things: if the test fails, it fixes the underlying code or the test itself; if the test passes, it improves the test to be more thorough and faithful to the original tutorial block. In either case, the agent writes a result JSON file describing what it did.

The Test Framework

Imbue built a thin wrapper around Python's subprocess module to make test functions expressive and information-rich. A typical test looks like this:

test_help_succeeds(e2e: E2eSession): it writes the tutorial block, runs mngr --help, and asserts success with specific stdout matches.

But the framework goes further. Because mngr runs agents inside tmux sessions that can't be captured through simple stdout/stderr, Imbue built a custom "connect command" that uses asciinema to record full TUI sessions and save them alongside each test's output. The result is a combined web view where any failing test can be inspected through both a CLI transcript and a terminal recording, making it possible to understand exactly what an agent did and where it went wrong.

Orchestration: Map-Reduce with Agents

The orchestration is where scale comes in. The pipeline collects all test names via pytest --collect-only, then launches one mngr create call per test. It uses mngr list to poll agent state, mngr pull to retrieve result files and artifacts when agents finish, and mngr stop to clean up. Once all testing agents are done, a final "integrator" agent — also created via mngr create — merges everything into a single pull request.

Imbue's integration strategy is carefully designed for human review. Testing agents divide their commits into two categories: implementation fixes (changes to mngr's own code) and non-implementation changes (test improvements, documentation updates). The integrator merges non-implementation fixes wholesale since they are uncontroversial. For implementation fixes, it ranks them by importance and merges them as distinct commits in a linear branch, resolving conflicts along the way. A human reviewer can then examine each implementation commit individually and revert any that don't pass muster.

Not every agent produces mergeable work. Some get confused or hit environment issues. These agents write a "blocked" report that gets included in the summary, so nothing silently fails.

From 10 Agents to 100: Local to Cloud in One Flag

One of the most practical aspects of Imbue's approach is that the entire pipeline runs locally during development. The author developed the full orchestration workflow on their own machine, starting with 10 agents managed via Git worktrees. Each agent gets its own working copy, and merging is straightforward since all branches live in the same repo.

When it was time to scale to 100+ agents, the only code change was the deployment target. Instead of mngr create foo, they use mngr create foo@.modal, which sends the agent to Modal's serverless sandbox infrastructure. All subsequent operations — mngr list, mngr pull, mngr stop — work identically whether the agents are running locally or on Modal.

To keep the local workflow exercising the same code path as remote, they also switched local agents from Git worktree mode to Git mirror mode, using mngr pull explicitly. The result is a single pipeline that works at any scale without architectural changes.

Composability Over Frameworks

A key insight from Imbue's design is what they deliberately didn't build. Rather than creating a map-reduce framework specific to agent orchestration, mngr provides primitives — create, list, pull, stop — that compose like functional programming operators. The pipeline itself is built from these primitives, not defined by a pre-baked orchestration DSL.

As Imbue put it in their blog post: "Unlike other multi-agent workflow tools that try to give you the 'map reduce framework', mngr gives you the 'maps' and 'reduces', just like a library of functional programming primitives, so that you can build whatever simple or complex multi-agent pipeline you want."

The Cost and Observability Reality

Running a hundred parallel agents sounds impressive, and it is — but the economics are nuanced. Each agent run against a real codebase consumes 20,000 to 50,000 tokens just on context: repo structure, relevant files, recent changes. Multiply that by 100 agents across multiple repos, and daily token costs climb into the millions before any actual fixes are written. Add reruns for failures and retries on rate-limited API calls, and the cost curve steepens quickly.

Short-lived parallel bursts are cost-effective — a quick sweep of tests that each complete in minutes. Scheduled hourly or overnight batch runs change the math entirely. A hundred agents running every hour on the clock multiplies context-window burn by a factor that forces hard decisions about which tests actually need to re-run and which can be skipped based on what code changed.

The harder problem at scale is observability. With one agent, you read its logs. With a hundred, you need to detect patterns across distributions. If three agents fail identically, is it a real code issue or rate limiting? If forty timeout simultaneously, is there a dependency problem or infrastructure saturation? At scale, you're debugging distributions, not individual runs. Imbue addresses this with structured result JSON files and a combined artifact view that surfaces CLI transcripts and terminal recordings side by side, but the challenge scales with the agent count and remains an area that demands attention from teams adopting this pattern.

Concurrency management is equally critical. Running "as many as possible" isn't the right strategy — you want "exactly as many as the API and your budget can support without making failure modes harder to diagnose." The Claude API has rate limits, and beyond that, each concurrent agent competes for the same sandbox infrastructure. Imbue's approach uses mngr list to poll and throttle, but the sweet spot for parallelism depends on the specific API tier, the complexity of each test, and the acceptable failure rate. Community discussion on the announcement thread suggested that for most teams, effective parallelism tops out lower than the headline number of 100 agents might imply — somewhere in the 20-50 range for sustained, reliable throughput.

The data privacy question also surfaced in the community response. Some readers raised concerns about sending proprietary codebases into AI providers' sandboxes, even on the assumption that zero-retention policies apply. The Imbue team's tool is open source and runs against their own code on Modal's infrastructure, but any organization adopting a similar pattern would need to evaluate their own data governance requirements, contract terms with the model provider, and whether the convenience of cloud sandboxes outweighs the risk of exposing sensitive code even transiently.

What the Community Thinks

The announcement drew a mix of enthusiasm and skepticism on Hacker News, generating over 55 points and dozens of comments. Kanjun Lee, Imbue's co-founder, actively participated in the discussion, confirming that mngr has no monetization plan and was shipped as open source because the team believes "open agents must win over closed/verticalized platforms."

One of the most prominent reactions came from a developer contrasting their own experience of "babysitting every feature for hours in Claude Code" with Imbue's claims of shipping features every 17 minutes across an 8M-line codebase using 3,000 parallel agents. Imbue's blog author responded directly, agreeing that the system works best on smaller, well-isolated components and advocating for modular architecture as the answer: "our reaction has been to say 'ok, well the best practice in software engineering is to make small, well-isolated components anyway, so what if we did that?'" This is a crucial caveat — the pattern works when codebases are deliberately structured for agent consumption.

Other threads debated the broader implications. Some commenters questioned how companies should approach intellectual property when AI-generated code exists in a shifting legal landscape around copyright. Others raised concerns about skill atrophy: if teams increasingly delegate to agent swarms, do engineers lose the ability to reason about low-level implementation details? Imbue's response was that their approach actually improves codebase understandability — by generating human-readable documentation linked to test cases, anyone (not just the original author) can read a high-level description of how the software should work, even without diving into every line of code.

There was also discussion about the trade-off between open-sourcing agent-built tools versus keeping them proprietary. Some argued that once AI can replicate any workflow using the same services, keeping it secret provides no lasting advantage. Others countered that AI providers themselves could eventually become competitors, trained on the very codebases they host. The debate remains unresolved, but Imbue's open-source bet reflects a philosophy that ecosystem velocity trumps competitive moat in the early days of agent-based development.

Scaling Down Is a Feature

Imbue emphasizes something rarely discussed in scalability conversations: the ability to scale down. The default expectation is that scaling means paying more upfront for infrastructure complexity. But because every mngr operation works identically on local machines, a small team can run 10 agents on their laptop, hundreds overnight as a trade of time for resources, and only spin up cloud infrastructure when they need burst parallelism.

This matters for the long tail of teams that don't have dedicated infrastructure staff. The upfront cost is near-zero, the incremental cost is linear, and the path from local development to cloud-scale is a single configuration flag.

Practical Takeaways

Imbue's approach offers several patterns developers can adopt today:

Start narrow. Each agent owns one test case. Small, isolated tasks produce reliable results that can be reviewed and merged with minimal friction.
Use structured reporting. Every agent writes a machine-readable result file. This is what enables automated integration and makes failures tractable at scale.
Separate mergeable from reviewable. Not all agent changes are equal. Auto-merge low-risk changes and keep the implementation work visible for human review.
Build recordings into tests. CLI transcripts and terminal recordings make it possible to debug agent behavior after the fact without re-running anything.
Develop locally, deploy when needed. Don't design for a hundred agents if ten will validate the design. Scale up when the logic is proven and the cost makes sense.
Break codebases into isolated components. As Imbue's blog author noted in the HN discussion, these systems work best on well-isolated components. The pattern naturally encourages micro-architectures where individual modules can be tested independently.

The self-testing pipeline is also mngr's own test suite: every time it runs, it validates that the tool works correctly in the hands of real agents. That feedback loop — the tool testing itself using itself — is perhaps the most elegant part of the design.

Why Open Source Matters

Kanjun Lee, Imbue's co-founder, confirmed on Hacker News that mngr is fully open source with no monetization plan. "We shipped it this way instead of trying to monetize because we believe open agents must win over closed/verticalized platforms in order for humans to live freely in our AI future," she wrote.

The codebase is built as a collection of plugins. Much of the orchestration logic, including the multi-agent testing pipeline described here, is implemented through mngr's own plugin system — making it both a product and a demonstration of its own capabilities.

Whether you're a solo developer running a handful of agents on a laptop or a team spinning up hundreds on Modal, the primitives are the same. The question shifts from "how do I orchestrate AI agents at scale?" to "which specific test cases should my agents own?"

Sources: - Imbue: "A case study in testing with 100+ Claude agents in parallel" (https://imbue.com/product/mngr_part_2/) - Hacker News discussion (https://news.ycombinator.com/item?id=47629485)