Terminal Bench Leaderboard as of August 2025
We've reached an inflection point. Over the past year, we've been systematically building the infrastructure to create world-class agents:
- We started by building a leading text-to-SQL agent for blockchain data—leveraging our deep domain expertise to solve real developer pain points.
- We then launched Agent Arena, the best way to get high-quality human feedback on agents.
This quarter, everything came together.
🏆 Major Milestone Achieved
We built OB-1—our specialized coding agent that achieved 49% success rate on Terminal Bench, claiming the #2 position globally. We outperformed Anthropic's Claude Code (43.2%) and more than doubled OpenAI's Codex (20%).
Terminal Bench has become the defining benchmark for coding agents as CLI-based workflows dominate enterprise adoption.
Why This Matters
Specialized agents built by small teams can outperform big labs. The market is shifting from instruction-tuning to specialized RL environments. As Karpathy notes: "the highest leverage thing you can do is help construct a high diversity of RL environments that help elicit LLM cognitive strategies." We're building these environments via decentralized networks—better than centralized coordination. We see two major opportunities ahead:
Platform-Specific Coding Agents
We've identified significant white space in the coding agent market. Large developer ecosystems lack access to tailored coding agents that can truly supercharge developer productivity. While Claude Code and Cursor excel at general-purpose coding, they're optimized for breadth, not depth. Platform-specific nuances—like Ethereum's gas optimization patterns, Android's lifecycle management, or Stripe's idempotency requirements—demand specialized agents. We're positioning ourselves as the strategic partner to fill this gap because decentralized coding agents create vendor-agnostic, non-lock-in solutions that ecosystems truly appreciate. With our #2 position on Terminal Bench, we now have a seat at the table to drive this transformation.
Decentralized Agent Network
Open-source will be fundamental to achieving AGI—not through a singular model, but through continuous domain-specific learning integrated into existing model architectures. We're transforming our infrastructure into the go-to open-source stack for building, evaluating, and training specialized agents. The market is validating this approach: OpenAI and Anthropic are spinning out dozens of teams for high-value domains. This fragmentation creates a coordination challenge that favors decentralized networks over centralized entities. Our infrastructure becomes the substrate for this inevitable shift. If good environments remain expensive and hidden, open-source models will fall further behind—but if robust open-source tooling emerges, open-source can also be state-of-the-art.
Next Steps
Capability — Pushing OB-1 to #1 on Terminal Bench.
Commercialization — Our leaderboard standing creates immediate go-to-market leverage. XBOW's Series B validates that technical leadership translates to commercial success. We're pursuing ecosystem partnerships with Apple, Microsoft, Android, AWS, Ethereum, and Stripe.
Where You Can Help
- Amplification — Please share our announcement to help us reach developer communities.
- Strategic Outreach — Help us connect with OpenAI and other major players. OpenAI should want to co-market this: we helped their GPT-5 model achieve #2 position on Terminal Bench, higher than Claude. Our specialized agent OB-1 also hit #2 globally (49% success rate), showing how focused expertise can complement their general models.
- Connections — We're seeking introductions to ecosystem leaders at Ethereum, Android, AWS, Stripe, and similar platforms. These massive developer communities represent an underserved market for specialized coding agents tailored to their unique APIs, patterns, and best practices. Each ecosystem partnership represents a multi-million developer opportunity. If you have other ideas of how to commercialize coding agents, please let us know.
We're building an in-person team in SF. We are hiring for 2 key roles.
View the live Terminal Bench leaderboard:
Visit Terminal Bench →