The flagship content asset of the DevRel program: reproducible, open-harness benchmarks across AI code review, CI runners, and agent sandboxes — published where Tenki loses, not just where it wins. In an unknown brand with a real precision gap, honesty is the credibility. It's also literally a Luxor value — "honesty over kindness." Every number below is Tenki first-party; the whole point is that you can re-run it.
Score every coding agent against every open model on real coding tasks — then slice it both ways. Pick a model and see which agent gets the most out of it; pick an agent and see which open model it actually shines on. The pairing matters more than either pick, and no neutral benchmark maps it — SWE-bench and terminal-bench rank models alone.
Claude Code, Codex, Aider, Goose, ForgeCode, Droid… ~25 agents run side by side (the dockingstation harness).
Qwen3-Coder, GLM, Kimi-K2, DeepSeek, Llama… served on Luxor GPU compute — no frontier API required.
Pick a model → the agent ranking flips. Pick an agent → the best open model changes. The winners aren't who you'd guess.
It runs entirely on Tenki — that's the point. Runners orchestrate a daily sweep (schedule → fan out → score → publish); Sandbox runs each agent×model cell in its own Firecracker microVM, in parallel; Luxor compute serves the open models on GPU. The benchmark itself proves the thesis: the agent era runs on Tenki, top to bottom.
Bring your own evals. Point your own task suite at the matrix and run it on Tenki — turning the public leaderboard into a hosted agent×model eval surface. Developers come to run their evals and stay as users. Neutral by design: Tenki ships infra, not a coding agent, so the upsets get published straight. Phasing: fixed matrix weekly → daily → self-serve.
Most vendor benchmarks are marketing dressed as data — cherry-picked repos, hidden harnesses, no way to reproduce. The Tenki Benchmark Series inverts that. The harness is open, the test sets are bugs that aren't in Tenki's own benchmark repos, and we publish the columns where competitors beat us. That is exactly what turns a benchmark from a brag into a trust asset — and it's the only way an unknown brand earns credibility on a metric.
The test harness and methodology are public so anyone — competitors included — can re-run the numbers. No hidden setup, no cherry-picked repos.
We report the columns Tenki trails on (precision, today). Showing the loss is what makes the win believable — and it maps a public roadmap.
A stated Luxor value, made operational. The benchmark is the clearest proof that Tenki's marketing tells developers the truth.
Every figure here is Tenki-run, first-party data — labeled as such. The credibility comes from reproducibility, not from pretending to be third-party.
How many real bugs does each reviewer catch (recall), how many of its flags are real (precision), and the harmonic balance (F1). Tenki leads recall and F1 by a wide margin — and trails on precision. We publish both, because the gap is the story.
| Reviewer | Recall | Precision | F1 |
|---|---|---|---|
| Tenki | 68.9% | 29.9% | 41.7 |
| Devin | 36.1% | 47.3% | 40.9 |
| Cursor | 32.0% | 51.3% | 39.4 |
| CodeRabbit | 28.7% | 25.0% | 26.7 |
| Greptile | 36.1% | 15.9% | 22.1 |
| Copilot | 24.6% | 18.9% | 21.4 |
| Graphite | 3.3% | 50.0% | 6.2 |
Tenki leads recall and F1 — and trails on precision. It catches roughly twice as many real bugs as the next reviewer (recall 68.9%) and posts the highest balanced score (F1 41.7), but only ~30% of its flags are true positives, behind Cursor (51.3%), Graphite (50.0%), and Devin (47.3%). We publish that gap on purpose: precision is the public roadmap and the content thesis. "We catch the most bugs; here's exactly how we're closing the noise" is a more credible — and more interesting — story than a sanitized win, and it's the spine of Initiative 1.
Drop-in GitHub Actions runners on Firecracker microVMs over Luxor's owned compute. Headline: ~30% faster and up to 60% cheaper. Per-workload speedups vary widely — and where a number is cache-driven, we say so rather than imply it's pure hardware.
| Workload | Speedup vs GitHub-hosted | Note |
|---|---|---|
| Rust build | +40% | Compute-bound build |
| Docker build | +30% | Image build pipeline |
| Android build | +37% | Toolchain-heavy build |
| n8n monorepo | +48% | Real-world monorepo CI |
| Go build | +99% | Cache-driven — flagged honestly; not representative of raw-hardware gains |
Why the honest flag matters. The Go result nearly doubles throughput — but it's cache-driven, not a raw-silicon win, so we label it that way. The same discipline that publishes the precision gap publishes the asterisk on a flattering speedup. Buyers evaluating runners can trust the ~30% / up-to-60% headline precisely because we don't dress up the outliers. The cost advantage is structural — Luxor owns the metal — so we can lead with economics without a fragile dollar-savings claim.
Benchmarks ship quarterly and on every major model release — because the AI-review field reshuffles each time a new frontier model lands. Three tracks run in rotation, each owning its metric definitions before competitors do.
Re-run every major model release on bugs that are not in Tenki's own benchmark repos. Owns the recall/precision/F1 definitions for the category and tracks the precision gap closing over time.
Per-workload speed and cost with a public cost calculator. Cache-driven results flagged. Quarterly, timed to the GitHub-pricing conversation.
microVM boot time and per-second cost for agent workloads. The agent-era proof point, co-published with the reference architectures.
Drop #1 — the AI-reviewer teardown. The series opens with an independent, open-harness study: Tenki vs CodeRabbit vs Greptile on real open-source bugs the reviewers have never seen. Same methodology as the table above, fully reproducible, published where Tenki loses. It's the first proof that the benchmark is a trust asset, not a brag — and it lands in Week 2 of the plan.
Before the sandbox benchmark ships, here's the honest frame: the purpose-built agent-sandbox category, on the dimensions that actually decide a buy — isolation, cold start, persistence, GPU, and product shape. Tenki sits squarely in the Firecracker microVM pack on speed; the difference is what's wrapped around the sandbox.
| Vendor | Isolation | Cold start | Persistence | GPU | Note |
|---|---|---|---|---|---|
| Tenki | Firecracker microVM | <100ms boot | Persistent volumes + snapshot/restore | No public GPU | Sandbox-first; native ADE app, integrated with runners + AI review, owned compute |
| E2B | Firecracker microVM | ~150ms | Ephemeral / pause (beta) | No | SOC 2; 200M+ sandboxes; SDK-first |
| Daytona | Docker | ~90ms | Stateful / unlimited | Yes | "Fastest creation," Computer Use desktops — but pricey and hard to self-host |
| Modal | gVisor | Sub-sec | Snapshots | Yes (A100/H100) | 50K+ concurrency; SOC 2 |
| Sprites.dev | Firecracker | Instant | Indefinite + hibernate (~300ms) | No | Zero idle cost |
| Blaxel | Firecracker | ~25ms resume | Standby snapshots | No | YC X25; $7.3M seed; SOC 2 / HIPAA / ISO 27001 |
| Contree (Nebius) | microVM | Sub-sec | Git-like branching + snapshots | Yes | MCP-native; 7,000+ SWE-bench environments |
| Runloop | Custom hypervisor | Sub-sec | Snapshots | No | SWE-bench focus; 10K+ parallel; VPC |
| Northflank | microVM / gVisor | Sub-sec | Stateful | Yes (H100) | Enterprise VPC, multi-cloud |
| Vercel Sandbox | Firecracker | Sub-sec | Snapshot / hibernate | No | Part of the Vercel Agents stack |
Tenki's cold start (<100ms) is competitive with the whole Firecracker pack — but raw speed is not the differentiation. E2B, Sprites.dev, Blaxel, and Vercel all live in the same microVM-boot neighborhood, and Blaxel's ~25ms resume is faster. The wedge is the native ADE + the integrated runners-and-review loop + long-running-agent fit + owned-compute price: Tenki is the only one of these where the sandbox, the CI runners, and the AI code reviewer are one product over Luxor's own metal. The honest gaps: no public GPU (Modal, Daytona, Contree, Northflank expose it; Tenki doesn't yet) and no published sandbox benchmark yet. That second gap is a near-term DevRel deliverable — run an open, reproducible sandbox benchmark on the same "release the harness, publish where we lose" philosophy as the code-review study above.
Competitor figures are drawn from public sources and opencolin's agentic-engineering research; Tenki figures are first-party.
Every figure on this page — recall/precision/F1, runner speedups, and the headline 30%/60% — is Tenki first-party data, run on Tenki's own harness. We label it as such on purpose. The credibility doesn't come from claiming third-party validation; it comes from an open harness anyone can re-run, test sets drawn from bugs outside Tenki's benchmark repos, and a standing commitment to publish the columns where Tenki loses. That is the trust engine — and the reason this is Initiative 1 of the whole DevRel program.