Initiative 1 · The trust engine

The Tenki Benchmark Series.

The flagship content asset of the DevRel program: reproducible, open-harness benchmarks across AI code review, CI runners, and agent sandboxes — published where Tenki loses, not just where it wins. In an unknown brand with a real precision gap, honesty is the credibility. It's also literally a Luxor value — "honesty over kindness." Every number below is Tenki first-party; the whole point is that you can re-run it.

Cadence: Quarterly + per-model-release Coverage: AI review · runners · sandbox Method: Open harness, reproducible
The flagship

TenkiBench — the agent × open-model matrix.

Score every coding agent against every open model on real coding tasks — then slice it both ways. Pick a model and see which agent gets the most out of it; pick an agent and see which open model it actually shines on. The pairing matters more than either pick, and no neutral benchmark maps it — SWE-bench and terminal-bench rank models alone.

Agents

Every coding agent

Claude Code, Codex, Aider, Goose, ForgeCode, Droid… ~25 agents run side by side (the dockingstation harness).

Models

Every open model

Qwen3-Coder, GLM, Kimi-K2, DeepSeek, Llama… served on Luxor GPU compute — no frontier API required.

The map

Sliced both ways

Pick a model → the agent ranking flips. Pick an agent → the best open model changes. The winners aren't who you'd guess.

It runs entirely on Tenki — that's the point. Runners orchestrate a daily sweep (schedule → fan out → score → publish); Sandbox runs each agent×model cell in its own Firecracker microVM, in parallel; Luxor compute serves the open models on GPU. The benchmark itself proves the thesis: the agent era runs on Tenki, top to bottom.

Bring your own evals. Point your own task suite at the matrix and run it on Tenki — turning the public leaderboard into a hosted agent×model eval surface. Developers come to run their evals and stay as users. Neutral by design: Tenki ships infra, not a coding agent, so the upsets get published straight. Phasing: fixed matrix weekly → daily → self-serve.

The philosophy

Open harness. Publish where we lose.

Most vendor benchmarks are marketing dressed as data — cherry-picked repos, hidden harnesses, no way to reproduce. The Tenki Benchmark Series inverts that. The harness is open, the test sets are bugs that aren't in Tenki's own benchmark repos, and we publish the columns where competitors beat us. That is exactly what turns a benchmark from a brag into a trust asset — and it's the only way an unknown brand earns credibility on a metric.

Reproducible

Open harness

The test harness and methodology are public so anyone — competitors included — can re-run the numbers. No hidden setup, no cherry-picked repos.

Honest

Publish where we lose

We report the columns Tenki trails on (precision, today). Showing the loss is what makes the win believable — and it maps a public roadmap.

On-brand

"Honesty over kindness"

A stated Luxor value, made operational. The benchmark is the clearest proof that Tenki's marketing tells developers the truth.

First-party

We say so, plainly

Every figure here is Tenki-run, first-party data — labeled as such. The credibility comes from reproducibility, not from pretending to be third-party.

Benchmark 1

AI Code Review — recall, precision, F1.

How many real bugs does each reviewer catch (recall), how many of its flags are real (precision), and the harmonic balance (F1). Tenki leads recall and F1 by a wide margin — and trails on precision. We publish both, because the gap is the story.

68.9%
Tenki recall
~2× the next-best reviewer
41.7
Tenki F1 (best)
highest balanced score in the field
29.9%
Tenki precision
the honest gap — and the roadmap
ReviewerRecallPrecisionF1
Tenki68.9%29.9%41.7
Devin36.1%47.3%40.9
Cursor32.0%51.3%39.4
CodeRabbit28.7%25.0%26.7
Greptile36.1%15.9%22.1
Copilot24.6%18.9%21.4
Graphite3.3%50.0%6.2

Tenki leads recall and F1 — and trails on precision. It catches roughly twice as many real bugs as the next reviewer (recall 68.9%) and posts the highest balanced score (F1 41.7), but only ~30% of its flags are true positives, behind Cursor (51.3%), Graphite (50.0%), and Devin (47.3%). We publish that gap on purpose: precision is the public roadmap and the content thesis. "We catch the most bugs; here's exactly how we're closing the noise" is a more credible — and more interesting — story than a sanitized win, and it's the spine of Initiative 1.

Benchmark 2

Runner performance — faster and cheaper, flagged honestly.

Drop-in GitHub Actions runners on Firecracker microVMs over Luxor's owned compute. Headline: ~30% faster and up to 60% cheaper. Per-workload speedups vary widely — and where a number is cache-driven, we say so rather than imply it's pure hardware.

30%↑
Faster (typical)
vs GitHub-hosted runners
60%↓
Cheaper (up to)
owned-compute cost moat
+48%
n8n monorepo
real-world monorepo CI
+99%
Go (cache-driven)
flagged — cache, not raw hardware
WorkloadSpeedup vs GitHub-hostedNote
Rust build+40%Compute-bound build
Docker build+30%Image build pipeline
Android build+37%Toolchain-heavy build
n8n monorepo+48%Real-world monorepo CI
Go build+99%Cache-driven — flagged honestly; not representative of raw-hardware gains

Why the honest flag matters. The Go result nearly doubles throughput — but it's cache-driven, not a raw-silicon win, so we label it that way. The same discipline that publishes the precision gap publishes the asterisk on a flattering speedup. Buyers evaluating runners can trust the ~30% / up-to-60% headline precisely because we don't dress up the outliers. The cost advantage is structural — Luxor owns the metal — so we can lead with economics without a fragile dollar-savings claim.

The cadence

A series, not a one-off.

Benchmarks ship quarterly and on every major model release — because the AI-review field reshuffles each time a new frontier model lands. Three tracks run in rotation, each owning its metric definitions before competitors do.

1

AI Code Review track

Recall · precision · F1 — vs CodeRabbit, Greptile, Copilot, Cursor, Devin, Graphite

Re-run every major model release on bugs that are not in Tenki's own benchmark repos. Owns the recall/precision/F1 definitions for the category and tracks the precision gap closing over time.

2

Runners track

Speed + $/run — vs GitHub-hosted, Depot, Blacksmith, Namespace

Per-workload speed and cost with a public cost calculator. Cache-driven results flagged. Quarterly, timed to the GitHub-pricing conversation.

3

Sandbox track

Boot latency + cost — vs E2B, Modal, Daytona

microVM boot time and per-second cost for agent workloads. The agent-era proof point, co-published with the reference architectures.

Drop #1 — the AI-reviewer teardown. The series opens with an independent, open-harness study: Tenki vs CodeRabbit vs Greptile on real open-source bugs the reviewers have never seen. Same methodology as the table above, fully reproducible, published where Tenki loses. It's the first proof that the benchmark is a trust asset, not a brag — and it lands in Week 2 of the plan.

The competitive frame

Sandbox competitive landscape.

Before the sandbox benchmark ships, here's the honest frame: the purpose-built agent-sandbox category, on the dimensions that actually decide a buy — isolation, cold start, persistence, GPU, and product shape. Tenki sits squarely in the Firecracker microVM pack on speed; the difference is what's wrapped around the sandbox.

VendorIsolationCold startPersistenceGPUNote
Tenki Firecracker microVM <100ms boot Persistent volumes + snapshot/restore No public GPU Sandbox-first; native ADE app, integrated with runners + AI review, owned compute
E2B Firecracker microVM ~150ms Ephemeral / pause (beta) No SOC 2; 200M+ sandboxes; SDK-first
Daytona Docker ~90ms Stateful / unlimited Yes "Fastest creation," Computer Use desktops — but pricey and hard to self-host
Modal gVisor Sub-sec Snapshots Yes (A100/H100) 50K+ concurrency; SOC 2
Sprites.dev Firecracker Instant Indefinite + hibernate (~300ms) No Zero idle cost
Blaxel Firecracker ~25ms resume Standby snapshots No YC X25; $7.3M seed; SOC 2 / HIPAA / ISO 27001
Contree (Nebius) microVM Sub-sec Git-like branching + snapshots Yes MCP-native; 7,000+ SWE-bench environments
Runloop Custom hypervisor Sub-sec Snapshots No SWE-bench focus; 10K+ parallel; VPC
Northflank microVM / gVisor Sub-sec Stateful Yes (H100) Enterprise VPC, multi-cloud
Vercel Sandbox Firecracker Sub-sec Snapshot / hibernate No Part of the Vercel Agents stack

Tenki's cold start (<100ms) is competitive with the whole Firecracker pack — but raw speed is not the differentiation. E2B, Sprites.dev, Blaxel, and Vercel all live in the same microVM-boot neighborhood, and Blaxel's ~25ms resume is faster. The wedge is the native ADE + the integrated runners-and-review loop + long-running-agent fit + owned-compute price: Tenki is the only one of these where the sandbox, the CI runners, and the AI code reviewer are one product over Luxor's own metal. The honest gaps: no public GPU (Modal, Daytona, Contree, Northflank expose it; Tenki doesn't yet) and no published sandbox benchmark yet. That second gap is a near-term DevRel deliverable — run an open, reproducible sandbox benchmark on the same "release the harness, publish where we lose" philosophy as the code-review study above.

Competitor figures are drawn from public sources and opencolin's agentic-engineering research; Tenki figures are first-party.

Methodology & disclosure

All numbers are Tenki first-party.

Every figure on this page — recall/precision/F1, runner speedups, and the headline 30%/60% — is Tenki first-party data, run on Tenki's own harness. We label it as such on purpose. The credibility doesn't come from claiming third-party validation; it comes from an open harness anyone can re-run, test sets drawn from bugs outside Tenki's benchmark repos, and a standing commitment to publish the columns where Tenki loses. That is the trust engine — and the reason this is Initiative 1 of the whole DevRel program.