In December we published the first MCPMark benchmark results comparing InsForge MCP, Supabase MCP, and Postgres MCP across 21 real-world database tasks using Claude Sonnet 4.5. InsForge came out ahead on accuracy, speed, and token efficiency.
We reran the benchmarks. This time on Claude Sonnet 4.6, the latest model from Anthropic. InsForge MCP achieves 28% higher Pass⁴ accuracy while using 2.4x fewer tokens than Supabase MCP. The efficiency gap has widened.
Updated Results: Claude Sonnet 4.6
Same 21 MCPMark Postgres tasks, 4 runs per task, strict Pass⁴ scoring.
| Metric | InsForge | Supabase MCP |
|---|---|---|
| Pass⁴ Accuracy | 42.86% | 33.33% |
| Pass@1 Average | 58.33% | 47.62% |
| Pass@4 | 76.19% | 66.67% |
| Tokens Per Run | 7.3M | 17.9M |
| Avg Tokens Per Task | 358K | 862K |
| Avg Time Per Task | 156.6s | 198.8s |
| Avg Turns Per Task | 18.6 | 17.0 |
InsForge MCP maintains higher accuracy across all three metrics and uses 2.4x fewer tokens per run.
Accuracy
Pass@1 is average single-run accuracy, Pass@4 means the agent passed at least once in 4 runs, and Pass⁴ requires passing all 4. InsForge passes 76% of tasks at least once (Pass@4) and 43% under strict repeated execution (Pass⁴). Supabase MCP reaches 67% and 33% respectively.
| Model | Metric | InsForge | Supabase MCP |
|---|---|---|---|
| Sonnet 4.5 | Pass⁴ | 47.6% | 28.6% |
| Sonnet 4.6 | Pass⁴ | 42.86% | 33.33% |
| Sonnet 4.6 | Pass@4 | 76.19% | 66.67% |
| Sonnet 4.6 | Pass@1 Avg | 58.33% | 47.62% |
InsForge's accuracy advantage comes from surfacing backend state before the agent acts. When the agent can see record counts, RLS policies, and foreign keys upfront, it writes correct queries on the first attempt instead of guessing and retrying. That is why the gap holds across model versions.
The Token Gap Widened
This is the most notable change from v1.
With Sonnet 4.5, InsForge used approximately 30% fewer tokens than Supabase MCP (8.2M vs 11.6M per run). With Sonnet 4.6, the gap has grown to 59% fewer tokens (7.3M vs 17.9M per run).
| Model | InsForge Tokens/Run | Supabase Tokens/Run | Difference |
|---|---|---|---|
| Sonnet 4.5 | 8.2M | 11.6M | 1.4x |
| Sonnet 4.6 | 7.3M | 17.9M | 2.4x |
InsForge got slightly more efficient on Sonnet 4.6 (8.2M down to 7.3M). Supabase MCP went in the opposite direction (11.6M up to 17.9M). The newer model appears to reason more extensively when backend context is incomplete, which increases token consumption on backends that do not surface schema details upfront.
When the backend provides structured context from the start, the agent reasons less and executes more. When it does not, the agent compensates with additional discovery queries and verification steps, and that compensation costs more tokens on a more capable model.
Where the Extra Tokens Go
Two factors account for most of the gap:
- Documentation overhead. Supabase's
search_docsreturns full GraphQL schema metadata on every call, 5-10x more tokens per query than InsForge'sfetch-docs. - Exploration before execution. Without structured schema context upfront, the agent runs more discovery queries before doing actual work.
Speed
InsForge completes tasks in an average of 156.6 seconds compared to 198.8 seconds for Supabase MCP. This is a 1.27x speed advantage, consistent with what we observed on Sonnet 4.5.
What This Means
The core finding from our original benchmark post holds: agents perform better when the backend gives them structured context and workflow upfront. The Sonnet 4.6 results reinforce this and show that the advantage grows as models become more capable.
More capable models do not eliminate the need for structured backend context. They amplify the cost of not having it.
We will continue running benchmarks as new models are released and as we improve the InsForge MCP layer. All benchmark methodology follows MCPMark standards and is fully reproducible. The latest raw results are available on GitHub.

