MCPMark v2: InsForge on Sonnet 4.6

02 Mar 20264 minute
Tony Chang

Tony Chang

CTO & Co-Founder

MCPMark v2 Benchmark Results

In December we published the first MCPMark benchmark results comparing InsForge MCP, Supabase MCP, and Postgres MCP across 21 real-world database tasks using Claude Sonnet 4.5. InsForge came out ahead on accuracy, speed, and token efficiency.

We reran the benchmarks. This time on Claude Sonnet 4.6, the latest model from Anthropic. InsForge MCP achieves 28% higher Pass⁴ accuracy while using 2.4x fewer tokens than Supabase MCP. The efficiency gap has widened.

Updated Results: Claude Sonnet 4.6

Same 21 MCPMark Postgres tasks, 4 runs per task, strict Pass⁴ scoring.

MetricInsForgeSupabase MCP
Pass⁴ Accuracy42.86%33.33%
Pass@1 Average58.33%47.62%
Pass@476.19%66.67%
Tokens Per Run7.3M17.9M
Avg Tokens Per Task358K862K
Avg Time Per Task156.6s198.8s
Avg Turns Per Task18.617.0

InsForge MCP maintains higher accuracy across all three metrics and uses 2.4x fewer tokens per run.

Accuracy

Pass@1 is average single-run accuracy, Pass@4 means the agent passed at least once in 4 runs, and Pass⁴ requires passing all 4. InsForge passes 76% of tasks at least once (Pass@4) and 43% under strict repeated execution (Pass⁴). Supabase MCP reaches 67% and 33% respectively.

ModelMetricInsForgeSupabase MCP
Sonnet 4.5Pass⁴47.6%28.6%
Sonnet 4.6Pass⁴42.86%33.33%
Sonnet 4.6Pass@476.19%66.67%
Sonnet 4.6Pass@1 Avg58.33%47.62%

InsForge's accuracy advantage comes from surfacing backend state before the agent acts. When the agent can see record counts, RLS policies, and foreign keys upfront, it writes correct queries on the first attempt instead of guessing and retrying. That is why the gap holds across model versions.

The Token Gap Widened

This is the most notable change from v1.

With Sonnet 4.5, InsForge used approximately 30% fewer tokens than Supabase MCP (8.2M vs 11.6M per run). With Sonnet 4.6, the gap has grown to 59% fewer tokens (7.3M vs 17.9M per run).

ModelInsForge Tokens/RunSupabase Tokens/RunDifference
Sonnet 4.58.2M11.6M1.4x
Sonnet 4.67.3M17.9M2.4x

InsForge got slightly more efficient on Sonnet 4.6 (8.2M down to 7.3M). Supabase MCP went in the opposite direction (11.6M up to 17.9M). The newer model appears to reason more extensively when backend context is incomplete, which increases token consumption on backends that do not surface schema details upfront.

When the backend provides structured context from the start, the agent reasons less and executes more. When it does not, the agent compensates with additional discovery queries and verification steps, and that compensation costs more tokens on a more capable model.

Where the Extra Tokens Go

Two factors account for most of the gap:

  1. Documentation overhead. Supabase's search_docs returns full GraphQL schema metadata on every call, 5-10x more tokens per query than InsForge's fetch-docs.
  2. Exploration before execution. Without structured schema context upfront, the agent runs more discovery queries before doing actual work.

Speed

InsForge completes tasks in an average of 156.6 seconds compared to 198.8 seconds for Supabase MCP. This is a 1.27x speed advantage, consistent with what we observed on Sonnet 4.5.

What This Means

The core finding from our original benchmark post holds: agents perform better when the backend gives them structured context and workflow upfront. The Sonnet 4.6 results reinforce this and show that the advantage grows as models become more capable.

More capable models do not eliminate the need for structured backend context. They amplify the cost of not having it.

We will continue running benchmarks as new models are released and as we improve the InsForge MCP layer. All benchmark methodology follows MCPMark standards and is fully reproducible. The latest raw results are available on GitHub.

Try It

InsForge on GitHub

Quickstart guide