InsForge MCP: The most reliable, context-efficient backend for AI agents

04 Dec 20258 minute
Tony Chang

Tony Chang

CTO & Co-Founder

InsForge MCPMark Benchmark Results

We are excited to share results from our evaluation of InsForge MCP using MCPMark, an open source benchmark that measures how well MCP servers handle complex database tasks. InsForge MCP consistently reaches higher accuracy while using fewer tokens across a wide range of workloads. Our MCP layer provides the most efficient context for AI agents to understand the backend and produce more reliable, precise output.

MCPMark Benchmark Results

The Problem

Large language models have limited context windows. When the model cannot see the full backend, it hallucinates and tries to guess the current backend structure, which leads to recurring failures.

Examples including:

  • Joining tables that do not share a valid foreign key relationship
  • Ignoring row level security rules and exposing restricted data
  • Querying columns or tables that no longer exist in the current schema

See here for a detailed explanation of why context matters: Why Context is Everything in AI Coding

The Benchmark

To quantify context efficiency and accuracy, we used the MCPMark Postgres Dataset. This benchmark evaluates whether an MCP server enables an agent to perform the required database operations correctly. The suite includes 21 real world tasks that cover analytical reporting, complex joins, migrations, CRUD logic, constraint reasoning, index creation, query optimization, row level security enforcement, trigger based consistency, transactional operations, audit logging, and vector search through pgvector. MCPMark also records accuracy, token usage, tool call counts, and run time, providing a clear and reproducible basis for comparing different MCP layers.

For this evaluation, we compared three MCP servers: Supabase MCP, Postgres MCP, and InsForge MCP. All three servers allow an LLM, acting as an MCP client, to inspect database schemas and perform database operations. Tests were run using the Anthropic Sonnet 4.5 model. Each task was executed 4 times to reduce model variability and ensure stable results.

The Results

We evaluated token usage, run time, and Pass⁴ accuracy. Pass⁴ refers to strict accuracy: a task counts as successful only when the agent completes it correctly in all four runs. This gives a reliable picture of how stable each MCP server is across repeated executions.

Run Time

Across the 21 tasks, InsForge completes each task in an average of 150 seconds, while Postgres MCP and Supabase MCP both require more than 200 seconds on average.

MCPMark Run Time Comparison

Token Usage

Token usage shows an even clearer difference. InsForge uses an average of 8.2 million tokens per run (across all 21 tasks), compared with 10.4 million for Postgres MCP and 11.6 million for Supabase MCP.

MCPMark Token Usage Comparison

Pass⁴ Accuracy

Pass⁴ counts a task as accurate only when the agent succeeds in all four runs. Under this strict measure, InsForge achieves 47.6 percent accuracy, while Supabase MCP and Postgres MCP reach 28.6 percent and 38.1 percent, respectively.

MCPMark Pass⁴ Accuracy Comparison

Agents consistently make fewer mistakes and complete multi step backend operations with less time and cost. See Appendix 1 for the per task breakdown across all test cases.

Deep Dive Example 1: RLS Setup

Task: security__rls_business_access

Implement Row Level Security policies for a social media platform with 5 tables (users, channels, posts, comments, channel_moderators).

Results Summary 1

BackendSuccess RateTokens UsedTurns
InsForge4/4 (100%)296K15
Supabase1/4 (25%)340K-
Postgres4/4 (100%)581K23

InsForge's get-table-schema provides RLS-aware context:

json
{
  "users": {
    "schema": [...],
    "indexes": [...],
    "foreignKeys": [...],
    "rlsEnabled": false,
    "policies": [],
    "triggers": []
  }
}

The agent immediately knows: RLS is disabled, no policies exist. It can proceed directly to implementation.

Postgres's get_object_details lacks RLS information:

json
{
  "basic": { "schema": "public", "name": "users", "type": "table" },
  "columns": [...],
  "constraints": [...],
  "indexes": [...]
}

No rlsEnabled field. No policies field. The agent must run additional queries to check RLS status before proceeding.

Execution Pattern Comparison

InsForge (15 turns, 296K tokens):

text
get-instructions          → Learns how to use InsForge Workflow
get-backend-metadata      → All 5 tables at a glance
get-table-schema × 5      → Full schema WITH RLS status (parallel)
run-raw-sql × 6           → Create functions, enable RLS, create policies

Postgres (23 turns, 581K tokens):

text
list_schemas              → Schema names only
list_objects              → Table names only
get_object_details × 5    → Schema WITHOUT RLS status
execute_sql               → Query to check current RLS status
execute_sql × 8           → Create functions, enable RLS, policies
execute_sql × 4           → Verification queries

Supabase MCP provides only list_tables for discovery. It returns table names without structure, constraints, or RLS status. The agent attempted blind migrations without understanding existing state:

text
list_tables               → Just table names
apply_migration           → Failed: naming conflict
apply_migration           → Retry, still incomplete

Without visibility into existing policies or RLS status, the agent couldn't reliably implement the required security model.

Deep Dive Example 2: Demographics Report

Task: employees__employee_demographics_report

Create four statistical tables for an annual HR demographics report: gender statistics, age group analysis, birth month distribution, and hiring year summary. Requires understanding employee-to-salary table relationships.

Results Summary 2

BackendSuccess RateTokens Used
InsForge4/4 (100%)207K
Supabase3/4 (75%)204K
Postgres2/4 (50%)220K

InsForge is the only backend with 100% reliability. Token usage is similar across all three, but Supabase and Postgres fail due to SQL logic errors.

InsForge's get-backend-metadata exposes record counts for each table:

json
{
  "tables": [
    { "schema": "employees", "tableName": "employee", "recordCount": 300024 },
    { "schema": "employees", "tableName": "salary", "recordCount": 2844047 }
  ]
}

This gives the agent a clear relationship signal:

  • 2.84M salary rows
  • 300K employees
  • roughly 9.5 salary records per employee

The agent immediately understands it must avoid naive COUNT(*) over a JOIN and instead use COUNT(DISTINCT e.id). This prevents the most common metrics error in many to one relationships.

Both Supabase and Postgres lack record count visibility:

  • Supabase: list_tables shows names only
  • Postgres: list_objects + get_object_details show schema but no counts

Example failure written using Supabase and Postgres MCP:

sql
SELECT gender, COUNT(*)   -- ❌ counts salary rows, not employees
FROM employee e
LEFT JOIN salary s ON e.id = s.employee_id;

This produces results 9.5 times too large, matching the row multiplication effect.

Correct SQL written when using InsForge MCP:

sql
SELECT gender, COUNT(DISTINCT e.id)
FROM employees.employee e
LEFT JOIN employees.salary s ON e.id = s.employee_id
GROUP BY gender;

InsForge's metadata gives agents the context they need to handle schema relationships correctly. Even small signals like record count can improve correctness in analytical workloads.

Conclusion

MCPMark makes it clear that reliable backend context matters. InsForge MCP helps agents complete database tasks more accurately and with fewer tokens by giving them a structured and complete view of the underlying schema.

This leads to fewer mistakes, fewer retries, and a more predictable natural-language development experience. If you want agents that can handle complex backend operations, InsForge MCP delivers the context needed to make that possible.

Appendix 1: Per Task Breakdown

Appendix 1 lists the results for every individual task in the MCPMark Postgres benchmark. For each task, we report two values for every MCP server:

  • Pass⁴ accuracy: how many of the four runs succeeded
  • Token usage: the average number of tokens consumed for that task

This table provides a detailed, per-task view of how InsForge MCP, Supabase MCP, and Postgres MCP performed across all twenty one benchmark tasks.

For descriptions of what each task represents, feel free to explore the official task list at: https://mcpmark.ai/tasks?category=postgres

MCP Benchmark Results

#TaskInsForgeSupabasePostgres
1chinook__customer_data_migration4/4, 529K3/4, 1,421K2/4, 1,639K
2chinook__employee_hierarchy_management4/4, 248K4/4, 260K4/4, 230K
3chinook__sales_and_music_charts0/4, 264K0/4, 260K0/4, 302K
4dvdrental__customer_analysis_fix0/4, 221K2/4, 355K0/4, 333K
5dvdrental__customer_analytics_optimization4/4, 277K3/4, 215K4/4, 195K
6dvdrental__film_inventory_management4/4, 378K4/4, 342K4/4, 375K
7employees__employee_demographics_report4/4, 207K3/4, 204K2/4, 220K
8employees__employee_performance_analysis0/4, 316K0/4, 505K0/4, 211K
9employees__employee_project_tracking2/4, 596K3/4, 321K2/4, 286K
10employees__employee_retention_analysis0/4, 330K0/4, 214K0/4, 255K
11employees__executive_dashboard_automation0/4, 438K1/4, 324K0/4, 686K
12employees__management_structure_analysis4/4, 286K2/4, 195K4/4, 280K
13lego__consistency_enforcement4/4, 573K4/4, 346K4/4, 787K
14lego__database_security_policies0/4, 327K0/4, 374K3/4, 552K
15lego__transactional_inventory_transfer1/4, 922K2/4, 945K0/4, 1,178K
16security__rls_business_access4/4, 296K1/4, 340K4/4, 581K
17security__user_permission_audit0/4, 118K0/4, 182K0/4, 352K
18sports__baseball_player_analysis3/4, 681K4/4, 1,221K0/4, 645K
19sports__participant_report_optimization4/4, 248K4/4, 653K4/4, 219K
20sports__team_roster_management0/4, 318K0/4, 1,785K0/4, 369K
21vectors__dba_vector_analysis4/4, 638K4/4, 1,181K4/4, 656K