InsForge MCP: The most reliable, context-efficient backend for AI agents

We are excited to share the results of our MCPMark evaluation of InsForge MCP.

Across 21 real world database tasks, InsForge MCP is 1.6x faster, uses 30 percent fewer tokens, and achieves up to 70 percent higher Pass⁴ accuracy than Supabase MCP and Postgres MCP. These results show that providing richer and more precise backend context allows AI agents to complete complex database operations more efficiently and more reliably.

We ran this evaluation using MCPMark, an open source benchmark designed to measure how well MCP servers support large language models in performing non trivial database tasks. MCPMark focuses on correctness, repeatability, and cost by tracking accuracy, token usage, tool calls, and run time across repeated executions.

InsForge MCP consistently produces higher accuracy while consuming fewer tokens across a wide range of workloads. This is because our MCP layer gives AI agents a complete and structured view of the backend, allowing them to reason about schemas, constraints, security policies, and data relationships without guessing. When agents understand the backend state clearly, they generate more precise SQL, avoid unnecessary retries, and behave more predictably.

The Problem

Large language models have limited context windows. When the model cannot see the full backend, it hallucinates and tries to guess the current backend structure, which leads to recurring failures.

Examples including:

Joining tables that do not share a valid foreign key relationship
Ignoring row level security rules and exposing restricted data
Querying columns or tables that no longer exist in the current schema

See here for a detailed explanation of why context matters: Why Context is Everything in AI Coding

The Benchmark

To quantify context efficiency and accuracy, we used the MCPMark Postgres Dataset. This benchmark evaluates whether an MCP server enables an agent to perform the required database operations correctly. The suite includes 21 real world tasks that cover analytical reporting, complex joins, migrations, CRUD logic, constraint reasoning, index creation, query optimization, row level security enforcement, trigger based consistency, transactional operations, audit logging, and vector search through pgvector. MCPMark also records accuracy, token usage, tool call counts, and run time, providing a clear and reproducible basis for comparing different MCP layers.

For this evaluation, we compared three MCP servers: Supabase MCP, Postgres MCP, and InsForge MCP. All three servers allow an LLM, acting as an MCP client, to inspect database schemas and perform database operations. Tests were run using the Anthropic Sonnet 4.5 model. Each task was executed 4 times to reduce model variability and ensure stable results.

The Results

We evaluated InsForge against Postgres MCP and Supabase MCP using MCPMark. The benchmark measures three dimensions that matter most for AI coding agents operating backend systems: run time, token usage, and Pass⁴ accuracy.

Pass⁴ accuracy is a strict metric. A task is counted as successful only if the agent completes it correctly in all four independent runs. This reflects not just whether an agent can succeed once, but whether it can reliably repeat the same backend operation without mistakes.

Across all metrics, InsForge consistently outperforms both Postgres MCP and Supabase MCP.

Run Time

InsForge is 1.6x faster than both Postgres MCP and Supabase MCP on backend tasks.

Across the 21 evaluated tasks, InsForge completes each task in an average of 150 seconds. In contrast, both Postgres MCP and Supabase MCP require more than 200 seconds on average to complete the same tasks.

This difference compounds in real agent workflows. Faster execution means agents can iterate more quickly, recover from errors sooner, and complete multi step backend operations with less latency.

Token Usage

InsForge uses approximately 30 percent fewer tokens per run than Postgres MCP and Supabase MCP.

Across all 21 tasks, InsForge consumes an average of 8.2 million tokens per run. Postgres MCP uses 10.4 million tokens on average, while Supabase MCP uses 11.6 million tokens.

Lower token usage indicates that agents require fewer corrective steps and less redundant reasoning to complete backend operations. In practice, this directly translates to lower cost and more predictable agent behavior.

Pass⁴ Accuracy

InsForge achieves up to 70 percent higher Pass⁴ accuracy under strict repeated execution.

Under the Pass⁴ metric, InsForge reaches an accuracy of 47.6 percent. Supabase MCP achieves 28.6 percent, and Postgres MCP reaches 38.1 percent.

This result highlights an important difference. While agents may occasionally succeed on other MCP servers, they fail more often when asked to repeat the same backend task multiple times. InsForge enables agents to complete complex backend operations correctly and consistently across repeated runs.

Benchmark Summary

Across speed, cost, and reliability, InsForge provides a backend environment where AI coding agents perform better.

Agents complete backend tasks faster, use fewer tokens, and make fewer mistakes when operating on InsForge. These improvements become especially significant in multi step workflows where stability and repeatability matter.

A full per task breakdown for all 21 benchmark cases is available in Appendix 1.

Deep Dive Example 1: RLS Setup

Task: security__rls_business_access

Implement Row Level Security policies for a social media platform with 5 tables (users, channels, posts, comments, channel_moderators).

Results Summary 1

Backend	Success Rate	Tokens Used	Turns
InsForge	4/4 (100%)	296K	15
Supabase	1/4 (25%)	340K	-
Postgres	4/4 (100%)	581K	23

InsForge's get-table-schema provides RLS-aware context:

json

{
  "users": {
    "schema": [...],
    "indexes": [...],
    "foreignKeys": [...],
    "rlsEnabled": false,
    "policies": [],
    "triggers": []
  }
}

The agent immediately knows: RLS is disabled, no policies exist. It can proceed directly to implementation.

Postgres's get_object_details lacks RLS information:

json

{
  "basic": { "schema": "public", "name": "users", "type": "table" },
  "columns": [...],
  "constraints": [...],
  "indexes": [...]
}

No rlsEnabled field. No policies field. The agent must run additional queries to check RLS status before proceeding.

Execution Pattern Comparison

InsForge (15 turns, 296K tokens):

text

get-instructions          → Learns how to use InsForge Workflow
get-backend-metadata      → All 5 tables at a glance
get-table-schema × 5      → Full schema WITH RLS status (parallel)
run-raw-sql × 6           → Create functions, enable RLS, create policies

Postgres (23 turns, 581K tokens):

text

list_schemas              → Schema names only
list_objects              → Table names only
get_object_details × 5    → Schema WITHOUT RLS status
execute_sql               → Query to check current RLS status
execute_sql × 8           → Create functions, enable RLS, policies
execute_sql × 4           → Verification queries

Supabase MCP provides only list_tables for discovery. It returns table names without structure, constraints, or RLS status. The agent attempted blind migrations without understanding existing state:

text

list_tables               → Just table names
apply_migration           → Failed: naming conflict
apply_migration           → Retry, still incomplete

Without visibility into existing policies or RLS status, the agent couldn't reliably implement the required security model.

Deep Dive Example 2: Demographics Report

Task: employees__employee_demographics_report

Create four statistical tables for an annual HR demographics report: gender statistics, age group analysis, birth month distribution, and hiring year summary. Requires understanding employee-to-salary table relationships.

Results Summary 2

Backend	Success Rate	Tokens Used
InsForge	4/4 (100%)	207K
Supabase	3/4 (75%)	204K
Postgres	2/4 (50%)	220K

InsForge is the only backend with 100% reliability. Token usage is similar across all three, but Supabase and Postgres fail due to SQL logic errors.

InsForge's get-backend-metadata exposes record counts for each table:

json

{
  "tables": [
    { "schema": "employees", "tableName": "employee", "recordCount": 300024 },
    { "schema": "employees", "tableName": "salary", "recordCount": 2844047 }
  ]
}

This gives the agent a clear relationship signal:

2.84M salary rows
300K employees
roughly 9.5 salary records per employee

The agent immediately understands it must avoid naive COUNT(*) over a JOIN and instead use COUNT(DISTINCT e.id). This prevents the most common metrics error in many to one relationships.

Both Supabase and Postgres lack record count visibility:

Supabase: list_tables shows names only
Postgres: list_objects + get_object_details show schema but no counts

Example failure written using Supabase and Postgres MCP:

sql

SELECT gender, COUNT(*)   -- ❌ counts salary rows, not employees
FROM employee e
LEFT JOIN salary s ON e.id = s.employee_id;

This produces results 9.5 times too large, matching the row multiplication effect.

Correct SQL written when using InsForge MCP:

sql

SELECT gender, COUNT(DISTINCT e.id)
FROM employees.employee e
LEFT JOIN employees.salary s ON e.id = s.employee_id
GROUP BY gender;

InsForge's metadata gives agents the context they need to handle schema relationships correctly. Even small signals like record count can improve correctness in analytical workloads.

Conclusion

MCPMark makes it clear that reliable backend context matters. InsForge MCP helps agents complete database tasks more accurately and with fewer tokens by giving them a structured and complete view of the underlying schema.

This leads to fewer mistakes, fewer retries, and a more predictable natural-language development experience. If you want agents that can handle complex backend operations, InsForge MCP delivers the context needed to make that possible.

Appendix 1: Per Task Breakdown

Appendix 1 lists the results for every individual task in the MCPMark Postgres benchmark. For each task, we report two values for every MCP server:

Pass⁴ accuracy: how many of the four runs succeeded
Token usage: the average number of tokens consumed for that task

This table provides a detailed, per-task view of how InsForge MCP, Supabase MCP, and Postgres MCP performed across all twenty one benchmark tasks.

For descriptions of what each task represents, feel free to explore the official task list at: https://mcpmark.ai/tasks?category=postgres

MCP Benchmark Results

#	Task	InsForge	Supabase	Postgres
1	chinook__customer_data_migration	4/4, 529K	3/4, 1,421K	2/4, 1,639K
2	chinook__employee_hierarchy_management	4/4, 248K	4/4, 260K	4/4, 230K
3	chinook__sales_and_music_charts	0/4, 264K	0/4, 260K	0/4, 302K
4	dvdrental__customer_analysis_fix	0/4, 221K	2/4, 355K	0/4, 333K
5	dvdrental__customer_analytics_optimization	4/4, 277K	3/4, 215K	4/4, 195K
6	dvdrental__film_inventory_management	4/4, 378K	4/4, 342K	4/4, 375K
7	employees__employee_demographics_report	4/4, 207K	3/4, 204K	2/4, 220K
8	employees__employee_performance_analysis	0/4, 316K	0/4, 505K	0/4, 211K
9	employees__employee_project_tracking	2/4, 596K	3/4, 321K	2/4, 286K
10	employees__employee_retention_analysis	0/4, 330K	0/4, 214K	0/4, 255K
11	employees__executive_dashboard_automation	0/4, 438K	1/4, 324K	0/4, 686K
12	employees__management_structure_analysis	4/4, 286K	2/4, 195K	4/4, 280K
13	lego__consistency_enforcement	4/4, 573K	4/4, 346K	4/4, 787K
14	lego__database_security_policies	0/4, 327K	0/4, 374K	3/4, 552K
15	lego__transactional_inventory_transfer	1/4, 922K	2/4, 945K	0/4, 1,178K
16	security__rls_business_access	4/4, 296K	1/4, 340K	4/4, 581K
17	security__user_permission_audit	0/4, 118K	0/4, 182K	0/4, 352K
18	sports__baseball_player_analysis	3/4, 681K	4/4, 1,221K	0/4, 645K
19	sports__participant_report_optimization	4/4, 248K	4/4, 653K	4/4, 219K
20	sports__team_roster_management	0/4, 318K	0/4, 1,785K	0/4, 369K
21	vectors__dba_vector_analysis	4/4, 638K	4/4, 1,181K	4/4, 656K