We are excited to share results from our evaluation of InsForge MCP using MCPMark, an open source benchmark that measures how well MCP servers handle complex database tasks. InsForge MCP consistently reaches higher accuracy while using fewer tokens across a wide range of workloads. Our MCP layer provides the most efficient context for AI agents to understand the backend and produce more reliable, precise output.

The Problem
Large language models have limited context windows. When the model cannot see the full backend, it hallucinates and tries to guess the current backend structure, which leads to recurring failures.
Examples including:
- Joining tables that do not share a valid foreign key relationship
- Ignoring row level security rules and exposing restricted data
- Querying columns or tables that no longer exist in the current schema
See here for a detailed explanation of why context matters: Why Context is Everything in AI Coding
The Benchmark
To quantify context efficiency and accuracy, we used the MCPMark Postgres Dataset. This benchmark evaluates whether an MCP server enables an agent to perform the required database operations correctly. The suite includes 21 real world tasks that cover analytical reporting, complex joins, migrations, CRUD logic, constraint reasoning, index creation, query optimization, row level security enforcement, trigger based consistency, transactional operations, audit logging, and vector search through pgvector. MCPMark also records accuracy, token usage, tool call counts, and run time, providing a clear and reproducible basis for comparing different MCP layers.
For this evaluation, we compared three MCP servers: Supabase MCP, Postgres MCP, and InsForge MCP. All three servers allow an LLM, acting as an MCP client, to inspect database schemas and perform database operations. Tests were run using the Anthropic Sonnet 4.5 model. Each task was executed 4 times to reduce model variability and ensure stable results.
The Results
We evaluated token usage, run time, and Pass⁴ accuracy. Pass⁴ refers to strict accuracy: a task counts as successful only when the agent completes it correctly in all four runs. This gives a reliable picture of how stable each MCP server is across repeated executions.
Run Time
Across the 21 tasks, InsForge completes each task in an average of 150 seconds, while Postgres MCP and Supabase MCP both require more than 200 seconds on average.

Token Usage
Token usage shows an even clearer difference. InsForge uses an average of 8.2 million tokens per run (across all 21 tasks), compared with 10.4 million for Postgres MCP and 11.6 million for Supabase MCP.

Pass⁴ Accuracy
Pass⁴ counts a task as accurate only when the agent succeeds in all four runs. Under this strict measure, InsForge achieves 47.6 percent accuracy, while Supabase MCP and Postgres MCP reach 28.6 percent and 38.1 percent, respectively.

Agents consistently make fewer mistakes and complete multi step backend operations with less time and cost. See Appendix 1 for the per task breakdown across all test cases.
Deep Dive Example 1: RLS Setup
Task: security__rls_business_access
Implement Row Level Security policies for a social media platform with 5 tables (users, channels, posts, comments, channel_moderators).
Results Summary 1
| Backend | Success Rate | Tokens Used | Turns |
|---|---|---|---|
| InsForge | 4/4 (100%) | 296K | 15 |
| Supabase | 1/4 (25%) | 340K | - |
| Postgres | 4/4 (100%) | 581K | 23 |
InsForge's get-table-schema provides RLS-aware context:
{
"users": {
"schema": [...],
"indexes": [...],
"foreignKeys": [...],
"rlsEnabled": false,
"policies": [],
"triggers": []
}
}
The agent immediately knows: RLS is disabled, no policies exist. It can proceed directly to implementation.
Postgres's get_object_details lacks RLS information:
{
"basic": { "schema": "public", "name": "users", "type": "table" },
"columns": [...],
"constraints": [...],
"indexes": [...]
}
No rlsEnabled field. No policies field. The agent must run additional queries to check RLS status before proceeding.
Execution Pattern Comparison
InsForge (15 turns, 296K tokens):
get-instructions → Learns how to use InsForge Workflow
get-backend-metadata → All 5 tables at a glance
get-table-schema × 5 → Full schema WITH RLS status (parallel)
run-raw-sql × 6 → Create functions, enable RLS, create policies
Postgres (23 turns, 581K tokens):
list_schemas → Schema names only
list_objects → Table names only
get_object_details × 5 → Schema WITHOUT RLS status
execute_sql → Query to check current RLS status
execute_sql × 8 → Create functions, enable RLS, policies
execute_sql × 4 → Verification queries
Supabase MCP provides only list_tables for discovery. It returns table names without structure, constraints, or RLS status. The agent attempted blind migrations without understanding existing state:
list_tables → Just table names
apply_migration → Failed: naming conflict
apply_migration → Retry, still incomplete
Without visibility into existing policies or RLS status, the agent couldn't reliably implement the required security model.
Deep Dive Example 2: Demographics Report
Task: employees__employee_demographics_report
Create four statistical tables for an annual HR demographics report: gender statistics, age group analysis, birth month distribution, and hiring year summary. Requires understanding employee-to-salary table relationships.
Results Summary 2
| Backend | Success Rate | Tokens Used |
|---|---|---|
| InsForge | 4/4 (100%) | 207K |
| Supabase | 3/4 (75%) | 204K |
| Postgres | 2/4 (50%) | 220K |
InsForge is the only backend with 100% reliability. Token usage is similar across all three, but Supabase and Postgres fail due to SQL logic errors.
InsForge's get-backend-metadata exposes record counts for each table:
{
"tables": [
{ "schema": "employees", "tableName": "employee", "recordCount": 300024 },
{ "schema": "employees", "tableName": "salary", "recordCount": 2844047 }
]
}
This gives the agent a clear relationship signal:
- 2.84M salary rows
- 300K employees
- roughly 9.5 salary records per employee
The agent immediately understands it must avoid naive COUNT(*) over a JOIN and instead use COUNT(DISTINCT e.id). This prevents the most common metrics error in many to one relationships.
Both Supabase and Postgres lack record count visibility:
- Supabase: list_tables shows names only
- Postgres: list_objects + get_object_details show schema but no counts
Example failure written using Supabase and Postgres MCP:
SELECT gender, COUNT(*) -- ❌ counts salary rows, not employees
FROM employee e
LEFT JOIN salary s ON e.id = s.employee_id;
This produces results 9.5 times too large, matching the row multiplication effect.
Correct SQL written when using InsForge MCP:
SELECT gender, COUNT(DISTINCT e.id)
FROM employees.employee e
LEFT JOIN employees.salary s ON e.id = s.employee_id
GROUP BY gender;
InsForge's metadata gives agents the context they need to handle schema relationships correctly. Even small signals like record count can improve correctness in analytical workloads.
Conclusion
MCPMark makes it clear that reliable backend context matters. InsForge MCP helps agents complete database tasks more accurately and with fewer tokens by giving them a structured and complete view of the underlying schema.
This leads to fewer mistakes, fewer retries, and a more predictable natural-language development experience. If you want agents that can handle complex backend operations, InsForge MCP delivers the context needed to make that possible.
Appendix 1: Per Task Breakdown
Appendix 1 lists the results for every individual task in the MCPMark Postgres benchmark. For each task, we report two values for every MCP server:
- Pass⁴ accuracy: how many of the four runs succeeded
- Token usage: the average number of tokens consumed for that task
This table provides a detailed, per-task view of how InsForge MCP, Supabase MCP, and Postgres MCP performed across all twenty one benchmark tasks.
For descriptions of what each task represents, feel free to explore the official task list at: https://mcpmark.ai/tasks?category=postgres
MCP Benchmark Results
| # | Task | InsForge | Supabase | Postgres |
|---|---|---|---|---|
| 1 | chinook__customer_data_migration | 4/4, 529K | 3/4, 1,421K | 2/4, 1,639K |
| 2 | chinook__employee_hierarchy_management | 4/4, 248K | 4/4, 260K | 4/4, 230K |
| 3 | chinook__sales_and_music_charts | 0/4, 264K | 0/4, 260K | 0/4, 302K |
| 4 | dvdrental__customer_analysis_fix | 0/4, 221K | 2/4, 355K | 0/4, 333K |
| 5 | dvdrental__customer_analytics_optimization | 4/4, 277K | 3/4, 215K | 4/4, 195K |
| 6 | dvdrental__film_inventory_management | 4/4, 378K | 4/4, 342K | 4/4, 375K |
| 7 | employees__employee_demographics_report | 4/4, 207K | 3/4, 204K | 2/4, 220K |
| 8 | employees__employee_performance_analysis | 0/4, 316K | 0/4, 505K | 0/4, 211K |
| 9 | employees__employee_project_tracking | 2/4, 596K | 3/4, 321K | 2/4, 286K |
| 10 | employees__employee_retention_analysis | 0/4, 330K | 0/4, 214K | 0/4, 255K |
| 11 | employees__executive_dashboard_automation | 0/4, 438K | 1/4, 324K | 0/4, 686K |
| 12 | employees__management_structure_analysis | 4/4, 286K | 2/4, 195K | 4/4, 280K |
| 13 | lego__consistency_enforcement | 4/4, 573K | 4/4, 346K | 4/4, 787K |
| 14 | lego__database_security_policies | 0/4, 327K | 0/4, 374K | 3/4, 552K |
| 15 | lego__transactional_inventory_transfer | 1/4, 922K | 2/4, 945K | 0/4, 1,178K |
| 16 | security__rls_business_access | 4/4, 296K | 1/4, 340K | 4/4, 581K |
| 17 | security__user_permission_audit | 0/4, 118K | 0/4, 182K | 0/4, 352K |
| 18 | sports__baseball_player_analysis | 3/4, 681K | 4/4, 1,221K | 0/4, 645K |
| 19 | sports__participant_report_optimization | 4/4, 248K | 4/4, 653K | 4/4, 219K |
| 20 | sports__team_roster_management | 0/4, 318K | 0/4, 1,785K | 0/4, 369K |
| 21 | vectors__dba_vector_analysis | 4/4, 638K | 4/4, 1,181K | 4/4, 656K |
