How Context-First MCP Design Cuts Agent Failures on Backend Tasks

TL;DR

The problem: AI agents handle frontend tasks well, but backend operations expose a consistent gap. Even when MCP is connected, most backends return table names without record counts, schema without RLS state, and tool responses without success signals. The agent fills that gap with extra queries, retries, and guesses.
Why existing backends fall short: Platforms like Supabase and Postgres MCP were designed for human operators who can read a dashboard, check a UI, and make judgment calls. When an agent is the one operating the backend, that design assumption breaks at every step.
What changes with an agent-native backend: InsForge builds MCP into the backend from the start, not on top of it. Two tool calls give the agent a full map of the backend and deep context for the table it is working on, with live record counts, RLS state, and policy definitions returned before the agent writes a single query. Across 21 MCPMark tasks, the design produces 47.6% Pass⁴ accuracy, 30% fewer tokens, and 1.6x faster execution compared to Postgres MCP and Supabase MCP.

Introduction

AI agents handle frontend tasks well. Give one a component structure, a routing pattern, or a state management problem, and it will usually get it right. But backend operations are different.

Ask an agent to implement Row Level Security on a production Postgres database. It will write the policies, run the migration, and return success. But if the MCP server it is connected to does not expose RLS status in its schema response, the agent has no way to verify what it actually changed. It assumes the operation worked. The policies it wrote may be incomplete, applied to the wrong roles, or missing entire tables. Nothing throws an error, but the data access rules are silently wrong.

This is not a model quality problem. It is a context problem. And it happens consistently, across models, across tasks, even when MCP is part of the setup.

In this article, we examine where agents consistently fail on backend tasks, what the current MCP layer design in most backends is missing, and how a backend built specifically for agents changes the execution pattern.

MCP Exists but Agents Still Fail

Most major backend platforms have MCP servers now. Supabase has one. Postgres has one. The assumption is that once you connect an agent to a backend via MCP, the backend visibility problem is solved, but connecting to an MCP does not determine what it returns.

MCP is a protocol. It defines how tools are called and how responses are structured. It does not define what those responses contain. That part is entirely up to whoever built the MCP server.

Here is what Supabase MCP returns when an agent calls list_tables:

json

["users", "orders", "products", "sessions"]

The response contains table names only, with no record counts, no foreign key relationships, no RLS status, no policy definitions, no trigger logic, and no index information.

Supabase was built for a specific use case, and it does that well. Tools like Lovable, Bolt, and v0 use it because it is fast to set up, visually manageable, and good for getting a product off the ground quickly. The dashboard-first design works for that workflow because a human is always in the loop, reading errors, checking the UI, and making judgment calls.

When an agent is the one operating the backend, there is no human in the loop. The agent cannot open the Supabase dashboard and read the RLS policy panel. It can only work with what the MCP tool returns. And what it returns is not enough to act correctly on anything beyond a basic read or write.

So the agent does what any system does when it lacks information. It runs more queries. It guesses. It retries. Each of those extra steps costs tokens and time, and none of them are guaranteed to surface the right answer.

Every extra query the agent runs, every retry, every guess is a symptom of the same underlying gap. The MCP layer was built for a human operator, and the agent is paying the cost of that assumption at runtime.

How Agents Actually Fail

The surface problem shows up in four consistent failure patterns. These are not edge cases or misconfigured setups. They happen across models, across tasks, and across backends that have MCP connected.

Two of the failures below reference MCPMark tasks. MCPMark is an open-source benchmark that measures how well MCP servers support language models on real database tasks. Each task is a non-trivial backend operation run against actual databases with real data volumes. The task names map directly to the operation being tested. They are used here because they are reproducible, measurable, and not hypothetical.

Failure 1: Non-Deterministic Tool Calls

Agents interact with backends by calling tools, and some of those tools trigger real, irreversible operations like creating a resource, provisioning infrastructure, or writing to a database. When one of those calls times out, the agent has no success signal to work from. The tool response came back empty, the operation has no ID, and the backend has no idempotency key, so the agent does what it was built to do: it retries. By the time the second request lands, the first one has already gone through. Two identical resources now exist in production, and neither the agent nor the developer caught it in real time.

APIs built for humans assume someone will check the dashboard after an operation. Agents cannot do that. They need a deterministic success or failure signal from the tool response itself, or they will keep retrying until something breaks.

Failure 2: Schema Blindness in Real Databases

The MCPMark task employees__employee_demographics_report asks an agent to generate gender statistics, age group breakdowns, birth month distributions, and hiring year summaries from an HR database.

The employee table had 300,024 rows. The salary table had 2,844,047 rows. That is roughly 9.5 salary records per employee.

The MCP server returns table names and column definitions. No record counts. The agent sees a join between two tables and writes the most natural query:

sql

SELECT gender, COUNT(*)  -- counts salary rows, not employees
FROM employee e
LEFT JOIN salary s ON e.id = s.employee_id;

The query runs clean. No error. The output is 9.5 times too large because COUNT(*) multiplies across joined salary rows instead of counting distinct employees. The agent returns it as correct.

Why COUNT(*) returns the wrong number on a joined table, and what the correct query looks like

The backend had the record counts. It just never surfaced them.

Failure 3: Missing Context Compounds Cost

The MCPMark task security__rls_business_access asks an agent to implement Row Level Security policies across 5 tables on a social media platform.

The MCP server's get_object_details returns schema, columns, constraints, and indexes for each table. No rlsEnabled field. No policies array. The agent does not know RLS is disabled. It has to run a separate SQL query just to check the current RLS status before it can start the actual task. Then it runs verification queries after implementation to confirm the policies applied correctly.

text

list_schemas          → schema names only
list_objects          → table names only
get_object_details × 5  → schema, no RLS status
execute_sql           → check current RLS status
execute_sql × 8       → create functions, enable RLS, create policies
execute_sql × 4       → verification queries

23 turns. 581K tokens. Every extra query in that path exists because one field was missing from the original response.

Failure 4: No Guardrails for Autonomous Operations

An agent runs a schema migration. The migration has a logic error. There is no audit log of what the agent changed, no agent-scoped permissions limiting what operations it can run, and no rollback path. Production breaks. Nothing was recorded.

Backends built for human developers assume a human is reviewing the migration before it runs. When an agent is the one initiating the change, that assumption breaks. There is no mechanism to scope what the agent is allowed to modify, no record of what it actually did, and no way to recover cleanly if something goes wrong.

What an Agent-Native Backend Looks Like

The failures in the previous section share a common cause. The backend gave the agent a name when it needed a state. A table name when it needed a record count. A schema when it needed a policy definition.

Fixing this is not about a better prompt or a smarter model. It is about what the MCP layer returns by default. That means the backend itself has to be designed around what an agent needs to see, not retrofitted with MCP after the fact.

That is a different design starting point entirely. It is what separates a backend built for humans from one built for agents. InsForge is built on that starting point.

InsForge is an open-source backend platform built for AI-assisted development. It provides database management, authentication, storage, serverless functions, and AI integrations, with APIs structured specifically for deterministic agent execution. Unlike backends that expose MCP as a layer on top of a human-facing platform, InsForge builds the MCP design into the backend itself.

Its MCP server is built around two principles: hierarchical context and a live source of truth. Every tool call is designed so the agent gets exactly what it needs for the current step, with a clear signal pointing to what it needs next.

Backend retrofitted with MCP vs a backend built around agent execution from the start

In practice, that means two layers: one that gives the agent a map of the entire backend, and one that gives it the full detail for exactly the table it is working on.

Layer 1: Global Context with get-backend-metadata

The first call an agent makes is get-backend-metadata. It returns the full backend surface in one response: every table with its live record count, auth configuration, storage buckets, AI model integrations, and a built-in hint that tells the agent exactly what tool to call next.

json

{
  "auth": {
    "oauths": [
      {
        "provider": "google",
        "clientId": null,
        "redirectUri": null,
        "scopes": ["openid", "email", "profile"],
        "useSharedKey": true
      }
    ]
  },
  "database": {
    "tables": [
      { "tableName": "users", "recordCount": 1 }
    ],
    "hint": "To retrieve detailed schema information for a specific table, call the get-table-schema tool with the table name."
  },
  "storage": {
    "buckets": [],
    "totalSizeInGB": 0
  },
  "aiIntegration": {
    "models": [
      {
        "inputModality": ["text", "image"],
        "outputModality": ["text"],
        "modelId": "anthropic/claude-sonnet-4.5"
      }
    ]
  },
  "version": "1.0.0"
}

The recordCount field directly resolves Failure 2. Before writing a single query, the agent already knows there are 300,024 employee rows and 2,844,047 salary rows. It knows this is a many-to-one relationship. It knows COUNT(*) on a join will multiply rows. It writes COUNT(DISTINCT e.id) on the first attempt.

The hint field resolves the discovery loop problem. The agent does not have to guess what tool to call next or run exploratory queries to understand the backend topology. The response tells it directly.

Everything the agent gets from a single get-backend-metadata call

Layer 2: Local Context with get-table-schema

Once the agent knows which table it needs to work with, it calls get-table-schema. This returns the full definition for that table in a single response.

json

{
  "users": {
    "schema": [
      { "columnName": "id", "dataType": "uuid", "isNullable": "NO", "columnDefault": null },
      { "columnName": "nickname", "dataType": "text", "isNullable": "YES", "columnDefault": null },
      { "columnName": "bio", "dataType": "text", "isNullable": "YES", "columnDefault": null },
      { "columnName": "created_at", "dataType": "timestamp with time zone", "isNullable": "YES", "columnDefault": "now()" },
      { "columnName": "updated_at", "dataType": "timestamp with time zone", "isNullable": "YES", "columnDefault": "now()" }
    ],
    "indexes": [
      {
        "indexname": "users_pkey",
        "indexdef": "CREATE UNIQUE INDEX users_pkey ON public.users USING btree (id)",
        "isUnique": true,
        "isPrimary": true
      }
    ],
    "foreignKeys": [
      {
        "constraintName": "users_id_fkey",
        "columnName": "id",
        "foreignTableName": "accounts",
        "foreignColumnName": "id",
        "deleteRule": "CASCADE",
        "updateRule": "NO ACTION"
      }
    ],
    "rlsEnabled": true,
    "policies": [
      {
        "policyname": "Enable read access for all users",
        "cmd": "SELECT",
        "roles": "{public}",
        "qual": "true",
        "withCheck": null
      },
      {
        "policyname": "Disable delete for users",
        "cmd": "DELETE",
        "roles": "{authenticated}",
        "qual": "false",
        "withCheck": null
      },
      {
        "policyname": "Enable update for users based on user_id",
        "cmd": "UPDATE",
        "roles": "{authenticated}",
        "qual": "(uid() = id)",
        "withCheck": "(uid() = id)"
      },
      {
        "policyname": "Allow project_admin to update any user",
        "cmd": "UPDATE",
        "roles": "{project_admin}",
        "qual": "true",
        "withCheck": "true"
      }
    ],
    "triggers": []
  }
}

This single response resolves Failure 3 completely. The agent sees rlsEnabled: true and the full policy definitions before it writes any SQL. It knows exactly which roles have access to which operations, what the qual conditions are, and what withCheck constraints apply. With full policy definitions and RLS state returned upfront, there is nothing left to discover and guess.

Compare this to what Postgres MCP returns for the same table:

json

{
  "basic": { "schema": "public", "name": "users", "type": "table" },
  "columns": [...],
  "constraints": [...],
  "indexes": [...]
}

No rlsEnabled field. No policies array. The agent has no way to know RLS exists on this table from this response alone. It proceeds, hits a permission error, and starts the retry loop that costs 285K extra tokens.

Postgres MCP vs InsForge: what the same table call returns

Why This Is an Architecture Decision, Not a Feature

The two-layer design is intentional.

get-backend-metadata is global context: it gives the agent a high-level map of the entire backend without overloading the context window.
get-table-schema is local context: scoped, deep, and called only for the table the agent is actively working on.

This matters because of how context windows work in practice. Loading full schema details for every table upfront can consume 20K to 30K tokens of irrelevant information, pushing out logic the agent wrote earlier in the session. The hierarchical design keeps the context window clean while ensuring the agent always has what it needs for the current operation.

The hint field in get-backend-metadata is what connects the two layers. The agent does not have to reason about what to fetch next. The backend tells it. That is the difference between a backend that was retrofitted with MCP and one that was designed around how agents actually operate.

The context layer handles Failures 2 and 3. But a fully agent-native backend has to go further. Failures 1 and 4, non-deterministic operations and unguarded autonomous changes, are addressed at the platform level. InsForge's tool contracts return deterministic success and failure signals by design. Agent-initiated schema changes are logged, scoped, and reversible. The MCP layer and the platform contract layer work together.

How the two-layer context cycle works: global map first, table detail second

The Numbers

The architecture described in the previous section is not a theoretical improvement. MCPMark makes it measurable.

InsForge, Supabase MCP, and Postgres MCP were all evaluated against the same 21 tasks using Anthropic Claude Sonnet 4.5 as the model. Each task was run 4 consecutive times.

The accuracy metric used is Pass⁴. A task counts as successful only if the agent completes it correctly in all four independent runs. Not once. Not three out of four. All four. This is what reliability actually looks like in production.

Case Study 1: Demographics Report

Task: employees__employee_demographics_report

Generate gender statistics, age group breakdowns, birth month distributions, and hiring year summaries from an HR database with an employee table (300,024 rows) and a salary table (2,844,047 rows).

Backend	Success Rate	Tokens Used
InsForge	4/4 (100%)	207K
Supabase MCP	3/4 (75%)	204K
Postgres MCP	2/4 (50%)	220K

Token usage is similar across all three. The failures are not a cost problem. They are a correctness problem caused entirely by missing record count information.

Neither Supabase MCP nor Postgres MCP tells the agent how many rows each table contains. The agent sees a join between two tables and writes the most natural query, which counts salary rows instead of employees. The output is 9.5 times too large. No error is thrown. The agent returns it as correct.

InsForge surfaces record counts in the first call, so the agent sees the row ratio before writing any SQL, knows COUNT(*) will multiply rows on this join, and writes COUNT(DISTINCT e.id) on the first attempt. 4/4 every time. The only difference is two fields in the first MCP response.

Case Study 2: RLS Setup

Task: security__rls_business_access

Implement Row Level Security policies across 5 tables on a social media platform: users, channels, posts, comments, channel_moderators.

Backend	Success Rate	Tokens Used	Turns
InsForge	4/4 (100%)	296K	15
Supabase MCP	1/4 (25%)	340K	—
Postgres MCP	4/4 (100%)	581K	23

InsForge and Postgres MCP both reach 100% accuracy. But Postgres MCP uses 581K tokens and 23 turns to get there. InsForge uses 296K tokens and 15 turns. The 285K token difference is not model behavior. It is the direct cost of the agent not knowing RLS state upfront.

Postgres MCP's execution path:

text

list_schemas          → schema names only
list_objects          → table names only
get_object_details × 5  → schema, no RLS status
execute_sql           → query to check current RLS status
execute_sql × 8       → create functions, enable RLS, create policies
execute_sql × 4       → verification queries

InsForge's execution path:

text

get-instructions        → learns InsForge workflow
get-backend-metadata    → all 5 tables at a glance
get-table-schema × 5    → full schema with RLS status (parallel)
run-raw-sql × 6         → create functions, enable RLS, create policies

The Postgres agent runs extra queries to check RLS status and then verify the policies applied correctly because the information was never in the original response. InsForge skips both phases entirely because it was already there.

Supabase, without visibility into existing policies or RLS state, the agent could not reliably implement the required security model across 4 consecutive runs.

Aggregate Results

Across all 21 tasks:

Metric	InsForge	Postgres MCP	Supabase MCP
Pass⁴ Accuracy	47.6%	38.1%	28.6%
Avg Tokens Per Run	8.2M	10.4M	11.6M
Avg Time Per Task	150 seconds	200+ seconds	200+ seconds

InsForge is 1.6x faster, uses 30% fewer tokens, and achieves 47.6% Pass⁴ accuracy against Postgres MCP's 38.1% and Supabase MCP's 28.6%.

Token cost of missing RLS context across three backends on the same task

The accuracy gap is significant given what Pass⁴ measures. Passing once is not the bar. The bar is passing the same complex backend operation 4 times in a row, without mistakes, without retries caused by missing context. At that bar, InsForge is the only backend that consistently clears it.

Closing

The future of agent-native development will not be defined by better models alone. It will be defined by what those models can see, the context they are given, the signals they receive, and the backend layer that determines both.

If this is the problem you are working on, InsForge is open source and we welcome contributions from the community.

Try InsForge

Quickstart guide here