The hard part of building agents isn't reasoning. It's teaching a model to use a large and growing set of tools it was never trained on — essentially, handing it capabilities that didn't exist when its weights were frozen. And it breaks in a specific, predictable way: past a few dozen tools, the definitions stop fitting the context window. You burn tokens describing capabilities before the model has used a single one. For frontier models this is annoying. For local and small models it's a hard ceiling on what you can build.

The current best answer, and its catch

The strongest response so far is code execution. The insight is that models already know an operating system. They were trained on enormous amounts of Unix, bash, and filesystem navigation, so instead of inventing a bespoke tool-calling protocol, you give the model a Linux box and a handful of primitives and let it work the way it already knows how. Frontier labs report this outperforms direct tool-calling, and it's the direction behind Anthropic's code execution with MCP and Cloudflare's Code Mode. It fixes two costs at once: the model loads only the tools it needs (instead of all definitions up front), and intermediate results stay in the execution environment instead of round-tripping through the context window.

The catch is the box. Code execution needs a sandbox, and a sandbox cuts out a large slice of deployment surfaces — edge runtimes, constrained on-device targets, anywhere you can't or won't hand a model arbitrary code execution. You can emulate one, but a virtual environment with real Linux parity is its own infrastructure problem, and now you're maintaining an emulator to avoid maintaining a sandbox.

So the question is: is there a substrate models know as well as bash, that gives you the same composition and on-demand discovery, but doesn't require a general-purpose execution environment?

Databases

Databases are about as old as the field. SQL — the language for querying and mutating them — has been an ANSI standard since 1986 and in use since the early 1970s. Fifty-plus years of SQL means it sits at saturation in every model's training data. A 3B local model that fumbles a novel tool schema will still write a clean JOIN, because it has seen millions of them. SQL is the rare interface where small models and frontier models are roughly equally fluent.

So instead of emulating Linux, emulate a database. Tools become tables. The model expresses what it wants to read or change as queries over those tables, and a single execute_sql tool parses each query down to the real tool calls underneath. The model never sees the plumbing — it sees a schema and writes SQL against it.

The simplest case: a get_time tool becomes a one-row time table. To answer a question that needs the current time, the model runs:

SELECT now FROM time;

One tool — execute_sql — and the model gets a scalar back. The system prompt tells it which tables exist at init, and from there it composes.

A slightly more involved case — an agent maintaining a task list:

User: Do I have anything due tomorrow?
SELECT id, name, description
FROM tasks
WHERE due_date = (SELECT tomorrow FROM time);

Two tools, one query, no intermediate hop. The model doesn't fetch "tomorrow," read it back into context, then issue a second call against tasks. The relational layer resolves it in one shot.

Composition without round-trips

That second example is the whole point in miniature. Different tools interleave inside a single query, and the data between them never passes through the model. This is exactly the cost that code execution exists to eliminate — and SQL gets it for free, because joining and filtering across sources is what the language was built to do.

It scales the way joins scale. Five tools or fifty, the model writes one declarative statement describing the shape of the answer, and the engine handles the wiring.

Discovery is already a solved problem in SQL

Here's the part that should be the headline. The context-window problem — too many tool definitions to load up front — is something SQL already solved decades ago. It's called the catalog.

SELECT table_name FROM information_schema.tables;
SELECT column_name, data_type FROM information_schema.columns
WHERE table_name = 'tasks';

You don't stuff every capability into the system prompt. You expose a catalog, tell the model the catalog exists, and let it query for what it needs the moment it needs it. Progressive discovery isn't a feature you build on top — it's information_schema, and every model already knows how to walk it. An agent can land in an unfamiliar environment, list its tables, inspect the ones that look relevant, and figure out its own capabilities from there.

Preview before you commit

Because the substrate is a database, you inherit transactions. A write doesn't have to fire blind — the agent (or a human gating it) can see the plan first:

BEGIN;
INSERT INTO emails (to, subject, body) VALUES (...);
-- inspect the staged effect, then:
ROLLBACK;  -- or COMMIT

EXPLAIN shows which underlying tools a query will hit before it runs. Reads can be previewed, writes can be staged and approved, and the whole thing maps onto an approval-gated outbound model without inventing new machinery. Dry-run is BEGIN; … ROLLBACK. This is a much more natural safety surface than "the model ran some code; hope it was fine."

Security is the underrated argument

Arbitrary code execution is a large attack surface — exfiltration, lateral movement, anything bash can reach. SQL is a grammar you fully control. You can parse every statement before executing it, default to read-only, whitelist statement types, and scope grants per table. The model's "infinite capability" is bounded by a syntax tree you can inspect and reject. For the untrusted and on-device deployments where you can't hand over a sandbox, that bound is the entire reason this works.

MCP servers are mostly CRUD anyway

Look at what most MCP tools actually do: create, read, update, delete over some resource type. So decompose a server into its data types and give each one a table. A GitHub server and a Linear server become github_pr and linear_issue. Want to push merged PRs into another system?

INSERT INTO notion_page (title, body, external_id, status)
SELECT title, body, id, 'Done'
FROM github_pr
WHERE merged = true AND base = 'main';

Two servers, one statement, no glue code and no intermediate results in context. Compose this across however many servers an agent has wired up and it starts to "figure them out" on its own — list the catalog, learn the schemas, write the query.

It's not CRUD, it's dataflow

It's tempting to say SQL only fits the create-read-update-delete majority and transformations like summarize, translate, or generate_image fall outside it. That's the wrong cut. A table is just a relation — a set of tuples — and nothing requires those tuples to be stored. A relation can be a function presented as rows, materialized on demand, with the WHERE values pushed down as inputs rather than evaluated as predicates over existing data. Steampipe already leans on this: some plugin tables have required key columns that must appear in the WHERE clause or the query errors, because they're API parameters, not filters. So this is a perfectly honest SELECT:

SELECT url FROM images WHERE prompt = '...' AND style = 'watercolor';

prompt and style aren't filters — they're arguments the interpreter consumes. The four CRUD verbs are incidental.

And the input direction is only half of it. The same relational machinery carries values between tools, not just into them. A WHERE pushes an argument down into a tool; a JOIN, a subquery, or an INSERT … SELECT takes one tool's output and feeds it as the input to the next. Argument-passing and result-passing are the same operation viewed from two ends, and both happen off-context — the model writes the wiring once and never sees the values flow through it. That's what SQL actually gives you: a uniform medium for moving data into tools and between tools in a single expression. It generalizes to anything you can describe as inputs-to-rows. Which is everything.

Where it actually leaks

So the limit isn't expressiveness. It's two narrower things, and naming them honestly is what separates this from hand-waving.

Purity. SELECT makes a promise to the query planner: referential transparency. The engine is free to evaluate an expression zero times, twice, once per row, reorder it, cache it, or fold it away — and a pure API read tolerates all of that. An effectful tool tolerates none of it. generate_image is nondeterministic, it costs money, and it must run exactly once. Syntactically it's a SELECT; semantically it's a volatile one, and the naive thing happens the moment you compose it:

SELECT generate_image(t.prompt) FROM themes t;   -- N rows → N calls, N charges

That might be what you meant, or the planner might invoke it per output row when you wanted a single image. Postgres had to invent IMMUTABLE / STABLE / VOLATILE for exactly this reason. So the real constraint is: every tool is a relation, but effectful tools are volatile relations, and your interpreter has to carry a volatility model so the engine doesn't quietly call them the wrong number of times. The syntax generalizes perfectly. The execution contract doesn't, and you own that gap.

Atomic conditional composition. The one thing that genuinely won't reduce is branching where the next tool depends on a prior tool's output, inside a single statement. That's declarative-versus-imperative, not CRUD-versus-transform, and SQL is on the wrong side of it. You can punt it to the agent loop — query, look, query again — but then you've given back the no-round-trip property that was the whole point, for that one case.

And a smaller, orthogonal caveat worth keeping in view: these tables are live API views, not stored data. They hit rate limits and pagination, and a JOIN across two services can half-fail when one API times out mid-query. The abstraction is honest about shape and silent about guarantees — partial failure is yours to handle.

This substrate already exists

The "APIs as SQL tables" idea isn't hypothetical — Steampipe has done it in production for years, mapping cloud and SaaS APIs to Postgres foreign tables via FDWs and letting you join across them in real time, zero-ETL. That's a feature, not a problem for this argument: it's the existence proof that the hard part (translating SQL down to live API calls across many providers) is tractable and performant.

The new claim is narrower and, I think, more interesting: point that substrate at LLM tool use specifically. The properties that make Steampipe convenient for a human analyst — one ubiquitous language, joins across sources, a queryable catalog — are the exact properties that fix the things breaking agents today: context bloat, intermediate-result round-trips, and discovery at scale. The human got ergonomics. The model gets a capability surface it was fluent in before it ever saw your tools.

Give a model a database instead of a box, and a lot of the infrastructure melts away. It already knows the language.