May 21st, 2026
compelling1 reaction

The AX stack: what’s fixed, where you can win

Principal Developer Advocate

AI coding agents promise to make you more productive. On the surface they do, but in practice they fall short: agents generate code that doesn’t compile, use a deprecated SDK, or pick the wrong service entirely. Is it you using it wrong? Is it your tech stack? Or is it the tools you haven’t configured yet?

The stack between a developer’s prompt and the generated code has layers. Some of those layers are fixed: you can’t change them no matter what you do. But there’s one layer where you have all the leverage. And if you don’t know which is which, you’ll waste time optimizing the wrong thing without seeing any results.

This is the first article in a series about Agent Experience (AX): the practice of making AI coding agents work correctly with your technology. The series covers what you can and can’t control in the agent stack, how to measure whether your extensions are helping or hurting, and how to iterate toward better outcomes.

The stack

When a developer asks an AI coding agent to build something with your technology, here’s what actually happens: the developer sends an instruction to the agent. The agent sends it to the LLM along with information about the workspace and available tools (MCP, skills, etc.). The LLM processes the instruction and responds with instructions to the agent about which tools to call. This repeats until the LLM considers its job done or needs more information from you.

Developer prompt
  → Agent (harness)
    → Model
      → Agent extensions (skills, MCP servers, instructions, custom agents)
        → Your technology surface (CLI, SDK, API)
          → Generated code

Three layers matter for this conversation: the model, the harness, and the agent extensions. Each has a different owner, different constraints, and a different relationship to you.

The model

The model is a fixed constraint: you didn’t train it, you can’t retrain it, and you can’t control what’s in its weights.

If the model learned your API from outdated docs, it’ll generate code using deprecated patterns. If it never saw your technology during training, it’ll hallucinate something plausible and wrong. And if there’s a competing technology that has more training data, the model will default to it even when yours is the right fit.

You can’t fix this directly. What you can do is provide information at inference time that overrides or supplements what the model knows. That’s what agent extensions are for. But you need to understand: the model is the foundation, and its biases are the default behavior. Everything you build on top is fighting or reinforcing those defaults.

The harness

The harness is the agent itself: Copilot, Claude Code CLI, Cursor, Windsurf, whatever the developer is using. It controls the system prompt, the tool-calling protocol, and how the context gets assembled. It decides what gets included in the context window, what gets dropped, and what the agent does next.

You don’t control the harness either. You might build an extension that works perfectly in Copilot and breaks in Claude Code CLI because the two harnesses handle tool descriptions differently. The harness decides how the agent consumes your extensions, and that decision is opaque to you. As a result, the same MCP server, with the same tool descriptions, can produce completely different results across harnesses. The harnesses interpret and invoke them differently. Your extension isn’t running in a vacuum. It’s running inside someone else’s orchestration layer.

So when developers are telling that some model is better than another, they’re telling only half of the story. The same model will work differently in different harnesses, and you should consider them both when evaluating performance.

Agent extensions

Agent extensions are the surface you control. They’re everything you can put in front of the agent to shape its behavior: skills, MCP servers, instruction files, and custom agents. If you own a technology, you ship these to help agents use it correctly. If you’re a developer, you configure them in your workspace to get better results.

Agent extensions teach the model about your technology, correct its misconceptions, and steer it away from competing approaches. They’re how you get the model to do what you want instead of what it defaults to. They’re also how you inject up-to-date information that the model might not have learned during training. But agent extensions don’t exist in isolation: they compete.

The zero-sum context window

Every agent has a finite context window. Your MCP server’s tool descriptions, your instruction files, your skill definitions. They all consume tokens. And so do everyone else’s.

When a developer has 15 extensions installed and asks the agent to do something, the harness has to decide which tools to invoke, which context to include, and what to drop. Your tool description might get summarized, truncated, or ignored entirely because something else claimed the space first.

We’ve seen this in evaluations repeatedly. An extension with high discovery, correct invocations, and good outcomes in isolation degrades when other extensions are present. More extensions don’t mean better outcomes. Sometimes they mean worse outcomes.

Why would that happen? Because extensions fight for the same context window. Your tool description says “use this for database operations.” Another extension’s tool description says “use this for database operations.” The model has to pick, and its choice depends on factors you don’t control: the order things appear in context, how the harness ranks tools, what the model’s training says about each option. That’s the composition problem, and nobody has a good answer for it yet.

The three failure modes

After hundreds of agent sessions across different products and configurations, we keep seeing three ways extensions fail.

Discovery failure

Your extension exists, but the agent never sees it. The developer has too many extensions installed, and yours gets dropped before it reaches the context window. Or the harness doesn’t load it at all because of how extensions are registered or prioritized. The tool is invisible: the model can’t use what it can’t see. It’s a packaging and distribution problem. Your extension needs to be installed, registered, and small enough to survive context limits.

Selection failure

Your extension is in context, but the agent doesn’t connect it to the developer’s intent. The developer says “set up authentication”, while your tool is called configure-identity-provider, and the model never makes the connection, or your description is optimized for specific keywords that the developer doesn’t know or use.

Of the three, this one shows up the most – and it’s the most fixable. It comes down to vocabulary that you use to describe your tool vs. how developers (and models) think about the problem. Fix the description and you’ll fix the selection.

Quality failure

The agent discovers, selects, and invokes your extension, but the content the extension provides hurts more than it helps. The MCP server returns a wall of text that the model either ignores or misinterprets. The skill provides instructions that conflict with what the model already knows. The content returned is accurate but so verbose that it pushes other useful context out of the window.

This one’s subtler than the other two. The extension is working – it’s being called, it’s returning content, but the outcome is worse than if the extension didn’t exist at all. That’s drag. And you won’t know it’s happening unless you measure.

This also shows, that when testing in AX, you can’t just verify that the agent is calling your extension. You have to verify that the content it returns is improving outcomes. Otherwise, you might be optimizing for the wrong thing.

What you’re actually optimizing

When you’re working on AX, changing agent extensions gives you instant results, everything else is fixed. You can improve your public docs and hope future model training picks them up, but that’s a long-term bet with no guaranteed payoff. You can’t change how the harness orchestrates. What you can change though right now is your agent extensions, and there are four things to get right:

  1. Does the agent discover your extensions? If it doesn’t see them, nothing else matters. This is especially important when testing your extension in combination with other popular extensions relevant to your audience.
  2. Does the agent select your extensions for the task? If it sees them but never connects them to the developer’s intent, they’re dead weight.
  3. When it uses them, do outcomes improve? If outcomes don’t improve or get worse, your extension is drag, not lift.
  4. Do your extensions compose well with others? If your extension works in isolation but breaks when other extensions are present, you have a composition problem.

Every AX improvement you make maps to one of these four things: discovery, selection, quality, composition.

Lift and drag

Every AX conversation frames the question of lift vs. drag. Are your extensions creating lift by improving outcomes, or are they creating drag by making things worse?

Lift

You add your extension, and outcomes improve. The agent discovers your tool, uses it correctly, and the generated code actually works – right SDK, right patterns, up to date. That’s what you’re building extensions for.

Drag

Drag is the opposite of lift. Your extension is present, but outcomes are the same or worse. Maybe the agent never discovers it: zero lift, but at least no harm. Maybe it’s discovered but the content confuses the model. Maybe it works fine alone but conflicts with other extensions. The worst part? You usually don’t know it’s happening.

The only way to tell the difference is to measure. Run the same scenario with and without your extension. Compare the outcomes: is it better with the extension? You’ve got a lift. Is it same or worse: drag. Is the outcome better but token costs tripled? Expensive lift. Whether it’s worth it or not depends on the improvement.

This is what AX measurement comes down to: controlled comparisons. A baseline without extensions versus a profile with your extensions. Everything else (the model, the harness, the developer’s prompt) stays the same.

Summary

You can’t change the model or the harness. Agent extensions are your one lever, and the only way to know if they’re helping is to measure. Run the same scenario with and without your extensions, keep everything else the same, and compare the outcomes. That’s the difference between shipping lift and shipping drag. In the following articles, we’ll talk more about what good outcomes look like, how to measure them, and how to iterate on your extensions to get more lift and less drag.

Author

Waldek Mastykarz
Principal Developer Advocate

Waldek Mastykarz is a Principal Cloud Advocate.

0 comments