An open SLA specification for production AI agents

If you run AI agents in production, here is an uncomfortable truth: your observability tool can tell you what your agents did, but it cannot stop them from doing the wrong thing tomorrow. The industry has shipped an enormous amount of logging and almost no enforcement.

I think that is the wrong posture for 2026. Agents are taking actions now — in customer inboxes, in contract workflows, in browser tabs, with corporate credentials. “We will look at the logs” is not a governance strategy. It is forensics.

This post proposes an open specification, the Open Agent SLA Specification (OASS), for expressing what it means for an agent task to behave correctly — in a form that an engineer, an auditor, and a control plane can all agree on. The draft is up on GitHub: github.com/attestum-ai/agent-sla-spec. I am looking for feedback, implementations, and holes.

This is not a product announcement. I work on a control plane (Attestum) that implements the spec, but the spec is intentionally free, CC-BY licensed, and framework-neutral. If it becomes useful, I want it useful to everybody.

The gap nobody is filling

I have spent the last few months reading every production agent incident write-up I could find. The pattern is monotonously consistent:

A team ships an agent that works well in staging.
It drifts in production — prompt regression, tool schema change, retrieval pipeline off-by-one, a new model release with slightly different behavior.
Something bad happens: a customer gets a wrong answer, a tool call fires against the wrong tenant, a browser agent exfiltrates a file it should not have.
The team finds out hours or days later, from the logs.
Post-mortem concludes with “we need better evals” and “we should have rolled back sooner.”

The industry response has mostly been more observability — nicer UIs for agent traces, better LLM-as-judge rubrics, pretty dashboards with sparklines. Those are all useful, and I use them. But they are diagnostic. They do not stop a regressing agent version from continuing to harm users while a human decides whether to hit the kill switch.

The missing piece is a signed contract for what “correct” means, and a control plane that enforces it automatically, with a rollback as boring and reliable as the one your load balancer performs when an upstream goes unhealthy.

Writing that contract is harder than it looks. “Correct” is a moving target, varies by task, depends on downstream signals, and is usually conflated with “smart.” Most teams end up in one of two failure modes:

No contract: the team has opinions about what the agent should do, but nothing written down that a new hire or an auditor could read.
Contract in English prose: thirty pages of Confluence that nobody reads and no system enforces.

We need the OpenTelemetry-style answer: a small, stable schema that any vendor can implement, that a machine can enforce, and that a regulator can read.

What OASS proposes

Three building blocks.

1. The Agent SLA Contract

A short, signed JSON document per agent task. It says:

What the task is, with a stable name and category.
Who the parties are (the agent operator, the control plane).
One primary metric with a numeric threshold and a rolling window (e.g., “task success rate ≥ 92% over the last 200 runs”).
A list of guardrails — boolean per-run checks that fail the run immediately (e.g., “no tool call to a domain outside the allowlist,” “no unparseable output,” “no response longer than 2,000 tokens”).
Latency ceilings, both end-to-end and for control-plane overhead.
A hash of the frozen test set the primary metric is evaluated against.
A mapping to the regulatory frameworks the deployment is subject to (EU AI Act Articles 15/17/21, SOC 2 CC7/CC8, SR-11-7, MAS FEAT).

The contract is cryptographically signed by both parties. Changing it requires re-signing. This matters because the contract is what a compliance auditor will ask for: the authoritative statement of “what you intended this agent to do, at this point in time.”

2. The Agent Run Audit Record

For every run of the agent task, the control plane emits a structured record that:

Captures which contract version was in force, which agent version ran, and the trace of tool calls and step counts.
Reports the primary-metric score and every guardrail’s pass/fail.
Records control-plane overhead explicitly, so latency violations are attributable.
Logs whether a rollback was triggered and why.
Points — via hashes, never raw content — at the actual inputs and outputs, which stay encrypted in tenant-owned storage.
Maps each record to the specific regulatory articles it provides evidence for.

Records are append-only and signed. The combined log is Merkle-chained, so retroactive tampering is detectable — which makes it the kind of evidence an auditor can actually rely on, not just another CSV export.

3. The Rollback Decision Procedure

This is the part that makes the difference between “we have a governance framework” and “we have a governed system”:

If a guardrail with immediate: true fails on a single run, the control plane’s response is not served. The incumbent path’s response is, or a deterministic error is. Rollback logged.
If the primary metric’s rolling window violates the rollback condition, all subsequent runs of this task are served via the incumbent until a human operator signs a re-arm event and a replay against fresh data passes.
If the control plane’s own P99 overhead exceeds the contractual ceiling for 5 consecutive minutes, the control plane rolls itself back. We do not get to blame someone else for our own overhead.

This is deliberately boring. Site reliability engineering figured out a lot of this already: circuit breakers, canary deployments, automatic rollbacks, runbooks humans re-arm. The agent world has been slow to adopt these primitives, I suspect because the “AI” brand keeps dressing up straightforward reliability engineering as something more exotic than it is.

Why this matters now

Two forcing functions converge in the next 18 months.

First, regulation with a calendar. The EU AI Act’s general-purpose provisions are in force. High-risk system obligations — Annex III categories cover a lot of agents operating in healthcare, finance, education, HR, and law-enforcement-adjacent contexts — apply from 2027-08-02. Article 15 wants robustness, accuracy, and cybersecurity evidence. Article 21 wants continuous post-market monitoring. The rest of the world has similar rules coming (Singapore MAS FEAT, US Fed SR-11-7 extensions, UK AI sectoral guidance). I have yet to meet a team that has read the articles and feels comfortable.

Second, incidents. The blast radius of a misbehaving production agent is going up fast — more tool calls per task, more corporate credentials in the loop, more state-changing actions. The insurance market is starting to notice. When insurers start caring about something, enterprise buyers feel it in procurement about four quarters later.

Either force alone would be enough to push agent governance from “nice to have” to “required.” Together, they make an open spec not just useful but urgent.

What I would like feedback on

Open questions in the draft:

Task categories. I have started with three: customer_response, document_processing, browser_tool. Over-scoped, under-scoped, or about right? Should the registry be open?
The audit-record schema. What fields will your auditors require that I have missed?
The rollback procedure. Simple enough to implement; strong enough to matter?
Regulatory mapping. Have I mapped EU AI Act articles correctly? What about SR-11-7, MAS FEAT, NIST AI RMF?
Log format. Should we mandate a specific append-only log substrate (Sigsum, Rekor), or stay storage-agnostic?

If you run agents in production and any of this resonates — or if the draft is missing the obvious thing — please open an issue or PR. I will read all of them.

What I am not doing with this post

I am not pitching you a product here. If you want to know what I am building on top of the spec, the company is Attestum and there is a link in the nav, but I would rather you engage with the spec first. Specifications outlive products. If OASS is right in its current form, it should be implementable by five different vendors in 2027; that is the right outcome.

If it is wrong, please tell me why. I would rather find out now than ship the wrong thing to ten enterprises.

— Hongyi Li, founder, Attestum

Spec: github.com/attestum-ai/agent-sla-spec
Company: attestum.ai
X: @AttestumAI
Bluesky: @attestum.ai
LinkedIn: linkedin.com/company/attestumai