What replaces traditional monitoring if AI agents handle telemetry triage?
Direct Answer
Traditional monitoring — alerts, dashboards, on-call rotations, runbooks — does not vanish in the agent era; it gets squeezed out of the middle. The top compresses upward into outcome contracts (SLOs written as named-customer-impact code that humans negotiate and machines enforce). The bottom compresses downward into agent investigation runtime (LLM agents that triage, hypothesize, remediate, and write the post-mortem before a human reads the page). The flabby middle layer — alert routing, dashboard babysitting, runbook execution, p95 staring contests, the 3am ladder of "who owns this graph?" — gets eaten alive by agents. Call the new shape the Hourglass Stack: contracts on top, agents on the bottom, and the old monitoring middle squeezed to a thin governance waist that humans only touch to verify outcomes, not steps. Whoever owns the contract layer (Datadog, Honeycomb, Nobl9) and the agent runtime layer (Bits AI, Splunk AI Assistant, Microsoft Security Copilot, Anthropic Claude with MCP) owns the next decade of observability — and PagerDuty-style human-paging tools become the pay phones of the SRE industry.
The Traditional Monitoring Stack (2020-2025)
- Static thresholds + alerts — CPU > 80%, p95 > 500ms, error rate > 1%. Tuned by hand, drifted weekly, generated 95%+ false positives by industry surveys.
- Dashboards as the source of truth — Datadog, Grafana, Splunk dashboards that engineers "babysat" during deploys and incidents. Average enterprise had 2,000-10,000 dashboards, most stale.
- On-call rotations + paging — PagerDuty / Opsgenie / VictorOps wake-ups, escalation policies, follow-the-sun rotations. The human was the runtime.
- Runbooks + tribal knowledge — Confluence pages, Notion docs, "ask Steve" Slack pings. Runbook execution was manual, error-prone, and the slowest part of MTTR.
- Post-incident reviews — humans wrote 5-Whys docs days after the fact, often with incomplete telemetry recall and political framing.
- The unit of work — a page. Success was measured in MTTA, MTTR, and how many times the on-call got woken up per week.
The 4 Layers Of The New Agent Stack (2026-2030)
- Layer 1 — Outcome Contracts — SLOs rewritten as machine-enforceable, named-customer-impact contracts: "checkout conversion for Tier-1 enterprise customers stays above 99.2% over rolling 7d." Tools: Nobl9, OpenSLO, Datadog SLOs, Honeycomb SLOs, Cortex. Humans negotiate the contract; machines enforce it.
- Layer 2 — Agent Investigation Runtime — LLM agents that ingest telemetry, form hypotheses, query traces/logs/metrics, run remediations, and write the post-mortem. Tools: Datadog Bits AI, Splunk AI Assistant, Microsoft Security Copilot, Dynatrace Davis CoPilot, New Relic AI, Anthropic Claude + MCP, OpenAI agents on Responses API.
- Layer 3 — Tools Registry — the agent's hands. Standardized function-calling surface (Model Context Protocol, OpenTelemetry semantic conventions, OpenAI function calling) that exposes "restart pod," "roll back deploy," "page human," "open ticket" as governed, audited, rate-limited tools agents can invoke.
- Layer 4 — Audit + Governance — humans verify outcomes, not steps. Tools: LangSmith, Arize, Helicone, Braintrust, Datadog LLM Obs. Every agent action is logged, replayable, and graded against the outcome contract. The on-call's new job is *grading the agent*, not *being the agent*.
What Disappears
- Alert fatigue as a category — when agents triage at the source, humans never see the 95% noise. The entire "alert tuning" job function shrinks 70-90%.
- PagerDuty-style human paging at volume — paging humans for issues an agent can resolve becomes a *failure mode*, not a *feature*. PagerDuty must become the agent-escalation layer or become the pay phone of SRE.
- Splunk-style dashboard customization as a billable hour — "build me a dashboard" consultancy collapses; agents render the view they need on demand.
- Manual runbook execution tools — Rundeck, StackStorm, hand-coded Ansible incident playbooks lose ground to MCP-exposed agent tools.
- The "ChatOps" middle layer — Slackbots that paste graphs into channels. Agents skip the channel and post the *resolution*, not the symptom.
- The "NOC analyst" job — the human staring at a wall of screens at 3am is replaced by a single agent and one humans-on-the-loop reviewer per shift.
What Gets More Important
- OpenTelemetry — the agent's eyes. Without standardized, semantically-tagged traces/logs/metrics, agents hallucinate root cause. OTel becomes the bedrock standard, not a nice-to-have.
- Model Context Protocol (MCP) — the agent's hands. Anthropic's MCP (and equivalents) standardizes how agents call your infrastructure tools safely. The new "REST API of monitoring."
- Outcome-contract design tools — Nobl9, OpenSLO, Cortex, Effx — turning fuzzy business outcomes into machine-enforceable contracts is the new "writing good acceptance criteria."
- Agent guardrail platforms — Lakera, Robust Intelligence, Protect AI, Credal — preventing the remediation agent from running
rm -rf /or rolling back the wrong deploy is a board-level concern. - Eval + audit infrastructure — LangSmith, Arize, Braintrust, Helicone, Datadog LLM Obs — grading agent decisions becomes the new code review.
- Incident anthropology + post-mortem synthesis — humans curating *patterns* across thousands of agent-resolved incidents is the new senior SRE skill.
The 5 New Job Categories
- Outcome Architect — translates business objectives into machine-enforceable contracts. Hybrid of product manager + SRE + lawyer.
- Agent Trainer — fine-tunes, prompts, and red-teams the investigation agents. Owns the agent's knowledge base and tool-use repertoire.
- Tools Registrar — owns the MCP/function registry. Decides which production tools agents can call, with what guardrails, at what blast radius.
- Audit Engineer — grades agent decisions, runs evals on resolution quality, owns the audit trail for regulators and post-mortems.
- Incident Anthropologist — reads patterns across thousands of agent-resolved incidents, surfaces systemic risk humans should redesign around. The new principal SRE.
The 5 New Vendor Categories
- Outcome Contract Platforms — Nobl9, OpenSLO foundations, Cortex, Effx, Datadog SLOs, Honeycomb SLOs.
- Agent Investigation Runtimes — Datadog Bits AI, Splunk AI Assistant, Microsoft Security Copilot, Dynatrace Davis CoPilot, New Relic AI, PagerDuty AIOps, Resolve.ai.
- Tools Registry / MCP Platforms — Anthropic MCP, OpenTelemetry, OpenAI function calling, LangChain Hub, Composio, Zapier MCP.
- Agent Guardrails — Lakera, Robust Intelligence, Protect AI, Credal, Prompt Security.
- Agent Evals + Audit — LangSmith, Arize, Braintrust, Helicone, Datadog LLM Obs, Galileo, Patronus AI.
The Hourglass Stack
| Layer | Old Vendor (2020-25) | New Vendor (2026-30) | Customer Impact | Timeline |
|---|---|---|---|---|
| Outcome Contracts | Hand-written SLA PDFs | Nobl9, OpenSLO, Datadog SLOs, Cortex | Named-customer impact becomes the alert | 2026-27 mainstream |
| Agent Investigation | Human on-call + runbooks | Bits AI, Splunk AI Assistant, MS Security Copilot, Davis CoPilot | MTTR drops 60-80%, 3am pages drop 70%+ | 2026 GA, 2027 default |
| Tools Registry | Hand-coded Ansible / Rundeck | MCP, OpenTelemetry, function calling | Safe, audited agent actions | 2026-28 standardization |
| Audit + Governance | Post-mortem Confluence pages | LangSmith, Arize, Braintrust, Helicone | Regulators + boards trust agent ops | 2027-29 enterprise default |
| Dashboards (residual) | Datadog / Grafana / Splunk UI | Agent-rendered ephemeral views | Dashboards become *outputs of questions*, not *configurations* | 2027-30 long-tail decline |
| Paging (residual) | PagerDuty, Opsgenie | Agent-escalation-only paging | Humans paged ~10% of 2024 volume | 2026-29 collapse |
The Hourglass
Bottom Line
Monitoring does not die in the agent era — it inverts into an hourglass. Contracts compress on top, agents compress on the bottom, and the flabby middle of dashboards-runbooks-paging gets squeezed out into a thin governance waist that humans only touch to verify outcomes. The winning phrase to remember: "Humans negotiate the contract. Agents run the runtime. Nobody babysits the dashboard." PagerDuty-style human paging becomes the pay phone of SRE; OpenTelemetry and MCP become the TCP/IP of agent operations; and the new senior SRE is an *incident anthropologist* who reads patterns across thousands of agent-resolved incidents instead of being the runtime themselves. The vendors who own a layer of the Hourglass Stack — Datadog at contracts + agent runtime, Splunk at security agent runtime, Microsoft at Copilot-everywhere, Honeycomb + Nobl9 at contracts, Anthropic + OpenAI at the agent fabric — own the next decade. The vendors stuck selling the squeezed middle become the New Relics of the AI cycle: still profitable, no longer the default answer.
Related: [q1650](/lab/cheap-100/q1650) (Datadog moat scenarios) · [q1674](/lab/cheap-100/q1674) (Datadog AI strategy) · [q1709](/lab/cheap-100/q1709) (Datadog AI buyer thesis)