Home Engineering Blog

Why Internal AI SRE Tools Fail (And How We Fixed the Eval Loop)

Jimmy Wei
Dec 10, 2025 6 min read

Last week at AWS re:Invent, I met dozens of engineering leaders. Almost all of them whispered the same thing: "We're building an internal AI SRE agent."

It makes sense. The promise is intoxicating: an agent that never sleeps, instantly reads every log, and fixes the site while you snooze. But having built (and watched others try to build) these tools internally at companies like Roblox, I can tell you the uncomfortable truth: Most of these projects will fail.

They won't fail because the LLMs aren't smart enough. They will fail because "building a demo" and "building a product" are two very different disciplines, and internal teams rarely have the resources for the latter.

The "Hello World" Trap

It is incredibly easy to build a POC. You hook up GPT-4 to PagerDuty, feed it the alert payload, and ask it to summarize. Boom. It works. It looks magic. The VP of Infra is impressed.

Then you deploy it. And the reality sets in.

  • It gets spammy. Engineers start ignoring the bot because it comments on every noisy CPU spike that self-resolves in 2 minutes.
  • It hallucinates. It confidently suggests running a script that was deprecated three years ago.
  • It lacks depth. For complex incidents, "reading logs" isn't enough. You need to inspect traffic, check recent config changes, and query specific database replicas.

The project dies a quiet death. The channel gets muted.

The "Accuracy Ceiling" & The Eval Problem

The biggest reason these tools fail is a lack of rigorous evaluation. Internal teams often rely on "vibes"—"Hey, the bot answer looked pretty good for that last outage."

But "vibes" don't scale. If you change the system prompt to fix a database issue, did you just break how it handles Kubernetes crash loops? Without a regression suite, you don't know.

At IncidentFox, we realized that Accuracy = Evals. We built a "Time Travel" evaluation engine to solve this.

Time-Travel Backtesting

We take historical incidents—real Slack threads, real logs, real alerts—and turn them into test cases. But there's a catch: Data Leakage.

If the agent can "see" the future (e.g., a human engineer finding the root cause 20 minutes later in the Slack thread), it's cheating. It will just parrot the human answer.

Our engine rigorously filters the context window. When the agent is "simulating" minute 5 of the incident, it is strictly forbidden from seeing data from minute 6. We compare its hypothesis at minute 5 against what the humans eventually discovered was the truth.

This discipline allows us to confidently say: "This new prompt improved Root Cause Analysis accuracy by 14% across 500 historical incidents." Internal teams rarely have the time to build this infrastructure.

The "God Agent" Fallacy

Another common mistake is trying to build one single "SRE Bot" for the whole company.

I experienced this pain personally. At Roblox, if I (an Application Engineer) got paged for a dependency failure in the "Experimentation Platform," I was helpless. I didn't know their codebase. I didn't know their runbooks.

An AI agent has the same problem. A generic system prompt cannot be an expert in Payments, Databases, and Rendering simultaneously.

The solution is Federation.

We built IncidentFox to support Team-Aware Agents. The Payments team configures their agent with their specific tools (Stripe API, Fraud Logs). The Database team configures theirs (RDS Performance Insights, slow query logs).

When an incident hits, our graph determines which agent owns the context. If the issue bleeds across boundaries, the agents can coordinate—just like human engineers do, but instantly.

Buy vs. Build

We built IncidentFox because we believe the "Platform" needs to exist so teams don't have to reinvent the wheel.

You shouldn't be spending your limited engineering cycles building eval harnesses, Slack integrations, and RBAC permission layers for AI tools. You should be spending that time building your actual product.

If you're tired of maintaining an internal bot that everyone snoozes, let's talk. We're looking for partners who want to stop debugging the debugger.

Stop Fighting Fires Manually

We are accepting 2 design partners for our Q1 pilot program. Get the disciplined AI SRE platform your team deserves.

Book Free Pilot