How Snorkel AI Went from Zero to 93% RCA Accuracy in Days

When the Snorkel AI team started piloting Herald, the critical question was how long before an AI DevOps Agent actually became useful? Snorkel’s engineering team ships fast and so didn't have months to test, tune, and configure an agent before seeing results.

Most AI tools sold to engineering teams come with a setup tax. You spend weeks writing runbooks, tagging past incidents, and tweaking agent behavior. The intelligence you were promised shows up months later, after significant human investment. It typically goes out-of-date almost immediately, and we expected something similar here.

Fortunately, that's not how things worked with Herald.

Day One, No History, No Runbooks

We connected four data sources to get started: GitHub, Jira, Buildkite, and Datadog. That was it. No runbooks. No historical RCA documents. No Slack history. The Snorkel AI product line we piloted Herald on was relatively new, so there wasn't a rich archive of past incidents to draw from.

Herald had to work with what it could discover on its own.

Two days in, we had our first correct RCA. A few days later, one of our PMs put the agent through a proper stress test. He worked it through 90 tasks: real issues that required an understanding of our systems, our code, and our infrastructure.

The agent was 93% accurate.

We shared these results in our next engineering all-hands, explaining that when we asked the Herald agent 90 challenging questions, it got nearly every one right. That's a big sell with that kind of audience. As a result, we discontinued our POC with a general-purpose AI enterprise search platform, and our team went on to win Snorkel's internal Engineering Excellence Award.

The Cold Start Advantage

One thing that truly surprised us was how little setup it took. There was no onboarding period, no lengthy data preparation phase, no historical RCA library to learn from. Most DevOps agent require weeks of setup before they can do anything useful. A team has to write runbooks, label past incidents, tag infrastructure components, and build a knowledge base. Even then, the initial accuracy is low and requires extensive tuning.

We skipped all of that. Herald ingested our four data sources in hours and produced its first correct RCA within two days. It learned our product, infrastructure, and business from the ground up in the same way a strong engineer does when they join a new team. It read the code, followed the logs, and learned the relationships between services.

Historical context like runbooks, RCAs, and post mortems always helps, but we're a fast-moving team and honestly our documentation is never fully up to date. We needed a tool that could figure things out even when the docs don't have the answer, like an engineer who just figures things out from first principles.Building an agent that has an autonomous approach to onboarding isn’t easy, but the results spoke for themselves.

An Incident That Said It All

A few weeks after go-live, the delivery team pinged engineering on Slack saying that the product was down. They didn't share a stack trace or logs, just an observation that things weren’t working.

Someone asked Herald if it could figure out what was going on.

The agent did the investigation, pulling from our Datadog logs, correlating signals across services, and returning a precise diagnosis. Not a list of possibilities, just the root cause. An engineer looked at the results and quickly applied a fix. The platform was back up within minutes.

I brought that story to a leadership meeting: End-to-end resolution in minutes, from a team member noticing something might be wrong to a fixed production issue. This is not a typical outcome. Typically an engineer gets notified of an issue, drops what they’re doing, figures out the context in which the issue is occurring, then spends hours investigating an issue until the root cause is identified and resolved. Herald completely changed that equation.

After two weeks, 27 engineers at Snorkel had root caused 52 incidents, and only one received a negative feedback rating. At no point did engineers have to manually comb through Datadog or pinging the right subject matter expert.

What's Next

We started the pilot asking how long it would take before Herald became useful. The answer was days.

Herald is going to be our first line of defense for production issues.The next step is closing the loop entirely. The ability for Herald to identify an issue, fork the relevant repo, write a code change, and open a pull request for human review is already on the table. Engineers stay in control, the agent does the bulk of the work.