Quit Emailing Yourself

How we cut our NLQ agent debugging time from hours to minutes with LLM Observability | Datadog

6 min read | Saved February 14, 2026 | Copied!

nlq 🤖 observability 🤖 debugging 🤖 dataset 🤖 evaluation 🤖

Do you care about this?

This article details how Datadog's teams used LLM Observability to enhance their natural language query (NLQ) agent for analyzing cloud costs. It covers the creation of a ground truth dataset, the challenges of evaluating AI-generated queries, and the implementation of a structured debugging process to identify and address errors.

If you do, here's more

Datadog’s engineering teams have developed a natural language query (NLQ) agent that translates plain-English questions into valid metrics queries for cloud cost management. The focus is on making complex data easier to understand for financial operations and engineering teams. The process began with lightweight user testing, where internal users posed real questions. This generated LLM traces that documented user prompts, tool responses, and final queries. The team then created a reference dataset from these traces, ensuring it reflected genuine user language instead of artificial prompts.

The challenges with evaluating the NLQ agent stem from the nondeterministic nature of large language models. Early attempts at validation via string comparison were misleading, leading to unreasonably low success rates. Instead of a simple pass/fail approach, they broke down correctness into specific components such as parsing, metric selection, roll-up accuracy, group-bys, and filters. Each component has its own evaluator, allowing for more nuanced feedback. This helped categorize errors effectively, showing where improvements were needed without the labor-intensive manual comparisons that previously slowed down debugging.

The integration of LLM Observability has streamlined the testing process significantly. Every code change triggers a run of the evaluators against the dataset, providing immediate feedback. The system captures detailed traces of each run, making it easier to pinpoint exactly where failures occur. This methodology has reduced the time spent on debugging agent failures from hours to minutes, achieving a roughly 20x efficiency gain. With this setup, the team can also evaluate different models, like Anthropic Claude, using the same standardized process, allowing for objective comparisons across various performance metrics.

Questions about this article

No questions yet.