Quit Emailing Yourself

Vulnhalla: Picking the true vulnerabilities from the CodeQL haystack

7 min read | Saved February 14, 2026 | Copied!

vulnerabilities 🤖 static-analysis 🤖 codeql 🤖 llms 🤖 security 🤖

Do you care about this?

This article discusses a method for identifying software vulnerabilities by integrating large language models (LLMs) with static analysis tools like CodeQL. The authors highlight their tool, Vulnhalla, which filters out false positives and focuses on genuine security issues, illustrating the challenges of using LLMs in vulnerability research.

If you do, here's more

The blog post outlines a new tool called Vulnhalla, which combines large language model (LLM) reasoning with static analysis to identify software vulnerabilities more effectively. By layering an LLM on top of CodeQL, the process significantly reduces false positives, allowing developers and security researchers to concentrate on genuine security threats. In just two days and under an $80 budget, the team discovered multiple vulnerabilities, including significant ones like CVE-2025-38676 in the Linux Kernel and CVE-2025-0518 in FFmpeg.

The challenge with using LLMs for vulnerability detection lies in two main issues: the WHERE problem, which concerns identifying the right part of the code to analyze, and the WHAT problem, which involves determining the type of bug to search for. Existing models from Google and OpenAI attempt to tackle these problems through different strategies, but neither fully resolves the issues. Google's Deep Sleep analyzes code changes related to known vulnerabilities, while OpenAI's Aardvark focuses on identifying hazardous commits. Both approaches still face limitations in effectively narrowing down the types of bugs.

Static analysis tools like CodeQL can parse code to identify security issues, but they often generate an overwhelming number of alerts, most of which are false positives. The blog highlights a common scenario where a researcher could spend two years addressing these alerts based on a single repository's results. This excessive noise makes it hard to pinpoint real vulnerabilities. The authors argue that integrating LLMs with static analysis can enhance the detection process; the combined approach filters alerts to determine which are legitimate concerns versus false alarms.

The article also emphasizes the inadequacy of static analysis alone, citing studies where developers reported high false-positive rates—up to 80% in some tools. The proposed method aims to streamline the vulnerability assessment process by using CodeQL to generate a database of potential issues and leveraging an LLM to evaluate the alerts generated. This could lead to a more focused and efficient review process, ultimately improving code security without getting lost in the noise of excessive alerts.

Questions about this article

No questions yet.