Quit Emailing Yourself

Evaluating AI Agents in Security Operations (December 2025) - Cotool

4 min read | Saved February 14, 2026 | Copied!

security 🤖 ai 🤖 benchmarking 🤖 automation 🤖 models 🤖

Do you care about this?

This article benchmarks GPT-5.1, Claude Opus 4.5, and Gemini 3 Pro for security operations tasks. GPT-5.1 and Opus 4.5 show improved accuracy and speed, while Gemini 3 Pro lags behind. The findings help teams choose the best AI model for automation in SecOps.

If you do, here's more

The latest benchmark evaluates AI models GPT-5.1, Claude Opus 4.5, and Gemini 3 Pro for security operations tasks using the Splunk BOTSv3 dataset. GPT-5.1 and Opus 4.5 tied for the highest accuracy at 65%, a slight improvement over the previous state of the art (SOTA) of 63%. Gemini 3 Pro showed significant improvement from its predecessor, hitting 51% accuracy, but it still lags behind the top performers. The analysis indicates that while Opus 4.5 matches GPT-5.1’s accuracy, it comes with a higher cost per task.

In terms of efficiency, Opus 4.5 completed tasks in an average of 122 seconds, making it the fastest model tested. This speed is essential for time-sensitive investigations. GPT-5.1 took an average of 354 seconds, while Gemini 3 Pro averaged 500 seconds. Task completion rates were impressive across models, with GPT-5.1 achieving 100% completion. Opus 4.5 and Gemini 3 Pro both had a 92% completion rate, indicating potential limitations for Opus 4.5 in long-context tasks without fine-tuning.

Tool and token efficiency were also assessed. GPT-5.1 had the fewest average tool calls at 14.5, suggesting it’s more efficient in its reasoning. Opus 4.5 averaged 16 calls, while Gemini 3 Pro had the fewest at 9.3, though this did not translate into better accuracy. Token consumption showed Opus 4.5 at 1.1 million tokens per task, compared to GPT-5.1’s 1.2 million, both more efficient than Sonnet 4.5. For security teams, GPT-5.1 is recommended for blue team investigations due to its reliability and cost-effectiveness, while Opus 4.5 is better suited for speed-critical tasks. Gemini 3 Pro, despite improvements, still can't compete with the top models.

Questions about this article

No questions yet.