4 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
This article benchmarks GPT-5.1, Claude Opus 4.5, and Gemini 3 Pro for security operations tasks. GPT-5.1 and Opus 4.5 show improved accuracy and speed, while Gemini 3 Pro lags behind. The findings help teams choose the best AI model for automation in SecOps.
If you do, here's more
The latest benchmark evaluates AI models GPT-5.1, Claude Opus 4.5, and Gemini 3 Pro for security operations tasks using the Splunk BOTSv3 dataset. GPT-5.1 and Opus 4.5 tied for the highest accuracy at 65%, a slight improvement over the previous state of the art (SOTA) of 63%. Gemini 3 Pro showed significant improvement from its predecessor, hitting 51% accuracy, but it still lags behind the top performers. The analysis indicates that while Opus 4.5 matches GPT-5.1βs accuracy, it comes with a higher cost per task.
In terms of efficiency, Opus 4.5 completed tasks in an average of 122 seconds, making it the fastest model tested. This speed is essential for time-sensitive investigations. GPT-5.1 took an average of 354 seconds, while Gemini 3 Pro averaged 500 seconds. Task completion rates were impressive across models, with GPT-5.1 achieving 100% completion. Opus 4.5 and Gemini 3 Pro both had a 92% completion rate, indicating potential limitations for Opus 4.5 in long-context tasks without fine-tuning.
Tool and token efficiency were also assessed. GPT-5.1 had the fewest average tool calls at 14.5, suggesting itβs more efficient in its reasoning. Opus 4.5 averaged 16 calls, while Gemini 3 Pro had the fewest at 9.3, though this did not translate into better accuracy. Token consumption showed Opus 4.5 at 1.1 million tokens per task, compared to GPT-5.1βs 1.2 million, both more efficient than Sonnet 4.5. For security teams, GPT-5.1 is recommended for blue team investigations due to its reliability and cost-effectiveness, while Opus 4.5 is better suited for speed-critical tasks. Gemini 3 Pro, despite improvements, still can't compete with the top models.
Questions about this article
No questions yet.