3 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
Terminal-Bench 2.0 launches with a new testing framework, Harbor, aimed at improving the evaluation of AI agents in terminal-based tasks. The update includes 89 validated tasks and addresses previous inconsistencies, while Harbor supports scalable testing in cloud environments.
If you do, here's more
Terminal-Bench 2.0 has launched, alongside a new framework called Harbor, designed to test AI agents in terminal-based environments. This update replaces the previous version and aims to solve issues like task inconsistency that plagued version 1.0. Terminal-Bench 2.0 includes 89 rigorously validated tasks that are more realistic and clearly defined. Notably, tasks that relied on unstable third-party APIs, like downloading from YouTube, have been removed or revised to ensure reliability.
Harbor supports the evaluation of AI agents in cloud-based containers, allowing for large-scale testing and integration with various training pipelines. It enables developers to create custom benchmarks and is compatible with major cloud providers. The framework was used during the development of Terminal-Bench 2.0, facilitating thousands of test rollouts. Initial results show OpenAI's Codex CLI, powered by GPT-5, achieving the highest success rate at 49.6%, followed closely by other GPT-5 variants and Claude Sonnet 4.5 models. The competition among these agents indicates a dynamic landscape, with no single model dominating the performance metrics.
To submit an agent for evaluation, users can easily install Harbor and run benchmarks using command-line instructions. The combined release of Terminal-Bench 2.0 and Harbor aims to create a more standardized approach to testing AI agents, responding to the growing need for controlled and reproducible testing in increasingly complex operational environments.
Questions about this article
No questions yet.