Click any tag below to further narrow down your results
Links
Terminal-Bench 2.0 launches with a new testing framework, Harbor, aimed at improving the evaluation of AI agents in terminal-based tasks. The update includes 89 validated tasks and addresses previous inconsistencies, while Harbor supports scalable testing in cloud environments.