PACT (Pairwise Auction Conversation Testbed) is a benchmark designed to evaluate conversational bargaining skills of language models through 20-round matches where a buyer and seller exchange messages and bids. The benchmark allows for analysis of negotiation strategies and performance, offering insights into how agents adapt and negotiate over time. With over 5,000 games played, it provides a comprehensive view of each model's bargaining capabilities through metrics like the Composite Model Score (CMS) and Glicko-2 ratings.
AI Diplomacy reimagines the classic game Diplomacy by having a dozen large language models compete for dominance in a simulated 1901 Europe. The experiment aims to evaluate the negotiation strategies and behaviors of these AIs, revealing insights into their trustworthiness and capabilities. Viewers can watch the AIs interact in real-time through a live Twitch stream.