TextQuests introduces a benchmark to evaluate the performance of Large Language Models (LLMs) in classic text-based video games, focusing on their ability to engage in long-context reasoning and learning through exploration. The evaluation involves assessing agents' progress and ethical behavior across various interactive fiction games, revealing challenges such as hallucination and inefficiency in dynamic thinking. The aim is to help researchers better understand LLM capabilities in complex, exploratory environments.
JudgeLRM introduces a novel approach to using Large Language Models (LLMs) as evaluators, particularly in complex reasoning tasks. By employing reinforcement learning with judge-wise rewards, JudgeLRM models significantly outperform traditional Supervised Fine-Tuning methods and current leading models, demonstrating superior performance in tasks that require deep reasoning.