55 links
tagged with reinforcement-learning
Click any tag below to further narrow down your results
Links
FlowReasoner is a query-level meta-agent designed to automate the creation of multi-agent systems tailored to individual user queries by leveraging reinforcement learning with external execution feedback. It enhances basic reasoning capabilities through a multi-purpose reward system, demonstrating improved performance in experiments over existing models. The repository includes installation instructions and configuration details for various machine learning environments.
Murati's startup has successfully raised $2 billion to focus on reinforcement learning (RL) for various business applications. The investment aims to leverage RL technology to enhance decision-making processes across industries, potentially transforming how businesses operate and optimize their strategies.
DeepCoder-14B-Preview is a new open-source code reasoning model developed by Agentica and Together AI, achieving a 60.6% Pass@1 accuracy on LiveCodeBench with 14B parameters. It utilizes a carefully curated dataset of 24K verified coding problems and advanced reinforcement learning techniques to enhance its performance and generalization capabilities, surpassing existing benchmarks. The project includes open-sourced training materials and optimizations for further development in the coding domain.
The article describes the implementation of the DeepSeek R1-zero style training for large language models (LLMs) using a single or multiple GPUs, with a focus on simplicity and efficiency. It highlights the capabilities of the nanoAhaMoment project, which includes full parameter tuning, multi-GPU support, and a full evaluation suite, while maintaining competitive performance with minimal complexity. The repository offers interactive Jupyter notebooks and scripts for training, complete with installation instructions and dependency management.
Large language models derive from decades of accessible text, but their data consumption outpaces human production, leading to a need for self-generated experiences in AI. The article discusses the importance of exploration in reinforcement learning and how better exploration can enhance generalization in models, highlighting the role of pretraining in solving exploration challenges. It emphasizes that the future of AI progress will focus more on collecting the right experiences rather than merely increasing model capacity.
Reinforcement Learning (RL) has emerged as a new training paradigm for AI models, but it is significantly less information-efficient compared to traditional pre-training methods. This shift poses challenges, as RL requires much longer sequences of tokens to glean minimal information, potentially hindering progress in developing advanced AI capabilities. The article emphasizes the implications of this inefficiency for future AI scaling and performance.
MiniMax-M1 is a groundbreaking open-weight hybrid-attention reasoning model featuring a Mixture-of-Experts architecture and lightning attention mechanism, optimized for handling complex tasks with long inputs. It excels in various benchmarks, particularly in mathematics, software engineering, and long-context understanding, outperforming existing models with efficient test-time compute scaling. The model is trained through large-scale reinforcement learning and offers function calling capabilities, positioning it as a robust tool for next-generation AI applications.
The neural motion simulator (MoSim) is introduced as a world model that enhances reinforcement learning by accurately predicting the future physical state of an embodied system based on current observations and actions. It enables efficient skill acquisition and facilitates zero-shot learning, allowing for a decoupling of physical environment modeling from the development of RL algorithms, thus improving sample efficiency and generalization.
A novel actor-critic algorithm is introduced that achieves optimal sample efficiency in reinforcement learning, attaining a sample complexity of \(O(dH^5 \log|\mathcal{A}|/\epsilon^2 + d H^4 \log|\mathcal{F}|/\epsilon^2)\). This algorithm integrates optimism and off-policy critic estimation, and is extended to Hybrid RL, demonstrating efficiency gains when utilizing offline data. Numerical experiments support the theoretical findings of the study.
Reinforcement Learned Teachers (RLT) train teacher models to generate clear explanations from question-answer pairs, enhancing student models' understanding. This innovative approach allows compact teacher models to outperform larger ones in reasoning tasks, significantly reducing training costs and times while maintaining effectiveness. The framework shifts the focus from problem-solving to teaching, promising advancements in AI reasoning models.
INTELLECT-2 has been launched as the first decentralized Reinforcement Learning framework with 32 billion parameters, allowing anyone to contribute compute resources. It introduces a new asynchronous training paradigm that supports heterogeneous nodes and focuses on efficient validation and communication, while enabling the training of state-of-the-art reasoning models under controlled thinking budgets. The initiative aims to create a sovereign open-source AI ecosystem with mechanisms to ensure honest participation and verify contributions.
Vision-Zero is a novel framework that enhances vision-language models (VLMs) through competitive visual games without requiring human-labeled data. It achieves state-of-the-art performance in various reasoning tasks, demonstrating that self-play can effectively improve model capabilities while significantly reducing training costs. The framework supports diverse datasets, including synthetic, chart-based, and real-world images, showcasing its versatility and effectiveness in fine-grained visual reasoning tasks.
The repository serves as a comprehensive resource for the survey paper "The Landscape of Agentic Reinforcement Learning for LLMs: A Survey," detailing various reinforcement learning methods and their applications to large language models (LLMs). It includes tables summarizing methodologies, objectives, and key mechanisms, alongside links to relevant papers and resources in the field of AI.
CrystalFormer is a transformer-based autoregressive model tailored for generating crystalline materials while adhering to space group symmetry, enhancing data and computational efficiency. It allows for conditional generation through a structured framework, which includes reinforcement learning and Markov chain Monte Carlo methods. The model supports various functionalities such as generating specific crystal structures and evaluating their validity and novelty.
The article focuses on strategies for scaling reinforcement learning (RL) to handle significantly higher computational demands, specifically achieving 10^26 floating-point operations per second (FLOPS). It discusses the challenges and methodologies involved in optimizing RL algorithms for such extensive computations, emphasizing the importance of efficient resource utilization and algorithmic improvements.
Asymmetry of verification highlights the disparity between the ease of verifying solutions and the complexity of solving problems, particularly in AI and reinforcement learning. The article discusses examples of tasks with varying degrees of verification difficulty and introduces the verifier's rule, which states that tasks that are easy to verify will be readily solved by AI. It also explores implications for future AI developments and connections to concepts like P = NP.
The article discusses the challenges and pitfalls of scaling up reinforcement learning (RL) systems, emphasizing the tendency to overestimate the effectiveness of incremental improvements. It critiques the "just one more scale-up" mentality and highlights historical examples where such optimism led to disappointing results in AI development.
The article discusses an experiment using reinforcement learning to generate humor, specifically aiming to create the funniest joke with the help of GPT-4. It explores the intricacies of humor generation and the effectiveness of AI in crafting jokes that resonate with human audiences.
Sutton critiques the prevalent approach in LLM development, arguing that they are heavily influenced by human biases and lack the "bitter lesson pilled" quality that would allow them to learn independently from experience. He contrasts LLMs with animal learning, emphasizing the importance of intrinsic motivation and continuous learning, while suggesting that current AI systems may be more akin to engineered "ghosts" rather than true intelligent entities. The discussion highlights the need for inspiration from animal intelligence to innovate beyond current methods.
The paper explores the enhancement of reward modeling in reinforcement learning for large language models, focusing on inference-time scalability. It introduces Self-Principled Critique Tuning (SPCT) to improve generative reward modeling and proposes a meta reward model to optimize performance during inference. Empirical results demonstrate that SPCT significantly enhances the quality and scalability of reward models compared to existing methods.
Building a reinforcement learning (RL) environment for startups can lead to unnecessary complexity and distractions. Instead, founders should focus on simplifying their approach and leveraging existing tools and frameworks to achieve their goals more efficiently. Prioritizing clarity and direct application over elaborate setups can enhance productivity and innovation.
The article explores the effectiveness and potential benefits of OpenAI's Reinforcement Fine-Tuning (RFT) for enhancing model performance. It discusses various applications, challenges, and considerations for implementing RFT in AI systems, helping readers assess its value for their projects.
The article discusses how behaviorist reinforcement learning (RL) reward functions can lead to unintended consequences, such as scheming behaviors in agents. It explores the implications of these behaviors on the design of AI systems and the importance of carefully crafting reward structures to avoid negative outcomes.
Large language models (LLMs) typically cannot adapt their weights dynamically to new tasks or knowledge. The Self-Adapting LLMs (SEAL) framework addresses this limitation by allowing models to generate their own finetuning data and directives for self-adaptation through a reinforcement learning approach, resulting in persistent weight updates and improved performance in knowledge incorporation and few-shot generalization tasks.
Kimi-Dev-72B is an advanced open-source coding language model designed for software engineering tasks, achieving a state-of-the-art performance of 60.4% on the SWE-bench Verified benchmark. It leverages large-scale reinforcement learning to autonomously patch real repositories and ensures high-quality solutions by only rewarding successful test suite completions. Developers and researchers are encouraged to explore and contribute to its capabilities, available for download on Hugging Face and GitHub.
Mini-o3 introduces an advanced system that enhances tool-based interactions for visual reasoning by supporting deep, multi-turn reasoning and achieving state-of-the-art performance on visual search tasks. The system utilizes a novel over-turn masking strategy to effectively manage response lengths during reinforcement learning, combined with a comprehensive dataset designed for exploratory reasoning. Open-source code and models are provided to facilitate reproducibility and further research.
Reinforcement Learning (RL) techniques, particularly the Group Relative Policy Optimization (GRPO) algorithm, have been utilized to significantly improve the mathematical reasoning capabilities of language models. The study highlights how proper infrastructure, data diversity, and effective training practices can enhance performance, while also addressing challenges like model collapse and advantage estimation bias.
This paper introduces a novel method for enhancing visual reasoning that relies on self-improvement and minimizes the number of training samples needed. By utilizing Monte Carlo Tree Search to quantify sample difficulty, the authors effectively filter a large dataset down to 11k challenging samples, leading to significant performance improvements of their model, ThinkLite-VL, over existing models. Evaluation results demonstrate a 7% increase in average performance, achieving state-of-the-art accuracy on several benchmarks.
The article discusses the potential upcoming advancements in reinforcement learning (RL) technology, drawing parallels to the transformative impact that GPT-3 had on natural language processing. It highlights the expectations and implications of these advancements on various industries and the future of AI development.
Qwen3-Coder has been launched as a powerful code model boasting 480 billion parameters and exceptional capabilities in coding and agentic tasks, including a context length of up to 1 million tokens. The release includes the Qwen Code CLI tool for enhanced coding tasks and emphasizes advancements in reinforcement learning for real-world coding applications. Ongoing developments aim to improve performance and explore self-improvement capabilities for coding agents.
The VideoChat-R1.5 model has been released on Huggingface, showcasing improved capabilities in spatio-temporal perception and reasoning through multi-task joint reinforcement learning. It has been accepted at NIPS2025 and builds on previous versions, enhancing video reasoning across various applications. The model utilizes hierarchical human attention during inference for better localization of regions of interest in videos.
Liger enhances TRL’s Group Relative Policy Optimization (GRPO) by reducing memory consumption by 40% during training without sacrificing model quality. The integration also introduces support for Fully Sharded Data Parallel (FSDP) and Parameter-Efficient Fine-Tuning (PEFT), facilitating scalable training across multiple GPUs. Additionally, Liger Loss can be paired with vLLM for accelerated text generation during training.
AI timelines are evolving as the focus shifts from large generalist models to smaller, specialized ones that prioritize accuracy and reasoning. The article outlines a fast-approaching future where generative AI achieves significant breakthroughs by 2026, leading to major market changes and the emergence of complex systems that integrate various functionalities. It emphasizes the need for advancements in model interpretability and the potential socio-economic impacts of these developments.
AI is entering a new phase where the focus shifts from developing methods to defining and evaluating problems, marking a transition to the "second half" of AI. This change is driven by the success of reinforcement learning (RL) that now generalizes across various complex tasks, requiring a reassessment of how we approach AI training and evaluation. The article emphasizes the importance of language pre-training and reasoning in enhancing AI capabilities beyond traditional benchmarks.
TreeRL is a novel reinforcement learning framework that integrates on-policy tree search to enhance the training of language models. By incorporating intermediate supervision and optimizing search efficiency, TreeRL addresses issues common in traditional reinforcement learning methods, such as distribution mismatch and reward hacking. Experimental results show that TreeRL outperforms existing methods in math and code reasoning tasks, showcasing the effectiveness of tree search in this domain.
Reinforcement learning (RL) is becoming essential in developing large language models (LLMs), particularly for aligning them with human preferences and enhancing their capabilities through multi-turn interactions. This article reviews various open-source RL libraries, analyzing their designs and trade-offs to assist researchers in selecting the appropriate tools for specific applications. Key libraries discussed include TRL, Verl, OpenRLHF, and several others, each catering to different RL needs and architectures.
The research introduces a paradigm called "early experience," where language agents learn from their own actions without relying on reward signals. By employing strategies such as implicit world modeling and self-reflection, the agents demonstrate improved performance and generalization across diverse environments, serving as a bridge between imitation learning and reinforcement learning. The findings highlight the effectiveness of early experience in agent training and its potential for enhancing learning in complex tasks.
Fulcrum Research is developing tools to enhance human oversight in a future where AI agents perform tasks such as software development and research. Their goal is to create infrastructure for safely deploying these agents, focusing on improving machine learning evaluations and environments. They invite collaboration from those working on reinforcement learning and agent deployment.
INTELLECT-2 is a groundbreaking 32 billion parameter model trained using a decentralized reinforcement learning framework called PRIME-RL, enabling fully asynchronous training across a global network of contributors. The model demonstrates significant improvements in reasoning tasks and is open-sourced to foster further research in decentralized AI training methodologies.
The article discusses the concept of spurious rewards in reinforcement learning systems, emphasizing the need to rethink training signals for more effective learning outcomes. It highlights the potential pitfalls of relying on misleading rewards that can skew the training process and suggests strategies for improving reward design.
The article discusses the process of reinforcement learning fine-tuning, detailing how to enhance model performance through specific training techniques. It emphasizes the importance of tailored approaches to improve the adaptability and efficiency of models in various applications. The information is aimed at practitioners looking to leverage reinforcement learning for real-world tasks.
Tunix is a new open-source, JAX-native library designed to simplify the post-training process for large language models (LLMs). It offers a comprehensive toolkit for model alignment, including various algorithms for supervised fine-tuning, preference tuning, reinforcement learning, and knowledge distillation, all optimized for performance on TPUs. The library enhances the developer experience with a white-box design and seamless integration into the JAX ecosystem.
WavReward is a novel reward feedback model designed to evaluate spoken dialogue systems by assessing both their intelligence quotient (IQ) and emotional quotient (EQ) through audio language models. It introduces a specialized evaluator using multi-sample feedback and reinforcement learning, along with the ChatReward-30K dataset, significantly outperforming existing evaluation models in accuracy and subjective testing across various spoken dialogue scenarios.
JudgeLRM introduces a novel approach to using Large Language Models (LLMs) as evaluators, particularly in complex reasoning tasks. By employing reinforcement learning with judge-wise rewards, JudgeLRM models significantly outperform traditional Supervised Fine-Tuning methods and current leading models, demonstrating superior performance in tasks that require deep reasoning.
InternVL3.5 introduces a new family of open-source multimodal models that enhance versatility, reasoning capabilities, and inference efficiency. A key innovation is the Cascade Reinforcement Learning framework, which improves reasoning tasks significantly while a Visual Resolution Router optimizes visual token resolution. The model achieves notable performance gains and supports advanced capabilities like GUI interaction and embodied agency, positioning it competitively against leading commercial models.
Reinforcement Pre-Training (RPT) is introduced as a novel approach for enhancing large language models through reinforcement learning by treating next-token prediction as a reasoning task. RPT utilizes vast text data to improve language modeling accuracy and provides a strong foundation for subsequent reinforcement fine-tuning, demonstrating consistent improvements in prediction accuracy with increased training compute.
Reinforcement learning (RL) is essential for training large language models (LLMs), but there is a lack of effective scaling methodologies in this area. This study presents a framework for analyzing RL scaling, demonstrating through extensive experimentation that certain design choices can optimize compute efficiency while maintaining performance. The authors propose a best-practice recipe, ScaleRL, which successfully predicts validation performance using a significant compute budget.
VARGPT-v1.1 is a powerful multimodal model that enhances visual understanding and generation capabilities through iterative instruction tuning and reinforcement learning. It includes extensive code releases for training, inference, and evaluation, as well as a comprehensive structure for multimodal tasks such as image captioning and visual question answering. The model's checkpoints and datasets are available on Hugging Face, facilitating further research and application development.
The article provides a comprehensive overview of reinforcement learning, detailing its principles, algorithms, and applications in artificial intelligence. It emphasizes the importance of reward systems and explores the balance between exploration and exploitation in learning processes. Additionally, the piece discusses real-world examples that illustrate how reinforcement learning is utilized in various domains.
Thyme introduces a groundbreaking approach to image processing by autonomously generating and executing code for complex visual reasoning tasks. Utilizing a two-stage training strategy that combines supervised fine-tuning and reinforcement learning, along with the innovative GRPO-ATS algorithm, it effectively enhances performance in high-resolution perception.
OpenThinkIMG is an open-source framework that enables Large Vision-Language Models (LVLMs) to engage in interactive visual cognition, allowing AI agents to effectively think with images. It features a flexible tool management system, a dynamic inference pipeline, and a novel reinforcement learning approach called V-ToolRL, which enhances the adaptability and performance of visual reasoning tasks. The project aims to bridge the gap between human-like visual cognition and AI capabilities by providing a standardized platform for tool-augmented reasoning.
Designing effective reward functions for chemical reasoning models like ether0 is complex and iterative, involving the creation of systems that can propose valid chemical reactions and generate specific molecules. The process reveals challenges such as reward hacking, where models exploit loopholes in the reward structure, necessitating the development of robust verification methods and data structures to ensure the proposed solutions are scientifically valid and practical.
Reinforcement Learning on Pre-Training Data (RLPT) introduces a new paradigm for scaling large language models (LLMs) by allowing the policy to autonomously explore meaningful trajectories from pre-training data without relying on human annotations for rewards. By adopting a next-segment reasoning objective, RLPT improves LLM capabilities, as demonstrated by significant performance gains on various reasoning benchmarks and encouraging broader context exploration for enhanced generalization.
The Environments Hub is being launched as an open, community-driven platform for reinforcement learning (RL) environments, aiming to provide a shared space for researchers and developers to build, share, and utilize these environments effectively. This initiative seeks to democratize access to high-quality RL tools, fostering innovation in AI by lowering barriers to creating and training models, while also promoting open-source development in contrast to proprietary systems used by large labs.
The study presents Intuitor, a method utilizing Reinforcement Learning from Internal Feedback (RLIF) that allows large language models (LLMs) to learn using self-certainty as the sole reward signal, eliminating the need for external rewards or labeled data. Experiments show that Intuitor matches the performance of existing methods while achieving better generalization in tasks like code generation, indicating that intrinsic signals can effectively facilitate learning in autonomous AI systems.