Click any tag below to further narrow down your results
Links
ShapeR offers a method for generating 3D shapes from image sequences. It processes input images to extract relevant data, then uses a transformer model to create a mesh representation of each object in the scene. The project includes tools for setup, data exploration, and evaluation.
This GitHub repository provides RBench, a benchmark for evaluating robotics video generation, and RoVid-X, a dataset for training models with RGB, depth, and optical flow videos. The authors highlight limitations in existing video models and aim to enhance embodied AI research.
Egocentric-10K is the largest dataset focused on egocentric video collected in real factory settings, featuring over 1 billion frames across nearly 193,000 clips. It includes detailed camera intrinsics and metadata for each video, making it valuable for research in human-robot interaction and computer vision.
This article introduces FinCDM, a framework for assessing financial large language models (LLMs) by evaluating their knowledge and skills rather than relying on a single score. It highlights the creation of a new dataset, CPA-KQA, based on CPA exam questions, which allows for a more nuanced analysis of LLM capabilities in financial contexts. The framework aims to uncover knowledge gaps and enhance model development for real-world applications.
The article presents Golden Goose, a method to create unlimited Reinforcement Learning with Verifiable Rewards (RLVR) tasks by using unverifiable internet text. It describes how the authors developed a large-scale dataset, GooseReason-0.7M, which includes over 700,000 tasks across various domains. The approach successfully enhances model performance, even in areas like cybersecurity where prior data was unavailable.
This article details how Datadog's teams used LLM Observability to enhance their natural language query (NLQ) agent for analyzing cloud costs. It covers the creation of a ground truth dataset, the challenges of evaluating AI-generated queries, and the implementation of a structured debugging process to identify and address errors.
This GitHub repository provides an open-source dataset of over 20,000 identified malicious software packages. It includes samples from npm, PyPI, and IDE extensions, along with tools for analysis. Users can check package versions for malicious intent and must handle the software with caution.
DeepMath-103K is a newly released dataset designed to enhance mathematical reasoning in language models, featuring a broad range of challenging and diverse math problems. It includes rigorous decontamination processes to ensure fair evaluation, with detailed problem structures that support various research applications. The accompanying models and code are open-sourced to facilitate further exploration and development in the field.
This article introduces IMAGGarment-1, a framework for generating garments with detailed control over silhouette, color, and logo placement. It features a two-stage training process that separates global appearance from local details, allowing for precise customization in fashion design. The authors also present GarmentBench, a large dataset to support their model, which includes nearly 190,000 garment samples with various design conditions.
The article discusses a webinar focused on the world's largest multimodal dataset for artificial intelligence, highlighting its significance in advancing AI research and applications. It features insights from experts on the dataset's capabilities and potential impact across various industries.
Web Bench introduces a new dataset for evaluating AI browser agents, consisting of 5,750 tasks across 452 websites. The dataset aims to address limitations in existing benchmarks by focusing on both read and write tasks, revealing that agents struggle significantly with write-heavy tasks like form filling and authentication, while performing better on read tasks. Skyvern 2.0 currently leads in performance for write tasks, highlighting opportunities for improvement in AI browser capabilities.
MedReason is a comprehensive medical reasoning dataset that enhances large language models (LLMs) by utilizing a structured medical knowledge graph to create detailed reasoning paths from clinical question-answer pairs. The dataset includes 32,682 QA pairs with step-by-step explanations, and the MedReason-8B model, fine-tuned on this data, achieves state-of-the-art performance in medical reasoning tasks. The project is open-sourced, providing access to models, data, and deployment codes for further research and applications.
OmniSVG is a unified framework for generating high-quality scalable vector graphics (SVG) using pre-trained Vision-Language Models (VLMs), which decouples structural logic from low-level geometry. It introduces the MMSVG-2M dataset with two million annotated SVG assets and supports multiple generation modalities, demonstrating superior performance over existing methods for diverse creative tasks. The model is designed to handle complexity ranging from simple icons to intricate illustrations, offering flexibility for professional design workflows.
StreamBridge is a framework designed to convert offline Video Large Language Models (Video-LLMs) into proactive streaming assistants, addressing issues of multi-turn understanding and proactive response mechanisms. It utilizes a memory buffer and a lightweight activation model for continuous engagement, alongside the creation of the Stream-IT dataset for enhanced streaming video comprehension. Experiments demonstrate that StreamBridge outperforms existing models, showcasing significant improvements in video understanding tasks.
VistaDPO is a new framework for optimizing video understanding in Large Video Models (LVMs) by aligning text-video preferences at three hierarchical levels: instance, temporal, and perceptive. The authors introduce a dataset, VistaDPO-7k, consisting of 7.2K annotated QA pairs to address the challenges of video-language misalignment and hallucinations, showing significant performance improvements in various benchmarks.
MS MARCO Web Search is a comprehensive dataset designed for information retrieval research, featuring millions of real clicked query-document labels and a vast corpus from ClueWeb22. It supports various tasks in machine learning and retrieval systems, offering a benchmark for evaluating retrieval methods and performance across large datasets. Researchers can utilize this dataset to investigate the effectiveness of their techniques on both small and large data scales.
Access to the Institutional Books dataset requires users to agree to specific terms of use, emphasizing noncommercial use, no redistribution, and proper attribution. The dataset consists of 983,004 public domain books digitized by Harvard Library, aimed at supporting research and public-interest purposes while encouraging feedback for ongoing refinements. Users can create derivative works for noncommercial purposes but must adhere to the outlined guidelines and limitations of liability.
OpenAI MRCR (Multi-round co-reference resolution) is a long context dataset designed to evaluate a language model's ability to identify multiple instances of similar requests embedded in a conversation. This dataset incorporates varying levels of complexity by including multiple identical asks within long, multi-turn dialogues, challenging the model to accurately differentiate and respond to specific instances. Implementation details and grading methods for assessing model performance are also provided.
REverse-Engineered Reasoning (REER) introduces a novel approach to instilling deep reasoning in language models by working backwards from known solutions to discover the underlying reasoning process. This method addresses the limitations of traditional reinforcement learning and instruction distillation, resulting in the creation of a large dataset, DeepWriting-20K, and a model, DeepWriter-8B, that outperforms existing models in open-ended tasks. The research emphasizes the importance of structured reasoning and iterative refinement in generating high-quality outputs.
Mini-o3 introduces an advanced system that enhances tool-based interactions for visual reasoning by supporting deep, multi-turn reasoning and achieving state-of-the-art performance on visual search tasks. The system utilizes a novel over-turn masking strategy to effectively manage response lengths during reinforcement learning, combined with a comprehensive dataset designed for exploratory reasoning. Open-source code and models are provided to facilitate reproducibility and further research.
REGEN introduces a new benchmark dataset aimed at enhancing the capabilities of large language models (LLMs) in generating personalized recommendations through natural language interactions. By augmenting the Amazon Product Reviews dataset with user critiques and contextual narratives, REGEN allows for more nuanced conversational recommendations that adapt to user feedback. The study demonstrates how models like LUMEN can effectively integrate recommendation and narrative generation, paving the way for more intuitive user experiences.
Weak-to-Strong Decoding (WSD) is a novel framework designed to enhance the alignment capabilities of large language models (LLMs) by utilizing a smaller aligned model to guide the initial drafting of responses. By integrating a well-aligned draft model, WSD significantly improves the quality of generated content while minimizing the alignment tax, as demonstrated through extensive experiments and the introduction of the GenerAlign dataset. The framework provides a structured approach for researchers to develop safe AI systems while navigating the complexities of preference alignment.
Migrating from DataFrame to Dataset in Apache Spark can significantly reduce runtime errors thanks to type safety, compile-time checks, and clearer schema awareness. This transition addresses common issues such as human errors and schema mismatches, ultimately leading to more robust and maintainable data processing systems. The article provides insights into the advantages of using Dataset over DataFrame for large-scale data processing, emphasizing correctness and maintainability.
EleutherAI has released the Common Pile v0.1, an 8 TB dataset of openly licensed and public domain text for training large language models, marking a significant advancement from its predecessor, the Pile. The initiative emphasizes the importance of transparency and openness in AI research, aiming to provide researchers with essential tools and a shared corpus for better collaboration and accountability in the field. Future collaborations with cultural heritage institutions are planned to enhance the quality and accessibility of public domain works.
The article introduces the Pico-Banana-400K dataset, a large-scale collection of 400,000 images designed for text-guided image editing. It aims to address the limitations in existing datasets by providing high-quality, diverse edit pairs generated from real photographs, facilitating advanced research in multimodal image editing techniques. The dataset includes specialized subsets for multi-turn editing, preference research, and instruction summarization.
The article presents the Pico-Banana-400K dataset, which consists of approximately 400,000 text-image-edit triplets aimed at enhancing research in text-guided image editing. It features a variety of edit operations across multiple semantic categories, with evaluations conducted using advanced AI models to ensure high-quality edits. This dataset is designed to support both single-step and multi-turn editing applications.