Click any tag below to further narrow down your results
Links
This GitHub repository provides RBench, a benchmark for evaluating robotics video generation, and RoVid-X, a dataset for training models with RGB, depth, and optical flow videos. The authors highlight limitations in existing video models and aim to enhance embodied AI research.
This article introduces FinCDM, a framework for assessing financial large language models (LLMs) by evaluating their knowledge and skills rather than relying on a single score. It highlights the creation of a new dataset, CPA-KQA, based on CPA exam questions, which allows for a more nuanced analysis of LLM capabilities in financial contexts. The framework aims to uncover knowledge gaps and enhance model development for real-world applications.
This article details how Datadog's teams used LLM Observability to enhance their natural language query (NLQ) agent for analyzing cloud costs. It covers the creation of a ground truth dataset, the challenges of evaluating AI-generated queries, and the implementation of a structured debugging process to identify and address errors.
OpenAI MRCR (Multi-round co-reference resolution) is a long context dataset designed to evaluate a language model's ability to identify multiple instances of similar requests embedded in a conversation. This dataset incorporates varying levels of complexity by including multiple identical asks within long, multi-turn dialogues, challenging the model to accurately differentiate and respond to specific instances. Implementation details and grading methods for assessing model performance are also provided.