Quit Emailing Yourself

# dataset → evaluation

4 links tagged with all of: dataset + evaluation

Click any tag below to further narrow down your results

Links

GitHub - DAGroup-PKU/ReVidgen: Rethinking Video Generation Model for the Embodied World

This GitHub repository provides RBench, a benchmark for evaluating robotics video generation, and RoVid-X, a dataset for training models with RGB, depth, and optical flow videos. The authors highlight limitations in existing video models and aim to enhance embodied AI research.

Saved by tldr-importer · Last saved February 14, 2026 · 3 min read

+ robotics + video-generation dataset ✓ evaluation ✓ + benchmark

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

This article introduces FinCDM, a framework for assessing financial large language models (LLMs) by evaluating their knowledge and skills rather than relying on a single score. It highlights the creation of a new dataset, CPA-KQA, based on CPA exam questions, which allows for a more nuanced analysis of LLM capabilities in financial contexts. The framework aims to uncover knowledge gaps and enhance model development for real-world applications.

Saved by tldr-importer · Last saved February 14, 2026 · 2 min read

+ financial + language-models evaluation ✓ + skills dataset ✓

How we cut our NLQ agent debugging time from hours to minutes with LLM Observability | Datadog

This article details how Datadog's teams used LLM Observability to enhance their natural language query (NLQ) agent for analyzing cloud costs. It covers the creation of a ground truth dataset, the challenges of evaluating AI-generated queries, and the implementation of a structured debugging process to identify and address errors.

Saved by tldr-importer · Last saved February 14, 2026 · 6 min read

+ nlq + observability + debugging dataset ✓ evaluation ✓

openai/mrcr · Datasets at Hugging Face

OpenAI MRCR (Multi-round co-reference resolution) is a long context dataset designed to evaluate a language model's ability to identify multiple instances of similar requests embedded in a conversation. This dataset incorporates varying levels of complexity by including multiple identical asks within long, multi-turn dialogues, challenging the model to accurately differentiate and respond to specific instances. Implementation details and grading methods for assessing model performance are also provided.

Saved by tldr-importer · Last saved October 29, 2025 · 4 min read

+ openai + mrcr dataset ✓ + language-model evaluation ✓