Quit Emailing Yourself

# evaluation → open-source

7 links tagged with all of: evaluation + open-source

Click any tag below to further narrow down your results

Links

Prompts for Open Problems

The article discusses various open problems in machine learning inspired by a graduate class. It critiques current methodologies, emphasizing the need for a design-based perspective, better evaluation methods, and innovations in large language models. The author encourages researchers to explore these under-addressed areas.

Saved by tldr-importer · Last saved February 14, 2026 · 4 min read

+ machine-learning + research evaluation ✓ + optimization open-source ✓

Fine-tuning open LLM judges to outperform GPT-5.2

This article discusses how fine-tuning open-source LLM judges using Direct Preference Optimization (DPO) can lead to performance that matches or exceeds GPT-5.2 in evaluating model outputs. The authors trained models like GPT-OSS 120B and Qwen 3 235B on human preference data, achieving better accuracy and efficiency at a lower cost.

Saved by tldr-importer · Last saved February 14, 2026 · 6 min read

+ llm + fine-tuning + dpo evaluation ✓ open-source ✓

GitHub - langchain-ai/open_deep_research

Open Deep Research is an open-source agent designed for deep research tasks, compatible with various model providers and search tools. It ranks high on the Deep Research Bench leaderboard and offers flexibility for customization through its API. The platform supports multiple LLMs and search APIs, making it versatile for different research needs.

Saved by tldr-importer · Last saved February 14, 2026 · 4 min read

+ deep-research open-source ✓ + langchain + agent evaluation ✓

Introducing Bloom: an open source tool for automated behavioral evaluations

Bloom is an open source framework that automates the evaluation of AI model behaviors, allowing researchers to specify a desired behavior and generate relevant scenarios for assessment. The tool produces evaluations quickly and offers flexibility in measuring different behavioral traits, complementing existing tools like Petri.

Saved by tldr-importer · Last saved February 14, 2026 · 5 min read

+ ai evaluation ✓ + alignment open-source ✓ + research

GitHub - kmccleary3301/nested_learning: A Reproduction of GDM's Nested Learning Paper

This article details the implementation of Google's Nested Learning (HOPE) architecture, focusing on its mechanism-level components and testing procedures. It provides guidance on installation, usage, and evaluation, including various training configurations and memory management strategies for machine learning models.

Saved by tldr-importer · Last saved February 14, 2026 · 4 min read

+ nested-learning + machine-learning + pytorch open-source ✓ evaluation ✓

GitHub - deep-symbolic-mathematics/llm-srbench: [ICML2025 Oral] LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

LLM-SRBench is a new benchmark aimed at enhancing scientific equation discovery using large language models, featuring comprehensive evaluation methods and open-source implementation. It includes a structured setup guide for running and contributing new search methods, as well as the necessary configurations for various datasets. The benchmark has been recognized for its significance, being selected for oral presentation at ICML 2025.

Saved by tldr-importer · Last saved October 29, 2025 · 4 min read

+ llm + benchmark + scientific-discovery open-source ✓ evaluation ✓

GitHub - hoorangyee/LRAGE: A framework for evaluating RAG pipelines, specifically adapted for the legal domain.

LRAGE is an open-source toolkit designed for evaluating Large Language Models in a Retrieval-Augmented Generation context, specifically for legal applications. It integrates various tools and datasets to streamline the evaluation process, allowing researchers to effectively assess model performance with minimal engineering effort. Key features include a modular architecture for retrievers and rerankers, a user-friendly GUI, and support for LLM-as-a-Judge evaluations.

Saved by tldr-importer · Last saved October 29, 2025 · 5 min read

+ legal evaluation ✓ + language-models open-source ✓ + retrieval-augmented