Quit Emailing Yourself

Fine-tuning open LLM judges to outperform GPT-5.2

6 min read | Saved February 14, 2026 | Copied!

llm 🤖 fine-tuning 🤖 dpo 🤖 evaluation 🤖 open-source 🤖

Do you care about this?

This article discusses how fine-tuning open-source LLM judges using Direct Preference Optimization (DPO) can lead to performance that matches or exceeds GPT-5.2 in evaluating model outputs. The authors trained models like GPT-OSS 120B and Qwen 3 235B on human preference data, achieving better accuracy and efficiency at a lower cost.

If you do, here's more

Open-source LLM judges, particularly GPT-OSS 120B, have been fine-tuned to outperform GPT-5.2 in evaluating model outputs. Using Direct Preference Optimization (DPO), researchers trained GPT-OSS on 5,400 preference pairs, achieving superior accuracy at a significantly lower cost and faster processing speeds—15 times cheaper and 14 times quicker than GPT-5.2. The evaluation was based on RewardBench 2, a benchmark specifically designed to measure alignment with human judgment rather than just correctness.

The article highlights a key paradox: employing LLMs to assess other LLMs, despite their tendency to create errors. However, judging is a more straightforward task than generating text, which allows smaller, open-source models to excel in evaluation. The experiment confirmed that these models can indeed surpass the performance of larger, closed-source alternatives. Researchers evaluated four judge models, including the target GPT-5.2, using a structured approach that focused on categories like safety and instruction adherence.

Baseline testing revealed that Qwen 3 235B outperformed GPT-5.2 from the start, while GPT-OSS 120B was close behind. The study also noted positional bias in model evaluations and demonstrated that safety judgments were easier across the board, as all models were trained to avoid harmful content. In contrast, assessing response quality and relevance proved more subjective and challenging. The fine-tuning process using DPO aimed to enhance these judges' capabilities, ultimately leading to better alignment with human preferences.

Questions about this article

No questions yet.