Quit Emailing Yourself

Black-Box On-Policy Distillation of Large Language Models

2 min read | Saved February 14, 2026 | Copied!

knowledge-distillation 🤖 generative-models 🤖 reinforcement-learning 🤖 llms 🤖 gpt-5 🤖

Do you care about this?

This article introduces Generative Adversarial Distillation (GAD), a method for training student models using only teacher-generated texts. Unlike traditional knowledge distillation, GAD employs a two-player game between a generator and a discriminator, enabling effective learning without probability supervision. The results demonstrate that models trained with GAD achieve performance comparable to their larger teacher models.

If you do, here's more

Black-box knowledge distillation (KD) of large language models (LLMs) can be challenging, especially when only teacher-generated texts are available, such as from proprietary APIs. Traditional likelihood-based KD methods don't work well in these situations. The authors propose a solution called Generative Adversarial Distillation (GAD). This method allows the student model to learn on-policy without needing direct probability supervision. GAD operates as a two-player minimax game where a discriminator evaluates the student’s outputs against the teacher's, effectively creating a reward model that helps train the student.

In their experiments, the authors evaluated GAD against standard sequence-level KD (SeqKD) using the LMSYS-Chat dataset. The results show that students trained with GAD, like Qwen2.5-14B-Instruct, perform comparably to the teacher model, GPT-5 Chat. GAD showed significant improvements in out-of-distribution generalization. While SeqKD struggled, often leading to minimal or negative results on datasets like Dolly, SelfInst, and Vicuna, GAD maintained strong performance. The authors attribute GAD's success to the advantages of reinforcement learning over traditional supervised fine-tuning.

The training process involves a warm-up phase where both the generator and discriminator are initially trained using cross-entropy loss and Bradley-Terry loss, respectively. This approach stabilizes the training before the full GAD setup. The findings also highlight the stability of the on-policy discriminator, which avoids issues like reward hacking that can affect off-policy methods. The GAD framework demonstrates a robust alternative for distilling LLMs when only limited teacher text data is available, ultimately advancing the field of knowledge distillation for AI models.

Questions about this article

No questions yet.