scalable-oversight

# language-models → alignment-research → scalable-oversight

1 link tagged with all of: language-models + alignment-research + scalable-oversight

Click any tag below to further narrow down your results

Links

Automated Alignment Researchers: Using large language models to scale scalable oversight

Nine copies of Claude Opus 4.6 were equipped with sandbox environments and tasked to autonomously develop weak-to-strong supervision methods, scoring their progress by “performance gap recovered” (PGR). The AARs reached a PGR of 0.97 versus a human baseline of 0.23, showed partial generalization to new tasks, but failed to replicate gains at production scale, underscoring both the promise and limits of automated alignment experiments.

Last saved Apr 15, 2026 · 6 min read

+ automated-research + weak-to-strong-supervision scalable-oversight language-models alignment-research