1 link tagged with all of: language-models + alignment-research + scalable-oversight
Click any tag below to further narrow down your results
Links
Nine copies of Claude Opus 4.6 were equipped with sandbox environments and tasked to autonomously develop weak-to-strong supervision methods, scoring their progress by “performance gap recovered” (PGR). The AARs reached a PGR of 0.97 versus a human baseline of 0.23, showed partial generalization to new tasks, but failed to replicate gains at production scale, underscoring both the promise and limits of automated alignment experiments.
+ automated-research
+ weak-to-strong-supervision
scalable-oversight ✓
language-models ✓
alignment-research ✓