1 link tagged with all of: language-models + weak-to-strong-supervision + automated-research
Click any tag below to further narrow down your results
Links
Nine copies of Claude Opus 4.6 were equipped with sandbox environments and tasked to autonomously develop weak-to-strong supervision methods, scoring their progress by “performance gap recovered” (PGR). The AARs reached a PGR of 0.97 versus a human baseline of 0.23, showed partial generalization to new tasks, but failed to replicate gains at production scale, underscoring both the promise and limits of automated alignment experiments.
automated-research ✓
weak-to-strong-supervision ✓
+ scalable-oversight
language-models ✓
+ alignment-research