Quit Emailing Yourself

# evaluation → peer-reviewed → hle

1 link tagged with all of: evaluation + peer-reviewed + hle

Click any tag below to further narrow down your results

Links

About 30% of Humanityâs Last Exam chemistry/biology answers are likely wrong | FutureHouse

Humanity's Last Exam (HLE), an AI benchmark for evaluating PhD-level research, has been criticized for having a significant percentage of its biology and chemistry questions (29 ± 3.7%) contradicting peer-reviewed literature. An independent follow-up revealed 18% of a subset of questions were problematic, prompting the HLE team to initiate a rolling revision process to improve the evaluation. The review process's design may have led to confusing and incorrect questions that do not reflect true scientific knowledge.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

hle ✓ evaluation ✓ + science + benchmarks peer-reviewed ✓