Quit Emailing Yourself

About 30% of Humanityâs Last Exam chemistry/biology answers are likely wrong | FutureHouse

6 min read | Saved October 29, 2025 | Copied!

hle 🤖 evaluation 🤖 science 🤖 benchmarks 🤖 peer-reviewed 🤖

Do you care about this?

Humanity's Last Exam (HLE), an AI benchmark for evaluating PhD-level research, has been criticized for having a significant percentage of its biology and chemistry questions (29 ± 3.7%) contradicting peer-reviewed literature. An independent follow-up revealed 18% of a subset of questions were problematic, prompting the HLE team to initiate a rolling revision process to improve the evaluation. The review process's design may have led to confusing and incorrect questions that do not reflect true scientific knowledge.

If you do, here's more

Click "Generate Summary" to create a detailed 2-4 paragraph summary of this article.

Questions about this article

No questions yet.