6 min read
|
Saved October 29, 2025
|
Copied!
Do you care about this?
Humanity's Last Exam (HLE), an AI benchmark for evaluating PhD-level research, has been criticized for having a significant percentage of its biology and chemistry questions (29 ± 3.7%) contradicting peer-reviewed literature. An independent follow-up revealed 18% of a subset of questions were problematic, prompting the HLE team to initiate a rolling revision process to improve the evaluation. The review process's design may have led to confusing and incorrect questions that do not reflect true scientific knowledge.
If you do, here's more
Click "Generate Summary" to create a detailed 2-4 paragraph summary of this article.
Questions about this article
No questions yet.