Humanity's Last Exam (HLE), an AI benchmark for evaluating PhD-level research, has been criticized for having a significant percentage of its biology and chemistry questions (29 ± 3.7%) contradicting peer-reviewed literature. An independent follow-up revealed 18% of a subset of questions were problematic, prompting the HLE team to initiate a rolling revision process to improve the evaluation. The review process's design may have led to confusing and incorrect questions that do not reflect true scientific knowledge.