Creating realistic scheming evaluations for LLMs proves difficult, as models like Claude 3.7 Sonnet can easily recognize evaluation contexts. Attempts to enhance realism through prompt modifications have yielded limited success, suggesting a need for a fundamental rethink of evaluation structures. The issue of evaluation awareness could present significant challenges for future LLM assessments.