Large Language Model (LLM) judges are essential for safety evaluations in AI systems, but their reliability is questionable due to challenges like prompt sensitivity and vulnerability to adversarial attacks. The study reveals significant performance variations in these judges, indicating that they may not provide accurate assessments, leading to a false sense of security regarding model safety.
+ llm-safety
meta-evaluation ✓
adversarial-attacks ✓
robustness ✓
machine-learning ✓