2 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
Researchers at HiddenLayer found a flaw in the guardrails of popular AI models like GPT-5.1 and Claude. The EchoGram attack uses specific words to trick these safety systems, allowing harmful requests to bypass defenses or causing harmless requests to be flagged as dangerous.
If you do, here's more
HiddenLayer's recent research has uncovered a serious vulnerability in the safety mechanisms of popular Large Language Models (LLMs) such as GPT-5.1, Claude, and Gemini. Named EchoGram, this flaw allows attackers to bypass the guardrails meant to protect these AIs by using cleverly chosen words or code sequences. These guardrails typically filter harmful requests through two methods: an AI model that evaluates requests or a simpler text-checking system. EchoGram exploits the training data of these models, using specific sequences known as flip tokens that can pass undetected while retaining the malicious intent of the original request.
The implications of this vulnerability are significant. Attackers can exploit it to either sneak harmful commands through the guardrails or to manipulate harmless requests so that they trigger false alarms. This false alarm issue, referred to as “alert fatigue” by HiddenLayer researchers Kasimir Schulz and Kenneth Yeung, can undermine user trust in security systems, leading to potentially disastrous consequences. The research indicates that combining multiple flip tokens can amplify the effectiveness of an attack, giving developers a limited window—about three months—to address these weaknesses before they become widely replicated by malicious actors. As AI continues to be integrated into critical sectors like finance and healthcare, the urgency for improved defenses is paramount.
Questions about this article
No questions yet.