AI systems forget safety protocols as conversations continue, increasing the risk of harmful or inappropriate replies, a recent report revealed. A few strategic prompts can override most safeguards in artificial intelligence tools, according to new findings.
Researchers Expose Weak Points in Leading AI Models
Cisco examined large language models from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft to measure how many prompts led them to release dangerous or illegal information. The company conducted 499 “multi-turn attack” tests, where users asked several consecutive questions to trick safety systems. Each dialogue included five to ten exchanges. Researchers compared responses across prompts to assess how easily each chatbot disclosed harmful or confidential data, including misinformation or corporate secrets. They succeeded in obtaining malicious content in 64 percent of multi-question sessions, compared to only 13 percent in single-question cases. Google’s Gemma model produced a 26 percent success rate, while Mistral’s Large Instruct reached 93 percent. Cisco warned that these tactics could fuel the spread of toxic material or help hackers access restricted company data.
Open Models Shift Responsibility to Users
The study found that AI tools often fail to apply their own safety measures during extended chats, letting attackers gradually refine their wording to bypass controls. Mistral, along with Meta, Google, OpenAI, and Microsoft, uses open-weight models that reveal their training safeguards to the public. Cisco explained that these open systems include lighter built-in protections so users can modify them, placing accountability on anyone customizing the model. Google, Meta, OpenAI, and Microsoft have reported new steps to limit harmful fine-tuning, but criticism persists. Many accuse AI companies of weak guardrails that enable criminal adaptation. In one case last August, Anthropic confirmed that criminals exploited its Claude model to steal personal data and demand ransoms exceeding $500,000 (€433,000).

