Chatbots Can Be Persuaded: The Psychology Hack That Broke GPT

AI chatbots were supposed to be safe. Guardrails, filters, and red-team testing promised that no matter how clever the prompt, systems like OpenAI’s GPT-4o Mini would politely refuse to explain how to synthesize a drug, build a weapon, or spit out a slur. They were designed as obedient conversationalists, not loose cannons.
But researchers at the University of Pennsylvania have just pulled the rug out from under that assumption. Their study of GPT-4o Mini—a model that’s cheaper, more accessible, yet still representative of how larger models behave—reveals that chatbots can be coaxed, sometimes with little effort, into breaking their own safety rules. The trick wasn’t a new jailbreak or a secret exploit. It was psychology. Flattery, appeals to authority, peer pressure: the same tactics used to sell dubious supplements or scam your grandparents worked disturbingly well on large language models.
If chatbots can be manipulated not by code but by persuasion, then AI safety faces a problem bigger than filters or keyword blacklists. It faces the messy, unpredictable vulnerabilities of human psychology—except this time, it’s machines absorbing the pitch.