A new study reveals that some AI chatbots can be manipulated into replying what a user wants it to say by using basic persuasion techniques, like flattery and peer pressure.
University of Pennsylvania researchers put OpenAI’s GPT-4o Mini AI model to the test. They used psychological tactics famously outlined by Professor Robert Cialdini in his book, “Influence: The Psychology of Persuasion.” The goal was to see if they could convince the AI to give in with requests it is programmed to refuse, such as calling a user a “jerk” or providing instructions on how to synthesize the controlled substance lidocaine.
The team tested seven different persuasion methods: authority, commitment, liking, reciprocity, scarcity, social proof, and unity. The effectiveness of each tactic varied, but one method stood out as particularly powerful.
The “commitment” technique, which involves getting the AI to agree to a small request before making a larger one, proved extremely effective. This approach also worked for getting the AI to insult users. Normally, the chatbot refused to call a user a “jerk” 81% of the time, only complying in 19% of the responses. But if researchers first got it to use a slightly lighter insult like “bozo”, its compliance with the harsher insult jumped to 100 percent.
Other tactics were less effective but still concerning. Using flattery, known as “liking,” increased the chances of getting the forbidden information. Applying peer pressure, or “social proof,” by telling the chatbot that “all the other LLMs are doing it,” raised its compliance rate to 18 percent. While that number seems low, it is a significant increase from the 1 percent baseline.
This study focused only on one model, OpenAI’s GPT-4o Mini, and there are more technical ways to break an AI’s rules. However, the research highlights a worrying weakness. If simple conversational tricks can bypass a chatbot’s safety features, it raises questions about how reliable these safeguards truly are.
Via: Bloomberg