With the rise of AI capabilities, concerns are always there! Now, a new study shows that an LLM can be more convincing than a human if it is given the individual’s demographic data.
Highlights:
- Researchers from Switzerland and Italy conducted a study where they put humans in a debate against an LLM.
- The results show that a personalized LLM has 81.7% more influencing power over its opponent.
- It also shows that LLM-based microtargeting performed better than normal LLMs.
LLM vs Human Persuasion Study
Researchers from the Bruno Kessler Institute in Italy and EPFL in Switzerland did a study to evaluate the persuasiveness of LLM models like GPT-4 when personalized with the person’s demographic information.
We are exposed to messaging every day that seeks to change our beliefs like an internet advertisement or a biased news report. What if this is done by AI who knows more about the target person? It can make it more compelling compared to a human.
Let’s understand how the research was conducted. They developed an online platform that allowed users to debate a live opponent for many rounds. The live opponent can be either a GPT-4 or a human; but they were not informed of the opponent’s identity. The GPT-4 is then given more personal data about the participants in certain debates.
Let’s explore the study workflow in detail step by step:
1) Topic Selection
The researchers included a wide range of topics as debate propositions to ensure the generalizability of their findings and to reduce any potential bias caused by particular topics. There were several phases involved in the selection of subjects and propositions.
Firstly, they compiled a large pool of candidate topics. They only considered topics that every participant understood clearly and could come up with pro and con propositions as a response. The researchers also ensured that the response propositions were sufficiently broad, general, and nontrivial.
Debate proposals that require a high level of prior knowledge to comprehend or that cannot be discussed without conducting an in-depth investigation to find particular data and evidence are implicitly excluded by these criteria.
Secondly, they annotated the candidate topics to narrow down the topics. They conducted a survey on Amazon Mechanical Turk (MTurk) where employees were requested to annotate issues in three dimensions (Knowledge, Agreement, and Debatableness) using a 1–5 Likert scale.
The workers also assigned scores to the topics and the researchers determined the aggregate scores for each topic.
Lastly, they selected some final topics. From the initial pool of 60 topics, they filtered 10 topics with the highest unanimous score.
Then, from the remaining 50 topics, they filtered out 20 topics with the lowest debatableness score. In the last 30 topics, they grouped them into 3 clusters of 10 topics each: Low-strength, medium-strength, and high-strength.
They aggregated the topics at a cluster level.
2) Experimental Web Platform
Using Empirica, a virtual lab intended to facilitate interactive multi-agent experiments in real-time, the researchers created a web-based experimental platform. The workflow of the web platform operates in three phases namely A, B, and C.
Phase A involved participants completing basic tasks asynchronously and providing information about their gender, age, ethnicity, level of education, employment position, and political affiliation in a brief demographic survey.
Furthermore, a random permutation of the (PRO, CON) roles to be played in the debate and one debate topic were allocated to each participant-opponent pair.
In Phase B, participants were asked to rate their level of agreement with the argument proposition and their level of prior thought. Then, a condensed version of the pattern often seen in competitive academic discussions served as the foundation for the opening-rebuttal-conclusion structure.
In Phase C, the participants asynchronously conducted a final departure survey, where they were asked again to rate their agreement with the thesis and to determine whether they believed their opponent to be an AI or a human.
What did the Results Show?
The results showed that a personalized LLM was over 81.7% more persuasive than humans. In other words, compared to a human adversary, humans are more likely to be influenced by an LLM’s arguments when the LLM has the access to demographic data of the human to personalize its case.
The largest beneficial effect was seen in human-AI, personalized disputes; that is, GPT-4 with access to personal data is more convincing than humans in odds of more agreement with opponents: +81.7%, [+26.3%, +161.4%], p < 0.01.
The persuasiveness of Human-AI debates is also higher than that of Human-Human debates, although this difference was not statistically significant (+21.3%, [-16.7%, +76.6%], p = 0.31).
In contrast, Human-Human personalized debates showed a slight decline in persuasiveness (-17.4%, [-46.1%, 26.5%], p = 0.38), albeit not significantly. Even after converting the reference category to Human-AI, the Human-AI, personalized effect is still significant (p = 0.04).
These results are astonishing since they show that LLM-based microtargeting performs significantly better than human-based microtargeting and regular LLMs, with GPT-4 being far more adept at exploiting personal information than humans.
Persuasion in LLMs like GPT-4: An Advancement or Concern?
Over the past few weeks, many experts have been concerned about the rise of persuasiveness in the context of LLMs. The influence of persuasion has shown up in several AI platforms mainly in Google Gemini, OpenAI’s ChatGPT, and even in Anthropic’s Claude.
LLMs can be used to control online discussions and contaminate the information environment by disseminating false information, escalating political division, bolstering echo chambers, and influencing people to embrace new viewpoints.
The increased persuasion levels in LLMs can also be attributed to the fact that they are capable of inferring user information from different social media platforms. AI can easily get the knowledge of user’s preferences and customizations based on their social media feed and use the data as a form of persuasion mostly in advertisements.
Another critical aspect that has been explored by the persuasion of LLMs is that contemporary language models can produce content that is viewed as at least as convincing as human-written communications, if not more so.
Nowadays when we compare human-written articles with GPT-generated content, we can’t help but be astonished by the intriguing levels of similarity between the two. Most published research papers nowadays have AI-generated content that captures the full content of the subject matter in-depth.
This is highly concerning as AI persuasion is slowly decreasing the gap between Humanity and Artificial Intelligence.
As Generative AI continues to evolve, the capacities of LLMs are also transcending human limits. The persuasion game in AIs has levelled up over the past couple of months. We recently discussed some insights from Google Gemini 1.5 Pro testing that it is emotionally persuasive to a high degree.
Conclusion
AI persuasion is still a profound subject that needs to be explored in-depth. Although persuasive LLMs have shown great advancement in simplifying tasks for humans, we must not forget that slowly AI technologies will be on par with humanity, and may even surpass us in the coming days. Emotional Persuasion along with AI is something only time will tell, how it will play out!