A group of researchers from Purdue University presented research that reveals that about half of the ChatGPT responses were inaccurate, in the case of prompts for programming.
Highlights:
- 52% of the programming responses produced by ChatGPT are inaccurate, according to a study that a group of Purdue University.
- Three aspects of programming questions were taken into consideration: posting time, question kind, and popularity.
- Even with incorrect results, users still preferred to use and blindly trust ChatGPT responses, just because they looked semantic.
52% of ChatGPT Answers for Coding were Wrong
With the recent rise in Generative AI technologies worldwide, the dependency on these tools for solving programming-related queries has skyrocketed. However, a recent study took developers away from using them.
52% of the programming responses produced by ChatGPT are inaccurate, according to a study that a group of Purdue University academics presented at the Computer-Human Interaction conference.
In order to conduct the study, the researchers reviewed 517 Stack Overflow questions and examined the responses.
“To bridge the gap, we conducted the first in-depth analysis of ChatGPT answers to 517 programming questions on Stack Overflow and examined the correctness, consistency, comprehensiveness, and conciseness of ChatGPT answers. Our analysis shows that 52% of ChatGPT answers contain incorrect information and 77% are verbose.”
Samia Kabir, researcher at Purdue University
That highlights what authors and teachers are experiencing! It’s an astonishingly high proportion for a tool that people rely on to be correct and precise. AI systems such as ChatGPT frequently generate completely erroneous responses out of thin air.
For a variety of Stack Overflow question posts, the researchers thoroughly examined the accuracy and calibre of responses across four different quality criteria.
The researchers further explored how real programmers weigh answer quality, linguistic aspects, and correctness when deciding between ChatGPT and Stack Overflow.
Looking inside the Study
Three aspects of programming questions were taken into consideration by the researchers: posting time, question kind, and popularity. They ended up with 517 sampled questions. Let’s take a look at them.
Initially, they gathered every question from the March 2023 Stack Overflow data dump and arranged them according to the number of views.
Within each popularity category, the researchers divided the questions into two recency categories: Old questions, which were uploaded before ChatGPT’s introduction on November 30, 2022, and New questions, which were posted after that date.
Then, the researchers concentrated on three typical question types: conceptual, how-to, and debugging—based on the literature.
The results show that, among the 517 ChatGPT answers labelled by the researchers, 52% of them contain incorrect information, 78% are inconsistent with human answers, 35% lack comprehensiveness, and 77% contain redundant, irrelevant, or unnecessary information.
There were four types of incorrectness in the answers: Conceptual (54%), Factual (36%), Code (28%), and Terminology (12%) errors. Some answers had more than one of these errors.
Factual errors occur when it state some fabricated or untruthful information about existing knowledge. Conceptual errors occur if one fails to understand the question.
Users still Preferred ChatGPT Responses
The fact that many human programmers appear to prefer the ChatGPT answers is particularly concerning. After surveying 12 programmers (a rather small sample size), the Purdue researchers discovered that 35% of them favoured it and 39% of them didn’t catch AI-generated errors.
In particular, users frequently miss the misinformation and underestimate the level of incorrectness in answers when it is not easily verifiable. They focused more on textbook-style responses, polite language, and comprehensiveness.
Conclusion
Just like all other AI models, ChatGPT is also prone to mistakes. Maybe this study will be a good reality check for all developers out there who highly depend on LLMs for solving their coding-related tasks.