A study by the University of Oxford revealed that a lot of news websites around the globe are banning AI web crawlers. Let’s dive deeper to get into the heart of the matter.
Highlights:
- 48% of the major news websites are blocking AI-based Web Crawlers, according to a study.
- The number is highest among legacy print publications, compared to new-age digital outlets.
- Web crawlers collect data from these news websites for training their LLMs, which threatens the integrity of the news data.
Study says Half of the News Websites Don’t Want AI
According to the study by the University of Oxford’s Reuters Institute for the Study of Journalism, 48% of the popular news websites across 10 countries are blocking crawlers of OpenAI, the company behind ChatGPT.
Web crawlers, sometimes known as “spiders” or “bots,” are used to automatically search the internet and systematically gather data. They are useful for many different things. They are the main building blocks behind search engines like Google which rely on the information gathered by their crawlers to index web pages and provide fast search engine responses.
AI models like ChatGPT are also highly dependent on such crawlers. These models require Large language models (LLMs) to be trained on data collected from websites on the internet. After being taught, LLMs like GPT can generate outputs and reply to inquiries from users through ChatGPT-type interfaces.
All and all, such AI Chatbots and models should have access to news websites to stay updated with the latest information. Not just ChatGPT, Google’s Bard (now Gemini) also wants to have access to news. If not, that’s not good for them.
Major Insights from the Technical Report
We obtained the following facts and figures from the Director of Research, Dr. Richard Fletchers’ technical report which shows the insights in detail!
By the end of 2023, 48% of the most widely used news websites across ten countries were blocking OpenAI’s crawlers. A smaller number, 24%, were blocking Google’s AI crawler.
Almost every website (97%) that decided to block Google’s AI crawler was also blocking OpenAI’s crawlers.
There is much more, we have to discuss, so let’s break it down.
Country-wise Proportions
The results also showed a country-wise variance in the proportion of news websites blocking OpenAI varying from just 20% in Mexico to 79% in the USA.
Demark is number two with 67%, with 60% of news outlets from Germany, India, and Norway blocking AI crawlers as well. It's interesting to see that the UK ranks behind them with 53% only.
Google wins over OpenAI
For Google, the blocking numbers are significantly lower, nearly half, only 24%, compared to OpenAI. This might be because OpenAI is more known worldwide, and also News websites don't want to upset Google. There is the caution that if they ban Google's crawler, the search ranking might be affected.
It is seen that almost everyone blocking Google is doing it with OpenAI too, but not vice-versa. However, the more Google' AI models become popular, this might not remain the same.
Which Sites Are Blocking the Most?
While news sites are not still friendly with AI, some organizations are more cautious, especially the older ones. Online websites of established print media are more likely to block AI crawlers compared to digitally native outlets.
On average, 57% of print publications are banning access to OpenAI, while it is only 31% of digital-born websites. We believe this is because older organizations are more aware of the effects that AI can have on their revenue. They also have technical teams that know how to block crawlers.
On the other side, modern media is more open to such changes and they might think that they can benefit from AI ultimately. However, if we get some major updates from AI companies that hurt their business, they might also do the same as print and broadcast media do.
Is the future of Generative AI in question?
As we mentioned, these news outlets and websites are concerned about the privacy and integrity of their data online. This is why an increasing number of news companies have concerns related to copyright and equitable remuneration, as well as the loss of direct traffic to news websites and concerns about disseminating false information.
However, consequently, this puts the working of LLMs in jeopardy. LLMs are the main building blocks of Generative AI tools such as ChatGPT, Google Gemini, Groq AI, etc. These companies use crawlers to scrape data heavily from news websites.
AI models will become stuck in the past and unable to change to reflect the advances in the actual world without new data. Model collapse may also occur if AI-generated, low-quality data is consumed in excess by the models.
But there is some hope! Some news websites don’t seem to have any issues with Generative AI companies accessing their content as they look to explore the benefits of journalism being featured on the company’s handles. This strikes a mutual benefit as people looking to use Generative AI tools will also refer to these websites for news-related purposes.
This has led to some mutually exclusive deals between AI companies and news websites adhering to all the policies involved and the code of conduct. One of these deals involved Axel Springer shaking hands with OpenAI allowing them to respond to user queries by taking up news content from their website.
Conclusion
News websites blocking AI web crawlers is an important step in protecting the privacy of their data and profits. However, this holds serious implications for the field of Generative AI tools. The most effective solution to this matter would be for AI tools to strike deals and make their way out of the predicament or enhance their models and be independent in the matter of content generated.