Amidst the recent strides in the field of LLMs, the quest for extensive datasets to increase their capabilities has led companies to resort to controversial data acquisition methods. A new report points out Many top tech companies are now facing lawsuits for the use of copyrighted content for training purposes.
Highlights:
- The latest New York Times report exposes AI Companies resorting to controversial methods to gain extensive amounts of data.
- Many tech companies are facing lawsuits regarding the usage of copyrighted and private content
- Leading tech company Google altered its privacy policy to gain more data for training its generative AI models
Why do AI Companies need massive amounts of Data?
For many years, sites like Wikipedia and Reddit were endless sources of information. But as the world of AI has advanced, more and more data is needed for expanding the capabilities of models.
The initial Large Language Models (LLMs) designed are also now in need of refinements. Companies like OpenAI, Google, and Meta are looking for new ways to generate data that can be used for training purposes. However, they have been largely limited by privacy laws and internal policies.
This makes the situation of gathering data of utmost importance and priority. According to Epoch, a research institute, tech companies could run through the high-quality data on the internet as soon as 2026.
In response, Google has recently revised its terms of service to enable access to publicly available content such as Google Docs and restaurant reviews on Google Maps for its AI products. This increases ways of gathering data that can in turn be used to make LLMs better.
Let’s take a look at a few recent controversies related to how AI giants are training their models.
OpenAI
According to a report by The New York Times, OpenAI has used over a million hours of YouTube videos to train LLM GPT-4. This adds to the multiple questions being faced by OpenAI regarding the training data being used for their different models.
The report states that to increase the amount of data available for training their GPT-4 model, OpenAI took the risky decision to gather data from YouTube videos. In late 2021, OpenAI fell short of data for training GPT-4.
It had used up every available data source on the internet and needed new sources. So, they designed Whisper, a speech recognition tool to transcribe YouTube videos and podcasts.
This helped them gather all available data on YouTube to increase the capabilities of GPT-4 However, YouTube prohibits users from using its data for independent applications or accessing it through bots or scrapers.
The report further goes on to say that OpenAI employees, including President Mr Greg Brockman, knew that they were entering into a legal grey area but believed that training GPT-4 with YouTube data was fair use. When asked about the data used for GPT-4, Mr. Brockman commented that they used “numerous sources.”
A month back, various concerns were also raised about how SORA was trained, OpenAI’s groundbreaking text-to-video generator.
In an interview with the Wall Street Journal (WSJ), OpenAI CTO Mira Murati offered vague responses when asked about the source of the videos it was trained on. Murati said that they only used sources that are publicly available and licensed.
However, when asked for further clarification on whether Sora had been trained with data from platforms like YouTube, Instagram, or Facebook, Murati had this to say: “I’m actually not sure about that.” before adding, “You know, if they were publicly available — publicly available to use. But I’m not sure. I’m not confident about it.”
Things will be more clear when SORA comes out, but the concerns with SORA will be a big thing the company will face in the future.
The report also states that some employees from Google were aware of OpenAI’s usage of YouTube videos as training data. However, they were in no position to stop OpenAI as they had also been using these videos to train their LLMs.
This is because if Google creates an issue regarding this, then there would be allegations against Google’s terms and conditions regarding YouTube.
Google’s terms and conditions state that it can access YouTube’s data to come up with new features for the platform. However, doubts are being raised regarding whether Google could use this data for its own AI services instead of improving the YouTube platform.
They further said that its AI models were trained on some content that was permitted under agreements with some creators and that they did not misuse data from the platform.
In June, Google’s legal department instructed the privacy team to draft language aimed at expanding the permissible use of consumer data. The proposed changes would allow Google to utilize publicly available content from platforms like Google Docs and Google Sheets for various AI products.
While Google’s previous privacy policy limited the use of such data to training language models and enhancing features like Google Translate, the updated terms would enable its application across a broader spectrum of AI technologies, including Bard and Cloud AI capabilities.
Here is how Google changed its privacy policy last year. Initially, it was:
“Google uses information to improve our services and to develop new products, features and technologies that benefit our users and the public. For example, we use publicly available information to help train Google’s language models and build features like Google Translate.”
Now:
“Google uses information to improve our services and to develop new products, features and technologies that benefit our users and the public. For example, we use publicly available information to help train Google’s language AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.”
Meta
Ahmad Al-Dahle, Meta’s vice president of generative AI, revealed in internal meetings that his team had extensively utilized English-language books, essays, poems, and news articles available on the internet for training their models.
Expressing the need for more data to match ChatGPT’s capabilities, Al-Dahle emphasized the urgency of the situation, leading to frequent discussions among business development leaders, engineers, and lawyers in March and April 2023.
During these discussions, strategies such as acquiring full licensing rights to new titles or purchasing publishing companies like Simon & Schuster were deliberated.
The team also considered summarizing books and essays from the internet without permission, despite potential legal repercussions, prompting a lawyer to raise “ethical” concerns regarding intellectual property rights.
Additionally, Meta faced constraints due to privacy changes implemented after the 2018 Cambridge Analytica scandal.
Nick Grudin, Meta’s vice president of global partnership and content, underscored the significance of data volume in achieving ChatGPT’s level of performance during meetings, suggesting that Meta could adopt similar practices to OpenAI in leveraging copyrighted material.
Stability AI
Another issue involving the legality of training data led to the resignation of Stability AI’s head of audio and former vice president Ed Newton-Rex. He stated that he doesn’t agree with the company’s opinion that training generative AI models on copyrighted works is ‘fair use. ’He justified his resignation with the following tweet on X.
I’ve resigned from my role leading the Audio team at Stability AI, because I don’t agree with the company’s opinion that training generative AI models on copyrighted works is ‘fair use’.
— Ed Newton-Rex (@ednewtonrex) November 15, 2023
First off, I want to say that there are lots of people at Stability who are deeply…
This further goes on to show how lead figures within the industry are concerned about the use of copyrighted work for improving LLMs.
Since July 2023, Stability AI, Midjourney, and Deviant Art have been caught up in legal battles over the issue of AI image generators, facing allegations of copyright infringement.
In October, a federal judge dismissed the majority of claims brought by a group of artists, including famous illustrator Sarah Andersen, against Midjourney and Deviant Art. However, he said that the lawsuit against Stability AI could proceed.
Stable Diffusion
A report issued by the Stanford Internet Observatory revealed over 1,000 confirmed cases of child sexual abuse imagery within a substantial dataset utilized for training generative AI models like Stable Diffusion 1.5.
Illegal child sexual abuse material (CSAM) represents an extreme example of the wider problem of AI developers not having or sharing clear records of what material is used to train their models.
Additionally, It may take only a small selection of CSAM images to create many more new and realistic synthetic images of child abuse. This alarming finding is sparking fears that such content could enhance AI image generators’ ability to produce realistic yet harmful counterfeit images of child sexual exploitation, along with other prejudicial material.
Nowadays, tech companies designing LLMs increasingly resort to synthetic data for various reasons such as data scarcity, privacy concerns, and the need for more.
By producing synthetic data, companies can increase available data through augmentation in training data, addressing biases, and improving the robustness and generalization capabilities of their LLMs’ without being overly reliant on real-world data, which may be limited or ethically sensitive to collect.
Conclusion
The controversies surrounding the sources of training data for large language models underscore concerns regarding legality and ethicality. Companies must prioritize ethical considerations and ensure transparency in their data acquisition practices to avoid overstepping boundaries in their quest for enhanced model capabilities.