- OpenAI, Anthropic, and other AI firms are running out of quality data for training their models.
- The could impede AI development as companies race to build the best products in the booming space.
- Companies are now exploring other ways to train AI, like using synthetic data, per the Journal.
Companies like OpenAI and Anthropic are scrambling to get their hands on one of AI’s most valuable resources: reliable data. That deficit could hinder the development of large language models that power their chatbots as the race to build the best products in the growing sector intensifies.
Typically, OpenAI’s ChatGPT and its chatbot competitors are trained on troves of information like scientific papers, news articles, and Wikipedia posts scraped from the web to generate human-like responses. The higher the quality and greater the trustworthiness of the data these models use, the more capable they are of producing accurate, desirable outputs, or so the theory goes.
Based on that, a shortage could make it harder for companies to make their AI products smarter. And there’s more than a 50{b35c98fb0b5373898aeb6e2d0db4f287402c3d8e7e09edb32fb78fc4e77f672b} chance that the demand for high-quality data will surpass the supply of available training material by 2028, Pablo Villalobos, an AI expert at research firm Epoch, told the Wall Street Journal.
So, why do tech firms appear to be scrambling for reliable information?
Firstly, only a slice of online data is generally suitable for AI training. That’s because most public information on the web contains sentence fragments and other textual flaws that can prevent AI from producing conversational responses. The lack of usable data is compounded by the slew of already AI-generated text on the internet that can pollute a model with nonsense — a process experts call “model collapse.”
On top of that, major news outlets, social-media platforms, and other public sources of information have restricted access to their content for training AI over concerns around copyright, privacy, and fair compensation. People, too, don’t seem keen on making their iMessage conversations and other private text data accessible for training purposes.
That’s leaving companies scrambling to find new data sources to beef up their tools. OpenAI, for instance, is discussing training GPT-5, which would be its most advanced model, on YouTube video transcripts, sources told the Journal.
OpenAI has also discussed creating a data market where providers can get paid for content that the company considers valuable for model training, sources familiar with the matter told the Journal. Google is reportedly considering a similar method, per the Journal, though researchers have yet to build a system to carry it out properly.
Other firms are experimenting with what they call synthetic data to further their models. Anthropic has fed internally generated data into its AI chatbot family Claude, Jared Kaplan, chief scientist at the startup, said in an October 2023 Bloomberg interview. OpenAI, which created ChatGPT, is also looking into that tactic, a spokesperson told the Journal.
Concerns around data scarcity come as users complain about the quality of AI chatbots.
Some users of GPT-4, OpenAI’s most advanced model behind ChatGPT, claim they’ve encountered problems getting the bot to follow instructions and respond to queries. Google paused its AI image generation feature on its model Gemini after users complained it produced historically inaccurate pictures of US presidents. AI models are generally prone to hallucinating false information they deem accurate.
While companies figure out how to continue training their models, some seem open to limiting the size of their AI in the meantime.
“I think we’re at the end of the era where it’s going to be these giant, giant models,” Sam Altman, the CEO of OpenAI, said at an MIT conference event in 2023. “And we’ll make them better in other ways.”
OpenAI and Google didn’t immediately respond to a request for comment from Business Insider before publication. Anthropic declined to comment.