Nearly 400 US Newspapers File Joint Lawsuit Against OpenAI and Microsoft Over AI Training Data Scraping

A coalition representing nearly 400 US newspapers has filed a copyright lawsuit against OpenAI and Microsoft, alleging the companies systematically scraped news content without authorization to train their artificial intelligence models.

According to court documents made public on June 24 and first reported by Bloomberg, the publisher coalition filed the lawsuit on Wednesday in the US District Court for the Southern District of New York. The complaint accuses Microsoft and OpenAI of unauthorized scraping of news content to train the AI models powering Copilot and ChatGPT, alleging copyright infringement and violations of the Digital Millennium Copyright Act (DMCA).

The lawsuit claims the defendants “systematically and surreptitiously” crawled publisher websites, copying articles, stories, and other original works onto their own servers. The plaintiffs allege the companies used this content to train large language models while stripping copyright management information from the works.

“The generative AI products are built on content that publishers have long invested in, yet they generate billions of dollars in market value for the defendants while publishers have not received a single cent,” the complaint states. The plaintiffs further warned that if AI companies are permitted to exploit news content without accountability, the current AI boom could become the “death knell” for local journalism.

Matthew Platkin, former New Jersey Attorney General (Democrat), who represents the plaintiffs, described the lawsuit as the largest legal action ever mounted by local and regional newspapers.

OpenAI spokesperson Drew Pusateri responded by stating that the company’s models drive innovation and that their training data comes from publicly available sources, operating under the principle of fair use. Microsoft did not immediately respond to requests for comment.

The case adds to the growing legal pressure on AI companies over their data collection practices, as content creators and publishers worldwide challenge the unauthorized use of copyrighted material for training commercial AI systems.