Home / News / Technology / OpenAI Wins Court Battle Over Data Scraping: Copyright Claims by News Outlets Rejected
Technology
3 min read

OpenAI Wins Court Battle Over Data Scraping: Copyright Claims by News Outlets Rejected

Published
James Morales
Published

Key Takeaways

  • Federal Judge Colleen McMahon has granted OpenAI’s motion to dismiss a copyright lawsuit brought by Raw Story and AlterNet.
  • The news outlets objected to OpenAI’s using their content to train AI.
  • The latest decision could have implications for a string of similar lawsuits.

A recent decision by Judge Colleen McMahon of the Southern District of New York has established a crucial precedent for OpenAI, and the wider AI industry.

By granting  OpenAI’s motion to dismiss a lawsuit brought by the news outlets Raw Story and AlterNet, McMahon could potentially shut down a key avenue in publishers’ efforts to claim AI training infringes their intellectual property rights.

News Outlets Dealt Legal Blow

As with a string of similar cases making their way through the courts, the crux of the plaintiff’s lawsuit was OpenAI’s mass scraping of web content to create datasets for training its large language models.

In this case, the specific datasets were Common Crawl , a massive, publicly available archive of the internet generated by monthly web crawls since 2008, and two internal OpenAI datasets, WebText and WebText 2, which contain a corpus of all Reddit posts made between 2005 and 2020.

In their complaint, Raw Story and AlterNet alleged that OpenAI’s ChatGPT “provided responses to users that regurgitate verbatim or nearly verbatim copyright-protected works of journalism without providing any author, title, or copyright information contained in those works.”

In their motion to dismiss the case, OpenAI’s lawyers argued that the plaintiffs lacked Article III standing to assert their claims, i.e., that the dispute did not belong in federal court.

While federal courts are the main venue for litigating intellectual property matters, by siding with OpenAI, McMahon ruled that ChatGPT’s outputs do not incur the kind of “concrete harm” required to bring a lawsuit under the Digital Millennium Copyright Act (DMCA).

AI Training and the Digital Millennium Copyright Act

Article III of the U.S. Constitution outlines the judicial authority of Federal courts vis-à-vis the States.

Until recently, lawsuits alleging violation of an Act of Congress were broadly determined to belong in Federal court. However, two Supreme Court decisions—Spokeo Inc. v. Robins (2016) and TransUnion LLC v. Ramirez (2021)—introduced key changes to the doctrine of Article III standing.

For AI copyright lawsuits under the DMCA, McMahon determined that defendants must prove specific financial or reputational harms to win damages or injunctive relief.

Other AI Copyright Claims

McMahon’s ruling has important implications for other AI copyright lawsuits that cite the DMCA, including cases brought against OpenAI by The New York Times, The Authors Guild, and Sarah Silverman et al.

A key part of Raw Story and AlterNet’s case was that OpenAI violated the DMCA by removing copyright management information from content scraped from the web before using it to train AI models. However, that argument has now been struck down twice by the courts after Judge Jon S. Tigar rejected a similar argument in a lawsuit against Github.

The removal of copyright management information is also mentioned in the New York Times’ complaint. However, a DMCA claim is only one of seven made by the newspaper.

Was this Article helpful? Yes No

James Morales

Although his background is in crypto and FinTech news, these days, James likes to roam across CCN’s editorial breadth, focusing mostly on digital technology. Having always been fascinated by the latest innovations, he uses his platform as a journalist to explore how new technologies work, why they matter and how they might shape our future.
See more