4 min read

Nvidia Faces Legal Action from Authors Over AI’s Unauthorized Book Training

Last Updated March 11, 2024 3:41 PM

Stewart O'Nan is one of 3 authors who want to sue Nvidia for copyright infringement. Photo by Francois G. Durand/Getty Images.

Key Takeaways

Three authors have filed a class action lawsuit against Nvidia.
The complaint alleges that Nvidia’s NeMo large language models (LLMs) were trained on copyright-protected materials.
Training datasets compiled from online book repositories are central to several ongoing AI copyright cases.

Does using copyrighted material as AI training data count as fair use? Or should AI developers require permission to train their models on someone else’s intellectual property (IP)?

Such questions sit at the heart of an ongoing legal debate that has pitted IP owners against the likes of OpenAI, Google and Adobe. And now, Nvidia has joined the list of firms facing copyright infringement lawsuits over their AI training data.

Nvidia Accused of Intellectual Property Theft

The proposed class action lawsuit against Nvidia was filed on Friday, March 8, by 3 authors who allege their works were used to train the company’s NeMo large language models (LLMs).

Specifically, Brian Keene, Abdi Nazemian and Stewart O’Nan have taken issue with Nvidia’s use of a dataset known as “the Pile.”

An 825GB corpus of English-language texts assembled for LLM training purposes, the Pile includes a sub-dataset referred to as Books3, which consists of 196,640 books downloaded from the bibliotik BitTorrent tracker.

We suspect OpenAI's books2 dataset might be "all of libgen", but no one knows. It's all pure conjecture.

Nonetheless, books3, released above, is "all of bibliotik", which I imagine will be of interest to anyone doing NLP work. Or anyone who wants to read 196,640 books. :)

— Shawn Presser (@theshawwn) October 25, 2020

If the latest litigation is allowed to move ahead, hundreds of American authors whose works are included in Books3 could potentially join the complaint against Nvidia.

In the meantime, a similar case brought by the Authors Guild against Microsoft and OpenAI will also explore AI developers’ use of copyright-protected works as training data.

Echoes of Authors Guild vs. OpenAI

Like the proposed Nvidia suit, the Authors Guild’s complaint centers on a database of works structured for LLM training. Specifically, the complaint highlights a dataset referred to in OpenAI’s research papers as Books2.

We have organized a lawsuit against OpenAI for copyright infringement of their works of fiction on behalf of a class of authors whose works have been used to train GPT. Plaintiffs incl John Grisham, Jodi Picoult, Victor LaValle, George R.R. Martin & more.https://t.co/1laaRbRCyC

— The Authors Guild (@AuthorsGuild) September 20, 2023

While the team behind Books3 have been transparent about their use of Bibliotik, OpenAI has never revealed the source of texts included in Books2. But there has been speculation that it was pulled from illegal torrent platforms.

According to the Authors Guild lawsuit, “AI researchers suspect that Books2 contains or consists of ebook files downloaded from large pirate book repositories such as Library Genesis.”

Similar allegations have been made in other complaints against the firm currently making their way through the courts.

Implications for AI

In a case that could have implications for Nvidia and other AI firms embroiled in copyright litigation, last month, Judge Araceli Martínez-Olguín partially dismissed a lawsuit filed against OpenAI by a group of authors including Sarah Silverman.

The judge was especially skeptical of the authors’ financial claims, concluding that the alleged injuries were too speculative to warrant consideration.

Crucially, however, she declined to dismiss the central charge of copyright infringement.

For firms like OpenAI and Nvidia, AI copyright litigation currently making its way through the US court system could ultimately determine how they source training data going forward. For authors and publishers, on the other hand, it could set the terms of their relationship with an industry that has become one of the most influential forces in global business.

Was this Article helpful? Yes No