Home / News / Technology / Anthropic Lawsuit: AI Author Tim Boucher Hits Back at ‘Mischaracterization’ of Work
Technology
6 min read

Anthropic Lawsuit: AI Author Tim Boucher Hits Back at ‘Mischaracterization’ of Work

Last Updated September 2, 2024 5:54 PM
James Morales
Last Updated September 2, 2024 5:54 PM

Key Takeaways

  • Three authors have brought a class action lawsuit against Anthropic accusing it of “stealing hundreds of thousands of copyrighted books”.
  • As evidence of the alleged copyright infringement, the complaint cites the case of Tim Boucher, who used Claude to help author 97 books in less than a year.
  • However, Boucher has hit back at what he says is a mischaracterization of his work.

In a class action lawsuit  against Anthropic, filed on Aug. 19, plaintiffs Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson (Bartz et al) accused the AI developer of “stealing hundreds of thousands of copyrighted books” compiled in an AI training dataset known as “The Pile”.

As evidence of the alleged copyright infringement, the complaint cites the case of Tim Boucher, who it observes used Claude and OpenAI’s ChatGPT to write 97 books in less than a year. However, in the latest development, Boucher has called the plaintiffs out for “mischaracterization” of his work.

Tim Boucher Objects to Characterization of AI-Authored Books

Boucher’s objections relate to 2 paragraphs in Bartz et al’s lawsuit.

The first, paragraph 51, makes a general statement that Amazon has been flooded with AI-generated “copycats,” “rip-offs,” and “garbage books.” The second, paragraph 52, explicitly refers to Bouchers’ work, stating: “Claude could not generate this kind of long-form content if it were not trained on a large quantity of books, books for which Anthropic paid authors nothing.”

While Boucher expressed some sympathy for Bartz et al’s cause, in a letter to the judge to the judge overseeing the case, he objected to the plaintiffs using his books as an illustration of AI plagiarism:

“The books I produce involve a significant amount of original writing, creativity, collaboration, artistic direction, editorial, and curatorial work. While I sometimes use AI tools to generate sections, passages, sentences, or phrases, I typically rewrite or substantially edit these AI-generated outputs. This constitutes creative and intellectual labor that is absolutely ‘real.'”

In separate communication appealing to the plaintiffs directly, Boucher lamented that the complaint “mischaracteriz[es] my work in a way that has led to real reputational harm,” including causing a major media outlet to label him a fraudster.

Boucher has requested that Bartz et al remove all references to his works, which consist of a collection  of graphic novels he created using Claud, OpenAI and Midjourney, from their lawsuit.

Anthropic vs Bartz et al Explained

At the heart of Bartz et al’s legal challenge lies an 825GB corpus of English-language texts known as “The Pile”.

Because it is compiled from a collection of pirated books downloaded from torrent websites, the class action lawsuit argues that Anthropic’s use of The Pile amounts to intellectual property theft. And neither is it the only AI developer under fire for using the controversial dataset.

In separate litigation brought by Brian Keene, Abdi Nazemian and Stewart O’Nan, Nvidia has been accused of using The Pile to train its NeMo model.

Meanwhile, brought by the Authors Guild against Microsoft and OpenAI alleges that a similar dataset known as Books2 was used to train its GPT models.

Alongside less contested data sources, a subset of The Pile known as Books3 includes 196,640 books downloaded from the bibliotik BitTorrent tracker. It is these that the authors take issue with.

“It is apparent that Anthropic downloaded and reproduced copies of The Pile and Books3, knowing that these datasets were comprised of a trove of copyrighted content sourced from pirate websites like Bibiliotik,” Bartz et al’s lawsuit claims.

Anthropic Scientists Admit to Using Pirated Content

In a 2021 research paper , Anthropic scientists acknowledged using the controversial dataset.  The paper, which describes work that would culminate in the development of Claude, admits that “the training dataset is composed of […] 32% internet books […] most of which we sourced from The Pile.”

The latest legal challenge creates a fresh headache for Anthropic, which is also being sued by a group of large American record labels who argue that using copyrighted recordings as training data amounts to copyright infringement.

Anthropic vs. Music Publishers

In the initial lawsuit  filed in Oct. 2023, plaintiffs, including Universal, Concorde, and other major music publishers (Concorde et al.), alleged that Anthropic’s use of music lyrics to train its AI models without the intellectual property (IP) holder consent amounts to copyright infringement.

As the complaint argued:

“Although the AI technology involved in this case may be complex and cutting-edge, the legal issues presented here are straightforward and long-standing. A defendant cannot reproduce, distribute, and display someone else’s copyrighted works to build its own business unless it secures permission from the rightsholder.”

The case rests on the plaintiffs’ charge that: “As a result of Anthropic’s mass copying and ingestion of Publishers’song lyrics, Anthropic’s AI models generate identical or nearly identical copies of those lyrics, in clear violation of Publishers’ copyrights.”

But in a recent motion to dismiss , the AI developer has denied this allegation.

Copyright Protection and AI

At the crux of many ongoing AI copyright disputes is an important and as-yet unanswered legal question: does training AI models itself infringe upon IP owners’ copyrights? Or does the resultant AI model need to violate copyright protections for a breach to have occurred?

AI developers like Anthropic have typically emphasized the second aspect. They argue that training counts as fair use and that a violation can only occur if models are found to distribute copyrighted materials. 

Preempting this argument, Concorde et al cited evidence that they were able to prompt Anthropic’s Claude to provide copyright-protected song lyrics.

However, Anthropic’s motion to dismiss counterargues that “the Complaint does not identify any instances of ordinary Claude users inducing this alleged behavior.”

Publishing Industry Parallels

The disagreement between Anthropic and Concorde et al mirrors a similar one in the case of New York Times vs. OpenAI. In that instance, the Times alleged that ChatGPT generated “near-verbatim” excerpts from its articles. Responding to the accusation, OpenAI claimed the publisher “intentionally manipulated prompts” in a way no ordinary user ever would.

While he hasn’t publicly commented on specific lawsuits, Anthropic CEO Dario Amodei has previously outlined his belief that AI training counts as fair use as long as models don’t regurgitate copyright-protected content.

“I think everyone agrees the models shouldn’t be verbatim outputting copyrighted content,” he said in an interview  earlier this year. “For things that are available on the web, […]we don’t think it’s just hoovering up content and spitting it out, or it shouldn’t be spitting it out.”

For Concorde et al, Claude’s alleged distribution of song lyrics forms an important pillar of their case. But for the latest class action, Anthropic crossed a line the moment it accessed IP originating from a torrent site that is widely regarded as illegal.

Was this Article helpful? Yes No