Image: Unsplash

OpenAI Named in Copyright Lawsuit by Authors Over Use of Books as Training Data

A couple of authors are the latest to file suit against ChatGPT developer OpenAI, joining a swiftly growing list of plaintiffs that have waged lawsuits (like this one and this one) against the tech giant. According to the complaint that they with a federal court in Northern ...

July 3, 2023 - By TFL

Image : Unsplash

Case Documentation

OpenAI Named in Copyright Lawsuit by Authors Over Use of Books as Training Data

A couple of authors are the latest to file suit against ChatGPT developer OpenAI, joining a swiftly growing list of plaintiffs that have waged lawsuits (like this one and this one) against the tech giant. According to the complaint that they with a federal court in Northern California on June 28, Paul Tremblay and Mona Awad (the “plaintiffs”) assert that in furtherance of the training of the large language model that powers the generative artificial intelligence (“AI”) chatbot that is ChatGPT, OpenAI has made use of large amounts of data, including the text of books that they have authored without their authorization, thereby, engaging in direct copyright infringement, violations of the Digital Millennium Copyright Act, and unfair competition.

Setting the stage in their proposed class action complaint against OpenAI, Inc. and a handful of affiliated entities, Tremblay and Awad, who respectively authored “The Cabin at the End of the World,” and “13 Ways of Looking at a Fat Girl” and “Bunny,” assert that ChatGPT consists of “a large language model [that] is ‘trained’ by copying massive amounts of text and extracting expressive information from it.” As a result, ChatGPT can “emit convincingly naturalistic text outputs in response to user prompts.”

While “many kinds of material have been used to train large language models,” such as ChatGPT, the plaintiffs argue that books “have always been a key ingredient in training datasets for large language models because books offer the best examples of high-quality longform writing.” (They cite a June 2018 paper introducing GPT-1, in which OpenAI revealed that it trained GPT-1 on “over 7,000 unique unpublished books,” and claim that the company has since expanded the training materials to include an additional 350,000+ books.)

Against this background and despite “not consenting to the use of their copyrighted books as training material for ChatGPT,” Tremblay and Awad allege that the text of their books has been “ingested and used to train ChatGPT.” In fact, they allege that “when ChatGPT was prompted to summarize books written by [them], it generated very accurate summaries,” something that Tremblay and Awad claim is “only possible if ChatGPT was trained on [their] copyrighted works.” Since the data that is used to train ChatGPT is “copied by OpenAI without consent, without credit, and without compensation,” the plaintiffs argue that OpenAI “benefit[s] commercially and profit richly from the use of [others’] copyrighted materials.”

TLDR: The plaintiffs allege that OpenAI “knowingly designed ChatGPT to output portions or summaries of [their] copyrighted works without attribution,” and the company “unfairly profit[s] from and take[s] credit for developing a commercial product based on unattributed reproductions of those stolen writing and ideas.”

With the foregoing in mind, the plaintiffs set out claims of direct and vicarious copyright infringement, and violations of section 1202(b) of the Digital Millennium Copyright Act (“DMCA”), arguing on the latter front that “OpenAI copied the [the] infringed works and used them as training data for the OpenAI language models.” By design, “the training process does not preserve any copyright management information (‘CMI’) in each of the plaintiffs’ infringed works, including the copyright notice, title, and other identifying information; the name or other identifying information about the owners of each book; terms and conditions of use; and identifying numbers or symbols referring to CMI.” As such, they claim that OpenAI “intentionally removed CMI from the infringed works” in violation of the DMCA.

Still yet, the plaintiffs claim that OpenAI is on the hook for unjust enrichment, violations of the California and common law unfair competition laws, and negligence.

In addition to request that the court certify their proposed class action, the plaintiffs are seeking monetary damages and injunctive relief, including but not limited to changes to ChatGPT to ensure that all applicable information set forth in [the DMCA] is included when appropriate.”

The case is Tremblay v. OpenAI, Inc., 3:23-cv-03223 (N.D. Cal.).