Image: Copilot

Microsoft, GitHub, and OpenAI are on the receiving end of an interesting new lawsuit, with a couple of plaintiffs accusing them of running afoul of the Digital Millennium Copyright Act (“DMCA”), and also engaging in breach of contract, tortious interference, fraud, false designation, unjust enrichment, and unfair competition in connection with Copilot, a subscription-based AI tool co-developed by GitHub and OpenAI. At the heart of the plaintiffs’ suit: Their claim that the defendants used their copyright-protected source code as training data for Copilot, which enables software developers to easily generate code by “turning natural language prompts into coding suggestions across dozens of languages.”

According to the plaintiffs, open-source code repository GitHub, OpenAI (which created the GPT-3 language model used to create Copilot), GitHub owner and OpenAI investor Microsoft (the “defendants”) used data that they sourced from publicly accessible repositories on GitHub to train Copilot. The plaintiffs assert that they “posted such code or other works under cer­tain open-source licenses on GitHub,” and that all of those licenses require attribution of the author’s name and copyright. The problem, they claim, is that in using their code, the defendants stripped the “attribution, copyright notice, and license terms from their code in violation of the licenses and the plaintiffs’ and the class’s rights.” 

Now, the plaintiffs argue that Microsoft and co. are using Copilot to “distribute the now-anonymized code to Copilot users as if it were created by Copilot.” 

In addi­tion to allegedly vio­lat­ing the attri­bu­tion require­ments of the open-source licenses, the plaintiffs contend that, among other things, the defendants breached GitHub’s own terms of service and privacy policies (giving rise to contract claims), DMCA § 1202, which forbids the removal of copyright-management information, and the California Consumer Privacy Act. They are seeking certification of the proposed class action case, injunctive relief, and damages, stating that their estimates of statutory damages for the defendants’ “direct violations of DMCA Section 1202, alone, will exceed $9 million.” (That figure represents “minimum statutory damages ($2,500) incurred three times for each of the 1.2 million Copilot users Microsoft reported in June 2022,” the plaintiffs assert.)

Reflecting on the newly initiated lawsuit, Bristows LLP’s Toby Headdon, Anneke Pol and Toby Crick stated in a note that this appears to be “the first U.S. class action challenging the training and output of an AI model and is therefore likely to been keenly watched by developers, data scientists, and lawyers in the field.” 

They state that the case is likely to involve a key of issues, including whether “copyrights in the software hosted in GitHub been infringed.”In that respect, they claim that key inquiries are: (1) Has code been copied from GitHub in order to be used as training data for, or incorporated in, Copilot/Codex?; (2) Are Copilot coding suggestions derivative works of GitHub code or do they render such derivative works as outputs?; and (3) Is the use made of code hosted in GitHub permitted under the US doctrine of fair use – for example, because it is transformative and used for a different purpose, namely to train Copilot/Codex to generate code? 

Beyond that, a critical issue will be whether “the open-source license terms included with the code hosted in GitHub prevent the use made of the code by Microsoft/OpenAI and GitHub?” For example, they contend: “Do the terms preclude use of the code (i) as training data or (ii) which results in it being incorporated verbatim in proprietary code such as Copilot/Codex?” And “have those license terms otherwise been breached – for example, by a failure to attribute the author’s name and copyright in a manner required by the open-source licenses?”

The case is J. DOE 3, et al., v. GitHub, Inc. et al., 3:22-cv-07074 (N.D. Cal.).