Generative AI systems – such as ChatGPTGitHub’s CopilotDALL-E – are becoming increasingly popular thanks to their applications in creating content, applying for jobs, and even eliminating recipe sagas. Trained on a vast body of copyright-protected data scraped from the web, these AI systems can create new works using users’ word prompts or “freestyle” based on the images uploaded by users (like Lensa AI’s magic avatar feature does). However practical, innovative, and accessible, many of these AI projects may sometimes be engaging in copyright infringement at two levels: at the machine learning level (see our article on that) and at the output stage (if AI’s creation is substantially similar to a copyright-protected piece of training data).

In the midst of legal uncertainty, it seems that most generative AI companies are willfully blind to the potential copyright infringement they may be committing or encouraging end users to commit. Most AI services’ terms and conditions attempt to shift the liability for any potential copyright infringement at the output stage onto the end user (see the Terms of Use for Github Copilot, for example), but would that stand in court? (Disclaimer: a defendant may successfully invoke the fair use doctrine in some copyright infringement cases. The factors outlined in our previous article can apply with necessary alterations to the infringing AI output data.)

Infringement in General

By way of background, in accordance with U.S. law, copyright infringement takes place if: (1) there is a valid copyright in the original work; and (2) there was unauthorized copying of the original work (meaning that at least one of the exclusive rights under copyright was violated). The copying component of the copyright infringement test is proven if” (1) there is either evidence of factual copying or (2) there is a “substantial similarity” between the original and the infringing work.

Factual copying could be proven by direct (rarely available) or circumstantial evidence. Circumstantial evidence may include proof of AI’s access to the copyrighted work AND a “probative similarity” beyond independent creation between the original work and the AI’s output. A claimant in a copyright infringement case could obtain evidence that their copyrighted work was included in the machine training dataset. It may be readily available (there is a website that checks whether a popular text-image pair training dataset contains an image) or could be procured in a court-ordered discovery. Absent evidence of access to the copyrighted work, a “striking similarity” is enough to prove the copying.

The degree of similarity is a question of fact to be determined by the jury based on the evidence in the case, which may include expert evidence). In assessing the degree of similarity between the (allegedly) infringed and infringing works, courts consider whether the similar elements are unique, intricate, or unexpected; whether the two works contain the same errors; and whether it appears that there were blatant attempts to cover up the similarities. The existence of something that closely resembles the particular claimant’s artist signature in an AI output or a company’s watermark could potentially be an example of such evidence. Courts can use other criteria, such as “the total concept and feel,” which combines “objective” extrinsic and “subjective” intrinsic tests. All in all, the examination is factual and case specific.

But Who is to Blame?

In general, under the doctrine of direct infringement, the actor committing copyright infringement is the one most proximately positioned to the cause of the infringing eventSecondary infringement, on the other hand, occurs when there is a direct infringer, but a second party induces, contributes to, encourages, or profits from the infringement. The latter type of infringement is rooted in case law and takes the forms of contributory and vicarious infringement, with contributory infringement occuring when someone knows of the direct infringement and encourages, induces, causes, or materially contributes to it, and vicarious liability arising when someone has the authority and ability to control the direct infringer and directly profits from the infringement.

With most generative AI systems, end users do not make expressive choices but rather provide abstract verbal prompts, leaving the “creative” work to the AI. So, it appears the end user is unlikely to be the direct copyright infringer if the output is infringing.

Usually, the verbal prompts will take forms of ideas not subject to copyright protection (“Create a pop-art portrait of a blond actress”). On the other hand, users may either input requests that contain copyrightable material on which the output will be based (“combine these two actual paintings by Yayoi Kusama I uploaded”) or otherwise intentionally target a copyrighted work (“summarize Martin Luther King Jr.’s “I have a dream” speech). So, when the output work turns out to be substantially similar to a copyrighted work or otherwise passes the copyrighted infringement threshold, the end user may or may not be causing the infringement and thus, be liable.

In the case that the end user is directly liable, the AI company might be secondarily liable since: (1) it provided a product that is capable of producing infringing work and (2) it benefits from the infringing activity (for example, if the service is subscription-based). Sometimes, however, AI may return infringing outputs even when not reasonably expectable by the end user. In such a case, the AI company would be the only actor capable of exercising control over the infringing AI system since it conducted the machine learning and chose/built the datasets. Consequently, the AI company would likely be the direct infringer.

Enforceability of Liability-Shifting Terms of Use

Apparently aware of the AI’s ability to produce output containing recognizable portions of training data used at the ingestion phase, many generative AI services include provisions in their terms and conditions disclaiming the companies’ copyright ownership of the output data and shifting the risk of liability for any infringement on to the end user of the AI. But is that enforceable in court?

Generally, contract provisions shifting the risk of civil liability (i.e., exculpatory and indemnification clauses) are commonplace and enforceable – provided they are not ambiguous in scope and do not violate public policyExculpatory clauses are contract provisions that generally absolve one party (the AI company in this case) from claims by another party (the end user). In contrast, indemnification clauses obligate one party (the end user) to compensate another party (the AI company) for third-party claims (here, those would be the claims of the training data copyright owners). 

As a public policy consideration, most states will not enforce an exculpatory/indemnification clause absolving a party from the results of its own gross negligence or willful misconduct. Consequently, shifting copyright infringement liability to end users in AI companies’ terms of use may not always be a get-out-of-jail-free card for the companies. A fact-specific analysis will need to examine whether the AI company willfully engaged in practices that led to copyright infringement and whether it could and should have technically prevented that.

Generative AI is a powerful tool helping human creators cut down on content-generating costs, save time for more complex work, and conceive of new ideas. If AI creates something closely resembling a copyrighted piece of data it trained on, absent fair use, the copyright holder has a case against the person or entity who caused the infringing act. 

Certain factors come into play when determining who is liable for the AI output infringing on a copyrighted work. End users are likely not precluded from being liable if they have the power to cause the AI output to resemble a copyrighted work, but AI companies would be most likely to be liable because they are the best positioned to design systems with specific quality controls that would or would not enable an AI to infringe. Terms of use shifting the legal risks to the end user may be enforceable. Still, the legal analysis will need to determine whether the AI company engaged in willful misconduct or gross negligence, and the outcome will depend on the applicable state law.

The intersections of generative AI and copyright are exciting new domains with a potential for policymaking. In the meantime, we must proceed cautiously and mitigate the legal risks for AI companies and end users.


Diana Bikbaeva is a tech and intellectual property law attorney. 

Artificial intelligence (“AI”) is still a relatively new, although rapidly evolving technology, and some of its legal implications (especially in copyright law) remain a gray area, creating uncertainty on its use and development. AI and machine learning technology are not one-size-fits-all and have diverse structures and algorithms specific to the tasks they are programmed to solve. So, any discussion of the legal implications of machine learning and resulting artificial intelligence needs to avoid sweeping conclusions on the technology in general and should consider the underlying technology and its treatment of copyrighted materials on a case-by-case basis

AI is a computer system designed to make predictions or decisions (almost) independently from a human programmer. It carries out a variety of functions, and some of its most cutting-edge types generate art, computer code suggestions, and even music (“generative AI”). To make the predictions, AI must go through machine learning, which involves processing incredible amounts of input training data to identify patterns. The more training data is input into the datasets, the more precise and valuable the output data. To give you an idea of how vast a training dataset can get, LAION-5B consists of 5.85 billion image-text pairs (this dataset is used by Stable Diffusion and Lensa AI, for example).

Often, those massive datasets contain copyrighted materials – photos, paintings, books, or computer source code. Even more often, copyright owners have no idea about (let alone consent to) the use of their material in machine learning. No longer rare are the cases of human artists or programmers discovering that someone used their works to train AI that produces output containing recognizable portions of their work. 

Copyright Implications of Machine Learning

In general, copyright grants to its holder six exclusive rights to the copyrighted material: make copies of the work; prepare derivative works (create new matter based on the original copyrighted work); distribute copies of the work to the public; and perform or display the work publicly. Machine learning is most likely to implicate the first two: the right to make copies of the work and the right to create derivative works. Since creating or using a dataset often technically involves making copies of the copyrighted material, it may implicate the “reproduction of copies” aspect of the copyright. If the output data produced by the AI closely resembles one or several of the copyrighted materials in the training dataset (by incorporating them in some concrete form), that implicates the right to create derivative works. Unless there is an applicable exception (such as the “fair use” doctrine), those are acts of copyright infringement.

No case law in the US has directly addressed the use of copyrighted materials in machine learning/copyright implications of artificial intelligence yet.

The recently filed GitHub Copilot lawsuit alerts that creating datasets consisting of open-source code for machine learning may violate the licenses accompanying that open-source code (which are enforceable legal rules for using copyrighted material). Most open-source code licenses require that programmers who use the open-source code in creating their product must attribute the authors of the underlying code and share the resulting code with the public for free. These principles are essential for the development of the open-source community and the state of the software art.

GitHub and Microsoft filed a motion to dismiss recently, arguing, in part, that the plaintiffs opt not to make a copyright infringement claim, which is a “doubtless an attempt to evade […] the progress-protective doctrine of fair use.”

Fair Use & its Criteria

The purpose of the fair use doctrine is to balance the protections that copyright grants its owners with the greater social good and promote creativity, education, and free speech. Fair use is an exception from copyright allowing the use of copyrighted materials without the owner’s consent for criticism, comment, news reporting, teaching, scholarship, or research. Fair use is a mixed question of law and fact, which means that the finding of whether something constitutes fair use is case-specific. There are no areas where fair use is presumedProcedurally, fair use is an affirmative defense (meaning that a defendant in a copyright infringement suit has the right to invoke it). The burden of proof of fair use is on the defendant.

In deciding fair use cases, courts must consider the following factors having equal weight: (1) The purpose and character of the use, including whether it is commercial, transformative, and non-expressive; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.

Purpose and Character of Use

Commercial use, as opposed to not-for-profit, weighs against the finding of fair use. Courts presume commercial use if the purported infringer profits from exploiting the copyrighted material without paying the customary price to copyright owners. Hypothetically, this can occur if the AI owners charge end users money, host ads on the AI website/app, or otherwise profit from the AI (for example, by collecting and selling user data).

On the other hand, transformative use favors fair use. Use is transformative if it transforms the original work in some ways, altering the original with new expression, meaning, or message). Transformative use may occur if it has a different purpose than the original work or constitutes copying for the analysis or reverse engineering (“intermediate” copying). For example, it was fair use to copy a competitor’s computer program code to understand its unprotected functional elements and ensure compatibility of the defendant’s new program with the competitor’s gaming console (Sega Enterprises Ltd. v. Accolade, Inc., 977 F.2d 1510 (9th Cir. 1992)). An important caveat is that the defendant did not use the creative elements of the competitor’s code in its code for the new program.

Transformative use occurs if its result and the original copyrighted work serve different market functions (Campbell v. Acuff-Rose Music, Inc. 510 US 569 (1994) 591). A transformative use offers something new and different from the original or expands its utility, thus serving copyright’s overall objective of contributing to public knowledge (Authors Guild v. Google, Inc. – 804 F.3d 202 (2d Cir. 2015) 214). Search engines’ production of thumbnails or snippets of copyrighted books or images was transformative fair use because it served another function than the underlying creative content. For example, by providing image thumbnails in the image search feature, search engines did not engage in artistic expression, which is the prerogative of the underlying copyrighted content, but rather improved access to information on the Internet (Kelly v. Arriba Soft Corp., 336 F.3d 811 (9th Cir. 2002) 819).

In cases where the end goal of machine learning is new functionality, the use is likely transformative. Some examples could be using the learned ability to recognize faces or types of objects in the pictures for purposes other than generating art. Another example could be learning to understand text to find grammatical mistakes. 

A leading AI research company OpenAI contends that including copyrighted material in datasets for machine learning is fair use because it is “non-expressive intermediate copying.” According to OpenAI, the purpose and character of the use are transformative. Unlike the original works’ “human entertainment” purpose, machine training has the purpose of learning “patterns inherent in human-generated media”.

Although AI is undoubtedly a valuable technology, we are not convinced that any machine learning on copyrighted data is inherently transformative for fair use purposes. Accordingly, AI learning to create material that serves the same (aesthetic/expressive/functional) purpose as the training data is likely, not transformative. Importantly, it appears that all the functional transformation in most generative AI stays “inside” of it and goes unnoticed by the end user (the art goes in, and the art goes out). The end user does not use such AIs to learn the unprotected technical/factual information about the copyrighted training data. Instead, the end user employs such AIs to produce content (AI art, computer code, prose, music).

Amount and Substantiality of Portion Used in Relation to Full Copyrighted Work 

For there to be a finding of fair use, the amount and substantiality of the portion used in relation to the copyrighted work as a whole should be reasonable in relation to the copying’s purpose. It is detrimental to the finding of fair use if the defendant used so much of the original copyrighted work to consider that the defendant made a “competing substitute” available to the public(Authors Guild v. Google, Inc. – 804 F.3d 202 (2d Cir. 2015) 214). An important factor is not just whether a lot was copied from a copyrighted work but whether much of the resulting product consists only of the copied material (Campbell v. Acuff-Rose Music, Inc.).

In the case of machine learning, the training datasets essentially contain complete works. On the surface, this may count against the finding of fair use. However, copying full corpora of copyrighted works into the training datasets is likely reasonable in relation to the purpose of machine learning, which requires the analysis of whole works to learn the targeted patterns. A crucial factual factor to consider is the amount of a single piece of training data made available verbatim to the end users (we will explore the copyright implications of generative AIs outputting recognizable portions of copyrighted data in our next article).

Effect of Use Upon Potential Market for or Value of Copyrighted Work

An essential prong in finding fair use is the effect of the use on (1) the potential market for or (2) the value of the copyrighted work and its derivative works. It means that fair use is unlikely if the use seeks to “substitute ” the original work and compete for its market. The analysis examines whether the use harms the copyright owner’s ability to sell or license their work.

The goal is to strike a balance between the benefit gained by the copyright owner when the copying is found to be an “unfair use” and the benefit gained by the public when the use is held to be fair.

With some companies already outsourcing their digital art needs to AI, it is not merely a hypothesis that generative AI will compete with human creators. With said competition, products of generative AI may negatively affect the value of the copyrighted material on which it is trained. The factual inquiry will thus be into the relationship between the output data (does training on a text result in a new text or in a function of correcting grammatical mistakes?) and the copyrighted training data.

Market consequences of specific AI may be less obvious but more disruptive. The principal author of the GitHub class action complaint, Matthew Butterick, argues that a code-writing AI will potentially “starve” the open-source communities. According to him, it will remove the incentive for developers to dis­cov­er and contribute to­ “tra­di­tional open-source com­mu­ni­ties” that made the creation and constant development of the open-source code possible, stifling the growth and development of open-source software. It is unclear whether the potentially reduced demand for the open-source community will be enough for the finding of a negative market effect. However, it does seem to go against the ultimate goal of copyright – to expand public knowledge and understanding by giving potential creators exclusive control over copying of their works, thus giving them a financial incentive to create informative, intellectually enriching works for public consumption.

Conclusion 

Did we leave you with more questions than you had before? We hoped so. Hopefully, you also feel more confident in understanding the legal risks associated with generative AI and are better positioned to mitigate them. (Un)fortunately for all stakeholders, there may not be a definitive answer as to whether machine learning on copyrighted material is generally fair use. Courts will need to consider the factors outlined in this article in deciding on each individual case. 

Although it may not absolve AI developers from copyright infringement claims, developers of generative AIs could consider structuring the machine learning process to not make tangible copies of training data and to analyze the non-expressive structural elements (pixels, parts of speech) directly from the source. Datasets could be designed to contain links to the training data and not a reproduction of the copyrighted material. Finally, the output should not include recognizable portions of expressive/creative elements of the training data to infringe on the right to produce derivative works.


Diana Bikbaeva is a tech and intellectual property law attorney.