Image: Unsplash

Artificial intelligence (“AI”) is still a relatively new, although rapidly evolving technology, and some of its legal implications (especially in copyright law) remain a gray area, creating uncertainty on its use and development. AI and machine learning technology are not one-size-fits-all and have diverse structures and algorithms specific to the tasks they are programmed to solve. So, any discussion of the legal implications of machine learning and resulting artificial intelligence needs to avoid sweeping conclusions on the technology in general and should consider the underlying technology and its treatment of copyrighted materials on a case-by-case basis

AI is a computer system designed to make predictions or decisions (almost) independently from a human programmer. It carries out a variety of functions, and some of its most cutting-edge types generate art, computer code suggestions, and even music (“generative AI”). To make the predictions, AI must go through machine learning, which involves processing incredible amounts of input training data to identify patterns. The more training data is input into the datasets, the more precise and valuable the output data. To give you an idea of how vast a training dataset can get, LAION-5B consists of 5.85 billion image-text pairs (this dataset is used by Stable Diffusion and Lensa AI, for example).

Often, those massive datasets contain copyrighted materials – photos, paintings, books, or computer source code. Even more often, copyright owners have no idea about (let alone consent to) the use of their material in machine learning. No longer rare are the cases of human artists or programmers discovering that someone used their works to train AI that produces output containing recognizable portions of their work. 

Copyright Implications of Machine Learning

In general, copyright grants to its holder six exclusive rights to the copyrighted material: make copies of the work; prepare derivative works (create new matter based on the original copyrighted work); distribute copies of the work to the public; and perform or display the work publicly. Machine learning is most likely to implicate the first two: the right to make copies of the work and the right to create derivative works. Since creating or using a dataset often technically involves making copies of the copyrighted material, it may implicate the “reproduction of copies” aspect of the copyright. If the output data produced by the AI closely resembles one or several of the copyrighted materials in the training dataset (by incorporating them in some concrete form), that implicates the right to create derivative works. Unless there is an applicable exception (such as the “fair use” doctrine), those are acts of copyright infringement.

No case law in the US has directly addressed the use of copyrighted materials in machine learning/copyright implications of artificial intelligence yet.

The recently filed GitHub Copilot lawsuit alerts that creating datasets consisting of open-source code for machine learning may violate the licenses accompanying that open-source code (which are enforceable legal rules for using copyrighted material). Most open-source code licenses require that programmers who use the open-source code in creating their product must attribute the authors of the underlying code and share the resulting code with the public for free. These principles are essential for the development of the open-source community and the state of the software art.

GitHub and Microsoft filed a motion to dismiss recently, arguing, in part, that the plaintiffs opt not to make a copyright infringement claim, which is a “doubtless an attempt to evade […] the progress-protective doctrine of fair use.”

Fair Use & its Criteria

The purpose of the fair use doctrine is to balance the protections that copyright grants its owners with the greater social good and promote creativity, education, and free speech. Fair use is an exception from copyright allowing the use of copyrighted materials without the owner’s consent for criticism, comment, news reporting, teaching, scholarship, or research. Fair use is a mixed question of law and fact, which means that the finding of whether something constitutes fair use is case-specific. There are no areas where fair use is presumedProcedurally, fair use is an affirmative defense (meaning that a defendant in a copyright infringement suit has the right to invoke it). The burden of proof of fair use is on the defendant.

In deciding fair use cases, courts must consider the following factors having equal weight: (1) The purpose and character of the use, including whether it is commercial, transformative, and non-expressive; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.

Purpose and Character of Use

Commercial use, as opposed to not-for-profit, weighs against the finding of fair use. Courts presume commercial use if the purported infringer profits from exploiting the copyrighted material without paying the customary price to copyright owners. Hypothetically, this can occur if the AI owners charge end users money, host ads on the AI website/app, or otherwise profit from the AI (for example, by collecting and selling user data).

On the other hand, transformative use favors fair use. Use is transformative if it transforms the original work in some ways, altering the original with new expression, meaning, or message). Transformative use may occur if it has a different purpose than the original work or constitutes copying for the analysis or reverse engineering (“intermediate” copying). For example, it was fair use to copy a competitor’s computer program code to understand its unprotected functional elements and ensure compatibility of the defendant’s new program with the competitor’s gaming console (Sega Enterprises Ltd. v. Accolade, Inc., 977 F.2d 1510 (9th Cir. 1992)). An important caveat is that the defendant did not use the creative elements of the competitor’s code in its code for the new program.

Transformative use occurs if its result and the original copyrighted work serve different market functions (Campbell v. Acuff-Rose Music, Inc. 510 US 569 (1994) 591). A transformative use offers something new and different from the original or expands its utility, thus serving copyright’s overall objective of contributing to public knowledge (Authors Guild v. Google, Inc. – 804 F.3d 202 (2d Cir. 2015) 214). Search engines’ production of thumbnails or snippets of copyrighted books or images was transformative fair use because it served another function than the underlying creative content. For example, by providing image thumbnails in the image search feature, search engines did not engage in artistic expression, which is the prerogative of the underlying copyrighted content, but rather improved access to information on the Internet (Kelly v. Arriba Soft Corp., 336 F.3d 811 (9th Cir. 2002) 819).

In cases where the end goal of machine learning is new functionality, the use is likely transformative. Some examples could be using the learned ability to recognize faces or types of objects in the pictures for purposes other than generating art. Another example could be learning to understand text to find grammatical mistakes. 

A leading AI research company OpenAI contends that including copyrighted material in datasets for machine learning is fair use because it is “non-expressive intermediate copying.” According to OpenAI, the purpose and character of the use are transformative. Unlike the original works’ “human entertainment” purpose, machine training has the purpose of learning “patterns inherent in human-generated media”.

Although AI is undoubtedly a valuable technology, we are not convinced that any machine learning on copyrighted data is inherently transformative for fair use purposes. Accordingly, AI learning to create material that serves the same (aesthetic/expressive/functional) purpose as the training data is likely, not transformative. Importantly, it appears that all the functional transformation in most generative AI stays “inside” of it and goes unnoticed by the end user (the art goes in, and the art goes out). The end user does not use such AIs to learn the unprotected technical/factual information about the copyrighted training data. Instead, the end user employs such AIs to produce content (AI art, computer code, prose, music).

Amount and Substantiality of Portion Used in Relation to Full Copyrighted Work 

For there to be a finding of fair use, the amount and substantiality of the portion used in relation to the copyrighted work as a whole should be reasonable in relation to the copying’s purpose. It is detrimental to the finding of fair use if the defendant used so much of the original copyrighted work to consider that the defendant made a “competing substitute” available to the public(Authors Guild v. Google, Inc. – 804 F.3d 202 (2d Cir. 2015) 214). An important factor is not just whether a lot was copied from a copyrighted work but whether much of the resulting product consists only of the copied material (Campbell v. Acuff-Rose Music, Inc.).

In the case of machine learning, the training datasets essentially contain complete works. On the surface, this may count against the finding of fair use. However, copying full corpora of copyrighted works into the training datasets is likely reasonable in relation to the purpose of machine learning, which requires the analysis of whole works to learn the targeted patterns. A crucial factual factor to consider is the amount of a single piece of training data made available verbatim to the end users (we will explore the copyright implications of generative AIs outputting recognizable portions of copyrighted data in our next article).

Effect of Use Upon Potential Market for or Value of Copyrighted Work

An essential prong in finding fair use is the effect of the use on (1) the potential market for or (2) the value of the copyrighted work and its derivative works. It means that fair use is unlikely if the use seeks to “substitute ” the original work and compete for its market. The analysis examines whether the use harms the copyright owner’s ability to sell or license their work.

The goal is to strike a balance between the benefit gained by the copyright owner when the copying is found to be an “unfair use” and the benefit gained by the public when the use is held to be fair.

With some companies already outsourcing their digital art needs to AI, it is not merely a hypothesis that generative AI will compete with human creators. With said competition, products of generative AI may negatively affect the value of the copyrighted material on which it is trained. The factual inquiry will thus be into the relationship between the output data (does training on a text result in a new text or in a function of correcting grammatical mistakes?) and the copyrighted training data.

Market consequences of specific AI may be less obvious but more disruptive. The principal author of the GitHub class action complaint, Matthew Butterick, argues that a code-writing AI will potentially “starve” the open-source communities. According to him, it will remove the incentive for developers to dis­cov­er and contribute to­ “tra­di­tional open-source com­mu­ni­ties” that made the creation and constant development of the open-source code possible, stifling the growth and development of open-source software. It is unclear whether the potentially reduced demand for the open-source community will be enough for the finding of a negative market effect. However, it does seem to go against the ultimate goal of copyright – to expand public knowledge and understanding by giving potential creators exclusive control over copying of their works, thus giving them a financial incentive to create informative, intellectually enriching works for public consumption.


Did we leave you with more questions than you had before? We hoped so. Hopefully, you also feel more confident in understanding the legal risks associated with generative AI and are better positioned to mitigate them. (Un)fortunately for all stakeholders, there may not be a definitive answer as to whether machine learning on copyrighted material is generally fair use. Courts will need to consider the factors outlined in this article in deciding on each individual case. 

Although it may not absolve AI developers from copyright infringement claims, developers of generative AIs could consider structuring the machine learning process to not make tangible copies of training data and to analyze the non-expressive structural elements (pixels, parts of speech) directly from the source. Datasets could be designed to contain links to the training data and not a reproduction of the copyrighted material. Finally, the output should not include recognizable portions of expressive/creative elements of the training data to infringe on the right to produce derivative works.

Diana Bikbaeva is a tech and intellectual property law attorney.