US Copyright Office issues pre-publication version of 3rd Report on AI Training and Fair Use. AI training is transformative, but degree depends on how AI functions. Supports new market dilution theory of harm under Factor 4.

On late Friday, the U.S. Copyright Office issued its long-awaited report on AI training and fair use. It’s labeled “Pre-publication Version,” though it says no substantive changes will be made. UPDATE on May 8: President Trump reportedly fired Shira Perlmutter, the Register of Copyrights, which now raises a cloud of uncertainty over whether this report ever becomes official. It might not. It could be DOA.

Pres. Trump fires Register of Copyrights Shira Perlmutter less than 24 hours after she issued pre-publication report on AI training & fair use more favorable to copyright holders. Raises a cloud of uncertainty over Report.

Wow, this report will create a firestorm in the AI copyright lawsuits. The Copyright Office has taken sides on all of the four factors of fair use. On balance, the Office’s view seems to favor the copyright holders in the AI litigation, notwithstanding the Office’s agreement that AI training often serves a transformative purpose (especially for large AI models that require large and diverse datasets for training) under Factor 1 of fair use.

Perhaps the biggest surprise and most controversial aspect: the Copyright Office breaks new ground in admittedly “uncharted territory” and agrees with a new theory of market dilution advanced by copyright stakeholders under Factor 4, the effect of the use upon the potential market for or value of the copyrighted work. It’s a new (and untested) theory in the sense that it has yet to be recognized by any court in a copyright case. As discussed below, the Office’s example of flooding the market for the entire genre of romance novels that are AI-generated as a cognizable market harm under Factor 4–even without proof of any infringing output or AI-generated book that actually infringes any plaintiffs’ romance book–shows just how expansive the theory is. It’s open to question whether the Copyright Office should be opining on a new theory of market harm that no court has recognized yet.

Granted, the theory sounds similar to the concern of market “obliteration” that Judge Chhabria voiced at the hearing in Kadrey v. Meta, although he also questioned whether the evidence in the summary judgment record supported it. (At the hearing, Judge Chhabria also said he was inclined to view AI training as transformative in purpose, even highly transformative. Plus, he said Factor 4 is the most important factor in the AI lawsuits–and he may be right.) But what’s different between the court and the Copyright Office is that Judge Chhabria has actual evidence submitted by the parties as well as the law governing the burden of proof in a court of law and the briefs of both sides. The Office, by contrast, has embarked on “uncharted territory” with no rules of evidence, but merely the comment or hearsay example presented by UMG Recordings–the plaintiff in two pending lawsuits against Suno and Udio–about how streaming of AI generated music has distorted streaming royalties, pointing to one criminal case involving the alleged fraudulent streaming of AI generated music by Michael Smith. Yet this criminal indictment itself shows that existing criminal laws can and already are being used to address manipulated streaming. This example seems too tenuous a basis for the Copyright Office to endorse an expansive theory of market harm under fair use that no court has adopted yet. The incongruity between Factor 4 of fair use and the Office’s new theory of market dilution caused by AI-generated music is shown by the fact that any court decision rejecting the fair use defense of an AI company is unlikely to stop fraudsters from trying to manipulate streaming royalties, a problem that started before AI. In any event, we should expect that many of the plaintiffs in the AI litigation will advance this theory of market harm through dilution, provided their complaints sufficiently alleged it.

Highlights of important points from U.S. Copyright Office Report (in order in Report):

Memorization in weights after training

“Third, the training process—providing training examples, measuring the model’s performance against expected outputs, and iteratively updating weights to improve performance—may result in model weights that contain copies of works in the training data. If so, then subsequent copying of the model weights, even by parties not involved in the training process, could also constitute prima facie infringement.” p. 28.
“Whether a model’s weights implicate the reproduction or derivative work rights turns on whether the model has retained or memorized substantial protectable expression from the work(s) at issue. As discussed above, the use of those works in preparing a training dataset and training a model implicates the reproduction right, but copying the resulting weights will only infringe where there is substantial similarity.” p. 29.

Different uses during AI development v. AI deployment require separate consideration but fair use “must also be evaluated in the context of the overall use” (emphasis added)

“The Office agrees that different uses during AI development and deployment require separate consideration. But while it is important to identify the specific act of copying during development, compiling a dataset or training alone is rarely the ultimate purpose. Fair use must also be evaluated in the context of the overall use.” pp. 36-37.

Factor 1: Training AI models will often be transformative in purpose, but how the models function or can be used, once they are deployed, will affect the degree of transformativeness.

Copyright Office agrees “training a generative AI foundation model on a large and diverse dataset will often be transformative. The process converts a massive collection of training examples into a statistical model that can generate a wide range of outputs across a diverse array of new situations. ” p. 45.
But the degree of transformativeness depends ultimately on what the AI model does, once deployed after training. pp. 46-47
- Most transformative: “On one end of the spectrum, training a model is most transformative when the purpose is to deploy it for research, or in a closed system that constrains it to a non-substitutive task. For example, training a language model on a large collection of data, including social media posts, articles, and books, for deployment in systems used for content moderation does not have the same educational purpose as those papers and books.”
- Least transformative: “On the other end of the spectrum is training a model to generate outputs that are substantially similar to copyrighted works in the dataset. For example, a foundation image model might be further trained on images from a popular animated series and deployed to generate images of characters from that series. Unlike cases where copying computer programs to access their functional elements was necessary to create new, interoperable works, using images or sound recordings to train a model that generates similar expressive outputs does not merely remove a technical barrier to productive competition. In such cases, unless the original work itself is being targeted for comment or parody, it is hard to see the use as transformative.”
- In between: “Many uses fall somewhere in between. The use of a model may share the purpose and character of the underlying copyrighted works without producing substantially similar content. Where a model is trained on specific types of works in order to produce content that shares the purpose of appealing to a particular audience, that use is, at best, modestly transformative. Training an audio model on sound recordings for deployment in a system to generate new sound recordings aims to occupy the same space in the market for music and satisfy the same consumer desire for entertainment and enjoyment. In contrast, such a model could be deployed for the more transformative purpose of removing unwanted distortion from sound recordings.”
- The use of guardrails on outputs of AI model: “Because generative AI models may simultaneously serve transformative and non-transformative purposes, restrictions on their outputs can shape the assessment of the purpose and character of the use. As described above, developers can apply training techniques or deployment guardrails so that the model rejects requests for excerpts of copyrighted works or even refuses to generate expressive works. Where such restrictions are effective, the system will be less capable of fulfilling the purpose of the original works, and their use in training may be more transformative.”
- RAG (or retrieval augmented generation) search is less likely to be transformative when the outputs are “summaries of retrieved copyrighted works, such as news articles, as opposed to hyperlinks.” p. 47.

Copyright Office rejects theories that AI engages in nonexpressive use and fair learning.

“In providing this analysis, the Office rejects two common arguments about the transformative nature of AI training. As noted above, some argue that the use of copyrighted works to train AI models is inherently transformative because it is not for expressive purposes. We view this argument as mistaken. Language models are trained on examples that are hundreds of thousands of tokens in length, absorbing not just the meaning and parts of speech of words, but how they are selected and arranged at the sentence, paragraph, and document level—the essence of linguistic expression. Image models are trained on curated datasets of aesthetic images because those images lead to aesthetic outputs. Where the resulting model is used to generate expressive content, or potentially reproduce copyrighted expression, the training use cannot be fairly characterized as ‘non-expressive.’” p. 47
“Nor do we agree that AI training is inherently transformative because it is like human earning. To begin with, the analogy rests on a faulty premise, as fair use does not excuse all human acts done for the purpose of learning. A student could not rely on fair use to copy all the books at the library to facilitate personal education; rather, they would have to purchase or borrow a copy that was lawfully acquired, typically through a sale or license. Copyright law should not afford greater latitude for copying simply because it is done by a computer. Moreover, AI learning is different from human learning in ways that are material to the copyright analysis. Humans retain only imperfect impressions of the works they have experienced, filtered through their own unique personalities, histories, memories, and worldviews. Generative AI training involves the creation of perfect copies with the ability to analyze works nearly instantaneously. The result is a model that can create at superhuman speed and scale. In the words of Professor Robert Brauneis, ‘Generative model training transcends the human limitations that underlie the structure of the exclusive rights.’” p. 48

Using “pirated” datasets should weigh against fair use without being determinative (aka “unlawful access”)

“In the Office’s view, the knowing use of a dataset that consists of pirated or illegally accessed works should weigh against fair use without being determinative. Courts have expressed some uncertainty about whether good or bad faith generally is relevant to the fair use analysis. The cases in which they have done so, however, involved defendants who uses copyrighted works despite the owners’ denial of permission. Training on pirated or illegally accessed material goes a step further. Copyright owners have a right to control access to their works, even if someone seeks to obtain them in order to make a fair use. Gaining unlawful access therefore bears on the character of the use.” p. 51

Factor 2: Using more expressive works in AI training weighs against fair use under Factor 2

“Where the works involved are more expressive, or previously unpublished, the second factor will disfavor fair use.” p. 54 (Wow, very little explanation provided.)

Factor 3: Downloading entire works may factor against fair use but for some large AI models the amount copied may be reasonably necessary

“Downloading works, curating them into a training dataset, and training on that dataset generally involve using all or substantially all of those works.311 Such wholesale taking ordinarily weighs against fair use.” p. 55
“Nevertheless, the use of entire works appears to be practically necessary for some forms of training for many generative AI models. While for large, general-purpose models, there is no need to copy any amount of any specific work, research supports commenters’ assertions that internet-scale pre-training data, including large amounts of entire works, may be necessary to achieve the performance of current-generation models. To the extent there is a transformative purpose, the use of entire works on that scale could be reasonable.” p. 60.
Use of guardrails: “But many generative AI companies with chatbot and other public-facing services employ guardrails and other methods to prevent potentially infringing outputs. These include input filters that block user prompts likely to result in generations that reproduce copyrighted content; training techniques designed to make infringing outputs less likely; internal system prompts that instruct it not to generate names of copyrighted characters or create images in the style of living artists; and output filters that block copyrighted content from being displayed. Although there are factual disputes over the efficacy of these guardrails, where they do prevent the generation of infringing content, the third factor will weigh less heavily against fair use.” p. 60.

Factor 4: Copyright Office analyzes potential market harm in (1) lost sales, (2) market dilution, and (3) lost licensing opportunities

Lost sales: “There are instances, however, where the use of works in generative AI training can lead to a loss in sales. The use of pirated collections of copyrighted works to build a training library, or the distribution of such a library to the public, would harm the market for access to those works. And where training enables a model to output verbatim or substantially similar copies of the works trained on, and those copies are readily accessible by end users, they can substitute for sales of those works. A potential loss of sales is particularly clear in the case of works specifically developed for AI training. There is a thriving industry focused on developing training datasets that improve the ability of language models to follow instructions, format and structure outputs, use tools, act consistently with human values, or improve domain performance. Where the content of those datasets is copyrightable, or the datasets themselves evince human selection and arrangement of data, and the datasets are primarily or solely targeted at AI training, widespread unlicensed use would likely cause market harm. Uses involving the retrieval of copyrighted works by RAG can also result in market substitution. As described above, RAG augments AI model responses by retrieving relevant content during the generation process, resulting in outputs that may be more likely to contain protectable expression, including derivative summaries and abridgments. A user for whom the augmented response “satisf[ies] the . . . need” for the original work will not pay to obtain it in the marketplace.” pp. 63-64.
Market dilution theory (new): “While we acknowledge this is uncharted territory, in the Office’s view, the fourth factor should not be read so narrowly. The statute on its face encompasses any ‘effect’ upon the potential market. The speed and scale at which AI systems generate content pose a serious risk of diluting markets for works of the same kind as in their training data. That means more competition for sales of an author’s works and more difficulty for audiences in finding them. If thousands of AI-generated romance novels are put on the market, fewer of the human-authored romance novels that the AI was trained on are likely to be sold. Royalty pools can also be diluted. UMG noted that ‘[a]s AI-generated music becomes increasingly easy to create, it saturates this already dense marketplace, competing unfairly with genuine human artistry, distorting digital platform algorithms and driving ‘cheap content oversupply’ – generic content diluting human creators’ royalties.’” p. 65.
“Market harm can also stem from AI models’ generation of material stylistically similar to works in their training data. As the Office noted in Part 1 of this Report, many commenters raised concerns about AI outputs that imitate a creator’s style, which copyright does not protect as a separate element. Even when the output is not substantially similar to a specific underlying work, stylistic imitation made possible by its use in training may impact the creator’s market. In the words of the Writers Guild of America, because AI systems can be prompted to imitate a writer’s style, applying fair use would force writers “to compete with AI-generated scripts trained on their work, without their authorization, and without fair compensation. This threat is more acute because of the technology’s ability to produce works so similar in style “that the average person cannot discern a difference in the marketplace[,] . . . creat[ing] direct competition with the creators whose works have been used to train the model.” p. 65.
Lost licensing?: Unclear if licensing market exists at scale needed for some AI models. “Although licensing markets are still developing and factual contexts vary, available information shows that markets exist or are “reasonable” or “likely to be developed,” for certain copyright sectors, types of training or uses, and models. Direct licensing is most common and most promising with respect to corporate entities with catalogs of high-quality and easily identifiable content.398 For example, content controlled by large stock photography companies, national news outlets, and major record companies or film studios may be more easily licensable. Such content likely has a higher training value because it is high-quality and curated, and the centralization of rights makes it easier to license without incurring substantial volume-related transaction costs. Yet, it is also unclear that markets are emerging or will emerge for all kinds of works at the scale required for all kinds of models. There are copyright sectors where licensing infrastructure does not yet exist and may be difficult to build, and the amount of training data needed to produce state-of-the-art models may vary by content type or type of training.” p. 70.

Public benefits

“In the Office’s view, there are strong claims to public benefits on both sides. Many applications of generative AI promise great benefits for the public, as does the production of expressive works. While the sheer volume of production itself does not necessarily serve copyright’s goals, commenters identified a wide range of potential benefits weighing in favor and against training on unlicensed copyrighted works. With regard to the fair use analysis, however, the Office cannot conclude that unlicensed use of copyrighted works for training offers copyright-related benefits that would change the fair use balance, apart from those already considered.” p. 73.

Weighing the four factors

“We observe, however, that the first and fourth factors can be expected to assume considerable weight in the analysis. Different uses of copyrighted works in AI training will be more transformative than others. And given the volume, speed and sophistication with which AI systems can generate outputs, and the vast number of works that may be used in training, the impact on the markets for copyrighted works could be of unprecedented scale. As generative AI involves a spectrum of uses and impacts, it is not possible to prejudge litigation outcomes. The Office expects that some uses of copyrighted works for generative AI training will qualify as fair use, and some will not. On one end of the spectrum, uses for purposes of noncommercial research or analysis that do not enable portions of the works to be reproduced in the outputs are likely to be fair. On the other end, the copying of expressive works from pirate sources in order to generate unrestricted content that competes in the marketplace, when licensing is reasonably available, is unlikely to qualify as fair use. Many uses, however, will fall somewhere in between.”

Conclusion

“Various uses of copyrighted works in AI training are likely to be transformative. The extent to which they are fair, however, will depend on what works were used, from what source, for what purpose, and with what controls on the outputs—all of which can affect the market. When a model is deployed for purposes such as analysis or research—the types of uses that are critical to international competitiveness—the outputs are unlikely to substitute for expressive works used in training. But making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries.”

DOWNLOAD THE COPYRIGHT OFFICE REPORT BELOW

Related Stories

My paper on “Fair Use and the Origin of AI Training.” Or why the Copyright Office Report is wrong about fair use and why courts should reject its view.

Opinion: Why the Copyright Office’s “pre-publication” Report is flawed–both procedurally and substantively.

Copyright Office recommends Federal Digital Replica Law, Part I of AI Report

Copyright Office Report on Authorship & AI (PDF)

Chat GPT Is Eating the World