Robbed by Robots? The backlash over AI’s ingestion of people’s content without consent

The Backlash against AI ingestion

As three new proposed class action lawsuits were filed this week against Alphabet, Meta, and OpenAI, adding to the list of 11 lawsuits against AI, the media reports of the backlash against the training of AI with databases consisting of billions of data, including people’s online content scraped from the Internet, ranging from their artistic works, literary works, and social media posts, has intensified.

The backlash isn’t too surprising, given that 61% of Americans think that AI is a threat to humanity, according to a Reuters/Ipsos poll in May. An ADL/USA Today poll showed even greater fears. AI’s ingestion of vast amounts of people’s content and information, without their permission, likely plays into the larger fears of AI displacing human workers, not to mention the possibility that tech companies may be the ones profiting the most from AI.

The New York Times headline called it an outright revolt: “‘Not for Machines to Harvest’: Data Revolts Break Out Against A.I.” The article describes how fan fiction writers who post their content online, social media companies Reddit and Twitter, news media the New York Times and NBC News, and several book authors, including Sarah Silverman and Paul Tremblay, have objected to the unlicensed use of their content to train AI, without their permission or compensation.

The backlash against AI ingestion of people’s content started earlier this year, once people began to realize how OpenAI’s ChatGPT operated. A proposed copyright class action was filed in January 2023 by the artist Sarah Anderson against Midjourney, Stable Diffusion, and DreamUp

Where do they get the data to train AI?

You may be wonder, Where do tech companies get all the data they use to train AI?

It may be from a range of sources. For example, “OpenAI said ChatGPT was trained on ‘licensed content, publicly available content and content created by human A.I. trainers.’” One of the sources used by OpenAI was LAOIN-5B, “a nonprofit, publicly available database that indexes more than five billion images from across the Internet, including the work of many artists.”

Both Stable Diffusion’s Ben Brooks and Midjourney’s David Holz admitted that their companies are using vast amounts of copyrighted content without licenses from their creators.

Is ingesting content to train AI fair use?

For the content that is copyrighted, such as the artworks, photographs, and books involved in the several lawsuits against AI companies, the key question in the United States will be whether such (intermediate) use of the copyrighted content, including by making internal copies and then tokenize it, constitutes a permissible fair use.

I wrote initial thoughts on the fair use question back in January. Since then, the Supreme Court handed down another major fair use decision in Andy Warhol Foundation v. Goldsmith, which narrowed its approach to “transformative purpose” under Factor 1 of fair use if the copying of the copyright owner’s work is for the “substantially the same purpose,” such as for use on a magazine cover. This might bolster the argument that AI’s generation of content is substantially the same as the purpose on which the AI was trained, assuming the AI output is substantially the same in appearance as the original work.

On the other hand, the internal copies of works that were used to train AI arguably have a different purpose–to create a whole new AI technology. The AWF Court cited favorably its past recent fair use decision in Google v. Oracle. As the AWF Court explained:

ChatGPT and other AI platforms are an entirely new computing environment that are different in purpose than the underlying works on which they were trained. Plus, these AI platforms perform other functions that do not implicate copyrighted works.

Those are my initial impressions, post-Andy Warhol Foundation.

What are the best practices for AI?

However the lawsuits are resolved, we should also consider what are the best practices for AI platforms.

Although some platforms allow artists to opt out their materials from the training databases, that does put the burden on the artists to find out if their works were used.

Google has proposed the development of a comparable opt out provision to the robots.txt instruction that tells bots what webpages to crawl or not.

Beyond opt outs, we should consider whether people deserve compensation for use of their materials to train AI. The feasibility of a compensation scheme seems perhaps doubtful given the sheer scale of the billions of works used.

Leave a Reply

%d bloggers like this: