As part of their motion to intervene, book publishers Cengage Learning and Hachette Book’s proposed complaint against Google alleges “Google first illegally copied Plaintiffs’ and the Class’s copyrighted books—downloading them from pirated sources and extracting them from behind legitimate paywalls—to amass a massive corpus of source material.”
The nature of Google’s alleged downloading appears to be based on Google’s mere use of the Common Crawl dataset, which then Google allegedly curated into datasets called Infiniset and Google’s Colossal Clean Crawled Corpus (C4). Cengage Learning and Hachette’s Complaint alleges the following:

The book authors in the In re Google Generative AI Litigation make a similar allegation in their Second Amended Complaint:

In other words, the Complaints appear to allege that Common Crawl swept up pirated copies of books, including from pirate digital libraries, in crawling the Web, which were later included in the Common Crawl dataset that Google used. That inference is apparent in Paragraph 48 of the proposed Complaint, which doesn’t allege that Google torrented files from the controversial Library Genesis but instead the vague allegation: “48. Many other well-known pirate collections like LibGen are widely available on the internet. Common Crawl includes these sites when it scrapes the internet for text content to be used for training materials.”
How Google is different from Anthropic and Meta case
This allegation of Cengage Learning, Hachette, and the book author plaintiffs is quite different from the allegations against other AI companies (e.g., Meta and Anthropic) in torrenting directly from shadow libraries. The Shadow Library Strategy requires direct downloading from a shadow library, not simply that pirated copies were swept up in an automated crawling of the Web.
In its Answer, Google admits creating Infiniset and C4 datasets, but it states it “lacks knowledge or information sufficient to form a belief about the truth of the allegations” in the other allegations, such as the Common Crawl dataset sweeping in pirated books through crawling.
Does Google Books provide the reason?
I had long assumed that Google didn’t acquire copies of books from Library Genesis or other shadow libraries. Why? Google already has a vast database of 40 million books it manually scanned from physical books as a part of its Google Book search (and Judge Alsup later held in Bartz v. Anthropic manual scanning is a fair use and that training on books is a fair use). I have no confirmation my hunch is accurate, but it seems like a reasonable assumption.
And the same goes for YouTube videos.
