From a small seed a mighty trunk may grow. – Aeschylus
In the United States, 38 copyright lawsuits have been filed against AI companies and uses of copyrighted materials to train or use AI. To our knowledge, until this week, none involved an allegation that an AI company was actually using file sharing software under the BitTorrent protocol to not only download some datasets allegedly containing pirated copies of books, but also to allow “seeding” of those same files, allowing others to download them as well. Apparently, there’s a norm in the BitTorrent community that, if you download files via torrents, you should allow “seeding” of the same files for others to download as well. And, if you don’t share roughly the same amount, you are labeled a “leecher.”
But that BitTorrent norm runs head-first into copyright law–and the potential for engaging in blatant copyright infringement by copying and publicly distributing copyrighted works without permission, provided that third parties actually downloaded copies from the seeding.
For that very reason, it appears, based on the Kadrey book author plaintiffs’ briefs, that Meta employees hesitated about using torrents, especially while using Meta laptops. And Meta knew that the LibGen dataset that they wanted to torrent included pirated copies of works. Here are some of the plaintiffs’ allegations based on the documents and testimony they obtained from Meta, as they cite in their briefs in support of their motion for leave to file a Third Amended Consolidated Complaint:
- These documents concern Meta’s torrenting and processing of pirated copyrighted works, including that: Meta’s CEO, Mark Zuckerberg, approved Meta’s use of the LibGen dataset notwithstanding concerns within Meta’s AI executive team (and others at Meta) that LibGen is “a dataset we know to be pirated,” Stein Reply Decl. (“Reply Ex.”), Ex. A at 211699, 211702;
- top Meta engineers discussed accessing and reviewing LibGen data but hesitated to get started because “torrenting from a [Meta-owned] corporate laptop doesn’t feel right [smiley emoji],” Reply Ex. B at 204224;
- one of those engineers “filtered . . . copyright lines” and other data out of LibGen to prepare a CMI-stripped version of it to train Llama, Reply Ex. C at 204220-21;
- by January 2024, Meta had already torrented (both downloaded and distributed) data from LibGen, Reply Ex. D. [n.1: According to their “File Paths,” Meta collected these and hundreds of other documents months ago, some as early as June 2024. Yet Meta withheld them until the last hours of fact discovery.]
- when asked about the type of piracy described in the TACC [Third Amended Consolidated Complaint], Mr. Zuckerberg testified that such activity would raise “lots of red flags” and “seems like a bad thing.” Reply Ex. E (Zuckerberg Dep. Tr.) at 102:10–14; 98:24–99:2.
- internal Meta records show every relevant decision-maker at Meta, up to and including its CEO, Mark Zuckerberg, knew LibGen was “a dataset we know to be pirated.” Reply Ex. A at 211699 (memo to Meta’s AI decision-makers noting that after “escalation to MZ,” Meta’s AI team “has been approved to use LibGen”), 211702 (“[M]edia coverage suggesting we have used a dataset we know to be pirated, such as LibGen, [] may undermine our negotiating position with regulators.”).
Of course, as we have cautioned in earlier posts, we need to see these references to Meta documents and employees’ testimony in their full context. We also need to know (1) what works Meta allowed for seeding, (2) did they include any of the plaintiffs’ works, (3) the scope and duration of such seeding, and (4) the evidence of actual downloads from the seeding. According to the Meta employee Stein declaration, Meta tried to allow “seeding” of only the smallest amount necessary for Meta’s use of torrents to occur: “That’s what we did and the library that we used [was called] Lib Torrent for downloading LibGen, [Meta employee] Bashlykov configured the configure setting so the smallest amount of seeding could occur.”
But, if Judge Chhabria grants the plaintiffs’ request to file a Third Amended Consolidated Complaint–which we fully expect he will soon–the seeding issue will no doubt be a major part of the plaintiffs’ case, as the plaintiffs’ own briefing indicates.
And it could prove to be most damaging to Meta.