The pirated books dataset Library Genesis is turning into the mother of all disputes in the copyright litigation against AI companies.
In the MDL litigation against Microsoft, Plaintiffs seek discovery of the Bing Index: “What Class Plaintiffs are challenging, however, is the transfer of a copy (or copies of portions) of the Bing Index data to OpenAI, ostensibly for use in training LLMs or some other purpose, which Class Plaintiffs allege is a separate use of that copyright-protected data that would potentially go beyond the recognized limits of those fair use decisions.”
Somehow, the Bing Index might contain a URL related to LibGen: “Although there are certainly differences between the past versions of the Bing Index and the current version, Microsoft’s 30(b)(6) deponent testified that Microsoft could use an internal too called the ‘Bingdex’ to search the Bing Index for metadata associated with certain URLs currently in the Bing Index, including when a particular URL was crawled and added. (See ECF 546-2 at 8). This means that if a URL related to LibGen is currently in the Bing Index, Microsoft can use Bingdex to determine whether that link was present during the relevant data transfer time period. (See Sept. 25 Tr. at 94).”
But for now, Judge Wang has denied the motion to compel without prejudice.
Judge Wang concluded: “At this stage, there is insufficient information to justify either a review of all LibGen- related links currently in the Bing Index by Microsoft or a full-scale production of the entirety of the Bing Index to Class Plaintiffs for their own review of such links. However, Microsoft has also failed to articulate the burden associated with such review or production (other than asserting that one exists), and thus there is insufficient information to show that such discovery is not proportional. The Court will address the proportionality of Class Plaintiffs’ request at tomorrow’s discovery status conference.
Ominously for both OpenAI and Microsoft, however, is discussion by Magistrate Judge Wang that appears at least somewhat supportive of Judge Alsup’s treatment of downloading from shadow libraries as a separate use and basis for infringement in Bartz v. Anthropic:

Of course, neither Judge Stein nor Magistrate Judge Wang consider whether downloading should be analyzed separate from or a part of training the AI model. So we shouldn’t read too much into these statements. However, Judge Stein’s decision yesterday (allowing a separate claim for downloading from shadow libraries) opens the door to the plaintiffs’ to pursue their separate claim.
DOWNLOAD JUDGE WANG’S DECISION