Bartz v. Anthropic parties fight over production of datasets spreadsheet and books dataset outside of inspection environment. Anthropic reveals it spent tens of millions of dollars to compile its own scanned books dataset.

As Judge Alsup is deliberating over Anthropic’s motion for summary judgment on fair use, the parties continue to fight over discovery.

One of the more fascinating discovery disputes relates to whether the Bartz plaintiffs should get production of (1) a spreadsheet of the datasets used by Anthropic, which it calls the Tokens Formula Log, and (2) a Scanned Books Dataset that Anthropic says it compiled itself at the cost of tens of millions of dollars: “it reflects a proprietary composition of books that Anthropic sourced, scanned, and created itself—it is available to no other AI company in the world.”

Wow, that’s the first I am learning about Anthropic’s own compiled and scanned books dataset. Most of the attention in the various books AI lawsuits is over the use of the so-called shadow libraries. (OpenAI used Books 1 and 2 datasets, which it later destroyed. But I haven’t heard much about how they were compiled.)

Anthropic’s opposition