After Disney sues Midjourney, Sarah Andersen seeks all datasets used by Midjourney

Two days after Disney sued Midjourney, Sarah Andersen and other artists are asking Magistrate Judge Cisneros to order Midjourney to produce all datasets it used to train its AI models.

Disney and Universal Studios sue Midjourney for recreating Disney characters. Copyright AI suits hit 42 in U.S. (Complaint PDF)

The original discovery request by Andersen had been made in November 2024. And now the parties turn to Judge Cisneros to resolve their dispute. Plaintiffs argue they should get all of the datasets. Midjourney argues that the plaintiffs are limited to the 2 datasets specified in their complaint: LAOIN-5B and LAOIN-400M. Another part of the objection of Midjourney is the size of all datasets used, allegedly spanning in the petabytes of data, and each petabyte is allegedly “the rough equivalent of 500 billion pages of text, or 20 million tall filing cabinets, or 13.3 years’ worth of high-definition video.”

I would expect that the Judge will find any datasets in which the plaintiffs’ images might reasonably be expected to be contained within should be produced.

I am more intrigued by how the production of any datasets from Midjourney will occur. The Stipulated Protective Order does not appear to require a specific viewing at the Midjourney’s offices, and I can’t find an order specific to training data (but I may be missing it). By contrast, in Bartz v. Anthropic, the datasets must be viewed at Anthropic.