The parties in the In re Google Gen AI Litigation are disputing over whether plaintiffs can discover certain datasets at Google.
According to Google, there are 7: FineWeb, FineWeb2, FineWeb-edu, Common Crawl, The Pile, RedPajama, and RedPajama2. But Google said it didn’t use these datasets to train the relevant models in the lawsuit.

By contrast, the Plaintiffs argue Google used at least some of these datasets and the others are relevant:

