There’s a discovery battle brewing in the In re Mosaic LLM Litigation. Databricks opposes the discovery requests for evidence related to Databricks’ new model DBRX, which goes beyond the complaint against the model Mosaic LLM (that Databricks acquired the rights to).
Apparently, the plaintiffs are seeking information related to the datasets used to train the DBRX model. Databricks said it didn’t use the controversial Books3 dataset. But the plaintiffs contend that the training of DBRX involved 12 trillion tokens, a size that the plaintiffs argue couldn’t have been achieved without using pirated books datasets of some kind.