, ,

Darius H. James seeks to file First Amended Complaint v. Cerebras Systems after adding Lieff Cabraser lawyers

Author Darius H. James has filed for leave to file a First Amended Complaint. James recently added lawyers from Lieff Cabraser law firm, who are heavily involved representing book authors in multiple suits against AI companies.

The proposed First Amended Complaint adds a claim for contributory infringement.

The claims relate to Cerebras Systems’ creation of the SlimPajama dataset and sharing it online.

Cerebras described this dataset on its website:

“Today we are releasing SlimPajama – the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. SlimPajama was created by cleaning and deduplicating the 1.21T token RedPajama dataset from Together. By filtering out low quality data and duplicates, we were able to remove 49.6% of bytes, slimming down the dataset from 1210B to 627B tokens. We believe SlimPajama offers the highest quality and most compute efficient data to train on for runs up to 627B tokens. When upsampled, we expect SlimPajama to perform equal to or better than RedPajama-1T when training at trillion token scale.

“In addition to the data, we are also releasing the tools we built to create SlimPajama. Applying MinHashLSH (Leskovec et al. 2014) deduplication to trillion token datasets like RedPajama was not possible with off-the-shelf open-source code. We made several improvements to existing solutions to produce an infrastructure that can perform MinHashLSH deduplication on trillion token datasets in a distributed, multi-threaded, and memory efficient fashion. Today we are open-sourcing this infrastructure to enable the community to easily create higher quality, extensively deduplicated datasets in the future.

“Our contributions are as follows:

  1. SlimPajama 627B – the largest extensively deduplicated, multi-corpora, open dataset for LLM training. We release it under the Apache 2.0 license at https://huggingface.co/datasets/cerebras/SlimPajama-627B
  2. Releasing validation and test sets, 500M tokens each, which has been decontaminated against the training data
  3. Library of methods to replicate or pre-process from scratch other datasets. To the best of our knowledge these are the first open-source tools to enable cleaning and MinHashLSH deduplication of text data at trillion token scale.”

The James Plaintiffs argue that Cerebras committed direct and contributory copyright infringement in creating and sharing this SlimPajama dataset.

DOWNLOAD THE REDLINED VERSION OF THE FIRST AMENDED COMPLAINT

Related Stories

Leave a Reply


Discover more from Chat GPT Is Eating the World

Subscribe now to keep reading and get access to the full archive.

Continue reading