Darius H. James seeks to file First Amended Complaint v. Cerebras Systems after adding Lieff Cabraser lawyers

Author Darius H. James has filed for leave to file a First Amended Complaint. James recently added lawyers from Lieff Cabraser law firm, who are heavily involved representing book authors in multiple suits against AI companies.

The proposed First Amended Complaint adds a claim for contributory infringement.

The claims relate to Cerebras Systems’ creation of the SlimPajama dataset and sharing it online.

Cerebras described this dataset on its website:

“Today we are releasing SlimPajama – the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. SlimPajama was created by cleaning and deduplicating the 1.21T token RedPajama dataset from Together. By filtering out low quality data and duplicates, we were able to remove 49.6% of bytes, slimming down the dataset from 1210B to 627B tokens. We believe SlimPajama offers the highest quality and most compute efficient data to train on for runs up to 627B tokens. When upsampled, we expect SlimPajama to perform equal to or better than RedPajama-1T when training at trillion token scale.

“In addition to the data, we are also releasing the tools we built to create SlimPajama. Applying MinHashLSH (Leskovec et al. 2014) deduplication to trillion token datasets like RedPajama was not possible with off-the-shelf open-source code. We made several improvements to existing solutions to produce an infrastructure that can perform MinHashLSH deduplication on trillion token datasets in a distributed, multi-threaded, and memory efficient fashion. Today we are open-sourcing this infrastructure to enable the community to easily create higher quality, extensively deduplicated datasets in the future.

“Our contributions are as follows:

SlimPajama 627B – the largest extensively deduplicated, multi-corpora, open dataset for LLM training. We release it under the Apache 2.0 license at https://huggingface.co/datasets/cerebras/SlimPajama-627B
Releasing validation and test sets, 500M tokens each, which has been decontaminated against the training data
Library of methods to replicate or pre-process from scratch other datasets. To the best of our knowledge these are the first open-source tools to enable cleaning and MinHashLSH deduplication of text data at trillion token scale.”

The James Plaintiffs argue that Cerebras committed direct and contributory copyright infringement in creating and sharing this SlimPajama dataset.

DOWNLOAD THE REDLINED VERSION OF THE FIRST AMENDED COMPLAINT

Exhibit B redlined version of First Amended Complaint (May 11 2026)Download

Related Stories

Was ImageNet offered illegally? Copyright suit over ImageNet dataset, EVOX Productions v. Stanford University, puts academic AI research under new scrutiny

Stanford University sued for copyright infringement based on Stanford AI researchers’ compilation of unlicensed images in ImageNet dataset, one of the most important datasets in the history of AI research

Are university AI researchers now at risk of facing copyright suits for acquiring or using unlicensed materials, or models trained on them?

Chat GPT Is Eating the World

Darius H. James seeks to file First Amended Complaint v. Cerebras Systems after adding Lieff Cabraser lawyers

Like this:

Leave a ReplyCancel reply

Darius H. James seeks to file First Amended Complaint v. Cerebras Systems after adding Lieff Cabraser lawyers

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Chat GPT Is Eating the World