OpenAI’s proposed reply reveals more on Books 1, 2 datasets used to train GPT-3

In its proposed reply to support its Rule 72(a) objection to Magistrate Judge Wang’s ruling that OpenAI waived its attorney-client privilege, OpenAI gives us a fuller description of the timeline in which OpenAI used Books 1 and 2 datasets:

The Book Author Class Plaintiffs also describe their own timeline in their brief:

Our Own Timeline

Earlier, we created our own timeline. Nearly all of it appears to be accurate.

However, (1) if the Plaintiffs’ description is correct, the initial downloading in 2018 might have been a different person than Benjamin Mann, who is first named in the unredacted activity in October 2019 above. It still could have been Mann (we can’t see the redacted portions) based on his deposition in Bartz v. Anthropic. (2) The training of GPT-3 had to have occurred at least by 2020 based on OpenAI’s publication of its research paper first posted online on Thu, 28 May 2020 17:29:03 UTC (6,995 KB).

Timeline of Alleged Infringements by OpenAI and Key Events

2018 downloading Library Genesis dataset [Alleged Willful Infringement 1]:
- “It is undisputed that in 2018, an OpenAI employee downloaded pirated copies of books from Library Genesis (“LibGen”).” p. 1.
Mar. 2019: OpenAI nonprofit adds a for-profit arm.
Oct. 2019: OpenAI employee Benjamin Mann allegedly does something “with a huge quantity of books from LibGen dataset.” Mann testified in Bartz that he thought it was fair use to download and use the datasets to train AI models.
2020 training with Books 1 and 2 derived from LibGen [Alleged Willful Infringement 2]:
- OpenAI states in May 2020 in its posted research paper Language Models Are Few Shot Learners that it used Books 1 and Books 2 datasets to train its (early) AI model.

April 2021: attorney Jason Kwon hired by OpenAI.
2021: attorney Che Chang hired by OpenAI.
Late 2021: OpenAI stops training with Books 1 and 2 (per Gratz Mar. 22, 2024 Letter)
2022 deletion of Books 1 and 2 + Attorney Communications [2022 OpenAI counsel communications]: OpenAI deletes Books 1 and 2 datasets, plus in-house attorneys Kwon and Chang communications related to deletion
Sept. 19, 2023: Authors Guild sues OpenAI and Microsoft
Sept. 2023: attorney Michael Trinh hired by OpenAI.
Mar. 22, 2024: OpenAI’s outside attorney Joseph Gratz March 22, 2024 letter to Plaintiffs’ attorney said, based on their understanding, Book1 and 2 were discontinued in training models in late 2021 and were deleted in mid-2022: “We also understand that the use of books1 and books2 for model training was discontinued in late 2021, after the training of GPT-3 and GPT-3.5, and those datasets were then deleted in or around mid-2022 due to their non-use. We therefore have not yet located copies of books1 and books2 to make available for inspection, but are actively working on doing so, to the extent that they are still available. In parallel, we are gathering information about the composition of books1 and books2 that could be used to provide an appropriate substitute if we are unable to locate the datasets themselves.” [OpenAI attempted to retract the Gratz letter on Jun. 13, 2025 for its mention of “due to their non-use” and instead OpenAI later asserted the reason for OpenAI’s deletion is privileged under attorney-client privilege. ECF 188; see Op. p. 6. The revised letter omits “due to their non-use,” but is otherwise the same as the Mar. 22, 2024 letter.]

I have labeled the 2018 downloading “Alleged Willful Infringement 1” and the circa 2020 training “Alleged Willful Infringement 2” for clarity.

The training with Books 1 and 2 may have spanned the period from 2018 – late 2021, but the precise timeline is a bit unclear from the publicly filed briefs. According to the Gratz letter for OpenAI, they were “deleted in or around mid-2022.”

This timeline shows that OpenAI’s deletion of Books 1 and 2 occurred some time after OpenAI’s downloading of the datasets and later use of them to train GPT-3 in 2020. Indeed, Plaintiffs’ own brief states it was “[t]hree years later,” meaning after the downloading of the datasets and Mann started to use the datasets:

Related Stories

Lisa Blatt asks Judge Stein for leave to file reply to Book Authors’ opposition to challenge to Magistrate Judge’s ruling on waiver of attorney-client privilege

Chat GPT Is Eating the World