, , , ,

AI training in the shadow of Geoffrey Hinton: was it fair use or not?

AI is one of the most disruptive technologies of the 21st century. It may go down as the most disruptive when history books are written.

AI’s disruption has triggered a wide spectrum of reactions among people, including hostility and outright vitriol. There’s probably no more triggering aspect of AI than the edifice on which it was built: training models using copyrighted works of others without their permission, and, for some models, on the scale of many millions, if not billions.

People have condemned this origin of AI training as the original sin” of AI. Yet, as I explain in my forthcoming Houston Law Review article, “If this practice is blatant theft of copyrighted works, then this ‘original sin’ occurred in academia, not Silicon Valley.”

To borrow the metaphor, Adam and Eve didn’t live in Silicon Valley. They lived in universities.

If this practice of AI training is blatant theft of copyrighted works, then this ‘original sin’ occurred in academia, not Silicon Valley.

We now have a case — Thomson Reuters v. ROSS Intelligence, the first federal appeal of a decision denying fair use in AI training — that will provide the Third Circuit Court of Appeals the opportunity to examine the history and origin of AI training.

The ROSS Intelligence co-founders (Andrew ArrudaJimoh Ovbiagele, and Pargles Dall’Oglio) were students at the University of Toronto, where Geoffrey Hinton, one of the so-called “godfathers of AI,” made his pathbreaking discoveries in neural networks. Hinton would later go on to be awarded the Nobel Prize for his research.

The University of Toronto President Meric Gertler below praised Andrew Arruda, Jimoh Ovbiagele, and Pargles Dall’Oglio when they were students and founded ROSS Intelligence, to apply AI techniques they learned at the university to legal research in 2017. As Gertler should. Which university president wouldn’t?

THE SHADOW OF GEOFFREY HINTON

As I explain in my law review article, the idea of using larger datasets that included copyrighted works didn’t originate at Big Tech companies in Silicon Valley.

Instead, it originated in universities when AI researchers made a major breakthrough: AI models dramatically improved simply by using larger and larger datasets. This technique is known as “scaling” and is now widely accepted as an important technique in developing AI models.

In their now famous AlexNet paper (which has been cited nearly 150,000 times!), Hinton and co-researchers (including then-University of Toronto students Alex Krizhevsky and Ilya Sutskever, who later co-founded OpenAI and still has a webpage at University of Toronto) showed how AI models improved by scaling, or exposing them to increasingly greater amounts of data.

As Hinton et al. concluded: “All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.”

It’s a pretty radical discovery. AI models improve on their own when given more and more data. Humans don’t have to write fancy instructions or programs for the AI to follow. Instead, the AI models learn more by “scaling up” the data they are trained on.

This seminal insight of “deep learning” from larger datasets by Hinton et al., along with other luminaries (e.g., Yoshua Bengio, Yann LeCun, Andrew Ng, Fei Fei Li) spawned dramatic advances in AI that we are witnessing today. If you’re curious about deep learning, take a look at Hinton, Bengio, and LeCun’s paper in Nature, cited 101,854 times!

In his 2023 annual report, even Chief Justice John Roberts recognized the importance of scaling (without mentioning it by name):

And now we face the latest technological frontier: artificial
intelligence (AI). At its core, AI combines algorithms and
enormous data sets to solve problems.

Chief Justice John roberts

BUT WAS THIS AI TRAINING USING COPYRIGHTED WORKS ALL INFRINGEMENT?

Yet, this major breakthrough also sparked major controversy–and condemnation–as the practice of AI training migrated from university research to startups like OpenAI and Big Tech companies like Google, Microsoft, and Meta.

Why? Typically, no one at universities or AI companies had obtained any permission from the copyright holders whose content were being used to train AI models.

For example, as my article explains: “The BookCorpus dataset or derivatives from it were later used by other researchers in major projects yielding significant advances in AI, including the Google researchers’ BERT model, a joint project between University of Washington researchers and Facebook AI called the RoBERTa model, OpenAI’s GPT model, and the XLNet model.” (BookCorpus was an early books dataset compiled without permission of the authors.)

In other words, AI researchers at universities and AI companies alike were using lots of copyrighted works without permission.

Although there are no U.S. copyright lawsuits filed against university researchers, if the position of some copyright stakeholders is right — that AI training is not transformative and not a fair use — then AI researchers at universities are engaging in copyright infringement, too. And universities may well be secondarily liable if willful blindness of the university to the mass copying being undertaken in their research labs is proven.

I disagree with that conclusion, as my article explains at length (and will spare you all the details here). So far, two federal judges (Judges Alsup and Chhabria) agreed that AI training is a highly transformative fair use in cases against Anthropic and Meta, respectively.

But Judge Bibas in the case against ROSS Intelligence held that its AI training on Westlaw headnotes for judicial opinions was neither transformative nor fair use. That decision is now on appeal.

THE THIRD CIRCUIT SHOULD CONSIDER THE ORIGIN OF AI TRAINING–AND THE TECHNOLOGICAL REASON THAT UNDERLIES IT

When the Third Circuit reviews Judge Bibas’s decision on appeal, I think it would help if they consider the origin and history of AI training tracing back to Hinton et al.’s seminal breakthrough.

I use the “shadow of Geoffrey Hinton” as a metaphor. It refers to the key recognition by Hinton and other AI researchers that larger and more diverse datasets used in AI training can yield major advances in the AI models. Indeed, it led to some of the major advances in AI that many considered pure fantasy only a decade ago.

That shadow is clearly in the background of this case involving former University of Toronto students, who drew on this knowledge and applied it to legal research, an area sorely in need of innovation.

Should the Third Circuit view all the unauthorized use of copyrighted works in university-based research in this “shadow” as categorically not transformative and not fair use as some vehemently argue? Or should the Third Circuit consider it as transformative for the further purpose of technological development and progress, with public benefits that may be profound?

Leave a Reply


Discover more from Chat GPT Is Eating the World

Subscribe now to keep reading and get access to the full archive.

Continue reading