Two new models further demolish the 'necessity defense' some AI companies use to justify scraping copyrighted work
Large AI companies regularly claim they need to scrape copyrighted work, without permission, to train state-of-the-art generative AI models. But this argument, already weak, is continually losing credibility as more models emerge that don’t take this approach.
The 20 or so models we’ve certified at Fairly Trained have already shown this. And just in the last week, two new models have bolstered the argument that fairer training is possible. These models are particularly interesting as I think they’re the largest models in their respective domains that have been trained on third-party work without scraping copyrighted works: a language model from PleIAs, and an image generation model from Spawning.
Pleias is a French AI lab, who recently coordinated the compilation and release of Common Corpus, a 2 trillion-token dataset of public domain text. They’ve now released three multilingual language models, Pleias-3b, Pleias-1b and Pleias-350m, all trained on that Common Corpus dataset.
And Spawning is an AI company that has released various products, including haveibeentrained.com and Kudurru, a tool to block web scrapers. They recently compiled a dataset of 12 million public domain images, and they’re now testing Public Diffusion, an image generation model trained on 30 million public domain images.
Both efforts add to the already considerable evidence that training models without relying on unlicensed copyrighted work is possible.
Since 2022, a huge number of generative AI companies have released models trained on copyrighted work without licensing that work. They’ve done this because licensing is expensive and takes time, not because it’s impossible.
When companies say that licensing their training data would have been impossible, point them to the growing number of AI companies that don’t exploit copyrighted work without permission. And ask them: did you even try to license your data? You’ll find that many did not.
It is, definitively, possible to build generative AI models without scraping copyrighted work. Over time, absent major legal overhauls, we’ll see more companies going this route. And the ‘necessity defense’ will get even weaker.
This is an interesting article, Ed. The glitch I am seeing is that publishers are licensing work without authors' permission, with no opt-out, AND no royalties. The contracts were written before Gen AI, so no mention of anything similar in past agreements - yet no negotiation on the contract.
In my experience Taylor & Francis/Routledge had zero communications with authors - I found out about the deal on social media and immediately wrote my editor requesting an opt-out. They said, too late! And anyway, no. I've asked about royalties.... crickets. I guess they need a lot of champagne drinking time with the millions they got from selling out our work!
So... when Fairly Trained looks at prospective tools, do you look at the conditions of the license?