Apple has made a big deal out of paying for the data used to train its Apple Intelligence, but one firm it used is accused of allegedly ripping off YouTube videos.
All generative AI works by amassing enormous datasets called Large Language Models (LLMs), and very often, the source of that data is controversial. So much so that Apple has repeatedly claimed that its sources are ethical, and it's known to have paid millions to publishers, and licensed images from photo library firms.
According to Wired, however, one firm whose data Apple has used, appears to have been less scrupulous about its sources. EleutherAI reportedly created a dataset it calls the Pile, which Apple has reported using for its LLM training.
Part of the Pile, though, is called YouTube Subtitles, which consist of subtitles downloaded from YouTube videos without permission. It's apparently also a breach of YouTube terms and conditions, but that may be a more gray area than it should be.
Alongside Apple, firms who have used the Pile include Anthropic, whose spokesperson claimed that there is a difference between using YouTube subtitles and using the videos.
"The Pile includes a very small subset of YouTube subtitles," said Jennifer Martinez. "YouTube's terms cover direct use of its platform, which is distinct from use of the Pile dataset."
"On the point about potential violations of YouTube's terms of service," she continued, "we'd have to refer you to the Pile authors."
Salesforce also confirmed that it had used the Pile in its building of an AI model for "academic and research purposes." Salesforce's vice president of AI research stressed that the Pile's dataset is "publicly available."
Reportedly, developers at Salesforce also found that the Pile dataset includes profanity, plus "biases against gender and certain religious groups."
Salesforce and Anthropic are so far the only firms that have commented on their use of the Pile. Apple, Nvidia, Bloomberg, and Databricks are known to have used it, but they have not responded.
The organization Proof News claims to have found that subtitles from 173,536 YouTube videos from over 48,000 channels were used in the Pile. The videos used include seven by Marques Brownlee (MKBHD) and 337 from PewDiePie.
Proof News has produced an online tool to help YouTubers see whether their work has been used.
However, it's not only YouTube subtitles that have been gathered without permission. It's claimed that Wikipedia has been used, as has documentation from the European Parliament.
Academics and even mathematicians have previously used thousands of Enron staff emails for statistical analysis. Now, it's claimed that the Pile used the text of those emails for its training.
It's previously been argued that Apple's generative AI might be the sole one that was trained legally and ethically. But despite Apple's intentions, Apple Intelligence has seemingly been trained on YouTube subtitles it had no right to.
3 Comments
In March apple added a feature to PodCasts, automatic Transcriptions. The podcaster doesn’t even have to request it, it’s just done automatically. Of course by doing this Apple was training its AI. Plus the transcripts are open for anyone to copy and paste out so they can be ripped off by anyone else as well.
Don’t talk to me about how Apple’s AI systems are ‘legal and ethical’.
As long as there aren’t any changes to copyright laws that make it illegal for companies to ingest content creators data for the purpose of LLM training, this will just happen over and over.
All we do when we go to school is digest material someone else produced, who probably 'ripped of' information from someone else and so on.
Isn't education, be it for humans or AI, absorbing existing information to learn and in some cases produce original ideas.
Why do we discriminate AI-entities for doing the same. /s