Supplier used controversial sources for training Apple Intelligence

Jul 16, 2024

Apple has made a big deal out of paying for the data used to train its Apple Intelligence, but one firm it used is accused of allegedly ripping off YouTube videos.

Apple Intelligence may have been trained less legally and ethically than Apple believed

All generative AI works by amassing enormous datasets called Large Language Models (LLMs), and very often, the source of that data is controversial. So much so that Apple has repeatedly claimed that its sources are ethical, and it's known to have paid millions to publishers, and licensed images from photo library firms.

According to Wired, however, one firm whose data Apple has used, appears to have been less scrupulous about its sources. EleutherAI reportedly created a dataset it calls the Pile, which Apple has reported using for its LLM training.

Part of the Pile, though, is called YouTube Subtitles, which consist of subtitles downloaded from YouTube videos without permission. It's apparently also a breach of YouTube terms and conditions, but that may be a more gray area than it should be.

Alongside Apple, firms who have used the Pile include Anthropic, whose spokesperson claimed that there is a difference between using YouTube subtitles and using the videos.

"The Pile includes a very small subset of YouTube subtitles," said Jennifer Martinez. "YouTube's terms cover direct use of its platform, which is distinct from use of the Pile dataset."

"On the point about potential violations of YouTube's terms of service," she continued, "we'd have to refer you to the Pile authors."

Salesforce also confirmed that it had used the Pile in its building of an AI model for "academic and research purposes." Salesforce's vice president of AI research stressed that the Pile's dataset is "publicly available."

Reportedly, developers at Salesforce also found that the Pile dataset includes profanity, plus "biases against gender and certain religious groups."

Salesforce and Anthropic are so far the only firms that have commented on their use of the Pile. Apple, Nvidia, Bloomberg, and Databricks are known to have used it, but they have not responded.

Apple Intelligence is Apple's version of AI

The organization Proof News claims to have found that subtitles from 173,536 YouTube videos from over 48,000 channels were used in the Pile. The videos used include seven by Marques Brownlee (MKBHD) and 337 from PewDiePie.

Proof News has produced an online tool to help YouTubers see whether their work has been used.

However, it's not only YouTube subtitles that have been gathered without permission. It's claimed that Wikipedia has been used, as has documentation from the European Parliament.

Academics and even mathematicians have previously used thousands of Enron staff emails for statistical analysis. Now, it's claimed that the Pile used the text of those emails for its training.

It's previously been argued that Apple's generative AI might be the sole one that was trained legally and ethically. But despite Apple's intentions, Apple Intelligence has seemingly been trained on YouTube subtitles it had no right to.

Read More from AppleInsider