Big-name publishers are refusing to let Apple Intelligence train on data

Aug 29, 2024

Website owners have a simple mechanism to tell Apple Intelligence not to scrape the site for training purposes, and reportedly major platforms like Facebook and the New York Times are using it.

Future expansions to Apple Intelligence may involve more AI partners, paid subscriptions

Apple has been offering publishers millions of dollars for the right to scrape their sites, as opposed to Google which believes all data should be freely available to train AI large language modules. As part of this, Apple honors a system where a site can just say in a particular file that it does not want to be scraped.

That file is a simple text one called robots.txt, and according to Wired, very many major publishers are choosing to use this to block Apple's AI training.

This robots.txt file is no technical barrier to scraping, nor even really a legal one, and there are firms that are known to ignore being blocked.

Reportedly, many news sites that are blocking Apple Intelligence. Significant ones include:

The New York Times
Facebook
Instagram
Craigslist
Timblr
Financial Times
The Atlantic
USA Today
Conde Nast

In Apple's case, Wired says that two main studies in the last week have shown that around 6% to 7% of high-traffic websites are blocking Apple's search tool, called Applebot-Extended. Then a further study by Ben Welsh, also undertaken in the last week, says that just over a 25% of sites checked are blocking it.

The discrepancy is down to which sets of high-traffic websites were researched. The Welsh study, for comparison, found that OpenAI's bot is blocked by 53% of news sites checked, and Google's equivalent Google-Extended is blocked by almost 43%.

Wired concludes that while sites might not care whether Apple Intelligence is scraping them, the major reason for low blocking figures is that Apple's AI bot is too little known for firms to notice it.

Yet Apple Intelligence is not exactly hiding in the dark, and AppleBot-Extended is a superset of AppleBot. That was first spotted by sites in November 2014, and officially revealed by Apple in May 2015.

So for ten years, AppleBot has been searching and scraping websites, and doing so in order to power Siri and Spotlight searches.

Consequently, it's less likely that websites owners haven't heard of Apple Intelligence, and more likely that they have heard of Apple making deals worth millions. While negotiations are continuing, or just conceivably might start, some sites are consciously blocking Apple Intelligence.

That includes The New York Times, which is also suing OpenAI over copyright infringement because of its AI scraping.

"As the law and The Times' own terms of service make clear, scraping or using our content for commercial purposes is prohibited without our prior written permission" says the newspaper's Charlie Stadtlander. "Importantly, copyright law still applies whether or not technical blocking measures are in place."

Read More from AppleInsider