Affiliate Disclosure
If you buy through our links, we may get a commission. Read our ethics policy.

Big-name publishers are refusing to let Apple Intelligence train on data

Future expansions to Apple Intelligence may involve more AI partners, paid subscriptions

Last updated

Website owners have a simple mechanism to tell Apple Intelligence not to scrape the site for training purposes, and reportedly major platforms like Facebook and the New York Times are using it.

Apple has been offering publishers millions of dollars for the right to scrape their sites, as opposed to Google which believes all data should be freely available to train AI large language modules. As part of this, Apple honors a system where a site can just say in a particular file that it does not want to be scraped.

That file is a simple text one called robots.txt, and according to Wired, very many major publishers are choosing to use this to block Apple's AI training.

This robots.txt file is no technical barrier to scraping, nor even really a legal one, and there are firms that are known to ignore being blocked.

Reportedly, many news sites that are blocking Apple Intelligence. Significant ones include:

  • The New York Times
  • Facebook
  • Instagram
  • Craigslist
  • Timblr
  • Financial Times
  • The Atlantic
  • USA Today
  • Conde Nast

In Apple's case, Wired says that two main studies in the last week have shown that around 6% to 7% of high-traffic websites are blocking Apple's search tool, called Applebot-Extended. Then a further study by Ben Welsh, also undertaken in the last week, says that just over a 25% of sites checked are blocking it.

The discrepancy is down to which sets of high-traffic websites were researched. The Welsh study, for comparison, found that OpenAI's bot is blocked by 53% of news sites checked, and Google's equivalent Google-Extended is blocked by almost 43%.

Wired concludes that while sites might not care whether Apple Intelligence is scraping them, the major reason for low blocking figures is that Apple's AI bot is too little known for firms to notice it.

Yet Apple Intelligence is not exactly hiding in the dark, and AppleBot-Extended is a superset of AppleBot. That was first spotted by sites in November 2014, and officially revealed by Apple in May 2015.

So for ten years, AppleBot has been searching and scraping websites, and doing so in order to power Siri and Spotlight searches.

Consequently, it's less likely that websites owners haven't heard of Apple Intelligence, and more likely that they have heard of Apple making deals worth millions. While negotiations are continuing, or just conceivably might start, some sites are consciously blocking Apple Intelligence.

That includes The New York Times, which is also suing OpenAI over copyright infringement because of its AI scraping.

"As the law and The Times' own terms of service make clear, scraping or using our content for commercial purposes is prohibited without our prior written permission" says the newspaper's Charlie Stadtlander. "Importantly, copyright law still applies whether or not technical blocking measures are in place."



14 Comments

gatorguy 14 Years · 24643 comments



Apple has been offering publishers millions of dollars for the right to scrape their sites, as opposed to Google which believes all data should be freely available to train AI large language modules. As part of this, Apple honors a system where a site can just say in a particular file that it does not want to be scraped.

That file is a simple text one called robots.txt, and according to Wired, very many major publishers are choosing to use this to block Apple's AI training.

From this article's link to a previous AppleInsider article:

"According to The Guardian, Google has presented

 a case to Australian regulators that it be allowed to do what it wants and, okay, maybe publishers should be able to say no. But that's on the publishers, not Google."

Now substitute Apple for Google in that quote. Isn't that what Apple is doing too, scraping unless the publisher says no?
As for paying, both companies have shown a willingness to if the data is important enough. For instance, Google this year alone has signed multi-million deals for access to training data with both Reddit and Stack Overflow.

Apple on the other hand appears to be low-balling potential training data partners, offering less in total than Google is paying Reddit alone, with no evidence yet that any sites are biting. 

3 Likes · 0 Dislikes
DAalseth 7 Years · 3084 comments

Good, all sites should. Not just Apple’s AI, block all of them, and sue them out of existance if they break in.

2 Likes · 0 Dislikes
Cesar Battistini Maziero 9 Years · 416 comments

New York times is Amazon, and Meta has a competing service.

I bet Meta, Open AI and google haven't asked permission to train on everyones data.

2 Likes · 0 Dislikes
Stabitha_Christie 4 Years · 593 comments

New York times is Amazon, and Meta has a competing service.

I bet Meta, Open AI and google haven't asked permission to train on everyone’s data.

What are you talking about? Amazon doesn’t own the New York Times. Jeff Bezos, founder and former CEO, owns The Washington Post.  The New York Times is owned by the New York Times Company which is a publicly owned company. You managed to get that 100% wrong. 
 


11 Likes · 0 Dislikes
gatorguy 14 Years · 24643 comments

New York times is Amazon, and Meta has a competing service.

I bet Meta, Open AI and google haven't asked permission to train on everyones data.

I'm not as familiar with Meta and OpenAI, but as for Google you would likely lose that bet.

Besides actually paying for particularly valuable training data, Google offers exactly the same mechanisms as Apple does for publishers to opt-out. If you believe Google is doing the wrong thing, then so is Apple. The opposite is also true of course. If you're proud of the way Apple is approaching it, then you're OK with Google too.  

That said, unlike the $multi-million deals Google has made to license data from private sites, I'm not aware of Apple paying for any private site data for AI training. But they are free-scraping them for it if not blocked from doing so, just like OpenAI and Google will in the absence of a licensing agreement. 

EDIT: After Google and Meta signed deals with Shutterstock for training data, I see Apple followed in their wake and came to an agreement with them as well. So that's one. 

3 Likes · 0 Dislikes