Apple is being sued by authors claiming Apple is using their works to train LLMs used for Apple Intelligence, in a lawsuit that echoes Anthropic's expensive legal activity.

Apple has endeavored to be ethical with its training of artificial intelligence models, which are used for Apple Intelligence and for other features in its operating systems. Despite going to great lengths to do things right, it has still become the target of a lawsuit over copyright.

A proposed class action lawsuit filed by authors Grady Hendrix and Jennifer Roberson, accuses Apple of using their copyrighted works to train its AI systems, reports Reuters. Filed on Friday in the U.S. District Court for the Northern District of California, the lawsuit says Apple is actively using a dataset based on pirated works.

The suit hinges on whether Apple used the dataset referred to as "Books3." The suit alleges that Books3 is based on the contents of a "shadow library" website known as Bibliotik, which allegedly hosted the contents of thousands of books.

The dataset was available on HuggingFace before being removed in October 2023, and it was also included as part of the RedPajama dataset. RedPajama was used as part of the OpenELM open-source models, which Apple made available in 2024.

Since Apple used a dataset that was connected to pirated books for OpenELM, the suit believes that Apple probably used the same techniques to train its Foundation Language Models.

The suit also insists that Apple has not attempted to pay authors for the content of the books.

The suit demands a trial by jury, and requests for Apple to pay statutory and compensatory damages, restitution, the destruction of Apple Intelligence and other LLM models that used the training sets, and attorney's fees.

A careful training existence

The lawsuit has many parallels with another involving AI training and piracy. In September, Anthropic agreed to pay $1.5 billion to authors to settle allegations of piracy, including by scanning the contents of books to train its models.

This new lawsuit doesn't have Apple directly ripping content itself, but instead using a dataset which the suit believes has questionable origins.

Apple has been outspoken in trying to be as ethical as it can be in training its models, and in securing data sources for that training.

Previously, Apple has offered publishers millions of dollars to access publications for training data. It's also agreed with Shutterstock in 2024 to license millions of images, again for training purposes.

In July, Apple doubled down on its claims of being ethical, including items accessible from the Internet. In a research paper, it explained that, if a publisher didn't agree to data being scraped for training, it will not scrape the content.

That includes adhering to the limitations outlined by robots.txt, which not every company abides by.