Apple's new AI model could help Siri see how iOS apps work

A ferret in the wild [Pixabay/Michael Sehlmeyer]

Apple's Ferret LLM could help allow Siri to understand the layout of apps in an iPhone display, potentially increasing the capabilities of Apple's digital assistant.

Apple has been working on numerous machine learning and AI projects that it could tease at WWDC 2024. In a just-released paper, it now seems that some of that work has the potential for Siri to understand what apps and iOS itself looks like.

The paper, released by Cornell University on Monday, is titled "Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs." It essentially explains a new multimodal large language model (MLLM) that has the potential to understand the user interfaces of mobile displays.

The Ferret name originally came up from an open-source multi-modal LLMreleased in October, by researchers from Cornell University working with counterparts from Apple. At the time, Ferret was able to detect and understand different regions of an image for complex queries, such as identifying a species of animal in a selected part of a photograph.

An LLM advancement

The new paper for Ferret-UI explains that, while there have been noteworthy advancements in MLLM usage, they still "fall short in their ability to comprehend and interact effectively with user interface (UI) screens." Ferret-UI is described as a new MLLM tailored for understanding mobile UI screens, complete with "referring, grounding, and reasoning capabilities."

Part of the problem that LLMs have in understanding the interface of a mobile display is how it gets used in the first place. Often in a portrait orientation, it often means icons and other details can take up a very compact part of the display, making it difficult for machines to understand.

Watch the Latest from AppleInsider TV

To help with this, Ferret has a magnification system to upscale images to "any resolution" to make icons and text more readable.

An example of Ferret-UI analyzing an iPhone's display

For processing and training, Ferret also divides the screen into two smaller sections, cutting the screen in half. The paper states that other LLMs tend to scan a lower-resolution global image, which reduces the ability to adequately determine what icons look like.

Adding in significant curation of data for training, it's resulted in a model that can sufficiently understand user queries, understand the nature of various on-screen elements, and to offer contextual responses.

For example, a user could ask how to open the Reminders app, and be told to tap the on-screen Open button. A further query asking if a 15-year-old could use an app could check out age guidelines, if they're visible on the display.

An assistive assistant

While we don't know whether it will be incorporated into systems like Siri, Ferret-UI offers the possibility of advanced control over a device like an iPhone. By understanding user interface elements, it offers the possibility of Siri performing actions for users in apps, by selecting graphical elements within the app on its own.

There are also useful applications for the visually impaired. Such an LLM could be more capable of explaining what is on screen in detail, and potentially carry out actions for the user without them needing to do anything else but ask for it to happen.

Follow AppleInsider on Google News

9 Comments

cpsro 15 Years · 3268 comments

About 1 year ago

This technology could be a great boon to the app review process and, in turn, our security and satisfaction with 3rd party apps.

2 Likes · 0 Dislikes

AppleZulu 9 Years · 2375 comments

About 1 year ago

This starts to get at where I think Apple's machine learning/artificial intelligence and Siri are going.

Despite it being front-and-center in the public consciousness for the past year, AI as currently implemented is a hot mess of questionable utility, privacy and security, and based on petabytes of stolen data and intellectual property. While everyone touts how cool their AI is and the peanut gallery throws shade at Apple for being late to the party, moving in "late" to supplant technological hot messes with something well thought out and useful is actually Apple's sweet spot. The Ferret LLM described above could result in users being able to make voice commands that require complex interactions with apps on their device to yield a desired result. If Siri is able to interface with and read from on-device apps, this does two important things. First, it eliminates a requirement for special code within the apps to allow for things like the current Shortcuts app to drive certain tasks... if the user can figure out how to make Shortcuts do it properly. Second, it allows the digital assistant to carry out functions and draw information from sources for which the user already has legitimate permission to access.

Legally and functionally it would be very much like handing a human PA your iPhone and then asking him or her to use it to carry out various tasks on your behalf.

Imagine waking in the morning and asking Siri to implement various smart home functions based on current conditions like the weather and what you've got on your schedule for the day. Then imagine asking Siri what the morning news is, and it pulls information from your Apple News subscription, along with other news sources to which you have subscribed or otherwise have access, and verbally gives you a news summary, citing each source. Then you ask Siri to bookmark a few of the source articles so you can read them during breakfast. You could ask Siri to order lunch, or flowers, or an Uber, and it simply interfaces with on-device apps and accounts on your behalf. Apple could implement this sort of thing without running roughshod over copyrights, and without selling out the user's privacy and security.

This could be how Apple once again enters a space seemingly late, but then implements its vision of that thing so well that others must regroup and scramble to catch up.

4 Likes · 0 Dislikes

avon b7 21 Years · 8217 comments

About 1 year ago

AppleZulu said:

This starts to get at where I think Apple's machine learning/artificial intelligence and Siri are going.

Despite it being front-and-center in the public consciousness for the past year, AI as currently implemented is a hot mess of questionable utility, privacy and security, and based on petabytes of stolen data and intellectual property. While everyone touts how cool their AI is and the peanut gallery throws shade at Apple for being late to the party, moving in "late" to supplant technological hot messes with something well thought out and useful is actually Apple's sweet spot. The Ferret LLM described above could result in users being able to make voice commands that require complex interactions with apps on their device to yield a desired result. If Siri is able to interface with and read from on-device apps, this does two important things. First, it eliminates a requirement for special code within the apps to allow for things like the current Shortcuts app to drive certain tasks... if the user can figure out how to make Shortcuts do it properly. Second, it allows the digital assistant to carry out functions and draw information from sources for which the user already has legitimate permission to access.

Legally and functionally it would be very much like handing a human PA your iPhone and then asking him or her to use it to carry out various tasks on your behalf.

Imagine waking in the morning and asking Siri to implement various smart home functions based on current conditions like the weather and what you've got on your schedule for the day. Then imagine asking Siri what the morning news is, and it pulls information from your Apple News subscription, along with other news sources to which you have subscribed or otherwise have access, and verbally gives you a news summary, citing each source. Then you ask Siri to bookmark a few of the source articles so you can read them during breakfast. You could ask Siri to order lunch, or flowers, or an Uber, and it simply interfaces with on-device apps and accounts on your behalf. Apple could implement this sort of thing without running roughshod over copyrights, and without selling out the user's privacy and security.

This could be how Apple once again enters a space seemingly late, but then implements its vision of that thing so well that others must regroup and scramble to catch up.

That is very far from the truth.

A quote from MWC2024:

"In addition to quantity though, we must also look at quality. The more accurate, reliable, relevant, and valuable our data, the more reliable our model input. This improves the availability and reliability of models. This is how data determines the power of AI,”

A huge amount of work has already gone into LLMs using clean data. Official data.

https://readwrite.com/huawei-build-ai-model-for-accurate-weather-forecasting/

The list of shipping solutions covers just about everything you can think of and not all of it was dredged off the internet. Far from it. That doesn't mean that dredging for data doesn't have its own value. There may be issues surrounding licencing in some cases but that is another story.

AI requires a lot of elements to tie it all together and Apple isn't producing many of those elements.

https://developingtelecoms.com/telecom-business/vendor-news/16365-huawei-urges-data-importance-for-ai-age.html

We are also fast approaching the yottabyte era and AI will be just one of the factors driving an ever increasing need for data storage and processing. Almost unimaginable amounts.

https://www.huawei.com/en/news/2023/5/data-infrastructure-forum

mattinoz 10 Years · 2595 comments

About 1 year ago

So Apple's decades-long efforts and incremental improvements to accessibility will pay big dividends for AI as it helps the ML learn how the app works.

It makes you wonder if that might have been a reason for pushing hard in that space all along. You know, not just to be nice or sell lots of hardware.

1 Like · 0 Dislikes

AppleZulu 9 Years · 2375 comments

About 1 year ago

avon b7 said:

AppleZulu said:

This starts to get at where I think Apple's machine learning/artificial intelligence and Siri are going.

Despite it being front-and-center in the public consciousness for the past year, AI as currently implemented is a hot mess of questionable utility, privacy and security, and based on petabytes of stolen data and intellectual property. While everyone touts how cool their AI is and the peanut gallery throws shade at Apple for being late to the party, moving in "late" to supplant technological hot messes with something well thought out and useful is actually Apple's sweet spot. The Ferret LLM described above could result in users being able to make voice commands that require complex interactions with apps on their device to yield a desired result. If Siri is able to interface with and read from on-device apps, this does two important things. First, it eliminates a requirement for special code within the apps to allow for things like the current Shortcuts app to drive certain tasks... if the user can figure out how to make Shortcuts do it properly. Second, it allows the digital assistant to carry out functions and draw information from sources for which the user already has legitimate permission to access.

Legally and functionally it would be very much like handing a human PA your iPhone and then asking him or her to use it to carry out various tasks on your behalf.

Imagine waking in the morning and asking Siri to implement various smart home functions based on current conditions like the weather and what you've got on your schedule for the day. Then imagine asking Siri what the morning news is, and it pulls information from your Apple News subscription, along with other news sources to which you have subscribed or otherwise have access, and verbally gives you a news summary, citing each source. Then you ask Siri to bookmark a few of the source articles so you can read them during breakfast. You could ask Siri to order lunch, or flowers, or an Uber, and it simply interfaces with on-device apps and accounts on your behalf. Apple could implement this sort of thing without running roughshod over copyrights, and without selling out the user's privacy and security.

This could be how Apple once again enters a space seemingly late, but then implements its vision of that thing so well that others must regroup and scramble to catch up.

That is very far from the truth.

A quote from MWC2024:

"In addition to quantity though, we must also look at quality. The more accurate, reliable, relevant, and valuable our data, the more reliable our model input. This improves the availability and reliability of models. This is how data determines the power of AI,”

A huge amount of work has already gone into LLMs using clean data. Official data.

https://readwrite.com/huawei-build-ai-model-for-accurate-weather-forecasting/

The list of shipping solutions covers just about everything you can think of and not all of it was dredged off the internet. Far from it. That doesn't mean that dredging for data doesn't have its own value. There may be issues surrounding licencing in some cases but that is another story.

AI requires a lot of elements to tie it all together and Apple isn't producing many of those elements.

https://developingtelecoms.com/telecom-business/vendor-news/16365-huawei-urges-data-importance-for-ai-age.html

We are also fast approaching the yottabyte era and AI will be just one of the factors driving an ever increasing need for data storage and processing. Almost unimaginable amounts.

https://www.huawei.com/en/news/2023/5/data-infrastructure-forum

Sure, because Huawei has an exemplary record in respecting IP.