Apple's Ferret LLM could help allow Siri to understand the layout of apps in an iPhone display, potentially increasing the capabilities of Apple's digital assistant.
A ferret in the wild [Pixabay/Michael Sehlmeyer]
Apple has been working on numerous machine learning and AI projects that it could tease at WWDC 2024. In a just-released paper, it now seems that some of that work has the potential for Siri to understand what apps and iOS itself looks like.
The paper, released by Cornell University on Monday, is titled "Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs." It essentially explains a new multimodal large language model (MLLM) that has the potential to understand the user interfaces of mobile displays.
The Ferret name originally came up from an open-source multi-modal LLMreleased in October, by researchers from Cornell University working with counterparts from Apple. At the time, Ferret was able to detect and understand different regions of an image for complex queries, such as identifying a species of animal in a selected part of a photograph.
An LLM advancement
The new paper for Ferret-UI explains that, while there have been noteworthy advancements in MLLM usage, they still "fall short in their ability to comprehend and interact effectively with user interface (UI) screens." Ferret-UI is described as a new MLLM tailored for understanding mobile UI screens, complete with "referring, grounding, and reasoning capabilities."
Part of the problem that LLMs have in understanding the interface of a mobile display is how it gets used in the first place. Often in a portrait orientation, it often means icons and other details can take up a very compact part of the display, making it difficult for machines to understand.
To help with this, Ferret has a magnification system to upscale images to "any resolution" to make icons and text more readable.
An example of Ferret-UI analyzing an iPhone's display
For processing and training, Ferret also divides the screen into two smaller sections, cutting the screen in half. The paper states that other LLMs tend to scan a lower-resolution global image, which reduces the ability to adequately determine what icons look like.
Adding in significant curation of data for training, it's resulted in a model that can sufficiently understand user queries, understand the nature of various on-screen elements, and to offer contextual responses.
For example, a user could ask how to open the Reminders app, and be told to tap the on-screen Open button. A further query asking if a 15-year-old could use an app could check out age guidelines, if they're visible on the display.
An assistive assistant
While we don't know whether it will be incorporated into systems like Siri, Ferret-UI offers the possibility of advanced control over a device like an iPhone. By understanding user interface elements, it offers the possibility of Siri performing actions for users in apps, by selecting graphical elements within the app on its own.
There are also useful applications for the visually impaired. Such an LLM could be more capable of explaining what is on screen in detail, and potentially carry out actions for the user without them needing to do anything else but ask for it to happen.