Researchers working for Apple and from Cornell University quietly pushed an open-source multimodal LLM in October, a research release called "Ferret" that can use regions of images for queries.
A ferret in the wild [Pixabay/Michael Sehlmeyer]
The introduction in October to Github largely flew under the radar, with no announcement or fanfare for its introduction. The code for Ferret was released alongside Ferret-Bench on October 30, with checkpoint releases introduced on December 14.
While it didn't receive much attention at first, the release became more of a big deal to AI researchers on Saturday, reports VentureBeat. Bart De Witte, operator of an AI-in-medicine non-profit, posted to X about the "missed" release, calling it a "testament to Apple's commitment to impactful AI research."
Ferret's release to open-source is being performed under a non-commercial license, so it cannot be commercialized in its current state. However, there's always a possibility for it to become used in a future Apple product or service in some way.
A tweet from October by Apple AI/ML research scientist Zhe Gan explains Ferret's use as being a system that can "refer and ground anything anywhere at any granularity" in an image. It can also do so by using any shape of region within an image.
In simpler terms, the model can examine a region drawn on an image, determine the elements within it that are of use to a user in a query, identify it, and draw a bounding box around the detected element. It can then use that identified element as part of a query, which it can then respond to in a typical fashion.
For example, highlighting an image of an animal in an image and asking the LLM what the animal is, it could determine the creature's species and that the user is referring to an individual animal from a group. It could then use the context of other items detected in the image to offer up further responses.
Introducing Ferret, a new MLLM that can refer and ground anything anywhere at any granularity.
-- Zhe Gan (@zhegan4) October 12, 2023
https://t.co/gED9Vu0I4y
1 Ferret enables referring of an image region at any shape
2 It often shows better precise understanding of small image regions than GPT-4V (sec 5.6) pic.twitter.com/yVzgVYJmHc
The release is important to researchers, as it shows Apple is keen to be more open with its AI work, rather than its usual secretive stance.
There's also the problem of infrastructure for Apple, as while it is working to increase the number of AI servers it owns, it may not have the scale available at the moment to work toe-to-toe with ChatGPT, for example. Though Apple could work with other firms to scale its capabilities, the other route is to do what it has just done, namely release an open-source model.
In one interesting element from the Github release, Reddit's r/Apple spotted that Ferret is "trained on 8 A100 GPUs with 80GB memory." Given Apple's history with Nvidia GPU support, this was seen to be a rare acknowledgment of the GPU producer.