Apple has released a research paper discussing what it calls HUGS, a generative AI technology that can create a digital human avatar from a brief video in about 30 minutes.
Apple HUGS
Released via Apple's Machine Learning Research page and shared by Apple researcher Anurag Ranjan on X, "HUGS: Human Gaussian Splats" discusses techniques to create digital avatars of humans. Using machine learning and computer vision, the research details the creation process, using relatively little source material.
Current neural rendering techniques are a marked improvement over earlier versions, but they are still best suited for "photogrammetry of static scenes and do not generalize well to freely moving humans in the environment," introductory paragraphs explain.
The concept of Human Gaussian Splats, HUGS, uses a technique called 3D Gaussian Splatting to create an animatable human within a scene.
The method itself requires a small amount of video of the subject, typically in motion within a scene and showing as many surfaces as possible for the system to work from. The technique can use very short clips in some cases, sometimes monocular video with as few as 50 to 100 frames, equating to two to four seconds of 24fps video.
Introducing HUGS: Human Gaussian Splats - capable of creating animatable (3DGS) avatars from a casual video (50-100 frames) in ~30 mins. Our avatars can easily be embedded into other (NeRF) scenes. (1/4)
-- Anurag Ranjan (@anuragranj) December 19, 2023
Project: https://t.co/ws69aCAUtG
arXiv: https://t.co/yjsR9Vt8RY pic.twitter.com/ADVWw56ats
The system has been trained to "disentangle the static scene and a fully-animatable human avatar within 30 minutes," Apple claims.
While the SMPL body model is used to initialize the human Gaussian models, it cannot capture every detail. The process is allowed to deviate from the SMPL model for elements that aren't modeled, such as cloth and hair, to fill in the gaps of what was captured and included in the model.
There is also a proposal to optimize linear blending skin weights so they can coordinate with the movements of a Gaussian model during animation, improving the appearance of the model.
In the end, the time from training video to a "state-of-the-art rendering quality" animation of the human model and the scene, outputted with a render speed of 60fps at a HD resolution, is about half an hour. This is claimed to be about 100 times faster than other methods, including NeuMan and Vid2Avatar.
The research paper lists its authors as Muhammed Kocabas, Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan, and was produced in collaboration with the Max Planck Institute for Intelligent Systems.
Apple has been working on the idea of creating digital avatars for quite some time, with the concept of a high-detailed version appearing in the Apple Vision Pro. To enable FaceTime conversations, as well as an external view of the user's eyes, the headset creates a digital "Persona," which is used in various ways to represent the user.