Affiliate Disclosure
If you buy through our links, we may get a commission. Read our ethics policy.

Apple isn't standing still on generative AI, and making human models dance is proof

Apple has released a research paper discussing what it calls HUGS, a generative AI technology that can create a digital human avatar from a brief video in about 30 minutes.

Released via Apple's Machine Learning Research page and shared by Apple researcher Anurag Ranjan on X, "HUGS: Human Gaussian Splats" discusses techniques to create digital avatars of humans. Using machine learning and computer vision, the research details the creation process, using relatively little source material.

Current neural rendering techniques are a marked improvement over earlier versions, but they are still best suited for "photogrammetry of static scenes and do not generalize well to freely moving humans in the environment," introductory paragraphs explain.

The concept of Human Gaussian Splats, HUGS, uses a technique called 3D Gaussian Splatting to create an animatable human within a scene.

The method itself requires a small amount of video of the subject, typically in motion within a scene and showing as many surfaces as possible for the system to work from. The technique can use very short clips in some cases, sometimes monocular video with as few as 50 to 100 frames, equating to two to four seconds of 24fps video.

The system has been trained to "disentangle the static scene and a fully-animatable human avatar within 30 minutes," Apple claims.

While the SMPL body model is used to initialize the human Gaussian models, it cannot capture every detail. The process is allowed to deviate from the SMPL model for elements that aren't modeled, such as cloth and hair, to fill in the gaps of what was captured and included in the model.

There is also a proposal to optimize linear blending skin weights so they can coordinate with the movements of a Gaussian model during animation, improving the appearance of the model.

In the end, the time from training video to a "state-of-the-art rendering quality" animation of the human model and the scene, outputted with a render speed of 60fps at a HD resolution, is about half an hour. This is claimed to be about 100 times faster than other methods, including NeuMan and Vid2Avatar.

The research paper lists its authors as Muhammed Kocabas, Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan, and was produced in collaboration with the Max Planck Institute for Intelligent Systems.

Apple has been working on the idea of creating digital avatars for quite some time, with the concept of a high-detailed version appearing in the Apple Vision Pro. To enable FaceTime conversations, as well as an external view of the user's eyes, the headset creates a digital "Persona," which is used in various ways to represent the user.



6 Comments

Marvin 19 Years · 15361 comments

The method itself requires a small amount of video of the subject, typically in motion within a scene and showing as many surfaces as possible for the system to work from. The technique can use very short clips in some cases, sometimes monocular video with as few as 50 to 100 frames, equating to two to four seconds of 24fps video.

The system has been trained to "disentangle the static scene and a fully-animatable human avatar within 30 minutes," Apple claims.

In the end, the time from training video to a "state-of-the-art rendering quality" animation of the human model and the scene, outputted with a render speed of 60fps at a HD resolution, is about half an hour. This is claimed to be about 100 times faster than other methods, including NeuMan and Vid2Avatar.

The research paper lists its authors as Muhammed Kocabas, Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan, and was produced in collaboration with the Max Planck Institute for Intelligent Systems.

Apple has been working on the idea of creating digital avatars for quite some time, with the concept of a high-detailed version appearing in the Apple Vision Pro. To enable FaceTime conversations, as well as an external view of the user's eyes, the headset creates a digital "Persona," which is used in various ways to represent the user. 

This will help a lot in making 3D avatars quickly at high quality. They can do lighting control so the avatar will blend better with the virtual environment:

https://www.youtube.com/watch?v=1V85241UJmg

https://www.youtube.com/watch?v=s6Lz-qjs_mA

These don't have the uncanny valley feeling that CGI avatars do. Animating them (especially faces) can still be difficult but Meta's ones look good with animation.

3 Likes · 0 Dislikes
ddawson100 17 Years · 539 comments

Look, I have no idea how to assess the technical merits of this but it looks absolutely stunning and if training is that quick then 2024 is going to be an interesting year.

3 Likes · 0 Dislikes
9secondkox2 9 Years · 3188 comments

Sheesh that’s amazing. 

Sucks for those who’ve spent much of their lives getting good at these things. 


2 Likes · 0 Dislikes
FileMakerFeller 7 Years · 1561 comments

Is the next technique going to be called KISSES?

1 Like · 0 Dislikes
chasm 11 Years · 3641 comments

We’ve actually already seen an earlier version of HUGS in the demo of the digital persona on a FaceTime call when the user has on their Apple Vision Pro.

That demo in and of itself was impressive, now imagine the months that have been spent improving it since June.