A human body pose and expressions can be captured by an animated implicit human avatar.

Pose, look, facial expression, hand gestures, etc.?collectively called “body language??has been the subject of many academic investigations. Accurately recording, interpreting, and creating non-verbal signals may greatly enhance the realism of avatars in telepresence, augmented reality (AR), and virtual reality (VR) settings.

Existing state-of-the-art avatar models, such as those in the SMPL family, can correctly depict different human body forms in realistic positions. Still, they are limited by the mesh-based representations they use and the quality of the 3D mesh. Moreover, such models often only simulate bare bodies and do not depict clothing or hair, reducing the results’ realism.

They introduce X-Avatar, an innovative model that can capture the complete range of human expression in digital avatars to create realistic telepresence, augmented reality, and virtual reality environments. X-Avatar is an expressive implicit human avatar model developed by ETH Zurich and Microsoft researchers. It can capture high-fidelity human body and hand movements, facial emotions, and other appearance traits. The technique can learn from either complete 3D scans or RGB-D data, producing comprehensive models of bodies, hands, facial emotions, and looks.

The researchers propose a part-aware learning forward skinning module that the SMPL-X parameter space may control, enabling expressive animation of X-Avatars. Researchers present unique part-aware sampling and initialization algorithms to train the neural shape and deformation fields effectively. Researchers augment the geometry and deformation fields with a texture network conditioned by position, facial expression, geometry, and the deformed surface’s normals to capture the avatar’s look with high-frequency details. This yields improved fidelity outcomes, particularly for smaller body parts, while keeping training effective despite the increasing number of articulated bones. Researchers demonstrate empirically that the approach achieves superior quantitative and qualitative results on the animation task compared to strong baselines in both data areas.

Researchers present a new dataset, dubbed X-Humans, with 233 sequences of high-quality textured scans from 20 subjects, for 35,500 data frames to aid future research on expressive avatars. X-Avatar suggests a human model characterized by articulated neural implicit surfaces that accommodate the diverse topology of clothed individuals and achieve improved geometric resolution and increased fidelity of overall look. The study’s authors define three distinct neural fields: one for modeling geometry using an implicit occupancy network, another for modeling deformation using learned forward linear blend skinning (LBS) with continuous skinning weights, and a third for modeling appearance using the RGB color value.

Model X-Avatar may take in either a 3D posed scan or an RGB-D picture for processing. Part of its design incorporates a shaping network for modeling geometry in canonical space and a deformation network that uses learned linear blend skinning (LBS) to build correspondences between canonical and deformed areas.

The researchers begin with the parameter space of SMPL-X, an SMPL extension that captures the shape, look, and deformations of full-body people, paying special attention to hand positions and facial emotions to generate expressive and controllable human avatars. A human model described by articulated neural implicit surfaces represents the various topology of clothed individuals. At the same time, a unique part-aware initialization method considerably enhances the result’s realism by raising the sample rate for smaller body parts.

The results show that X-Avatar can accurately record human body and hand poses as well as facial emotions and appearance, allowing for creating more expressive and lifelike avatars. The group behind this initiative keeps their fingers crossed that their method may inspire more studies to give AIs more personality.

Utilized Dataset

High-quality textured scans and SMPL[-X] registrations; 20 subjects; 233 sequences; 35,427 frames; body position + hand gesture + facial expression; a wide range of apparel and hairstyle options; a wide range of ages

Features

Several methods exist for teaching X-Avatars.
Image from 3D scans used in training, upper right. At the bottom: test-pose-driven avatars.
RGB-D information for instructional purposes, up top. Pose-testing avatars perform at a lower level.
The approach recovers greater hand articulation and facial expression than other baselines on the animation test. This results in animated X-Avatars using movements recovered by PyMAF-X from monocular RGB films.

Limitations

The X-Avatar has difficulty modeling off-the-shoulder tops or pants (e.g., skirts). However, researchers often only train a single model per subject, so their capacity to generalize beyond a single individual still needs to be expanded.

Contributions

X-Avatar is the first expressive implicit human avatar model that holistically captures body posture, hand pose, facial emotions, and appearance.
Initialization and sampling procedures that consider underlying structure boost output quality and maintain training efficiency.
X-Humans is a brand new dataset of 233 sequences totaling 35,500 frames of high-quality textured scans of 20 people displaying a wide range of body and hand motions and facial emotions.

X-Avatar is unrivaled when capturing body stance, hand pose, facial emotions, and overall look. Using the recently released X-Humans dataset, researchers have shown the method’s