How to Convert Verbal Descriptions Into Expressive 3D Avatar

The development of Large Language Models and Diffusion Models has paved the way for fusing text-to-image models with differentiable neural 3D scene representations, the best examples of which are  DeepSDF, NeRF, and DMTET. These have enabled the creation of accurate 3D models only from textual descriptions. Though these advancements have brought great progress in the Artificial Intelligence community, in terms of shape and texture, generated objects or characters frequently fall short of producing realistic 3D avatars of excellent quality. These characters may also not fit within conventional computer graphics workflows.

In recent research, a team of researchers has introduced TADA (Text to Animatable Digital Avatars), a simple but very powerful method for converting verbal descriptions into expressive 3D avatars with striking geometry and realistic texturing. These avatars can be animated using traditional graphics methods and are visually pleasing. Existing techniques for generating characters from text have issues with the geometry and texture quality. These techniques have trouble animating realistically because of mismatches in geometry and texture, especially on the face. TADA addresses these issues by forming a potent synergy between a 2D diffusion model and a parametric body model.

The creation of a sophisticated avatar representation is key to TADA’s invention. The team has added a displacement layer and a texture map to the SMPL-X body model to improve it. As a result, SMPL-X has been produced in a high-resolution form that can capture finer textures and features. A hierarchical rendering method, along with score distillation sampling (SDS), has been introduced to create complicated, high-quality 3D avatars from textual input. This technique ensures the detailed and comprehensive features of the avatars.

To align the avatars’ geometry and texture, the team has used latent embeddings of the created characters rendered normal and RGB pictures throughout the SDS optimization process. The misalignment problems that plagued earlier techniques have been gotten rid of, especially in the facial region, by implementing the alignment strategy. Also, an effort has been made to keep the characters’ facial expressions and semantics consistent by using a number of expressions during the optimization process. This method ensures that the final avatars retain the semantic integrity of the original SMPL-X model, allowing for realistic and organically aligned animation.

TADA has been employed using a technique called Score Distillation Sampling (SDS). The primary contributions are as follows. ?

  1. Hierarchical Optimization with Hybrid Mesh Representation, which allows for high-quality details, especially on the face.
  1. Consistent Alignment of Geometry and Texture, using an optimization process that deforms the generated character using predefined SMPL-X body poses and facial expressions.
  1. Semantic Consistency and Animation, ensuring that the generated character maintains semantic consistency with SMPL-X, allowing easy and accurate animation.

The team has performed certain evaluations, including both qualitative and quantitative, evaluating how much better TADA is than the alternatives. It was seen that the capabilities of TADA go beyond the production of avatars; it enables the large-scale construction of digital characters that are appropriate for both animation and rendering. It also provides text-guided editing, which gives users a tremendous amount of power and customization.