Framework for Self-Supervised Learning to Design Stylized 3D Avatars Using a Combination of Continuous and Discrete Parameters

A key entry point into the digital world, which is more prevalent in modern life for socializing, shopping, gaming, and other activities, is a visually appealing and animate 3D avatar. A decent avatar should be attractive and customized to match the user’s appearance. Many well-known avatar systems, such as Zepeto1 and ReadyPlayer2, employ cartoonized and stylized looks because they are fun and user-friendly. However, choosing and modifying an avatar by hand typically entails painstaking modifications from many graphic elements, which is both time-consuming and challenging for novice users. In this research, they investigate the automated generation of styled 3D avatars from a single selfie taken from the front.

Specifically, given a selfie image, their algorithm predicts an avatar vector as the complete configuration for a graphics engine to generate a 3D avatar and render avatar images from predefined 3D assets. The avatar vector consists of parameters specific to the predefined assets, which can be either continuous (e.g., head length) or discrete (e.g., hair types). A naive solution is to annotate a set of selfie images and train a model to predict the avatar vector via supervised learning. However, large-scale annotations are needed to handle a large range of assets (usually in the hundreds). Self-supervised approaches are suggested to train a differentiable imitator that replicates the graphics engine’s renderings to automatically match the produced avatar picture with the selfie image utilizing different identification and semantic segmentation losses, which would reduce the annotation cost. 

To be more precise, given a selfie photograph, their system forecasts an avatar vector as the whole setup for a graphics engine to produce a 3D avatar and render avatar images from specified 3D assets. The characteristics that make up the avatar vector are particular to the preset assets, and they can either be continuous (like head length) or discrete (e.g., hair types). A simple method is to annotate a collection of selfies and use supervised learning to build a model to predict the avatar vector. However, large-scale annotations are required to manage a wide variety of assets (usually in the hundreds). 

Avatar Vector Conversion, Self-supervised Avatar Parameterization, and Portrait Stylization make up the three steps of their innovative architecture. According to Fig. 1, the identification information (hairstyle, skin tone, eyeglasses, etc.) is retained throughout the pipeline while the domain gap is gradually closed throughout the three stages. The Portrait Stylization stage concentrates first on the domain crossover of 2D real-to-stylized visual appearance. This step maintains picture space while producing the input selfie as a stylized avatar. A crude use of the current stylization techniques for translation will keep elements like expression, which will obtrusively complicate the pipeline’s subsequent phases.

Figure 1

As a result, they developed a modified version of AgileGAN to guarantee expression homogeneity while maintaining user identification. The Self-Supervised Avatar Parameterization step is then concerned with transitioning from the pixel-based picture to the vector-based avatar. They discovered that strong parameter discreteness enforcement prevents optimization from achieving convergent behavior. They adopt a lenient formulation known as a relaxed avatar vector to overcome this problem, encoding discrete parameters as continuous one-hot vectors. They taught an imitator to behave like the non-differentiable engine to enable differentiability in training. All discrete parameters are converted to one-hot vectors in the Avatar Vector Conversion step. The domain is crossed from the relaxed avatar vector space to the strict avatar vector space. The graphics engine may then construct the final avatars and render them using the strict avatar vector. They use a unique search technique that produces superior outcomes than direct quantization. They employ human preference research to assess their findings and compare the results to baseline approaches like F2P and manual production to see how effectively their method protects personal uniqueness. Their outcomes attain scores that are substantially greater than those of baseline techniques and quite similar to those of hand creation.

They also provide an ablation study to support their pipeline’s design decisions. Their technical contributions include, in brief, the following: 

? A novel self-supervised learning framework to produce high-quality stylized 3D avatars with a combination of continuous and discrete parameters

? A novel method to bridge the substantial style domain gap in the creation of stylized 3D avatars using portrait stylization

? A cascaded relaxation and search pipeline to address the convergence problem in discrete avatar parameter optimization.

You can find a video demonstration of the paper on their site.