How a Semi-Supervised Image-to-Image Translation Framework performs in Automatic High-Quality Anime Scene Rendering


Anime sceneries need a great deal of creative talent and time to create. Hence, the development of learning-based methods for automatic scene stylization has undeniable practical and economic significance. Automatic stylization has significantly improved due to recent developments in Generative Adversarial Networks (GANs), yet most of this research has mostly focused on human faces. The process of creating high-quality anime sceneries from intricate real-world scene photos still needs to be studied despite its tremendous research worth. Due to several elements, converting real-world scene photographs into anime styles takes a lot of work. 

1) The scene’s composition: Figure 1 illustrates this hierarchy between foreground and background parts in scenes, which are frequently made up of several items connected in complicated ways. 

2) Characteristics of anime: Figure 1 shows how pre-designed brush strokes are employed in natural settings like grass, trees, and clouds to create distinctive textures and precise details that define anime. These textures’ organic and hand-drawn nature makes them considerably more challenging to imitate than the crisp edges and uniform color patches outlined in earlier experiments. 

3) The data shortage and domain gap: A high-quality anime scene dataset is crucial in bridging the gap between real and anime scenes, which has a significant domain difference. Existing datasets are low quality because of the large number of human faces and other foreground items that have a different aesthetic from the background landscape. 

Figure 1: Anime scene characteristics. The presence of hand-drawn brush strokes of grass and stones (foreground), as well as trees and clouds (background), as opposed to clean edges and flat surfaces, can be seen in a scene frame from Shinkai’s 2011 film “Children Who Chase Lost Voices.”

Unsupervised image-to-image translation is a popular method for complicated scene stylization without paired training data. Existing techniques that concentrate on anime styles need to catch up in several areas despite showing promising outcomes. First, the lack of pixel-wise correlation in complex sceneries makes it difficult for present approaches to execute obvious texture stylization while maintaining semantic meaning, potentially leading to outputs that are out of the ordinary and include noticeable artifacts. Second, certain methods don’t produce the delicate details of anime scenes. Their constructed anime-specific losses or pre-extracted representations, which enforce edge and surface smoothness, are to blame for this. 

To solve the abovementioned issues, researchers from S-Lab, Nanyang Technological University propose Scenimefy, a unique semi-supervised image-to-image (I2I) translation pipeline for creating high-quality anime-style representations of scene pictures. Figure 2. Their main suggestion is to use produced pseudo-paired data to introduce a new supervised training branch into the unsupervised framework to address the shortcomings of unsupervised training. They use StyleGAN’s advantageous traits by fine-tuning it to provide coarse paired data between real and anime or faux-paired data. 

Figure 2 shows renderings of anime scenes by Scenimefy. Top row: translated pictures; bottom row: outcomes of the translation.

They provide a brand-new semantic-constrained fine-tuning approach that uses rich pretrained model priors like CLIP and VGG to direct StyleGAN in capturing intricate scene details and reducing overfitting. To filter low-quality data, they also offer a segmentation-guided data selection technique. Using the pseudo-paired data and a unique patch-wise contrastive style loss, Scenimefy creates fine details between the two domains and learns effective pixel-wise correspondence. Their semi-supervised framework attempts a desirable trade-off between the faithfulness and fidelity of scene stylization and the unsupervised training branch. 

They also gathered a high-quality dataset of pure anime scenes to aid training. They carried out extensive tests showing Scenimefy’s efficacy, surpassing industry benchmarks for perceptual quality and quantitative evaluation. The following is an overview of their major contributions: 

? They provide a brand-new, semi-supervised scene stylization framework that transforms actual photographs into sophisticated anime scene images of excellent quality. Their system adds a unique patchwise contrastive style loss to enhance stylization and fine details. 

? A newly developed semantic-constrained StyleGAN fine-tuning technique with rich pre-trained prior guidance, followed by a segmentation-guided data selection scheme, produces structure-consistent pseudo-paired data that serves as the basis for the training supervision. 

? They gathered a high-resolution collection of anime scenes to aid future studies on scene stylization.

n