Meshes and points are the most common 3D scene representations because they are explicit and are a good fit for fast GPU/CUDA-based rasterization. In contrast, recent Neural Radiance Field (NeRF) methods build on continuous scene representations, typically optimizing a Multi-Layer Perceptron (MLP) using volumetric ray-marching for the novel-view synthesis of captured scenes. Similarly, the most efficient radiance field solutions build on continuous representations by interpolating values stored in, e.g., voxel, hash grids, or points. While the constant nature of these methods helps optimization, the stochastic sampling required for rendering is costly and can result in noise.
Researchers from Universit? C?te d?Azur and Max-Planck-Institut f?r Informatik introduce a new approach that combines the best of both worlds: their 3D Gaussian representation allows optimization with state-of-the-art (SOTA) visual quality and competitive training times. At the same time, their tile-based splatting solution ensures real-time rendering at SOTA quality for 1080p resolution on several previously published datasets (see Fig. 1). Their goal is to allow real-time rendering for scenes captured with multiple photos and create the representations with optimization times as fast as the most efficient previous methods for typical real scenes. Recent methods achieve fast training but struggle to achieve the visual quality obtained by the current SOTA NeRF methods, i.e., Mip-NeRF360, which requires up to 48 hours of training.
The fast ? but lower-quality ? radiance field methods can achieve interactive rendering times depending on the scene (10-15 frames per second) but fall short of high-resolution real-time rendering. Their solution builds on three main components. They first introduce 3D Gaussians as a flexible and expressive scene representation. They start with the same input as previous NeRF-like methods, i.e., cameras calibrated with Structure-from-Motion (SfM) and initialize the set of 3D Gaussians with the sparse point cloud produced for free as part of the SfM process. In contrast to most point-based solutions that require Multi-View Stereo (MVS) data, they achieve high-quality results with only SfM points as input. Note that for the NeRF-synthetic dataset, their method achieves high quality even with random initialization.
They show that 3D Gaussians are an excellent choice since they are a differentiable volumetric representation. Still, they can be rasterized very efficiently by projecting them to 2D and applying standard ??-blending, using an equivalent image formation model as NeRF. The second component of their method is the optimization of the properties of the 3D Gaussians ? 3D position, opacity ??, anisotropic covariance, and spherical harmonic (SH) coefficients ? interleaved with adaptive density control steps, where they add and occasionally remove 3D Gaussians during optimization. The optimization procedure produces a reasonably compact, unstructured, and precise representation of the scene (1-5 million Gaussians for all scenes tested). Their method’s third and final element is their real-time rendering solution, which uses fast GPU sorting algorithms inspired by tile-based rasterization following recent work.
However, thanks to their 3D Gaussian representation, they can perform anisotropic splatting that respects visibility ordering ? thanks to sorting and ??- blending ? and enable a fast and accurate backward pass by tracking the traversal of as many-sorted splats as required. To summarize, they provide the following contributions:
? The introduction of anisotropic 3D Gaussians as a high-quality, unstructured representation of radiance fields.
? An optimization method of 3D Gaussian properties, interleaved with adaptive density control, creates high-quality representations for captured scenes.
? A fast, differentiable rendering approach for the GPU, which is visibility-aware, allows anisotropic splatting and fast backpropagation to achieve high-quality novel view synthesis.
Their results on previously published datasets show that they can optimize their 3D Gaussians from multi-view captures and achieve equal or better quality than the best of previous implicit radiance field approaches. They also can achieve training speeds and quality similar to the fastest methods and, importantly, provide the first real-time rendering with high quality for novel-view synthesis.