An AI model called HOSNeRF can produce dynamic neural radiation fields from a single video.

Due of recent developments in 3D reconstruction techniques, immersive media has become a popular topic. Stronger technologies have evolved, particularly in the areas of video reconstruction and free-viewpoint rendering, which enable more user interaction and the creation of realistic settings. These techniques have been used in the creation of 3D animated films, virtual reality, telepresence, and metaverse, among other fields.

Reconstructing videos, however, presents a number of difficulties. This is especially true when dealing with singular opinions and intricate relationships between people and their surroundings. If something is easy, the difficulty is gone, but in fact, our encounters with the virtual world are extremely unexpected, making them difficult to deal with.

Significant progress has been made in the field of view synthesis, with Neural Radiance Fields (NeRF) playing a pivotal role. NeRF is originally proposed to reconstruct static 3D scenes from multi-view images. However, its huge success has attracted attention, and since then, it has been improved to address the challenge of dynamic view synthesis. Researchers have proposed several approaches to incorporate dynamic elements, such as deformation fields and spatiotemporal radiance fields. Additionally, there has been a specific focus on dynamic neural human modeling, leveraging estimated human poses as prior information. While these advancements have shown promise, accurately reconstructing challenging monocular videos with fast and complex human-object-scene motions and interactions remains a significant challenge.

What if we want to advance NeRFs further so that they can accurately reconstruct complex human-environment interactions? How can we utilize NeRFs in environments with complex object movement? Time to meet HOSNeRF.

Overview of HOSNeRF. Source: https://arxiv.org/pdf/2304.12281.pdf

Human-Object-Scene Neural Radiance Fields (HOSNeRF) is introduced to overcome the limitations of NeRF. HOSNeRF tackles the challenges associated with complex object motions in human-object interactions and the dynamic interaction between humans and different objects at different times. By incorporating object bones attached to the human skeleton hierarchy, HOSNeRF enables accurate estimation of object deformations during human-object interactions. Additionally, two new learnable object state embeddings have been introduced to handle the dynamic removal and addition of objects in the static background model and the human-object model.

Overview of the proposed method. Source: https://arxiv.org/pdf/2304.12281.pdf

The development of HOSNeRF involved the exploration and identification of effective training objectives and strategies. Key considerations included deformation cycle consistency, optical flow supervision, and foreground-background rendering. HOSNeRF can achieve high-fidelity dynamic novel view synthesis. Also, it allows for pausing monocular videos at any time and rendering all scene details, including dynamic humans, objects, and backgrounds, from arbitrary viewpoints. So, you can literally enjoy the infamous Neo dodging bullets scene in the Matrix movie.

HOSNeRF presents a groundbreaking framework that achieves 360° free-viewpoint high-fidelity novel view synthesis for dynamic scenes with human-environment interactions, all from a single video. The introduction of object bones and state-conditional representations enables HOSNeRF to effectively handle the complex non-rigid motions and interactions between humans, objects, and the environment.