Open Dataset with More Than 10 Million 3D Objects

A recent breakthrough in AI has been the significance of scale in driving advances in various domains. Large models have demonstrated remarkable capabilities in language comprehension, generation, representation learning, multimodal tasks, and image generation. With an increasing number of learnable parameters, modern neural networks consume vast amounts of data. As a result, the capabilities exhibited by these models have seen dramatic improvements.

One example is GPT-2, which broke data barriers by consuming approximately 30 billion language tokens a few years ago. GPT-2 showcased promising zero-shot results on NLP benchmarks. However, newer models like Chinchilla and LLaMA have surpassed GPT-2 by consuming trillions of web-crawled tokens. They have easily outperformed GPT-2 in terms of benchmarks and capabilities. In computer vision, ImageNet initially consisted of 1 million images and was the gold standard for representation learning. But with the scaling of datasets to billions of images through web crawling, datasets like LAION5B have produced powerful visual representations, as seen with models like CLIP. The shift from manually assembling datasets to gathering them from diverse sources via the web has been key to this scaling from millions to billions of data points.

While language and image data have significantly scaled, other areas, such as 3D computer vision, still need to catch up. Tasks like 3D object generation and reconstruction rely on small handcrafted datasets. ShapeNet, for instance, depends on professional 3D designers using expensive software to create assets, making the process challenging to crowdsource and scale. The scarcity of data has become a bottleneck for learning-driven methods in 3D computer vision. 3D object generation still falls far behind 2D image generation, often relying on models trained on large 2D datasets instead of being trained from scratch on 3D data. The increasing demand and interest in augmented reality (AR) and virtual reality (VR) technologies further highlight the urgent need to scale up 3D data.

To address these limitations researchers from Allen Institute for AI, University of Washington, Seattle, Columbia University, Stability AI, CALTECH and LAION introduces Objaverse-XL as a large-scale web-crawled dataset of 3D assets. The rapid advancements in 3D authoring tools, along with the increased availability of 3D data on the internet through platforms such as Github, Sketchfab, Thingiverse, Polycam, and specialized sites like the Smithsonian Institute, have contributed to the creation of Objaverse-XL. This dataset provides a significantly wider variety and quality of 3D data than previous efforts, such as Objaverse 1.0 and ShapeNet. With over 10 million 3D objects, Objaverse-XL represents a substantial increase in scale, exceeding prior datasets by several orders of magnitude.

The scale and diversity offered by Objaverse-XL have significantly expanded the performance of state-of-the-art 3D models. Notably, the Zero123-XL model, pre-trained with Objaverse-XL, demonstrates remarkable zero-shot generalization capabilities in challenging and complex modalities. It performs exceptionally well on tasks like novel view synthesis, even with diverse inputs such as photorealistic assets, cartoons, drawings, and sketches. Similarly, PixelNeRF, trained to synthesize novel views from a small set of images, shows notable improvements when trained with Objaverse-XL. Scaling pre-training data from a thousand assets to 10 million consistently exhibits improvements, highlighting the promise and opportunities enabled by web-scale data.

The implications of Objaverse-XL extend beyond the realm of 3D models. Its potential applications span computer vision, graphics, augmented reality, and generative AI. Reconstructing 3D objects from images has long been challenging in computer vision and graphics. Existing methods have explored various representations, network architectures, and differentiable rendering techniques to predict 3D shapes and textures from images. However, these methods have primarily relied on small-scale datasets like ShapeNet. With the significantly larger Objaverse-XL, new levels of performance and generalization in zero-shot fashion can be achieved.

Moreover, the emergence of generative AI in 3D has been an exciting development. Models like MCC, DreamFusion, and Magic3D have shown that 3D shapes can be generated from language prompts with the help of text-to-image models. Objaverse-XL also opens up opportunities for text-to-3D generation, enabling advancements in text-to-3D modeling. By leveraging the vast and diverse dataset, researchers can explore novel applications and push the boundaries of generative AI in the 3D domain.

The release of Objaverse-XL marks a significant milestone in the field of 3D datasets. Its size, diversity, and potential for large-scale training hold promise for advancing research and applications in 3D understanding. Although Objaverse-XL is currently smaller than billion-scale image-text datasets, its introduction paves the way for further exploration on how to continue scaling 3D datasets and simplify capturing and creating 3D content. Future work can also focus on choosing optimal data points for training and extending Objaverse-XL to benefit discriminative tasks such as 3D segmentation and detection.

In conclusion, the introduction of Objaverse-XL as a massive 3D dataset sets the stage for exciting new possibilities in computer vision, graphics, augmented reality, and generative AI. By addressing the limitations of previous datasets, Objaverse-XL provides a foundation for large-scale training and opens up avenues for groundbreaking research and applications in the 3D domain.