AI method reconstructs 3D scene details from simulated images using inverse rendering

Layout generation. a, Images for two scenes observed by a single camera. b, Test-time optimized inverse rendered objects. c, BEV layouts of the scenes. In the BEV layout (a common representation for autonomous driving tasks), black boxes represent the gro

Over the past decades, computer scientists have developed many computational tools that can analyze and interpret images. These tools have proved useful for a broad range of applications, including robotics, autonomous driving, health care, manufacturing and even entertainment.

Most of the best performing computer vision approaches employed to date rely on so-called feed-forward neural networks. These are computational models that process input images step by step, ultimately making predictions about them.

While some of these models were found to perform well when tested on the data they analyzed during training, they often do not generalize well across new images and in different scenarios. In addition, their predictions and the patterns they extract from images can be difficult to interpret.

Researchers at Princeton University recently developed a new inverse rendering approach that is more transparent and could also interpret a wide range of images more reliably. The new approach, introduced in a paper published in Nature Machine Intelligence, relies on a generative artificial intelligence (AI)-based method to simulate the process of image creation, while also optimizing it by gradually adjusting a model's internal parameters.

"Generative AI and neural rendering have transformed the field in recent years for creating novel content: producing images or videos from scene descriptions," Felix Heide, senior author of the paper, told Tech Xplore. "We investigate whether we can flip this around and use these generative models for extracting the scene descriptions from images."

Video of tracking results of the team's method. A demonstration of the performance of our proposed tracking method based on inverse neural rendering for a sample of diverse scenes from the nuScenes dataset and the Waymo Open Dataset. We overlay the observed image with the rendered objects through alpha blending with a weight of 0.4. Object renderings are defined by the averaged latent embeddings zk,EMA and the tracked object state yk.

The new approach developed by Heide and his colleagues relies on a so-called differentiable rendering pipeline. This is a process for the simulation of image creation, relying on compressed representations of images created by generative AI models.

"We developed an analysis-by-synthesis approach that allows us to solve vision tasks, such as tracking, as test-time optimization problems," explained Heide. "We found that this method generalizes across datasets, and in contrast to existing supervised learning methods, does not need to be trained on new datasets."

Essentially, the method developed by the researchers works by placing models of 3D objects in a virtual scene depicting real world settings. These models of objects are generated by a generative AI based on random sample of 3D scene parameters.

Optimizing 3D models through inverse neural rendering. From left to right: the observed image, initial random 3D generations, and three optimization steps that refine these to better match the observed image. The observed images are faded to show the rend

Generalization of 3D multi-object tracking with Inverse Neural Rendering. The method directly generalizes across datasets such as the nuScenes and Waymo Open Dataset benchmarks without additional fine-tuning and is trained on synthetic 3D models only. The

"We then render all these objects back together into a 2D image," said Heide. "Next, we compare this rendered image with the real observed image. Based on how different they are, we backpropagate the difference through both the differentiable rendering function and the 3D generation model to update its inputs. In just a few steps, we optimize these inputs to make the rendered match the observed images better."

A notable advantage of the team's newly proposed approach is that it allows very generic 3D object generation models trained on synthetic data to perform well across a wide range of datasets containing images captured in real-world settings. In addition, the renderings produced by the models are far more explainable than those produced by conventional rendering tools based on feed-forward machine learning models.

"Our inverse rendering approach for tracking works just as well as learned feed-forward approaches, but it provides us with explicit 3D explanations of its perceived world," said Heide.

"The other interesting aspect is the generalization capabilities. Without changing the 3D generation model or training it on new data, our 3D multi-object tracking through Inverse Neural Rendering works well across different autonomous driving datasets and object types. This can significantly reduce the cost of fine-tuning on new data or at least work as an auto-labeling pipeline."

This recent study could soon help to advance AI models for computer vision, improving their performance in real-world settings while also increasing their transparency. The researchers now plan to continue improving their method and start testing it on more computer vision-related tasks.

"A logical next step is the expansion of the proposed approach to other perception tasks, such as 3D detection and 3D segmentation," added Heide. "Ultimately, we want to explore if inverse rendering can even be used to infer the whole 3D scene, and not just individual objects. This would allow our future robots to reason and continuously optimize a three-dimensional model of the world, which comes with built-in explainability."

More information: Julian Ost et al, Towards generalizable and interpretable three-dimensional tracking with inverse neural rendering, Nature Machine Intelligence (2025). DOI: 10.1038/s42256-025-01083-x.

Journal information: Nature Machine Intelligence

Video can be accessed at source link below.

AI method reconstructs 3D scene details from simulated images using inverse rendering

Loading please wait...