Inferring 3D locations and shapes of multiple objects from a single 2D image is a long-standing objective of computer vision. Most of the existing works either predict one of these 3D properties or focus on solving both for a single object. One fundamental challenge lies in how to learn an effective representation of the image that is well-suited for 3D detection and reconstruction. In this work, we propose to learn a regular grid of 3D voxel features from the input image which is aligned with 3D scene space via a 3D feature lifting operator. Based on the 3D voxel features, our novel CenterNet-3D detection head formulates the 3D detection as keypoint detection in the 3D space. Moreover, we devise an efficient coarse-to-fine reconstruction module, including coarse-level voxelization and a novel local PCA-SDF shape representation, which enables fine detail reconstruction and one order of magnitude faster inference than prior methods. With complementary supervision from both 3D detection and reconstruction, one enables the 3D voxel features to be geometry and context preserving, benefiting both tasks.The effectiveness of our approach is demonstrated through 3D detection and reconstruction in single object and multiple object scenarios.


Figure 1. Given a single image as input, our proposed approach jointly predicts 3D object bounding boxes and surfaces.


Figure 2. Overview of our approach. The proposed joint framework is composed of three key modules: 3D voxel feature learning (consists of feature backbone and 2D-to-3D feature lifting), CenterNet-3D detector, and coarse-to-fine 3D reconstruction. 2D feature maps are first generated from input image I, which are back-projected into voxel features G using a known camera projection matrix P. The voxel features serve for our novel 3D object detection and reconstruction.

Shape Representation Comparisons

Figure 3. 2D examples of (a) DeepSDF, (b) DeepLS, and (c) our local PCA-SDF shape representation. DeepSDF describes the surfaces with global shape codes. The SDF function f in DeepLS outputs a scalar value conditional on the local latent code zi and local coordinate x. However, its inference is computationally expensive since it requires forward pass through f for every x. Our shape representation consists of coarse-level voxelization and fine-level local PCA-SDF. The coarse-level voxelization holistically represents the whole 3D surface with binary values. To further represent fine-level surfaces, we propose a novel local PCA-SDF model, representing any occupied voxel as a linear combination of regular SDF function bases, which enables a more efficient and accurate representation than DeepLS.

Reconstruction and Detection Results

Figure 4. Qualitative results on real images from ScanNet-MDR. Our reconstructions closely match the objects than CoReNet. Moreover, our method performs better for reconstruction of the truncated objects

Reshape Comparisons

Figure 5. Qualitative comparison of our local PCA-SDF with DeepSDF and DeepLS on some shapes from the ShapeNet dataset.

Additional Visualizations

Source Code

The source code can be downloaded from here


  • Voxel-based 3D Detection and Reconstruction of Multiple Objects from a Single Image
    Feng Liu, Xiaoming Liu
    In Proceeding of Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021), Virtual, Dec. 2021
    Bibtex | PDF | arXiv | Supplemental | Code | Video
  • @inproceedings{ voxel-based-3d-detection-and-reconstruction-of-multiple-objects-from-a-single-image,
      author = { Feng Liu and Xiaoming Liu },
      title = { Voxel-based 3D Detection and Reconstruction of Multiple Objects from a Single Image },
      booktitle = { In Proceeding of Thirty-fifth Conference on Neural Information Processing Systems },
      address = { Virtual },
      month = { December },
      year = { 2021 },