Inferring 3D structure of a generic object from a 2D image is a long-standing objective of computer vision. Conventional approaches either learn completely from CAD-generated synthetic data, which have difficulty in inference from real images, or generate 2.5D depth image via intrinsic decomposition, which is limited compared to the full 3D reconstruction. One fundamental challenge lies in how to leverage numerous real 2D images without any 3D ground truth. To address this issue, we take an alternative approach with semi-supervised learning. That is, for a 2D image of a generic object, we decompose it into latent representations of category, shape and albedo, lighting and camera projection matrix, decode the representations to segmented 3D shape and albedo respectively, and fuse these components to render an image well approximating the input image. Using a category-adaptive 3D joint occupancy field (JOF), we show that the complete shape and albedo modeling enables us to leverage real 2D images in both modeling and model fitting. The effectiveness of our approach is demonstrated through superior 3D reconstruction from a single image, being either synthetic or real, and shape segmentation.

Reconstruction Introduction

Figure 1. Our semi-supervised method learns a universal model of multiple generic objects. During inference, the jointly learnt fitting module decomposes a real 2D image into albedo, segmented full 3D shape, illumination, and camera projection.

Reconstruction Overview

Figure 2. Semi-supervised analysis-by-synthesis framework jointly learns one image encoder (E) and two decoders (DS, DA), with a differentiable rendering layer. The training uses both synthetic and real images, with supervision from class labels and 3D CAD models, the ground truth of synthetic data, and silhouette mask of real data, but not 3D ground truth of real data.

Reconstruction Segmentation Results

Figure 3. Unsupervised co-segmentation across 13 categories.

Reconstruction Reconstruction Results

Figure 4. Qualitative comparison for single-view 3D reconstruction on (a) ShapeNet, (b) Pascal 3D+, and (c) Pix3D datasets.

Additional Visualizations

Fully Understanding Generic Objects Source Code

The source code can be downloaded from here


  • Fully Understanding Generic Objects: Modeling, Segmentation, and Reconstruction
    Feng Liu, Luan Tran, Xiaoming Liu
    In Proceeding of IEEE Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, Jun. 2021
    Bibtex | PDF | arXiv | Supplemental | Code | Video
  • @inproceedings{ fully-understanding-generic-objects-modeling-segmentation-and-reconstruction,
      author = { Feng Liu and Luan Tran and Xiaoming Liu },
      title = { Fully Understanding Generic Objects: Modeling, Segmentation, and Reconstruction },
      booktitle = { In Proceeding of IEEE Computer Vision and Pattern Recognition },
      address = { Nashville, TN },
      month = { June },
      year = { 2021 },