M3D-RPN: Monocular 3D Region Proposal Network for Object Detection

Understanding the world in 3D is a critical component of urban autonomous driving. Generally, the combination of expensive LiDAR sensors and stereo RGB imaging has been paramount for successful 3D object detection algorithms, whereas monocular image-only methods experience drastically reduced performance. We propose to reduce the gap by reformulating the monocular 3D detection problem as a standalone 3D region proposal network. We leverage the geometric relationship of 2D and 3D perspectives, allowing 3D boxes to utilize well-known and powerful convolutional features generated in the image-space. To help address the strenuous 3D parameter estimations, we further design depth-aware convolutional layers which enable location specific feature development and in consequence improved 3D scene understanding. Compared to prior work in monocular 3D detection, our method consists of only the proposed 3D region proposal network rather than relying on external networks, data, or multiple stages. M3D-RPN is able to significantly improve the performance of both monocular 3D Object Detection and Bird's Eye View tasks within the KITTI urban autonomous driving dataset, while efficiently using a shared multi-class model.

Figure 1. Overview of M3D-RPN. The proposed method consist of parallel paths for global (orange) and local (blue) feature extraction. The global features use regular spatial-invariant convolution, while the local features denote depth-aware convolution, as detailed right.

Figure 2. Anchor Formulation and Visualized 3D Anchors. We depict each parameter of within the 2D / 3D anchor formulation (left). We visualize the precomputed 3D priors when 12 anchors are used after projection in the image view (middle) and Bird’s Eye View (right).

Table 1. Bird’s Eye View. Comparison of our method to image-only 3D localization frameworks on the Bird’s Eye View task.

Table 2. 3D Detection. Comparison of our method to image-only 3D localization frameworks on the 3D Detection task.

Figure 3. Qualitative Examples. We visualize qualitative examples of our method for multi-class 3D object detection. We use yellow to denote cars, green for pedestrians, and orange for cyclists. All illustrated images are from the val1 split and not used for training.

Sample Results

Video 1. Demo Video. We process the raw KITTI image sequences and visualize both the image view (top) and the corresponding Bird’s Eye View (bottom). We encode 3D boxes for car as magenta, cyclist as blue, and pedestrian as green, consistent in each view.

M3D-RPN implementation in Python and Pytorch may be downloaded from here.

If you use M3D-RPN code, please cite the ICCV 2019 paper.

M3D-RPN: Monocular 3D Region Proposal Network for Object Detection

Sample Results

M3D-RPN Source Code

Publications