We present an autoregressive pedestrian detection framework with cascaded phases designed to progressively improve precision. The proposed framework utilizes a novel lightweight stackable decoder-encoder module which uses convolutional re-sampling layers to improve features while maintaining efficient memory and runtime cost. Unlike previous cascaded detection systems, our proposed framework is designed within a region proposal network and thus retains greater context of nearby detections compared to independently processed RoI systems. We explicitly encourage increasing levels of precision by assigning strict labeling policies to each consecutive phase such that early phases develop features primarily focused on achieving high recall and later on accurate precision. In consequence, the final feature maps form more peaky radial gradients emulating from the centroids of unique pedestrians. Using our proposed autoregressive framework leads to new state-of-the-art performance on the reasonable and occlusion settings of the Caltech pedestrian dataset, and achieves competitive state-of-the-art performance on the KITTI dataset.

Overview AR-RPN

Figure 1. Overview of AR-RPN (left) and our de-encoder module (right). The de-encoder module consist of top-down and bottom-up pathways with inner-lateral convolution to produce diversified features, as well as convolutional re-sampling layers (s denotes convolutional stride) e_i and d_i for memory-efficient generation. We further condition predictions on the previous phase predictions through concatenation within fk(ยท).

AR-Ped Phase Visualization

Figure 2. We visualize the prediction maps of each phase using the max foreground scores across all anchors at each location. We use scaled blue โ†’ yellow colors, where yellowness indicates high detection confidence. The detections of each phase become increasingly tighter and more adept to non-maximum suppression. We analyze the prediction disagreements between phases โˆ†1 โ†’ 3, shown in the right column, where green represents foreground agreement and magenta represents the regions suppressed.

AR-Ped Results

Table 1. Comprehensive comparison of our framework and state-of-the-art on the Caltech and KITTI benchmarks, in both accuracy and runtime (RT). We show the Caltech miss rates at multiple challenging settings, with both the original (O) and new (N) annotations, and at various occlusion settings. Boldface/italic indicate the best/second best performance.

Sample Results

Video 1. Test sequenes from the Caltech test set demonstrating the effects of AR-Ped. Row 1: matched detections with green true positive, yellow false positive, or magenta "ignored" / DontCare, following the Caltech reasonable setting. Row 2: prediction map using channel-wise max of the anchor foreground predictions (i.e., 45x60x9 --> 45x60x1) then visualized using scaled colors from blue --> yellow. Row 3: the disagreements in foreground between phase X as compared to future phase Y. We use green to encode agreement and red/blue to encode disagreement. Thus, magenta represents areas which are suppressed by the future phases.

AR-Ped Source Code

AR-Ped implementation in Matlab and Caffe may be downloaded from here.

If you use AR-Ped code, please cite the CVPR 2019 paper:


  • Pedestrian Detection with Autoregressive Network Phases
    Garrick Brazil, Xiaoming Liu
    In Proceeding of IEEE Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, Jun. 2019
    Bibtex | PDF | arXiv
  • @inproceedings{ pedestrian-detection-with-autoregressive-network-phases,
      author = { Garrick Brazil and Xiaoming Liu },
      title = { Pedestrian Detection with Autoregressive Network Phases },
      booktitle = { In Proceeding of IEEE Computer Vision and Pattern Recognition },
      address = { Long Beach, CA },
      month = { June },
      year = { 2019 },