We present an autoregressive pedestrian detection framework with cascaded phases designed to progressively improve precision. The proposed framework utilizes a novel lightweight stackable decoder-encoder module which uses convolutional re-sampling layers to improve features while maintaining efficient memory and runtime cost. Unlike previous cascaded detection systems, our proposed framework is designed within a region proposal network and thus retains greater context of nearby detections compared to independently processed RoI systems. We explicitly encourage increasing levels of precision by assigning strict labeling policies to each consecutive phase such that early phases develop features primarily focused on achieving high recall and later on accurate precision. In consequence, the final feature maps form more peaky radial gradients emulating from the centroids of unique pedestrians. Using our proposed autoregressive framework leads to new state-of-the-art performance on the reasonable and occlusion settings of the Caltech pedestrian dataset, and achieves competitive state-of-the-art performance on the KITTI dataset.

Figure 1. Overview of AR-RPN (left) and our de-encoder module (right). The de-encoder module consist of top-down and bottom-up pathways with inner-lateral convolution to produce diversified features, as well as convolutional re-sampling layers (s denotes convolutional stride) e_i and d_i for memory-efficient generation. We further condition predictions on the previous phase predictions through concatenation within fk(ยท).

Figure 2. We visualize the prediction maps of each phase using the max foreground scores across all anchors at each location. We use scaled blue โ yellow colors, where yellowness indicates high detection confidence. The detections of each phase become increasingly tighter and more adept to non-maximum suppression. We analyze the prediction disagreements between phases โ1 โ 3, shown in the right column, where green represents foreground agreement and magenta represents the regions suppressed.

Table 1. Comprehensive comparison of our framework and state-of-the-art on the Caltech and KITTI benchmarks, in both accuracy and runtime (RT). We show the Caltech miss rates at multiple challenging settings, with both the original (O) and new (N) annotations, and at various occlusion settings. Boldface/italic indicate the best/second best performance.
Sample Results
Video 1. Test sequenes from the Caltech test set demonstrating the effects of AR-Ped. Row 1: matched detections with green true positive, yellow false positive, or magenta "ignored" / DontCare, following the Caltech reasonable setting. Row 2: prediction map using channel-wise max of the anchor foreground predictions (i.e., 45x60x9 --> 45x60x1) then visualized using scaled colors from blue --> yellow. Row 3: the disagreements in foreground between phase X as compared to future phase Y. We use green to encode agreement and red/blue to encode disagreement. Thus, magenta represents areas which are suppressed by the future phases.