Understanding the world around us and making decisions about the future is a critical component to human intelligence. As autonomous systems continue to develop, their ability to reason about the future will be the key to their success. Semantic anticipation is a relatively under-explored area for which autonomous vehicles could take advantage of (e.g., forecasting pedestrian trajectories). Motivated by the need for real-time prediction in autonomous systems, we propose to decompose the challenging semantic forecasting task into two subtasks: current frame segmentation and future optical flow prediction. Through this decomposition, we built an efficient, effective, low overhead model with three main components: flow prediction network, feature-flow aggregation LSTM, and end-to-end learnable warp layer. Our proposed method achieves state-of-the-art accuracy on short-term and moving objects semantic forecasting while simultaneously reducing model parameters by up to 95% and increasing efficiency by greater than 40x.

Figure 1. Our proposed approach aggregates past optical flow features using a convolutional LSTM to predict future optical flow, which is used by an learnable warp layer to produce the future segmentation mask.

Table 1. Computational complexity analysis with respect to previous work. Models are measured without the SegCNN included (only SegPred). Runtime estimates were calculated by averaging 100 forward passes with each model. Single vs. sliding testing describes using a single forward pass of 512 × 1,024 resolution, relative to the costly sliding window approach of eight overlapping 713 × 713 full resolution crops.

Table 2. Comparison of available baselines for short-term (t = 3) and mid-term (t = 9) semantic forecasting tasks for all nineteen classes in Cityscapes. We further emphasize our models capability on the eight foreground, moving objects (MO) classes.

Table 3. Comparison of available baselines for short-term (t = 1) and mid-term (t = 10). † indicates a model contained no recurrent fine-tuning. We compare our model with FlowNet2-c and FlowNet2-C backbones, where C contains approx. 8/3 more feature channels.
Sample Results
