Feature fusion plays a crucial role in unconstrained face recognition where inputs (probes) comprise of a set of N low quality images whose individual qualities vary. Advances in attention and recurrent modules have led to feature fusion that can model the relationship among the images in the input set. However, attention mechanisms cannot scale to large N due to their quadratic complexity and recurrent modules suffer from input order sensitivity. We propose a two-stage feature fusion paradigm, Cluster and Aggregate, that can both scale to large N and maintain the ability to perform sequential inference with order invariance. Specifically, Cluster stage is a linear assignment of N inputs to M global cluster centers, and Aggregation stage is a fusion over M clustered features. The clustered features play an integral role when the inputs are sequential as they can serve as a summarization of past features. By leveraging the order-invariance of incremental averaging operation, we design an update rule that achieves batch-order invariance, which guarantees that the contributions of early image in the sequence do not diminish as time steps increase. Experiments on IJB-B and IJB-S benchmark datasets show the superiority of the proposed two-stage paradigm in unconstrained face recognition. Code and pretrained models are available.

Problem Definition

problem definition

Figure 1. • A probe (or gallery) video can 1. contain faces of varied identifiability, 2. vary number of images and 3. can be sequential in the stream of input. We propose a feature fusion algorithm that can be used in such scenarios.

Comparison of Feature Fusion Methods

Comparison of feature fusion methods.

Figure 2. • Comparison of feature fusion paradigms. a) In the individual paradigm, each probe sample’s weight is determined independently. b) In the intra-set paradigm, the sample weight is determined based on all inputs. However, when N is large or sequential, intra-set calculations become infeasible. c) In the Cluster and Aggregate paradigm, the intermediate representation F′ (green) can be updated across batches, allowing for large N intra-set modeling and sequential inference. Sharing universal cluster centers C ensures consistency of F′ across batches. Unlike RNN, the update rule is batch-order invariant.

Overall Pipeline (Loss Function)

Overall Architecture

Figure 3. •An overview of CAFace with cluster and aggregate paradigm. The task is to fuse a sequence of images to a single feature vector f for face recognition. SIM is responsible for decoupling facial identity features F from image style S that carry information for feature fusion (Sec. 3.1). Cluster Network (CN) calculates the affinity of S to the global centers C and produces an assignment map A. It will be used to map F and S to create fixed size representations F′ and S′ . Note that F′ and S′ are linear combinations of raw inputs F , S respectively. This property ensures that the previous and current batch representations can be combined using weighted average, which is order-invariant. Lastly, AGN computes the intra-set relationship of S′ to estimate the importance of F′ for fusion. For interpretability, AGN produces the weights for averaging F′ to obtain f.

NeurIPS 2022 Presentation

CAFace Source Code

The source code can be downloaded from here


  • Cluster and Aggregate: Face Recognition with Large Probe Set
    Minchul Kim, Feng Liu, Anil Jain, Xiaoming Liu
    In Proceeding of Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, Dec. 2022
    Bibtex | PDF | arXiv | Supplemental
  • @inproceedings{ cluster-and-aggregate-face-recognition-with-large-probe-set,
      author = { Minchul Kim and Feng Liu and Anil Jain and Xiaoming Liu },
      title = { Cluster and Aggregate: Face Recognition with Large Probe Set },
      booktitle = { In Proceeding of Thirty-sixth Conference on Neural Information Processing Systems },
      address = { New Orleans, LA },
      month = { December },
      year = { 2022 },