Indoor video surveillance systems often use the face modality to establish the identity of a person of interest. However, the face image may not offer sufficient discriminatory information in many scenarios due to substantial variations in pose, illumination, expression, resolution and distance between the subject and the camera.

In such cases, the inclusion of an additional biometric modality can benefit the recognition process. In this regard, we consider the fusion of voice and face modalities for enhancing the recognition accuracy. The main contribution of this work is assembling a multimodal (face and voice), semiconstrained, indoor video surveillance dataset referred to as the MSU Audio-Video Indoor Surveillance (MSU-AVIS) dataset. We use a consumer-grade camera with a built-in microphone to acquire data for this purpose. We use current state-of-art deeplearning based methods to perform face and speaker recognition on the collected dataset for establishing baseline performance. We also explore multiple fusion schemes to combine face and speaker recognition to perform effective person recognition on audio-video surveillance data. Experiments convey the efficacy of the proposed multimodal fusion scheme (face and voice) over unimodal approaches in surveillance scenarios. The collected dataset is being made available for research purposes.

Overview Face Recon

Figure 1: Examples of video clips where face recognition fails. The larger images are sub-regions from a frame obtained from the clip. The smaller images are some of the other faces obtained from the same clip.

Probe data example

We used a web-cam (Logitech C920 HD Pro) to collect probe data, by fixing it at a height of 240cm from the ground. It has built-in dual stereo microphones that are used to record the audio in the room. The target was asked to speak freely (text-independent) when acquiring the probe surveillance videos. Below are three probe videos collected of subject 1 in MSU-AVIS dataset: Subject 1 - probe 1 (small face size, and degraded voice quality)

Subject 1 - probe 2 (medium face size, and better voice quality)

Subject 1 - probe 3 (One person in video, closer face image, and good voice quality)

MSU-AVIS Dataset

You can download the MSU-AVIS dataset from here.

Publications

  • MSU-AVIS dataset: Fusing Face and Voice Modalities for Biometric Recognition in Indoor Surveillance Videos
    Anurag Chowdhury, Yousef Atoum, Luan Tran, Xiaoming Liu, Arun Ross
    In Proceeding of International Conference on Pattern Recognition (ICPR 2018), Beijing, China, Aug. 2018
    Bibtex | PDF | Poster
  • @inproceedings{ msu-avis-dataset-fusing-face-and-voice-modalities-for-biometric-recognition-in-indoor-surveillance-videos,
      author = { Anurag Chowdhury and Yousef Atoum and Luan Tran and Xiaoming Liu and Arun Ross },
      title = { MSU-AVIS dataset: Fusing Face and Voice Modalities for Biometric Recognition in Indoor Surveillance Videos },
      booktitle = { In Proceeding of International Conference on Pattern Recognition },
      address = { Beijing, China },
      month = { August },
      year = { 2018 },
    }