MSU-AVIS dataset: Fusing Face and Voice Modalities for Biometric Recognition in Indoor Surveillance Videos

Indoor video surveillance systems often use the face modality to establish the identity of a person of interest. However, the face image may not offer sufficient discriminatory information in many scenarios due to substantial variations in pose, illumination, expression, resolution and distance between the subject and the camera.

In such cases, the inclusion of an additional biometric modality can benefit the recognition process. In this regard, we consider the fusion of voice and face modalities for enhancing the recognition accuracy. The main contribution of this work is assembling a multimodal (face and voice), semiconstrained, indoor video surveillance dataset referred to as the MSU Audio-Video Indoor Surveillance (MSU-AVIS) dataset. We use a consumer-grade camera with a built-in microphone to acquire data for this purpose. We use current state-of-art deeplearning based methods to perform face and speaker recognition on the collected dataset for establishing baseline performance. We also explore multiple fusion schemes to combine face and speaker recognition to perform effective person recognition on audio-video surveillance data. Experiments convey the efficacy of the proposed multimodal fusion scheme (face and voice) over unimodal approaches in surveillance scenarios. The collected dataset is being made available for research purposes.

Figure 1: Examples of video clips where face recognition fails. The larger images are sub-regions from a frame obtained from the clip. The smaller images are some of the other faces obtained from the same clip.

Gallery data example

For assembling the gallery face images and audio, we used the same web-cam, with the height of the camera set to match the height of the target subject. Each subject was asked to turn their head slowly to the right to capture various face poses.

When collecting the gallery audio, the speech was scripted (text-dependent), where the target was required to select a short script at random from a list of five scripts.

Probe data example

We used a web-cam (Logitech C920 HD Pro) to collect probe data, by fixing it at a height of 240cm from the ground. It has built-in dual stereo microphones that are used to record the audio in the room. The target was asked to speak freely (text-independent) when acquiring the probe surveillance videos. Below are three probe videos collected of subject 1 in MSU-AVIS dataset: Subject 1 - probe 1 (small face size, and degraded voice quality)

Subject 1 - probe 2 (medium face size, and better voice quality)

Subject 1 - probe 3 (One person in video, closer face image, and good voice quality)

You can download the MSU-AVIS dataset from here.

MSU-AVIS dataset: Fusing Face and Voice Modalities for Biometric Recognition in Indoor Surveillance Videos

Gallery data example

Probe data example

MSU-AVIS Dataset

Publications