Indoor video surveillance systems often use the face modality to establish the identity of a person of interest. However, the face image may not offer sufficient discriminatory information in many scenarios due to substantial variations in pose, illumination, expression, resolution and distance between the subject and the camera.
In such cases, the inclusion of an additional biometric modality can benefit the recognition process. In this regard, we consider the fusion of voice and face modalities for enhancing the recognition accuracy. The main contribution of this work is assembling a multimodal (face and voice), semiconstrained, indoor video surveillance dataset referred to as the MSU Audio-Video Indoor Surveillance (MSU-AVIS) dataset. We use a consumer-grade camera with a built-in microphone to acquire data for this purpose. We use current state-of-art deeplearning based methods to perform face and speaker recognition on the collected dataset for establishing baseline performance. We also explore multiple fusion schemes to combine face and speaker recognition to perform effective person recognition on audio-video surveillance data. Experiments convey the efficacy of the proposed multimodal fusion scheme (face and voice) over unimodal approaches in surveillance scenarios. The collected dataset is being made available for research purposes.
Gallery data example
For assembling the gallery face images and audio, we used the same web-cam, with the height of the camera set to match the height of the target subject. Each subject was asked to turn their head slowly to the right to capture various face poses.
When collecting the gallery audio, the speech was scripted (text-dependent), where the target was required to select a short script at random from a list of five scripts.
Probe data example
We used a web-cam (Logitech C920 HD Pro) to collect probe data, by fixing it at a height of 240cm from the ground. It has built-in dual stereo microphones that are used to record the audio in the room. The target was asked to speak freely (text-independent) when acquiring the probe surveillance videos. Below are three probe videos collected of subject 1 in MSU-AVIS dataset: Subject 1 - probe 1 (small face size, and degraded voice quality)
Subject 1 - probe 2 (medium face size, and better voice quality)
Subject 1 - probe 3 (One person in video, closer face image, and good voice quality)