| With the booming development of artificial intelligence,biometric recognition technologies such as facial,iris,fingerprint,and voice recognition have also been widely applied in our daily lives.Among them,facial recognition technology and speaker recognition technology,in particular,have the advantages of high user acceptance and low sampling cost,and their usage on various mobile devices has become common.However,in practical applications,audiovisual recognition is easily affected by various complex real-world scenes,such as variety shows,interviews,singing,movies,and TV dramas.These scenes involve issues such as multiple speakers in speaker recognition,unclear sound caused by distance and proximity,and environmental noise,as well as issues such as multiple faces,side faces or occlusion,and environmental lighting in facial recognition.Current research has an effective method to solve these problems,which is to combine and utilize information from two modalities for complementarity and fuse multimodal information for identity verification tasks.Based on the above situation,this article conducts research on audiovisual multimodal identity verification combining speaker recognition and facial recognition technologies from natural real-world scenes.The main research contents and innovative works of this article are as follows:(1)Dataset construction.Currently,there is no audiovisual multimodal recognition dataset with multiple complex scenes,which makes it difficult to explore the recognition performance in real scenes.This paper provides an online data collection platform and constructs an audiovisual dataset for multimodal biometric recognition,including 3485 videos of 250 target individuals from 11 different scenes.These 11 scenes include multiple complex scenes,and the videos are divided into 316,832 completely parallel audio/video segments.This dataset can provide relatively high research value for multimodal biometric recognition tasks.(2)Speaker verification and facial verification task research.Based on the existing MOBIO,MSU-AVIS,Ave Robot,Vox Celeb1 datasets and the CN-Celeb3 dataset constructed in this article,combined with current advanced performance speaker recognition and facial recognition models,single-modal speaker verification and facial verification tasks are performed,and the verification pairs used in the two tasks are completely parallel.The baseline model used in the speaker verification task is ECAPATDNN,while Retina Face is used for face detection in the facial verification task and Arc Face is used for face recognition.The experimental results show that the equal error rates(EER)are extremely low,all below 2.5%,in datasets with a single scene such as MOBIO and Vox Celeb1.The video modality information loss in the MSU-AVIS dataset is more severe than the audio modality,and both the audio and video modality information loss is severe in the Ave Robot dataset.The CN-Celeb3 dataset,which has multiple complex scenes,has EERs of 19.72% and 15.43% in the speaker recognition and facial recognition tasks,respectively,indicating that the related modality information in this dataset is relatively well preserved and has high research value.(3)Audiovisual multimodal identity verification task research.Based on the verification scores of each dataset in the two single modalities,three different weighting methods are used to fuse the two modalities in the score domain.The experimental results show that score fusion can improve the results to a certain extent in all datasets used in this study,which indicates that multimodal score fusion can indeed achieve effective performance improvement in identity verification research.In the CN-Celeb3 dataset,after using the maximum value fusion method,the EER result can be improved to 8.96%,proving that modality fusion can effectively complement information. |