Font Size: a A A

3D Convolutional Neural Networks Based Speaker Identification And Authentication

Posted on:2020-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:J G LiaoFull Text:PDF
GTID:2428330623463752Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Human biological features such as fingerprints,faces and iris are widely used in the field of identity recognition and authentication because of their convenience and security,which greatly facilitates people's lives.The latest research shows that lip features include the speaker's unique lip physiological structure information and speech habits information,which can be used to speaker identification and authentication.In addition,lip features can be used as an effective complement to other biological features.For example,face recognition combined with lips characteristic enhances living body detection effect,and speech recognition combined with lips characteristic improves the recognition effect in noisy environment,etc.Therefore,it is of great significance to study the extraction and application of lip features.The difficulty of speaker recognition and authentication based on lip region lies in the extraction of lip features,which include both static information of lip appearance and dynamic information of lip deformation features during speech.Traditional methods,such as lip contour extraction,texture feature extraction and sparse coding,can extract the identity information of the speaker,but the effect on the speaker in different lighting,different angles and different distances is unsatisfactory.In this letter,a novel end-to-end method based on 3D convolutional neural network(3DCNN)is proposed to extract discriminative spatiotemporal features from raw lip video streams.In our approach,the lip video is first divided into a series of overlapping clips.For each clip,the lip-characteristics network is proposed to characterize the minutiae of the lip region and its movement.Finally,the entire lip video is represented by a set of sub-features corresponding to each clip in it.Experiments have been performed on a dataset with 200 speakers and the proposed method achieves high identification accuracy of 99.18% and very low authentication error(HTER of 0.15%).Compared with several state-of-the-art methods,our approach achieves better performance and higher robustness against variations caused by different speaker's pose and position.In addition,this method also achieves satisfactory results on the natural scene dataset VoxCeleb2 which contains nearly 3000 people.
Keywords/Search Tags:Visual speaker identification, Visual speaker authentication, 3DCNN, Lip feature
PDF Full Text Request
Related items