| Lip-reading technology,not only can be used as a supplementary means of speech recognition in multi-modal speech recognition systems,to enhance the robustness and accuracy of the system,and to break through the limitations of its application scenarios;but also can be applied to assist hearing-impaired people with normal communication and language function recovery,and can be used as a new type of coding for some specific scenes.Traditional lip-reading research based on two-dimensional videos has made great progress.With the development of three-dimensional imaging technology,lip reading research has a broader development prospect.This paper aims to study the real-time lip-reading technology using the Kinect sensor to acquire 3D data of speakers’ faces.This paper mainly includes data acquisition module,lip detection and localization module,feature extraction module and speech recognition module.Firstly,corpus data are collected by Kinect sensor.Secondly,the lip-moving three-dimensional model of human face is constructed in the data preprocessing phase of face 3D coordinate information acquired based on Kinect Face Tracking SDK.And according to the correspondence between CANDIDE-3 and MPEG-4 standard face models,the locations of the 18 feature points in the lip region can be further determined.In addition,19 feature points around the lip region are added together as the Region of Interest(ROI).Then,for the 37 feature points in ROI,4 kinds of 3D spatial features are extracted,which are coordinate vector features formed by the coordinate origin and these feature points separately,geometrical proportionality features calculated from the lip contour,lip angle features selected based on the KNearest Neighbors(KNN)classification algorithm,spatial angle features on the basis of the selection from the standard face model and the customization under the lip motion characteristics.These features can express the lip movement information more comprehensively,and can reduce the impact of the posture and orientation of the speakers effectively during data acquisition phase.Then,the four spatial features are normalized by piecewise linear interpolation method,and further feature selection is made by KNN classification algorithm to obtain the most representative features,which can be combined to form the final lip-reading feature.Finally,KNN classification algorithm and Ensemble Learning methods are used in the classification experiments.The KNN classification algorithm verified the high efficiency and good instantaneity of the spatial lip-reading feature.And Compared with Bagging ensemble learning method,KNN ensemble learning method achieves better classification accuracy,and is much more suitable for real-time lip-reading system. |