With the development of deep learning technology,speech recognition has been applied in the field of human-computer interaction.Speech recognition receives sound information,which will be interfered by noise,resulting in problems such as reduced recognition accuracy and misrecognition.Lip recognition receives signals from visual modalities and is not disturbed by sound noise,and can complement speech recognition.Lip recognition and speech recognition realize multi-modal fusion recognition to improve the recognition accuracy.In this paper,the research on the multimodal fusion recognition system of lipreading and speech recognition is carried out.In order to complete the multimodal fusion recognition,it is necessary to build a multimodal recognition deep learning model.The deep learning models of lipreading and speech recognition are respectively studied,and the architecture of lipreading and speech recognition is constructed by the combination of front-end feature extraction and back-end classification.For the fusion methods of the two,the feature-level fusion method is used to achieve the fusion of lip language and speech after comparison and analysis.For the constructed multimodal fusion recognition model,it needs to be trained through the dataset.In addition to the public dataset,the lipreading and speech data of the experimenter were acquired by means of computer camera,and the effective part of the data was intercepted by means of audio analysis,and then the position of the face and lips was determined by the dlib library.The scaling operation obtains lip videos of the same size as the processed dataset.In this way,the collected audio and video data are processed to obtain a dataset of automobile air-conditioning instructions and a data set of varying illumination angles.The dataset contains the video data of the experimenter’s lips and the corresponding audio data.Aiming at the problem that the recognition accuracy decreases due to uneven illumination and dark illumination in lip language recognition,an illumination preprocessing method is proposed.This method adopts the idea of normalization,and preprocesses the data with uneven illumination and dark illumination to obtain a uniformly illuminated image.Using the data preprocessed by this method to train the lip language recognition model can effectively improve the robustness of the model under illumination interference.Based on the background of auxiliary control of automobile air-conditioning,the corresponding experimental platform is built,the communication protocol between the upper computer and the lower computer is designed,the real-time recognition of the command of the upper computer is completed,and the prediction result is sent to the lower computer according to the communication protocol,so as to realize the communication between the lower computer and the lower computer.control.Real-time collection,recognition,transmission and control of lip and speech data. |