| Multi-speaker speech recognition is a speech recognition technology for special scenarios.It can separate the speech of each speaker and transcribe it into text in the scenario where multiple speakers are speaking at the same time.This technology can serve as a novel solution for needs such as conference transcription and paper recording.There are generally two scenarios for multi-speaker speech recognition.One is the multichannel scenario.This scenario means that there are multiple microphones on the scene to collect sounds at the same time.The multispeaker speech recognition task based on this can therefore be done with the help of Multi-channel implicit microphone position information becomes easier,while in real life is more common in single-channel scenarios,that is,there is only one microphone in the scene to collect sound,which greatly increases the difficulty of this task.The goal of this paper is to design a multi-speaker speech recognition system that can work in a single-channel scenario based on deep learning technology,and the system recognition performance can be comparable to that of a single-speaker speech recognition system.This paper mainly takes speech separation as the research focus.The main contents of the paper are as follows:1.Based on the latest research results in the current academic community,implement a single-channel speech separation system based on deep clustering,and then innovatively combine the very popular graph convolutional network(GCN)on this basis,and propose the sliding window algorithm to solve the difficulties encountered during training,thereby improving the performance of deep clustering systems.2.Based on the latest research results in the current academic community,implement an end-to-end single-channel speech separation system based on permutation invariant training,and on this basis innovatively use a one-dimensional convolutional network to combine it with the speech separation system described in(1)is coupled to perform multi-loss function training,thereby realizing a composite system with excellent separation performance.3.Based on the implementation of the end-to-end speech recognition system based on Transformer,the implemented speech separation system is combined with it,and finally a complete multi-speaker speech recognition system is realized.The multi-speaker speech recognition system that combines separation and speech recognition can achieve good recognition results under the test of the LibriMix dataset. |