| With the widespread adoption and continuous expansion of speech assistant devices,the importance of full-duplex continuous conversation technology is increasingly highlighted.Non-speech interference rejection is a key component of achieving full-duplex continuous conversation.It is used to distinguish whether the voice signal is directed at the speech assistant or not,filtering out environmental noise,interfering audio,and any other non-device-directed speech.This helps users issue commands continuously without repeating the wake-up word,improving the continuous conversation experience between the user and the speech assistant.Currently,the academic and industrial research on non-speech interference rejection is not sufficient,which restricts the promotion and application of full-duplex continuous conversation technology.To address this issue,this paper has established a joint project with NIO to help them build a complete chain of non-speech interference rejection models from the cloud to the vehicle end.This chain has now been integrated into NIO vehicles,providing over 360 million calls to NIO users and showing industry-leading results in various evaluations.The main research focus of this paper on non-speech interference rejection technology is the integration of speech and text modalities,covering both cloud and vehicle-end rejection models,as detailed below:A multi-modal rejection cloud model based on DSCLAP and CM-RDrop has been constructed.To address the issue of domain mismatch between pre-training models and in-vehicle target domains,as well as the lack of inter-modal alignment in pre-training tasks,this paper proposes a contrastive learning text-speech joint pre-training model for in-vehicle domains-DSCLAP,which uses contrastive learning to pre-train with the audio-ASR text pairs naturally provided by in-vehicle voice assistants,ensuring consistency between the pre-training domain and the target domain,and providing aligned modal encoding representations.Based on the DSCLAP pre-trained model,this paper proposes a regularization optimization strategy called CM-RDrop,which has two regularization terms: intra-modal RDrop and inter-modal RDrop.It helps the multi-modal rejection model to better distinguish the differential features of each modality during the training process,while capturing shared semantics across modal signals.DSCLAP was trained on over 40 TB of in-vehicle audio-ASR text pairs and achieved a recognition accuracy of 96.88% in an unrelated human voice rejection task using CM-RDrop,significantly outperforming publicly available pre-trained models used for contrastive learning and other optimization methods.A multi-modal rejection model for in-vehicle systems based on chain knowledge distillation is proposed in this paper.In order to achieve the transfer from cloud-based rejection model to in-vehicle systems and overcome the challenges of weakened feature extraction ability of modality encoder,collapse of modality alignment and poor generalization of multi-modal rejection model in end-to-end knowledge distillation,a chain knowledge distillation strategy is developed.Firstly,independent in-vehicle domain knowledge distillation is designed for each modality to improve the semantic representation ability of each encoder.Then,the end-side encoder after distillation is pre-trained with DSCLAP architecture for contrastive learning to enhance the alignment of multi-modal encoding representation in semantic space.Finally,a multi-perspective pseudo-label knowledge distillation algorithm is proposed to let the end-side model learn from tens of millions of pseudo-labeled data and improve the model’s generalization ability.The experimental results show that the proposed chain knowledge distillation strategy reduces the computation complexity of the cloud-based model by 98.90% while retaining 98.53% of the multi-modal rejection accuracy of the cloud-based model,indicating a strong practical value.The cloud-based rejection recognition model and the in-car rejection recognition model studied in this paper were also tested in a real-world online environment.The test results showed that the cloud-based and in-car models have achieved the goals of production and research,providing NIO’s users with stable irrelevant voice rejection services with less than 2% false rejection rate in continuous conversation scenarios,bringing pleasant full-duplex continuous conversation experiences to users.Overall,the research achievements of this paper not only have certain reference value in practical applications,but also provide new ideas and methods for the development of irrelevant voice rejection technology. |