| With the rapid development of endoscopy technology,more surgical operations choose to be done endoscopically.While the operation is in progress,the recording of the operation video will also be very convenient.The untrimmed video of endoscopic surgery has a lot of blur,jitter,blood stains,or cleaning of the lens,and other video clips that are not related to the operation.After surgery,when doctors need to use surgical videos for medical record records,preoperative publicity,auxiliary medical education,and academic exchanges,these video contents that has nothing to do with surgical operations will make the application process very cumbersome and easily cause doctors to obtain surgical information inconsistently.At the same time,it will also affect the efficiency of doctors in finding key surgical steps.In addition,marking and editing the surgical video at different stages of the operation is also of great significance for monitoring the surgical process,subdividing the surgical procedure,and cultivating junior doctors.After the operation,it is undoubtedly difficult for the doctor to edit the video of the operation manually.Therefore,the need for intelligent editing of endoscopic surgery videos has become very urgent.In endoscopic surgery videos,the coexistence of continuity and discontinuity,large intra-class differences and small inter-class differences will make the research on intelligent editing of surgical videos face many challenges.In view of the above problems,the main research work of this paper is as follows:(1)Aiming at the invalid operation video clips in the endoscopic surgery video,a multigranular hierarchical semantic analysis network MHN(Multi-granular Hierarchical Network)is proposed.In the coarse-grained module,taking Res Net-50 as the backbone network,adding an attention mechanism allows the network to automatically select the spatial information features of the endoscopic video,and then extract the temporal features of the video through the LSTM network to complete the temporal and spatial analysis of the surgical video.In the fine-grained module,a self-correction module is proposed.Based on the coarse-grained results,iteratively correct the boundary of the effective surgical video to make the editing more accurate.Finally,the experimental results show that the MHN network has good performance in terms of accuracy and efficiency.On the nasal endoscopic surgery video data set,the accuracy can reach 89%,which is 8%higher than other popular networks.(2)Aiming at the relatively low accuracy of MHN network classification of fuzzy video clips,an intelligent editing method based on hard frame detection HFD-Conv LSTM(Hard Frame Detection method using Convolutional LSTM network)is proposed.The core idea is to transform a three-classification problem(Clear,Background and Fuzzy)into two twostage two classification problems(Background,Non-Background/Clear,Fuzzy).Firstly,a new separator based on the coarse-grained classifier is defined to remove the invalid frames.Meanwhile,the hard frames are detected via measuring the blurring score of a video frame.Then,the squeeze-and-excitation is used to select the informative spatial temporal features of endoscopic videos and further classify the video frames with a fine-grained Conv LSTM learning from the reconstructed training set with hard frames.The experiments are performed on both hard frame detection and video frame classification.Nearly 88.3% fuzzy frames can be detected and the classification accuracy is boosted to 95.2%.HFDConv LSTM achieves superior performance compared to other methods.(3)Aiming at the problem of identifying the surgical stage,based on the laparoscopic surgery video Cholc80 dataset,and the Vision Transformer(Vi T)network as the method.A single image is divided and serialized according to rules to establish the temporal relationship within the image.The core idea is to build an Encoder network by stacking multiple identical multi-head self-attention modules,and use the residual structure to prevent training degradation.Experimental results show that the average classification accuracy of32 test videos can reach 80.6%,and the standard deviation is 8%.While the accuracy is improved,the stability of the model is guaranteed.Compared with the network with only CNN structure,the average classification accuracy is increased by 4.9%.Compared with the network combined with CNN and RNN,the average classification accuracy is increased by 1.6%. |