| The continuous improvement of digital slide scanning technology and high-performance computing hardware has promoted the widespread application of deep learning methods in histopathology.Existing work is mainly based on a multiple instance learning framework,using convolutional neural networks to analyze pathological slides.However,the model’s poor performance in analyzing pathological slides is mainly due to:1)the model parameters of the convolutional neural network;2)the noise interference caused by the use of whole slide image annotations in multiple instance learning.To build an efficient and lightweight model to classify pathological images while reducing the interference of noisy labels,this paper proposes an end-to-end multiple instance learning framework that integrates multi-scale information as follows:(1)A lightweight pyramid framework for efficient feature fusion is proposed.As a commonly used deep learning method in histopathology,convolutional neural networks usually face two problems:1)The methods based on the convolutional neural network framework achieve high accuracy but increase the model parameters and computational complexity.2)Balance the relationship between model accuracy and calculation amount so that it can maintain and improve the model’s classification accuracy as much as possible based on the lightweight.This paper proposes a novel multiple instance learning lightweight model based on Vision Transformer(Vision Transformer,ViT),which effectively addresses the above problems by combining multi-instance and multi-receptive fields.Specifically,first,we introduce Tokens-to-token Vision Transformer(Tokens-to-Token Vision Transformer,T2T-ViT)instead of convolutional neural network as the feature extractor of the model to reduce the number of model parameters.Then,the model performance is improved by incorporating image pyramids of multiple receptive fields,which can consider both local and global features of cellular structure.Experimental results show that our model greatly reduces the number of model parameters and computational complexity,and the classification effect is significantly better than the convolutional neural network method.(2)A multi-scale end-to-end architecture based on feature loss is proposed.As a weakly supervised learning method commonly used in histopathology,multiple instance learning usually faces two problems:1)Under the mathematical assumption that all examples of slide images with positive labels are positive,which is not entirely true.There is interference from noisy labels during training.2)The lack of multiple perspectives to consider image information under single-scale data.In this paper,we improve on the Pyramid Tokens-to-Token Vision Transformer proposed in the previous section and propose an end-to-end model that fuses multi-scale instance information.This model solves the above problems by integrating feature loss and fusing multi-scale instance information.Specifically,first,we build an end-to-end model to incorporate slide-level and instance-level information to reduce the contamination of noisy labels.Then,to obtain richer instance-level information,the instance-level losses at the corresponding scales are summed up through the multi-receptive field pyramid as the overall instance-level loss.Experimental results indicate that our model classifies significantly better than the slide-level or instance-level approaches. |