| Biomacromolecules must be in specific subcellular locations to play their roles correctly.Therefore,the study in subcellular localization of biomacromolecules is of great significance for revealing the functions and interactions between biomacromolecules in cells,and studying the mechanism of disease occurrence and development.However,in the face of massive biological data,traditional biological wet experiment methods are time-consuming and costly.Therefore,there is an urgent need to develop predictors that can accurately identify subcellular locations of biomacromolecules to improve research efficiency and save research resources.Protein and RNA are very important biomacromolecules that are widely distributed in cells.This study focuses on protein and RNA research,explores the relationship between protein or RNA sequence distribution and subcellular locations,and utilizes deep learning technology to build subcellular localization prediction models.The main work content is as follows:(1)Aiming at the problems of insufficient extraction of chloroplast protein sequence features and scarcity of labeled samples,a multi-location protein subchloroplast localization predictor Da DL-SChlo was proposed.Firstly,the predictor characterized the potential information of chloroplast proteins through the protein language model to obtain the deep learning features from the sequences.At the same time,to fully characterize the entire dataset,traditional manual features were introduced to characterize the structural evolution information of chloroplast proteins to construct a fusion feature set.To speed up model prediction,more discriminative features were extracted from the fusion feature set using extreme gradient boosting trees.A Wasserstein generative adversarial network with gradient penalty was then trained to generate high-quality pseudo-feature samples for data augmentation.Finally,a subchloroplast multi-location prediction model was constructed by a hybrid neural network combining Convolutional neural network and Transformer.Experimental results show that the Da DL-SChlo predictor outperforms the current state-of-the-art protein subchloroplast prediction models in both cross-validation and independent tests.(2)Aiming at the problems of low prediction accuracy caused by complex label relationships and insufficient feature extraction in m RNA sequences,a multi-location m RNA subcellular localization predictor DRpred was proposed.The predictor first divided the m RNA sequences into subsequences through a sliding window,then input the subsequences into the word embedding model Word2 vec,and used the Skip-gram method to encode them into feature vectors.Then Bayesian networks were used to model the relationships between labels of different subcellular locations,capturing the dependencies among different labels.Finally,the inter-label dependencies were treated as prior knowledge and concatenated with feature vectors.Then,input them into the Bidirectional long-short-term memory network combined with attention mechanism for modeling to classify and predict m RNA subcellular locations.The independent tests show that the DRpred predictor achieves good predictive performance in both multi-location and single-location subcellular predictions,surpassing the best existing m RNA prediction models. |