| Proteins,also called amino acid sequences,are the basic organisms that make up cells.Habitat stability of amino acid sequences refers to the ability to maintain their biological viability under specific ecological conditions.The analysis of amino acid sequences for habitat stability can assist in the synthesis of proteins that meet specific environmental requirements and is of scientific importance.The habitat stability of amino acid sequences is related to its structure closely.Although the analysis of habitat stability based on protein spatial structure reduces the cost of labor and time compared with traditional biological methods;the acquisition of amino acid structure information requires biological experiments such as X-ray crystal diffraction analysis and nuclear magnetic resonance,and the cycle time is long and the cost is high.So using sequence information for habitat stability analysis has become an important research direction.Existing amino acid sequence analysis methods are mainly divided into two categories:feature engineering methods that rely on expert knowledge can integrate known biological features into the modeling process,but it is difficult to express complex implicit associations between amino acids;although amino acid fragment implicit vector methods can solve these problems,the existing method of dividing the amino acid sequence by equal length may destroy the actual stable amino acid fragments and the long-distance amino acid associations,such as alpha helix and beta folding,which affects the stability of subsequent habitat Sexual analysis.In response to the above challenges,this paper studies the problem of hidden vector learning and habitat stability prediction based on amino acid sequences.The main work is as follows:1.For the problem of diversity of amino acid fragments,we propose an algorithm for segmentation of amino acid sequences based on statistics.There are many differences in the actual length and combination of stable amino acid fragments,and there are also rules.The rules are contained in many protein entities.The statistical results of the amino acid fragments contained in the dataset objectively reflect the rules.Based on the international NCBI dataset,this article establishes a dictionary of amino acid fragments based on statistics,and segmentes amino acid sequences according to the method of maximum posterior probability,so that the results of segmention can reflect the rules.2.For the complex association problem of amino acid fragments,a hidden space vector method is proposed.The diversity of the spatial structure of amino acid fragments reflects both explicit direct chemical relationships and implicit long-distance biological characteristics.In this paper,the hidden space vector method is used to model this complex association.According to the spatial proximity characteristics of the conserved domains in the amino acid sequence,the semantic distance of vector is modeled and the representation of amino acid fragments is learned.3.For the problem of habitat stability prediction,we propose a neural network model with attention mechanism.Model long-distance dependencies between amino acid fragments through recurrent neural networks,and learn correlations in different dimensions through convolutional neural networks.For the high-temperature and high-pressure environment goals,joint learning is used to share underlying semantic information.On the NCBI dataset,the model performance is tested in four aspects.The first is to compare the effects of different parameter settings.The second is the impact of the fine-tuning mechanism.The third is the comparison of different related work.The fourth is the impact of the components of the model. |