| The misfolding of amyloid within the cell causes it to aggregate and form insoluble amyloid fibrils with rich β-fold structures,which impede the normal function of neurons and lead to a variety of neurological disorders.Predicting amyloid proteins can provide scientific grounds for exploring the pathogenesis of related neurological diseases,developing targeted therapeutic regimens and providing effective prognostic protocols.Limited by experimental data,early amyloid prediction studies focused on amyloid degenerative region prediction and hot spot prediction,and protein sequence-based amyloid prediction models have been lacking.The large number differences between validated amyloids and unvalidated proteins results in a widespread problem of data imbalance.The traditional learning algorithms with the aim of accuracy tend to predict test samples as large classes.How to improve the prediction accuracy of small class samples and reduce the significant loss caused by misclassification is the core scientific problem that needs to be solved urgently.The commonly used oversampling algorithm SMOTE is prone to overlap between sample classes,leading to distribution marginalization,which gradually blurs the boundary between positive and negative samples and increases the difficulty of classification algorithms while improving data imbalance.To address the limitations of SMOTE,this thesis proposes a new SMOTE improvement method based on density clustering algorithm,and constructs amyloid prediction models that combines multi-source feature fusion and machine learning.The main research content and innovative points are as follows:1)To address the problems that existing models extract single features and use single classification algorithms,based on the ensemble learning,a new amyloid prediction model ECAmyloid is constructed.In this model,protein sequence features are extracted by Pseudo Position Specificity Scoring Matrix,Split Amino Acid Composition,Solvent Availability and Secondary Structure Information.Then the optimal feature subset is obtained by correlationbased feature subset selection method,and individual classifiers are selected by an incremental classifier selection strategy.The accuracy of ECAmyloid on the cross-validation,the independent test dataset and existing model test dataset are both better than the existing models.Furthermore,the excellent performance of ECAmyloid on the existing model test dataset can justify the strong generalization performance of the model.2)To eliminate noisy samples and avoid the loss of boundary sample information,based on the density clustering algorithm and SMOTE algorithm,a new oversampling algorithm OPTICS-SMOTE is proposed with reference to the core idea of Borderline-SMOTE algorithm.The algorithm uses the clustering algorithm OPTICS to cluster the minority samples,and divides the boundary samples in each cluster into dangerous and safe samples based on the number of majority samples in the neighborhood.The number of new samples to be generated for each cluster is derived by calculating the cluster density distribution function.Futhermore,the dangerous samples are selected for oversampling.The results of the comparison experiments with other oversampling algorithms on commonly used imbalanced datasets show that the algorithm has powerful capability of processing imbalance data.3)A new amyloid protein prediction model Amy_Fuse is constructed based on OPTICSSMOTE algorithm and multi-view features.The model achieves better prediction performance by processing the imbalanced dataset with the OPTICS-SMOTE algorithm.The sequential features favorable for amyloid prediction are combined with the probabilistic features and class features provided by the baseline model in feature representation learning to yield multi-view features which fully exploit protein sequence information.On the independent test dataset,the prediction performance of Amy_Fuse is better than that of other existing models except ECAmyloid.Compared to ECAmyloid,the model improves the identification of positive samples,which indicates that the impact of the class imbalance problem is somewhat reduced.Moreover,the comparative analysis of the predictive performance of probabilistic features,class features and combined features demonstrates the validity of the multi-view features. |