| The cell is the very basic building block of all living organism.There are approximately~37 trillion cells composing an adult human body where every cell contains around~10~9 protein molecules performing a wide variety of functions in many different organelles or compartment.Each cell has the ability to grow and reproduce self-sufficiently whereas the organelles or compartments in the cell are specialized to perform different functions.Thus,a protein subcellular location is often closely associated to its biological function.To guarantee its normal functionalities,protein needs to interact with its corresponding interacting molecules at right location at a right time.Mislocalization of a protein can ravage these biological functionalities,which may lead to serious diseases like blindness,diabetes and even cancers.The accurate identification of protein subcellular localization can provide important clues for understanding the mechanism of biological molecular interaction and identification of drug discovery.Theoretical techniques,wet-lab experiments and technologies are challenging due to the explosive growth of newly discovered proteins which makes them costly,time-consuming,and resource-intensive.Therefore,demands for effective,fast and intelligent automatic tools based on computational intelligence to identify uncharacterized protein subcellular localization are growing day by day.The aim of this dissertation is to characterize large-scale image-based protein subcellular localization using intelligent computational models.Numerous computational methods have been proposed to predict the subcellular location of proteins.However,most existing methods have limited capability in terms of the overall accuracy and generalization power.The reasons are that majority of the existing methods use few features extraction schemes which might not able to capture all the unique distributions in bioimages,simple feature selection algorithms,only serial combination method and conventional classifiers.To address these problems,we design three pipelines,named as PSc L-HDeep,PSc L-DDCFPred and PSc L-2LSAESM for accurate and efficient image-based prediction of protein subcellular location in human tissues as follows:In PScL-HDeep,we focused on collecting updated datasets,learning multi-view features and optimizing these learned features via our proposed two-layer feature selection method.Specifically,we extracted different handcrafted and deep learned(by employing pretrained deep learning model)features from different viewpoints of the image.The step-wise discriminant analysis(SDA)algorithm is applied to generate the optimal feature set from each original raw feature set.To further obtain a more informative feature subset,support vector machine based recursive feature elimination with correlation bias reduction(SVM-RFE+CBR)feature selection algorithm is applied to the integrated feature set.Finally,the classification models,namely support vector machine with radial basis function(SVM-RBF)and support vector machine with linear kernel(SVM-LNR),are learned on the final selected feature set.To evaluate the performance of the proposed method,a new gold standard benchmark training dataset is constructed from the HPA databank.PSc L-HDeep achieved the maximum performance on 10-fold cross validation test on this dataset and showed a better efficacy over existing predictors.Furthermore,we also illustrated the generalization ability of the proposed method by conducting a stringent independent validation test.In the next project of protein subcellular localization prediction,along with the collection of large-scale data,we aimed to develop more robust feature optimization and multiclass classification algorithms.Particularly,PSc L-DDCFPred first extracts multiview image features,including global and local features,as base or pure features.Next,it applies a new integrated feature selection method based on stepwise discriminant analysis and generalized discriminant analysis to identify the optimal feature sets from the extracted pure features.Finally,a classifier based on deep neural network(DNN)and deep-cascade forest(DCF)is constructed.Stringent ten-fold cross-validation tests on the new protein subcellular localization training dataset,constructed from the human protein atlas databank,illustrates that the developed PSc L-DDCFPred method achieves a better predictive performance than several existing state-of-the-art methods.The independent test set further illustrates the generalization capability and superiority of PSc L-DDCFPred over existing predictors.In-depth analysis shows that the excellent performance of PSc L-DDCFPred can be attributed to three critical factors,namely the effective combination of the DNN and DCF models,complementarity of the global and local features,and use of the optimal feature sets selected by the integrated feature selection algorithm.Similarly,a common problem in the designing and development of in-silico methods is how to proficiently utilize the heterogeneous feature sets extracted from bio-images.Less efforts have been undertaken in this regard.Therefore,in this research,we boost the efficiency of integrating these heterogeneous feature sets by developing a new two-level stacked autoencoder network(2L-SAE-SM).Particularly,in the 1st-level of 2L-SAE-SM,each optimal heterogeneous feature set is fed to train our designed stacked autoencoder network(SAE-SM).All the trained SAE-SMs in the 1st-level output decision sets based on their respective optimal heterogeneous feature sets,which are known as“intermediate decision”sets.These“intermediate decision”sets are then ensembled by mean ensemble(ME)method to generate the“intermediate feature”set for the 2nd-level SAE-SM.Based on the proposed framework,we developed a predictor,named PSc L-2LSAESM,to characterize image-based protein subcellular localization.Our experimental results on the latest benchmark training and independent datasets collected from the human protein atlas databank indicate the effectiveness of the proposed 2L-SAE-SM for integrating heterogeneous feature sets.Besides,the detailed comparison of the proposed PSc L-2LSAESM against current state-of-the-art protein subcellular localization methods also demonstrate that the PSc L-2LSAESM vies and outperforms the existing state-of-the-art methods. |