The rapid development of proteomics has provided unprecedent opportunities for the discovery of cancer biomarkers.Due to the high dimension and sparsity of mass spectrometry data,this study established a protein biomarker discovery workflow using data mining technology,which we named PBMiner.The process consists of four modules: data preprocessing,feature selection,modeling and classifier assessment,and finally lockdown the biomarker panels.We applied PBMiner to a mass spectrometry dataset of diffuse large B cell lymphoma(DLBCL),and identified a diagnostic panel composed of 6 proteins(PALD1,TBC1D4,TNFAIP8,CMAS,MME and PTPN1),and used this panel to construct a random forest model and a non-linear SVM model.The area under the receiver operating characteristic(ROC)curve(AUC)on the training set and the testing set are all equal to 1,leading to a complete distinction of the two subtypes of DLBCL.We also applied PBMiner to a recent mass spectrometry dataset of lung adenocarcinoma(LUAD),and identified a diagnostic panel composed of 19proteins(ABCF1,LAMC1,SRP72,AGER,etc.).The random forest model constructed by this panel had an AUC of 1 and 0.99 on the training set and the testing set,which distinguished tumor and para-tumor tissue with at least 97.5% accuracy.In conclusion,PBMiner provides a fast and effective pipeline to explore diagnostic and molecular stratification protein markers. |