| Cancer is an important public health problem in the world.Its morbidity and mortality are increasing year by year,and the treatment effect is poor.It is one of the common diseases and frequently-occurring diseases that do the greatest harm to human health,seriously affecting people’s life and health.Cancer can be cured,the key is the “three early” factors.A large number of clinical practices at home and abroad have proved that some cancers can be cured by early detection,early diagnosis and early treatment.If it reaches the late stage,modern medicine has no way to cure it.Therefore,the treatment of cancer should follow the prevention-based policy to achieve early detection,early diagnosis and to provide a reliable basis for the treatment of cancer,which is an important factor to reduce the death rate.In recent years,the progress and development of Next-Generation Sequencing(NGS)technology and the emergence of a large number of cancer-related genomic and transcriptome data have laid a foundation for diagnosis and treatment of cancer.Machine learning provides opportunities for automated diagnosis of cancer,discovery and detection of cancer-specific biomarkers,and optimization of overall treatment strategies from NGS data.Cell-free DNA(cfDNA),as an important source of materials for liquid biopsy,has important clinical application value.Therefore,based on omics data,cfDNA sequencing and machine learning,this study explored new markers and technologies for non-invasive diagnosis of cancer.The main research contents are as follows:1.Multi-label learning for the diagnosis of cancer and identification of novel biomarkersIn this study,we used the high-throughput omics data and machine learning models to classify the normal and digestive cancer samples,so as to find potential biomarkers for effective diagnosis and prognosis.We downloaded large-scale RNA expression data regarding normal tissue samples and the digestive cancer tissue samples from The Cancer Genome Atlas(TCGA)database,and implemented the Mean Decrease in Accuracy(MDA)and Mean Decrease in Gini(MDG)to identify novel biomarkers.Genes that are early dysregulated in the digestive cancers were selected to produce the final sets of biomarkers(biosignatures).Two-layer multi-label classifiers of the two biosignatures were then built and validated.The first layer is to identify whether a sample is Cancer,and the second layer addresses the multi-type classification problem by identifying the types of Cancer.Different classification models were compared,the results were discussed and analyzed.The results showed that all models showed satisfactory performance in which Multilabel-RF appeared to be the best.The accuracy of the Multilabel-RF based model was 83.12%,with precision,recall,F1 and hamming-loss of 79.70%,68.31%,0.7357 and 0.1688,respectively.To our knowledge,it is the first report of cancer diagnosis prediction based on multi-label learning.Proposed biomarker signatures were highly associated with multifaceted hallmarks in cancer.Functional enrichment analysis and impact of the biomarker candidates in the prognosis of the patients were also examined,revealing that these identified biomarkers(genes)have potential clinical application value in digestive cancers.Our analysis on gene expression patterns successfully introduced and validated novel biomarkers by using statistical learning.The newly proposed biomarkers revealed a strong classification power and might be applied to support the diagnostic decision making in clinical trial,which can improve the management of digestive cancer patients.2.Method and application of analyzing individual physiological state based on cfDNA high-throughput sequencingThe detection and analysis of plasma cfDNA based on high-throughput sequencing provides new opportunities for non-invasive diagnosis of cancer,especially early diagnosis.However,these non-invasive cancer detection and analysis technologies for cfDNA still have shortcomings such as tedious selection of DNA enrichment before sequencing,enrichment and separation of methylated DNA fragments,and finding and identifying fragment preference end coordinates,and there are still great obstacles in practical application.Aiming at the problems existing in the existing technology,this study provides a method for analyzing the source of cfDNA based on high-throughput sequencing of plasma cfDNA.This method simplifies the detection and analysis process of cfDNA,without any preprocessing of cfDNA before sequencing,such as fragment selection and methylation enrichment,and without complex bioinformatics modeling and terminal coordinate search after sequencing.The distribution characteristics of cfDNA around transcription start sites(TSS)in the whole genome were detected by this method,which can be used to judge the physiological state of the individual to which cfDNA belongs.NGS analysis of cfDNA from esophageal cancer(ESCA)and normal individuals showed that the distribution characteristics of cfDNA from the two sources were different on the genome,especially around TSS.Normal individuals had abundant reads distribution,while ESCA patients had significantly decreased reads distribution.This difference in characteristics provides a basis for judging individual physiological state described by cfDNA,which can be used for non-invasive detection and analysis of individual physiological state,and provides a new approach and means to detect individual physiological state.The promoter region of the whole genome was analyzed and49 ESCA-related genes were identified,which can be used as biomarkers for cancer diagnosis and prognosis.3.Identification of cancer diagnostic/prognostic markers from genome-wide chromatin accessibility based on cfDNAIn this study,we used adapted SALP-seq(Single strand adaptor library preparation-sequencing)combined with machine learning to search for ESCA epigenetic and genetic biomarkers in cfDNA.SALP-seq developed in our laboratory is a new single-stranded DNA library preparation technology.As a new single-stranded DNA library preparation and sequencing technique,SALP-seq is particularly suited to construct the NGS libraries for highly degraded DNA samples such as cfDNA.Moreover,by using the barcode T adaptors,this technique is competent to analyze many cfDNA samples in a high-throughput format.Combining SALP-seq and machine learning,we successfully analyzed cfDNA samples from ESCA and normal participants,identified epigenetic and genetic biomarkers of ESCA,and extended the detection of cfDNA to the chromatin openness state.In this study,the NGS libraries of 20 cfDNA samples,which were from 11 pre-operation ESCA patients,5post-operation ESCA patients,and 4 healthy people,were constructed by using SALP-seq.Based on bioinformatic analysis of sequencing data,we identified 54 ESCA epigenetic markers and 37 ESCA genetic markers,which may ultimately contribute to the development of effective diagnostic and therapeutic approaches for ESCA.Furthermore,these markers were verified by analyzing 10 new cfDNA samples from pre-operation ESCA patients.These biomarkers also shed important new insights on the potential regulatory and molecular mechanisms of tumorigenesis of ESCA.4.Comparative analysis of chromatin accessibility between cancer tissue and cell-free DNAIn this study,we performed the NGS library construction and sequencing of tissue samples and cfDNA samples,including 10 esophageal cancer tissues,3 matched adjacent tissues and10 matched whole blood samples,by using SALP-seq,developed by our laboratory.It further illustrates that cfDNA can be used to detect the chromatin openness state.By analyzing the signal strength of the reads distribution around peak summits identified based on paired cancer tissues,we found that a peak was formed around the peaks in cancer tissue,and a valley was formed in matched cfDNA.We also analyzed the signal strength of the reads distribution around peak summits that have been identified based on cfDNA,the results showed that a peak was formed around the peaks in cfDNA,and a valley was formed in matched cancer tissue,and a peak was formed in matched adjacent tissue.Moreover,we respectively performed visual comparison on the peaks of cancer tissue,Matched cfDNA and ESCA_ATAC(Results of ESCA tissue-based ATAC-seq downloaded from public database)in UCSC to further verify the reliability of our results.By mutation analysis of paired cancer tissue,paracancer tissue,and cfDNA,this study finally identified 17 ESCA-related mutation sites,illustrating that cfDNA can be a reliable source for cancer mutation analysis,which is consistent with the literature reports.We annotated these loci to obtain 22 target genes,4 of which were also found in MSK-IMPACT panel genes(U.S.Food and Drug Administration-approved panel genes for genetic mutation detection).The functional enrichment analysis showed that these genes have close relationships with the occurrence and development of cancer.In contrast with current paradigms for analyzing cfDNA,we extended the application of cfDNA to the detection of chromatin accessibility.To the extent that cfDNA composition is impacted by cell death consequent to malignancy,acute or chronic tissue damage,or other conditions,this method may expand the range of clinical scenarios in which cfDNA sequences comprise a clinically useful biomarker,making it a more powerful tool for clinical diagnosis.5.Chromatin openness index of cell-free DNACf DNA is an important source of material for liquid biopsies and is currently mainly used for cancer mutation detection.In previous work,cfDNA has been extended to the detection of the chromatin openness state.As important epigenetic information,chromatin accessibility is a key feature of cancer progression.Studying chromatin accessibility in cancer can aid in the early diagnosis,treatment,and prognosis of cancer patients.In current studies,researchers generally screen some regions related to cancer for research,which has certain limitations and loses a lot of important epigenetic information.To solve this problem,this study evaluated the genome-wide chromatin openness of cfDNA and proposed a new concept: the Chromatin Openness Index(COI),which is helpful for the clinical application of cfDNA.In this study,based on SALP-seq and COI,cancer and normal samples were accurately classified using machine learning.The results showed that the accuracy of the Probabilistic PCA-RF(3)based model was 96.88%,with sensitivity,specificity and MCC of 96.30%,100.00% and 0.8958,respectively.In conclusion,based on omics data,cfDNA sequencing and machine learning,this study discovered new markers and new technologies for non-invasive diagnosis of cancer,which provided new ideas and directions for early diagnosis of cancer. |