Font Size: a A A

Network Integration Analysis Method Based On SmCCNet And Its Application In Sweet Potato Multi-omics Dat

Posted on:2022-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:S T HeFull Text:PDF
GTID:2553306344993799Subject:Economic statistics
Abstract/Summary:PDF Full Text Request
Sweet potato(Ipomoea batatas L.)is a vine plant in the Ipomoea genus of Convolvulaceae family,and is the sixth vital food crop in the world.It is widely cultivated worldwide with the characteristics of high yield,resistance to stress and high nutritional value.Besides,it is not only an important guarantee for national food security,but also a vital raw material for the starch industry.Thus,ensuring its stable production,high yield and good quality has always been an important issue for biologists.Storage root is a crucial organ of sweet potato due to its high economic value and is also a significant agronomic trait that constitutes the yield of sweet potato.Its formation and development are complex biological processes coordinated by several internal and external factors,involving the regulations of genome,transcription,protein and metabolism,which contain extensive mutual interactions between each other.In recent years,with the development of functional genomics,the cultivar of sweet potato has gradually shifted from traditional breeding to molecular breeding.The understanding of molecular mechanism of sweet potato storage root development is the basis of molecular breeding.Besides,various omics technologies have been developed,and more and more multi-omics-level data can be obtained so that we can more comprehensively explore the molecular mechanism of storage root development and screen out the key regulatory factors to achieve precision breeding.The integrated analysis of multi-omics data has become a research hotspot now.However,biological omics data generally has a small number of samples and a large amount of genes.A few of these genes are closely related to specific phenotypes,which will have a relatively large impact on the accuracy of key gene selection.For this kind of high-dimensional multi-omics data,variable selection and dimensionality reduction are needed to filter out genes that are not related to the phenotype,in order to select a few of key genes that are truly associated with the phenotype,simplify the model and improve the accuracy of the model.In this study,the integrated analysis of m RNA omics data and mi RNA omics data of sweet potato roots were conducted.Considering the correlation between these omics data,a multi-omics integrated network that is specific to the phenotype of sweet potato root weight were firstly constructed using Sm CCNet which is an algorithm based on the canonical correlation analysis.To select the key variables related to the phenotype from the constructed network group,a two-level variable selection was carried out to select vital groups and crucial variables simultaneously.In this study,two-level penalty was used to achieve two-level variable selection.In the inner layer,we used regularization to filter the variable information within the group,while concave penalty was used to make the model having a group structure.Additionally,biological omics data usually have high noise,which makes it difficult to estimate the structure of the model.To obtain more accurate model,stability selection was used to optimize the regularization variable selection.Stability selection combines sub-sampling and variable selection algorithm,and controls the error rate of variable selection and improves the accuracy of the model via controlling the sample size.Here,we randomly sampled 100 times,and 2/3 of the total samples were selected each time.Then,the bi-level variable selection method was used to select variables,and high-frequency variables were used to perform linear regression modeling.Here,we combined Sm CCNet,bi-level Lasso and stability selection to propose a multi-omics data integrated analysis model,which was used to conduct empirical analysis on sweet potato m RNA and mi RNA omics data.As a result,the majority of selected omics variables with high frequency had important biological significance.Further,linear regression modeling was performed by using selected variables.The results showed that compared with single-omics Lasso,multi-omics Lasso and multiomics GLasso,the regression model based on Sm CCNet,bi-level Lasso and stability selection had smallest prediction error and better prediction performance.
Keywords/Search Tags:SmCCNet, bi-level Lasso, stability selection, multi-omics integrated analysis, network analysis
PDF Full Text Request
Related items