| A genome represents an organism’s complete set of genetic instructions,containing all the information needed to build an organism and allow it to grow and develop.Therefore,studying the differences in genomic data of species is crucial for understanding the differences between different species and different phenotypes of the same species,and thus for gaining insight into life.This thesis aims at two forms of genomic data:genome sequence data and gene expression level data.Based on phylogenetic relationships between species or cellular development,we proposed a data-driven dissimilarity measuring framework with the Siamese triplet network.The differences between genomes were analyzed from the level of species differences and cell evolution,so as to analyze and explore the phylogenetic and developmental process of species.(1)For the genome sequence data,the alignment-based method is not only computationally expensive,but also dependent on the reference genome sequence.In contrary,the sequence comparison model based on k-mer frequency information not only does not need alignment and does not rely on the reference database,but also greatly saves computing time and resources.Therefore,the Siamese triplet network takes k-mer(6 bp)sequence frequency as characteristic and learns the measurement scale targeting at Taxonomy of species.We applied the network to compare the genome sequences of bacterial domain and primate species to realize the localization and extension of the biological taxonomic tree of unknown species with a localization accuracy of more than 84%.In addition,from a qualitative and quantitative point of view,the performance is significantly improved compared to the commonly used Manhattan distance.(2)The gene expression level reflects the abundance of gene transcription product mRNA in the cell,which well reflects the process of cell development and functional development.In this paper,the clustering and analysis of single cell data during the development process were implemented through studying temporal evolution scale of cells by Siamese triplet network,which was characterized by the gene expression abundance data during the single cell development process.In this paper,the gene expression data at different stages of mouse embryonic development obtained by four kinds of sequencing methods are used to achieve dimensionality reduction through Siamese triplet network.The reduced-dimensional data can not only predict the developmental stage of unknown cell expression data,but also restore its development on the low-dimensional space.More importantly,the model is extensible and can be used to predict data of different sequencing methods,which is of great significance to the comparison of gene expression level data.We proposed a weakly supervised,data-based,scenario-oriented sequences comparison framework with Siamese triplet network,which does not require extra work to collect accurate baseline distances for training.The embedding function is learned automatically from the easily-acquired triplets and can be applied to the application scenarios of different data characteristics,single level distribution and multi-level category distribution. |