Font Size: a A A

Analysis Of The Genetic Resources Of World Cattle Breeds And Construction Of Pan-genome Using Non-reference Genome Sequences

Posted on:2023-09-21Degree:MasterType:Thesis
Country:ChinaCandidate:X T HanFull Text:PDF
GTID:2543306842969209Subject:Agriculture
Abstract/Summary:PDF Full Text Request
Genomic sequences that differ among breeds and individuals of cattle are the main reason controlling their formulation of different phenotypes and having different economic values.The use of the genome of a single breed/individual(Hereford cattle)as a reference to conduct research on cattle for a long time has severely limited the exploitation of superior genetic resources of different breeds and individuals.In this research,we used high depth sequencing data(read depth > 15×)to address the differences(insertion variants)in the sequences of 450 cattle of 31 breeds relative to the reference genome of Hereford cattle;evaluated the biological effects of unknown sequence insertion variants and further explored the genetic resources of unknown sequence insertion variants;at last,we used the second-generation high-throughput sequencing data of 898 individuals of 57 breeds to construct a multi-breed representative pan-genome based on the Hereford cattle reference genome,annotated the unknown sequences on the pangenome with coding ability,and initially formed a representative cattle reference genome.The main results of the study are as follows.(1)A total of 58,862 unknown sequence insertion variants were detected in the secondgeneration high-throughput sequencing dataset of 450 cattle from 31 breeds with read depth > 15×,and the same insertion variant site was detected up to 791 times,and the total length of insertion sequence was about 194 Mb,and the average length of the insertion sequence was 315 bp;the insertion sequence length was mainly enriched in the range of50-1000 bp.It is noteworthy that the number(1886)and length(343 bp)of insertional variants occurring in different populations of Bos indicus were larger than in the Bos taurus population,compared to the significantly larger number(1110)and length(305 bp)of insertional variants in Hereford cattle.The limitations of the reference genome in Hereford cattle were tentatively demonstrated.Separate annotation analysis of insertion sites for gene functional elements revealed that insertion sites were enriched in non-genetic regions of the bovine genome,with 23,423 insertion sites occurring in genetic regions and only 22% in coding regions,with possible effects on gene coding sequences.(2)Population structural analysis was further developed using insertional variants.Principal component analysis showed that the first principal component could clearly distinguished the three groups of Bos indicus,crossbred cattle and Bos taurus,while pedigree analysis also yielded the same results,demonstrating that insertional variation of unknown sequences differed selectively among cattle breeds.A total of 333 significantly different loci(top 1%)were identified by calculating the population differentiation index(Fst)of the insertional variant loci for the Bos indicus and Bos taurus populations.These significant insertion sites affected a total of 190 genes.Functional enrichment analysis of affected genes,GO and KEGG results showed significant enrichment(p < 0.05)of insertional variant loci in both populations in entries related to olfactory,immune and substance metabolism aspects,indicating a lack of sequence information related to relevant biological functions in the bovine reference genome,in agreement with the previously reported results.(3)Based on the second-generation high-throughput sequencing data of 898 cattle from 57 breeds(read depth >5×),a total of 4,285,821,838 sequences that were not successfully aligned with the reference genome were extracted,and 2,791,151 contigs were assembled to obtain a total of 543,702 sequence fragments by removing sequences below1000 bp and those classified as contaminants.A total of 38,980 representative sequences with a total length of about 74 Mb were identified by sequence similarity matching.A bovine pangenome representing the genomic information of these 898 cattle was constructed by combining the latest bovine reference genome(ARS-UCD1.2)as the framework and the missing non-reference genomic sequences on the bovine reference genome.These non-reference sequences accounted for 2.662% of the bovine pangenome.(4)One-end-anchor reads,Indirectly blast and EST sequences were used to locate 300,1256 and 23 non-reference sequences,respectively.170 protein coding genes,13 longstranded non-coding RNAs,7 pseudogene and 309 exons could be annotated by comparing the localization information with known annotation files.Of all 38,980 non-reference sequences,16,078(41.25%)sequences were predicted to have the ability to encode proteins and transcripts.In this research,starting from insertion variants in the genome,the distribution characteristics and biological effects of insertion variants in different cattle populations were analyzed,which fully illustrated the limitations and preferences of a single reference genome.On this basis,a cattle pan-genome representing 898 individuals was initially constructed,which provides necessary reference sequence information for genomic selection breeding and mining of genetic information in cattle breeding industry and even the treatment and prevention of diseases caused by structural variants.
Keywords/Search Tags:insertional variation, population structure analysis, population differentiation index, pan-genome, gene annotation
PDF Full Text Request
Related items