Font Size: a A A

A Reference Genome Sequence Of Gossypium Hirsutum TM-1 And Its Usages In Cotton Compative Gemomics Analysis

Posted on:2017-09-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:J D ChenFull Text:PDF
GTID:1363330518480185Subject:Crop Genetics and Breeding
Abstract/Summary:PDF Full Text Request
Cotton is one of the most important cash crops in the world,cotton fiber is the most important natural material for textile industry and cottonseeds can be also used for the production of edible oil.Allotetraploid Upland cotton(G.hirsutum L.)accounts for more than 90%of cultivated cotton worldwide,is the main source of renewable textile fibers and SeaIland cotton(G.barbadense L.)accounts for 5%-8%.Analysis of Upland cotton genome sequence is the basement of cotton genome research,which can not only shed light on the structural features of the genome,but also lay the foundation for systematic investigation of genomic structure and evolution,comparative genomics,molecular design breeding and the fine mapping of important traits.The closest extant relatives of the original tetraploid progentitors are the A genome species G.herbaceum and the D genome species G.raimondii.A genome species have natural fiber for textile industry and D genome species havn't,but which including many genes and regulatory elements related to fiber development.Gossypium hirsutum L.and Gossypium barbadense L.are the most important cultivated allotetraploid species in the world,they exit obvious differences in terms of yield and quality,Gossypium hirsutum L.provides a high yield,and G.barbadense L.provides good fiber quality.Identification of variants based on the comparative genomics,lay the genetic foundation for illuminating the molecular mechanism,which resulted in the difference of fiber quality,yield,disease-resistance and stress tolerance.In this paper,we updated our ultra-dense inter-specific genetic map and assisted to the allotetraploid cotton genome sequencing.Then,we accomplished the draft genome of allotetraploid cotton G.hirsutum L.acc.TM-1 and analyzed it's structural features.Finaly,Variants were identified based the TM-1 genome sequence compared with other accomplished cotton genome sequence.The main findings are as follows.In this study,an ultra-dense inter-specific genetic map was constructed,which consists of 4,049 recombination bins(4,999,048 SNPs)in 26 linkage group covering 4,042 cM with an average inter-bins distance of 1.0 cM.Compared with the previous ultra-dense inter-specific genetic map,the numer of SNPs increased by 3.329.284,the bins increased by 1,158,the total size increased by 261.2 cM and the average inter-bins decreased by 0.31 cM.When the ultra-dense allotetraploid genetic map was aligned with the G.raimondii genome,two reciprocal translocations,15 simple translocations and 19 possible inversion were identified.In addition to this,we use this map to assesse and validate the genomes of tetraploid cottons.90 and 128 misassembled scaffolds were detected in TM-1 genome assembly version vO.1 and v1.0,which accounted for 36.0 Mb and 406.2 Mb,respectively.After these misassembled scaffolds were broken,6.146 scaffolds(2.3 Gb,94.6%)were anchored to the chromosomes.Meanwhile,717 and 1941 misassembled scaffolds were detected in Xinhai21 genome assembly version v2.0 and v2.1,which accounted for 208.0 Mb and 1.4 Gb,respectively.After misassembled scaffolds in version 2.1 were broken,the scaffols anchored to the chromosomes accounted for 1.95 Gb(88.0%).Short-insert paired-end(180,300 and 500 bp)and large-insert mate-pair libraries(2,5,10 and 20 kb)were prepared for Illumina sequencing.In total,612.4 Gb(245-fold coverage)of DNA sequencing clean data were generated for the genome assembly.The resulting scaffold were integrated using BAC-end data and ultra-dense genetic map,and assembled into the TM-1 genome sequence(v1.1),which comprised 265,279 contigs(N50 = 34.0 kb)and 40,407 scaffolds(N50 = 1.6 Mb).The total scaffold length(2.4 Gb)spanned?96%of the estimated allotetraploid genome(2.5 Gb),which including 4,635 scaffolds(1.5 Gb)in the A subgenome and 1,511 scaffolds(0.8 Gb)in the D subgenome.Furthermore,1,456 scaffolds(1.9 Gb.79.2%)was oriented based on linkage maps,which including 955 scaffolds(1.2 Gb)in the A subgenome and 501 scaffolds(769.5 Mb)in the D subgenome.At least 64.8%of the assembled genome are transposable elements(TEs),which including 1081.3 Mb(52.3%)retroelement,22.4 Mb(1.1%)DNA transposon and others.The primary cause of that the assembled A subgenome(1,477 Mb)is nearly twice the size of the D subgenome(831 Mb)is more TEs in the A subgenome(at least 843.5 Mb)than in the D subgenome(at least 433 Mb).Among them,the number of Gypsy retroelements(25.33%,523.85 Mb)was threefold higher in the A subgenome(362 Mb)than in the D subgenome(136 Mb).The allotetraploid cotton sequence consists of 70,478 predicted proteincoding genes with an average length of 1,179 bp.By comparing genetic and physical distances between adjacent markers,we found 26 recombination suppression regions in the 26 linkage groups,which might be related to the heterochromatic regions.Based on the SNP frequency across chromosome between TM-1 and Hai7124,nine SNP-poor regions were identified.The genomic identity peak of G.arboretum and G.raimondii,G.hirsutum TM-1 and G.barbadense inhai21 was 90.0%and 99.0%by comparative genomics,respectively.The genomic identity peak of TM-1 A subgenome and G.arboretum(98.0%)were more than of TM-1 D subgenome and G.raimondii.Three reciprocal translocations(A04 and A05;A01,A02 and A03)were identified between TM-1 genome and G.arboretum genome,so the G.herbaceum genome was more similar to the TM-1 A subgenome.When the TM-1 genome aligned to the G.raimodii genome,some large structural variants(>1 Mb)were detected,including 28 possible inversions and nine translocations.Among these,five possible inversions(A02:38.03-53.01Mb,A02:60.09-65.93Mb,A02:73.29-78.29Mb,A03:36.98-43.98Mb,A05:67.01-76.01Mb)were located in the reciprocal translocation regions,so complex structural rearrangment occurred in the tetraploid.Overall,the number of rearrangements between the A and D subgenomes(19 versus 18)was similar.However;the length of total rearrangements was larger in the A subgenome(372.6 Mb)than in the D subgenome(82.6 Mb).The average length per rearrangement was 19.6 Mb in the A subgenome,which is significantly larger than that(4.6 Mb)in the D subgenome(t-test,P=0.0064).To identify variations between G.hirsut.um and G.barbadense,we aligned the de novo assembly of Xinhai21 onto the reference TM-1 genome,and 1,478,026 SNPs and 1,084,758 INDEL were detected.Of these SNPs,63.2%were novel compared to previous study.1,033,359 primer pairs were designed from the flanking sequence of INDELs and only 117,551(11.4%)primer pairs were redundant with previous research.The frequency of SNP and INDEL is 4.5 SNP/kb and 0.43 INDEL/kb in whole genome,respectively.Based on the SNP frequency across the chromosomes,we identified 15 SNP-poor regions and five SNP-rich regions.
Keywords/Search Tags:cotton genome, Illumina sequencing, genetic map, physical map, genome ananlysis, comparative genomics, bioinformatics, SNP, INDEL
PDF Full Text Request
Related items