| Chloroplast sequences have been widely applied as a useful tool for phylogenomic studies due to the simplicity of the structure of its circular genome,its predominantly(maternal inheritance)and the rate of evolution.Whole chloroplast genomes can be readily obtained from plants using Next Generation Sequencing Methods(NGS),providing invaluable data for species delimitation and systematics.Moreover,these genomes have been widely used in agriculture,evolutionary and ecological studies,food identification and currently they are the most deposited eukaryotic genomes in genetic sequence databases.Considering the high availability of these genetic resources either using NGS approaches or directly through online platforms,many high-throughput methods characterize genomic states of biological samples.Nevertheless,the rapid development of many sequencing and bioinformatic approaches to recover plastome sequences can lead to confusion in choosing the most effective approaches and settings.The trend towards massive sequencing of complete plastid genomes highlights the need for standardized and well-documented bioinformatic workflows.The requirement for efficient workflows that allow reproducibility of precise genome sequences is particularly important.Besides,a majority of published studies still fail to provide the necessary details to replicate their bioinformatic analyses and often merely list name and version number of the software tools utilized.Furthermore,with this rapid increase in availability of genomic resources offered by NGS technologies,efficient and standardized metadata curation approaches have become increasingly critical for the post-processing stages of biological data.Especially studies using chloroplast genome datasets,the assembly of the main structural regions in random order and orientation represents a major limitation in our ability to easily generate“ready-to-align” datasets for phylogenetic reconstruction,at narrow taxonomic scales.In addition,current practices discard the most variable regions of the genomes to facilitate the alignment of the remaining coding regions.Nevertheless,no software is currently available to perform curation to such a degree,through simple detection,organization and positioning of the main plastome regions,making it a time-consuming and error-prone process.The present thesis proffers an integrated approach through both molecular and bioinformatic components for addressing these issues in plant phylogenomics,which one has been structured into the following four chapters: Chapter 1 presents an introduction to the most current used methods and workflows in chloroplast genome sequencing,assembly,annotation,alignment,phylogenetic tree inference.A strong emphasis is placed on the standardization and reproducibility of bioinformatics workflows in NGSdriven genomic research.This emphasis is motivated by the previous experience in several plant phylogenomic studies as well as software tools that can affect downstream analyses and,ultimately,the final results.Chapter 2 describes a practical application by using the most current phylogenomic strategies for de novo assembly of the chloroplast genome of Sinopora hongkongensis,a critically endangered endemic tree species restricted to Hong Kong.We describe the laboratory procedure,the bioinformatic workflow,and the results of sequencing and characterization of the plastid genome.Specifically,we provide a detailed description of the bioinformatic steps taken to assemble and annotate this plastid genome.Upon assembly and annotation of this genome,we compare it against representative species in Lauraceae and other families.Here,we aligned the whole cp DNA regions of eighteen species to provide an interface for subsequent phylogenetic analysis.Finally,we conducted a simple phylogenetic tree inference as a proof of concept for the completeness of our workflow.In chapter 3,we introduce a fast and user-friendly software ECu ADOR,a Perl script algorithm specifically designed to automate the detection and reorganization of newly assembled plastomes obtained from any source available(NGS,sanger sequencing or assembler output).ECu ADOR uses a sliding-window approach to detect long repeated sequences in draft sequences,which then identifies the inverted repeat regions(IRs),even in case of artefactual breaks or sequencing errors,and automates the rearrangement of the sequence to the widely used LSC-IRb-SSC-IRa order.This facilitates rapid post-editing steps such as creation of genome alignments,detection of variable regions,SNP detection and phylogenomic analyses.ECu ADOR was successfully tested on plant families throughout the angiosperm phylogeny by curating 161 plastomes.ECu ADOR first identified and reordered the central regions(LSC-IRb-SSC-IRa)for each dataset and then produced a new annotation for the chloroplast sequences.The process took less than 20 minutes with a maximum memory requirement of 150 MB,and an accuracy of over 99 %.ECu ADOR is the sole de novo one-step recognition and re-ordination tool that provides facilitation in the post-processing analysis of the extra nuclear genomes from NGS data.The program is available at https://github.com/Biodiv Genomic/ECu ADOR/.Finally,in chapter 4 we synthesize the most important results of our research in an orderly way,highlighting our main motivations,novelty,discoveries,and future directions of our work.We have attempted to cover ground in the vast arena of issues facing modern phylogenomics today and summarize some of the most pressing challenges that the field of phylogenomics is experiencing as a consequence of the advent of molecular and genomic data.We emphasize that automated data-mining approaches remain incomplete and changing over time,and these can yield problematic data sets and results.Such errors could potentially compromise phylogenetic accuracy and could go undetected in the absence of expert knowledge. |