| The T-cell receptor(TCR)is a specific receptor on the surface of T cells that recognizes antigens presented by the major histocompatibility complex(MHC).TCR is a heterodimer composed of alpha chain and beta chain,each of which is divided into variable,constant,and transmembrane regions.The formation of the TCR undergoes a process called V(D)J rearrangement,in which gene fragments of the variable region genes on the chromosome are recombined to form the complete variable region sequence,resulting in a high degree of TCR diversity.The TCR diversity of individual can reflect the immune status,and thus TCRs can be used to predict disease progression and help clinical treatment.In this paper,we have developed a series of research methods and toolsfor TCR high-throughput sequencing(TCR-Seq)data,which include the following three main parts:1.CATT: an ultra sensitive TCR detection methodIdentifying and analyzing TCRs from high-throughput sequencing data has been the mainstream approach to study TCR.However,current tools still have insufficient performance in TCR detection,especially on short read length,low data volume data.Most of the methods are not sensitive enough.Therefore,this study developed a novel TCR identification algorithm tool,called CATT(Char Acterzing TCR repe Toires),which is applicable to a variety of TCR-containing sequencing data,such as single-cell and bulk RNA sequencing(RNA-Seq)or TCR-seq.For the problem of TCR sequence identification,CATT adapt a highly specialized greedy algorithm-based network flow assembly algorithm and an adaptive sequence error correction method to obtain good performance and efficiency.On short read length data or single-cell sequencing data,CATT’s performance substantially ahead of other methods,with an average performance improvement of 50% in recall.2.TCRdb: the most comprehensive TCR database with powerful search functionAs TCR research becomes widespread,more and more TCR-Seq data are produced,covering different tissues,disease states,and cell types.However,these data are not effectively utilized because they have not been integrated and processed.In this study,TCRSeq data from over 8 000 samples containing 98 diseases,32 tissues,and 14 cell types were collected from public databases.After data quality control and TCR detection,over 270 million high-quality TCR sequences were obtained,and the most comprehensive TCR-Seq database TCRdb was constructed.TCRdb allows users to retrieve and browse TCR-Seq samples and datasets under different conditions and view the characteristics of their TCR repertoire,including TCR CDR3 sequence length distribution,VJ gene association,TCR repertoire diversity,and conserved sequences in TCR repertoire.TCRdb is the first database that can support searching for specific TCR sequences in massive data,allowing users to query the distribution of specified TCR sequences in different samples.3.De RR: a single-cell sequencing-based tool for dual-TCR identificationMost T cells express only one kind of TCR,but a subset of T cells would express two TCRs,a condition known as dual-TCR.Dual-TCRs have been found to be associated with autoimmune diseases,inflammation,and graft-versus-host disease.The insufficient technology to isolate and identify dual-TCR cells has limited researchers’ understanding of dual-TCRs.However,single-cell sequencing technology enable the detection of dual-TCRs.To this end,we developed De RR,the first tool for identifying dual TCR in single cell sequencing data.Subsequently,single-cell sequencing data of over 600 000 T cells from public data were analyzed using De RR to obtain the most comprehensive dual-TCR landscape.The results showed that approximately 15%-25% of T cells expressed dual TCR and that approximately 8% of T cells expressed double β chains,which is higher than the previously estimated.The proportion of T cells expressing dual TCR varied among status and was significantly higher in cancer samples.The dual-TCR expressing T cells were evenly distributed among T cell subpopulations without preference.TCR specificity analysis showed that the secondary TCR in the dual TCRs exhibited a stronger crossreactivity and are more likely to dominate the specificity of whole-cell.In summary,this study developed a highly sensitive TCR repertoire identification algorithm CATT;By using CATT to identify TCR repertoire of large-scale TCR-Seq data,a comprehensive TCR sequence database TCRdb was constructed.Based on CATT,a dual TCR identificaiton algorithm De RR was devloped for single cell sequencing data and provided a comprehensive landscape of dual TCR expressed T cells.These methods,database and results offer important resources for the TCR research. |