| Antibody,the secreted form of B cell receptor,is an essential part of the adaptive immune system.It has immense sequence divergence(about 1013)and can resist the invasion of various types of pathogens.The whole collection in an individual is referred to as the antibody repertoire whose dynamic changes can reflect the individual’s immune history and current immune status.The adaptive immune receptor repertoire sequencing(AIRR-seq)or repertoire sequencing(Rep-seq)can obtain millions or tens of millions of antibody sequences at one time.It dramatically boosts the application of antibody repertoire in the disease occurrence,development,and diagnosis,accumulating a large number of Rep-seq datasets.Integration of these vast datasets is valuable for vaccine design,explorating the immune responses of autoimmune and infectious diseases,and monitoring cancer prognosis.However,a platform for effectively integrating Rep-seq datasets has not yet been established,limiting the comparison analysis and reuse of datasets across studies or institutions.Therefore,we collected raw reads of 2449 Rep-seq datasets and reanalyzed them by a standard analysis pipeline.Then extracted features such as gene usage,SHM patterns,and clone diversity and finally constructed the RAPID(Rep-seq dataset Analysis Platform with an Integrated antibody Database),a webserver with both query and data analysis functions.When users analyze datasets by RAPID,they can select particular samples from 2449 datasets as controls,which reduces the tedious preprocessing steps for data reuse and helps to find disease-associated repertoire features.RAPID also stores 306 million clones,521 therapeutic antibodies,and 88059 functional antibodies.According to the CDR3 sequence,our platform can annotate clones automatically,perform statistics on the diseases composition for annotated antibodies,and conduct enrichment analysis.In addition,RAPID supports functional antibody and repertoire query.Based on these 2449 datasets,we investigated the distribution and function of public clones at the population-level.In this study,the public clone was defined as antibody with the same CDR3 amino acid sequence that occurs in more than two individuals.We found 5.07 million public clones,accounting for about 10%of the individual antibody repertoire.The annotation results show that public clones contain therapeutic antibodies and virus-neutralizing antibodies.In addition,compared to private clones,functional antibodies are enriched in public clones.Thus,public clones are potential candidates for antibody screening.Furthermore,we selected 326 pathogen-infected samples and 276 healthy controls and used 1915 repertoire-level features and 160 sequence-level features to characterize per antibody repertoire.After feature selection,547 repertoire-level features and four sequence-level features were retained to construct the infectious disease prediction model—DeepID(Deep learning method for infection diagnosis).This model outperformed traditional machine learning methods significantly and its’ AUC in the internal validation set is 0.9883.When applied to the classification of COVID-19 patients,although the AUC dropped to 0.8267,it is still higher than that of reference models.Taken together,our study stems from Rep-seq dataset.Firstly,we summarized published huge datasets and established a comparison analysis platform.Based these datasets,we also investigated the feature of public clone and constructed the prediction model of infectious disease. |