| Objective(1)To analyze and master the characteristics of HIV-1 molecular evolution in Nanjing from 2015 to 2019,and provide theoretical basis for the control and prevention of the HIV-1 epidemic in Nanjing.(2)To establish and compare the accuracy and clustering rate of HIV-1 molecular network between paired gene distance method and phylogenetic tree combined maximum gene distance method,and discuss the implantation method and appropriate parameters for the construction of HIV-1 molecular network in the local area.(3)To identify the dynamics of the HIV-1 molecular network in Nanjing,and analyze the correlation between the characteristics of people living with HIV-1 and the dynamics of the molecular network,and explore the factors that affect the dynamic of the HIV-1 molecular network.(4)Predictive models for the dynamics of the HIV-1 molecular network were constructed based on machine learning,to make an in-depth understanding and exploration of the dynamics of a molecular network,and to make better use of HIV-1molecular network technology in the control and prevention of HIV/AIDS.Methods(1)Survey data collection and analysisAll newly diagnosed HIV-positive individuals were recruited in Nanjing from2015 to 2019.Data were collected using a structured intervieweradministered questionnaire.Three categories of indicators,including sociodemographic characteristics,information of infection and HIV/AIDS-related knowledge and behavior were collected.Enter questionnaire information into a database file using Epi Data software,statistical analysis of data using SPSS statistical software(Version24),rate(%)represents the classification data,and the chi-square test was used to analyze the difference between different groups,and the test levelαis 0.05.(2)Molecular evolution analysis of HIV-1 in NanjingWe collected blood samples from people infected with HIV-1 in Nanjing from2015 to 2019.HIV-1 RNA was extracted and was then applied in the subsequent reverse transcription–polymerase chain reaction(PCR)and nested PCR(n PCR)to generate the pol fragments.Rapid genotyping was performed using HIV-BLAST search,MEGA X,Fast Tree,and BEAST were used to analyze the molecular evolution characteristics of the pol sequences of HIV-1 in Nanjing.(3)Construction and analysis of the molecular network of HIV-1The CRF01_AE,CRF07_BC,CRF08_BC,and B subtype pol region gene sequences of China were downloaded from Los Alamos HIV sequence database,and the pol sequences of spouses of people with HIV-1 infection were sorted out;Two molecular network construction methods were used to construct the HIV-1 molecular network,the appropriate parameter range of the two methods were explored through the correct recognition rate of couples.We constructed the HIV-1 molecular network in Nanjing from 2015 to 2019 base on the two methods,and performed a comprehensive analysis of the internal characteristics of the HIV-1 molecular network in Nanjing.(4)Construction and evaluation of machine learning predictive modelsThe basic molecular network was constructed based on the HIV-1 sequence of Nanjing from 2015 to 2017,and the end-point molecular network was constructed from the sequence of 2015 to 2019.To identify the dynamics of the molecular network of Nanjing according to the change of gene sequences between the two networks.Comparison of machine-learning algorithms to build a predictive model for detecting the dynamics of molecular network,according to accuracy,precision,recall rate,AUC value,and F1 value.Results(1)A total of 1013 newly diagnosed HIV-positive individuals in Nanjing were investigated.The majority of the participants were men(958,94.57%),most of them were were infected through sexual intercourse,especially homosexual(750,78.53%);Almost half were floating population(45.86%);The proportion of participants with a college degree and above accounting for 67.33%;In total,955 HIV-1 pol fragment sequences were successfully amplified from 1013 specimens,HIV-1 CRF01_AE and CRF07_BC were the predominant circulating strains in Nanjing,accounting for40.84%and 33.61%respectively;HIV-1 CRF01_AE strains prevailing in Nanjing were distributed in 5 branches,and the evolution rate was 2.96×10-3[2.67×10-3-3.23×10-3];HIV-1 CRF07_BC strains prevailing in Nanjing were mainly distributed on two branches,with an evolution rate of 3.36×10-3[2.98×10-3-3.75×10-3];The recombination result of 86 URF sequences showed four main branches:01BC like strain,0107 like strain,01C like strain,01B like strain.(2)CRF01_AE,CRF07_BC,CRF08_BC and subtype B showed the highest molecular clustering rate when the pairwise genetic distance was 0.007,0.005,0.008/0.009 and 0.010 substitutions/site,respectively.All subtypes had the highest molecular clustering rate under the criteria of 90(node value)and 0.035substitutions/site(maximum genetic distance)when using the phylogenetic tree combined maximum genetic distance method.The clustered proportion for CRF01_AE,CRF07_BC,CRF08_BC and subtype B by pairwise genetic distance method was lower than those by phylogenetic tree combined maximum genetic distance method(33.2%versus 55.3%,45.6%versus 54.0%,31.9%versus 36.2%,39.0%versus 42.5%,respectively,p<0.05).(3)Among 89 HIV-positive couples,73 pairs were correctly recognized based on the pairwise genetic distance method(the genetic distance threshold is 0.014),the correct recognition rate was 82.02%.Seven sequences were incorrectly clustered,the incorrect clustering rate was 4.49%.For the phylogenetic tree combined genetic distance method(90+0.045),77 pairs were correctly recognized,the correct recognition rate was 86.25%,69 clusters were formed,and there was no wrong clustering sequence.(4)HIV-1 molecular network of Nanjing was constructed by two methods,respectively.Most of the clusters’size was 2 in molecular network.There was statistical difference in clustering rate between two methods(~2=12.55,P<0.001).A potential transmission association between URFs and CRF01AE sequences showed in the molecular network constructed by the paired genetic distance method.(5)Multi-factor analysis of the dynamics of HIV-1 molecular network showed that students,floating population,Han nationality,multiple sexual partners,casual sex,anal sex,singleness were independent risk factors,OR values(95%CI)were 2.63(1.54-4.47),1.83(1.17-2.84),2.91(1.09-7.79),1.75(1.06-2.90),4.12(2.48-6.87),5.58(2.43-12.80)),2.10(1.25-3.54),respectively;Compared with bisexuality,heterosexuality and homosexuality,were protective factors for the dynamic of molecular networks,with OR values(95%CI)of 0.12(0.05-0.32)and 0.26(0.11-0.64);In addition,qualified for“National Eight Articles”on AIDS prevention,had sexual education experience were protective factors,the OR values(95%CI)were 0.12(0.05-0.32)and 0.26(0.11-0.64),respectively.(6)The accuracy and the area under the receiver operating characteristic curve(ROC)of the Gradient boosted machines(GBM)model were the largest among 8models,which were(0.78±0.0)and(0.81),respectively.The accuracy,precision,recall,F1 value and AUC value of the GBM model were larger than those of the logistic regression(LR)model.Permutation test analysis on the AUC values of these two models(LR and GBM)showed significant difference(Z=0.03,P<0.001).The top six most important feature variables that contribute to the GBM prediction model were multiple sexual partners,qualified for“National Eight Articles”on AIDS prevention,anal sex,condom use,infection route,and casual sex.Conclusions(1)The majority of participants were men,MSM was the main transmission route.Most of the cases were young and highly educated.This population were in the sexually active period,risk sexual behaviors such as multiple sex partners,casual sex,and inability to persist in using a condom,which provided recombination opportunities for the emergence of new HIV-1 CRF and URF,greatly increased the difficulty of prevention and control of HIV/AIDS.(2)The subtypes of HIV-1 in Nanjing were diversified and complicated.CRF01_AE and CRF07_BC were the primary genotypes.There were many new CRFs and URFs,URFs recombination forms were diverse,most of which were related to CRF01_AE.(3)Based on the paired gene distance method,different subtypes of HIV-1sequence should choose different threshold for the construction of HIV-1 molecular network.In contrast,the phylogenetic tree combined gene distance method seemed to be more robust and systematically evolved,and can be used to construct HIV-1molecular network of different subtypes.(4)To our knowledge,this is the first study analysis of the dynamic characteristics of the HIV-1 molecular network in Nanjing and its influencing factors.Machine learning model can predicted the dynamics of HIV-1 molecular network well and provided new information for specific HIV intervention based on the molecular network. |