| Objective By means of computational biology and based on the sequencing data of Han Chinese,this experiment calculates a variety of population genetic parameters of Han population at single nucleotide polymorphism(SNP),and then uses the methods of machine learning to screen the ancestral information markers(AIMS)in the genome of Han Chinese.By constructing the specific ancestral information marker set of Han Chinese,So as to distinguish the biogeographic ancestry(BGA)of Han Chinese.The development of such a group of markers effectively makes up for the gap in ancestral inference of Han Chinese.So as to achieve efficient inference of the substructure of the Han Chinese.Methods Based on the geographical division of population genetics of the Han Chinese,the Han Chinese is divided into six regions: Northeast Han(NEH),Northwest Han(NWH),Central China Han(CCH),Southwest Han(SWH),Southern coastal Han(SCH)and Southeast Han(SEH),This study continued to select the core Han Chinese representing different regions from the sequencing data of 100 k Han based on the highest part of population density after principal component analysis.The data for further analysis included 3212 individuals and 899877 SNPs.Among them,1010 were from CCH,737 from SEH,549 from SWH,472 from NEH,246 from SCH and 198 from NWH.The population genetic parameters(Fst,Lei,in,KL)of the above groups were calculated by computer language,and the loci with the first 10000 information were selected as candidate loci.We classify them by means of naive Bayesian classifier model and support vector machine,and select the best subset as our specific Han AIMs system.Results1.Select the specific AIMs sets for the inference of Han Chinese ancestry.Among the 95% classification accuracy,271 SNP AIMs can effectively infer the North-South stratification of Han Chinese.For the Northern Han Chinese,457 AIMs are needed to identify,while the Southern Han Chinese needs 356 AIMs to effectively distinguish.2.The effectiveness of the system is proved by means of PCA and ADMIXTURE.3.By calculating the forensic parameters,it is known that the cumulative matching probability(CMP)value of 272 sites in North-South Han-AIMs is1.344e-73,and the cumulative exclusion probability value is 0.999999998192.It is proved that the system has the potential of forensic personal identification sites4.After Han AIMs screening,the genetic height difference ectopic point in the Han Chinese.The largest difference locus between the North and South Han is RSID: rs9614158(GRCH37),which is located in HORMAD2 region.The most different locus of Northern Han nationality is position 154197126 of chromosome 1.The biggest difference of Han Chinese in Southern China is reflected in the position18759629 of chromosome 10.Conclusions1.In this study,the sequencing data of Han Chinese were screened for ancestral information sites by machine learning for the first time,which effectively screened the SNP set for identifying the substructure of Han Chinese,and proved that the sites selected by us are more Han specific and have high forensic application value of personal identification and paternity identification,which plays an important role in identifying the population substructure and population stratification correction.2.Han-AIMs are expected to provide important reference and practical value for researchers in the fields of forensic genetics,criminal investigation,population genetics,precision medicine and multiomics in the future. |