| With the significant increase in trans-regional and transnational criminal activities,characterization of samples from unknown sources through in-depth mining of the DNA genetic information of biological samples at crime scenes has become a research hotspot in recent years,and ethnic inference is a very important research direction.A large number of systems for the differentiation of intercontinental,geographic regions,and domestic populations have been published at home and abroad.Most of these systems can be used for criminal investigation and identification.However,there are few reports on the differentiation of people in northern East Asia.A distinguished study was carried out on the Han in northern China,Japanese and Korean populations.The specific research is as follows:1.A total of 428 SNP loci of 307 samples from 103 the Han population in northern China,104 Japanese samples from the 1000 Genome Project,and 100 Koreans from the Asian Diversity Project were typed using multiple linear regression and collinearity diagnostic screening.67 high-information AISNPs loci combinations were selected,and 42 high-information AISNPs loci combinations were screened out using random forest average reduction accuracy analysis.2.For the selected 67AISNPs combination,construct two ancestry models of Softmax logistic regression and support vector machine algorithm;for the selected 42AISNPs combination,construct the ancestry inference model of random forest algorithm.The above three models are used to infer the Han,Japanese and Korean populations in northern China.3.Two methods of 307 samples:training:testing=8:2 ratio random sampling division and ten-fold cross-validation are used to evaluate the model performance.The accuracy rates of the three models in the 8:2 ratio random sampling test were 98.4%,96.7%,and 96.7%respectively.The average accuracy rates of the three models in the five-time ten-fold cross-validation test were 95.19%,95.77%,and 94.53%,respectively.4.A total of 31 test samples from the HGDP database and the SGDP database test the above three models with accuracy rates of 82.9%,80.5%,and 82.9%respectively.A total of 997 samples from Shandong and Shanxi tested in this study tested the above three models with overall accuracy rates of 81.1%,72.2%,and 76.1%,respectively.5.The ancestry inference method created in Study 2 was used to verify the 27 SNPs in the early stage of this research group,and build the Softmax logistic regression,support vector machine and random forest ancestry inference model of 27AISNPs,the accuracy rates of the three models in the 8:2 ratio random sampling test were 98.85%,97.3%,96.87%respectively.The average accuracy rates of the three models in the five-time ten-fold cross-validation test were 98.16%,98.26%,and 97.7%,respectively,and 1287 test samples tested the three models with overall accuracy rates of 95.96%,96.97%,and 95.73%respectively.The 67-plex and 42-plex AISNP prediction models established in this study can be used for genetic inference of the three major populations in North of East Asia.The application of this method will significantly improve the efficiency of the construction of the composite amplification system,and at the same time improve the effectiveness and reliability of the system for forensic identification.The combination of 42 AISNPs has a smaller number of locis and is more suitable for constructing a forensic detection system.It has a good application prospect in domestic Japanese and Korean personnel gathering areas,and has high practical application value.The ancestry inference method created in this study has achieved very good results in the application of 27-plex SNP. |