Font Size: a A A

Screening And Validation Of Blood-based Age-related CpG Sites For Individual Age Estimation In Chinese Han Population

Posted on:2020-05-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:C XiaoFull Text:PDF
GTID:1364330590459104Subject:Forensic medicine
Abstract/Summary:PDF Full Text Request
BackgroundAge is an individual characteristic with a biological basis.When the reference sample and the DNA database are unable to provide matching information,by accurately predicting individual age using biological samples,the range of the unknown suspect can be narrowed and additional information can be provided to better predict the externally visible characteristics.In theory,individual age estimation has forensic application value in the following aspects:i.instructing the police to conduct investigations without eyewitness testimony and DNA database records;ii.assisting identification of unidentified bodies;iii.providing age information for legal affairs and fraudulent events;iv.to improve the prediction of age-related phenotypes.Currently,anthropologists and forensic technicians can infer the age of an individual by measuring and analyzing various age-related morphological changes on bones or teeth,but these morphological methods are only applicable to living or intact remains.However,the biological evidence left by the offender is highly unlikely to be a sample other than body fluids?such as blood,semen,and saliva?or hair.Therefore,other methods of age estimation that apply to these samples must be found.In the past two decades,researchers have reported a variety of age-related biomarkers,such as aspartic acid racemization,mitochondrial DNA deletion,signal-joint T-cell receptor excision circles,telomere length,advanced glycation end products,and messenger RNA.But these markers still have shortcomings such as low prediction accuracy,poor accuracy and reproducibility of detection methods,and vulnerability to environmental factors.There is increasing evidence that the level of DNA methylation?DNAm?at specific sites in the human genome is significantly correlated with age and is known as the"episogenetic clock".Recent studies have shown that the age-predictive ability of age-related CpG sites?AR-CpGs?is significantly better than mRNA,signal-binding T cell receptor deletion loops,and telomere length.At present,several studies have developed age prediction models for blood,saliva,buccal swabs,semen or more extensive tissues based on different methylation analysis platforms,further confirming that DNAm is the most promising maker for forensic individual age estimation.Studies have confirmed that the methylation of CpG markers has population differences,but most of the current reports are based on European origin populations or Caucasians.Even if there are a few studies on the Han population,they are not systematic.In view of this,this study aims to the identification of blood-specific AR-CpGs in the Chinese Han population using the Infinium MethylationEPIC arrays,and the development of age estimation method based on pyrosequencing technology.This study mainly includes the following three parts:Part I:Screening of AR-CpGsObjective Screening for AR-CpGs in Chinese Han population and or outside the 450K chip coverage using the using the Illumina Infinium Methylation EPIC?850K?methylation arrays.Methods Forty-two Chinese Han nationality unrelated healthy volunteers were recruited through private medical history or routine physical examination.Among them,the youth group?1825 years old?,the middle-aged group?3545 years old?and the elderly group?5565 years old?each have 14 volunteers,and each group has half male and female.The methylation status of approximately 853,307 CpG sites in the genomic DNA of these volunteers was analysed using the 850K chips.Probes or samples were filtered according to the following principles:i.probes with signal intensity lower than the average background signal?detection p-values>0.01?;ii.probes with less than 3 beads in?5%sample;iii.samples with an effective probe ratio of less than 98%;iv.SNP sites with a control probe.After calculating the methylation?value of the effective CpG locus in the sample,the?value was normalized using the BMIQ?beta mixture quantile dilation?method,and based on the normalized?value,the p values and adjusted p values for comparison between groups were calcuted.Then,with a p-value of less than 0.01 or adjusted p-value of less than 0.05,statistically significant differentially methylated positions?DMPs?were screened from male samples,female samples,and total samples,respectively.The intersection of the comparisons of each age group was used as a collection of age-related CpG markers.Finally,male or female candidate CpG sites for subsequent validation were selected based on the absolute value of the difference in methylated beta values between the elderly and the younger groups greater than 0.15.In addition,the reliability of the data is confirmed by comparing with literature and using 850K data to evaluate existing models.Results All 42 samples passed the quality control standard.For the total samples,20,378,56,584 and 5,281 statistically significant DMPs were screened between the middle-aged and young,old and young,and elderly and middle-aged groups,respectively,with p-value<0.01or adjusted p-value<0.05 as criteria.For the male samples,1,030,14,453 and 5,686 statistically significant DMPs were screened between the middle-aged and young,old and young,and elderly and middle-aged groups,respectively.For the female samples,9,956,1,291 and 3,626statistically significant DMPs were screened between the middle-aged and young,old and young,and older and middle-aged groups,respectively.After intersection,785,68,and 151AR-CpGs with statistical significance in the comparison between the three groups were screened in the total sample,the male sample,and the female sample,respectively.All of these methylated CpG sites?value with age is gradually increased or decreased.Further analysis revealed that,in addition to the Y chromosome,AR-CpGs were present on all chromosomes.It is worth noting that most age-associated CpG sites are not part of the 450K chip,especially up to 60%when analysing male samples alone.About two-thirds of the age-related CpGs sites have a decreasing methylation level with age.In particular,methylation levels in only one of the 68 male age-related CpG sites increased with age,while the remaining 67 showed a downward trend.The total sample,the male sample,and the female sample shared 5 CpG sites,namely cg16867657?ELOVL2?,cg10501210?C1orf132?,cg12899747,cg07504615,and cg21599943.More importantly,the intersection between the male sample and the female sample is only these five sites.Considering the possible gender differences,25 male or 24female age-related CpG sites were selected based on the absolute value of the difference in methylation?values between the elderly group and the young group.In addition to the three common CpG sites?cg16867657,cg10501210,cg12899747?,there were 18 850K chip-specific CpG sites and 28 450K sites among these candidate sites.Correlation analysis showed that the absolute value of the Spearman correlation coefficient of these sites was in the range of0.750.95.The literature search results showed that 12 of the 28 450K loci were reported as age-associated CpGs loci,9 loci were not reported,and the other 7 loci were associated with other diseases,smoking or longevity.After a comprehensive analysis,the candidate site cg04885881 related to smoking was excluded.More importantly,only 3 of these loci overlap with the CpG locus studied by the existing Han population.The validity of the 850 kDNAm data was determined by using the age-inferred model reported by Park et al.:Age=39.73167+ZNF423?cg04208403?×?-0.28914?+ELOVL2?cg16867657?×1.19242+CCDC102B?cg19283806?×?-0.69994?.Conclusion Other AR-CpGs were found to exist outside the 450K coverage.Twenty-five male candidate CpG sites and 23 female candidate CpG sites were selected for further validation.In addition,in the chip analysis with small sample size,the comparison method between groups can be used for the screening of AR-CpGs.Part II:Development of pyrosequencing assays and validation of candidate CpG sitesObjective To develop pyrosequencing assays for methylation analysis of candidate CpG sites and to further identifying the CpG sites used for establishing the age estimation modle.Methods PCR primers and pyrosequencing primers were designed using PyroMark Assay Design software version 2.0 based on the flanking sequences of the candidate CpG sites and primer design principles.Genomic DNA extraction,bisulfite conversion of 1000 ng of genomic DNA,and PCR amplification were performed using the QIAamp DNA Blood Mini Kit,the EpiTect Fast DNA Bisulfite Kit,and the PyroMark PCR Kit,respectively,according to the manufacturer's instructions.Pyrosequencing assays that used to analyse the methylation levels of 137 CpG sites in 41 fragments was established by optimizing the annealing temperature of the PCR amplification and the sequencing primers for pyrosequencing.A total of 60 Chinese unrelated healthy individuals were selected.Among them,the youth group?1825 years old?,the middle-aged group?3545 years old?and the elderly group?5565 years old?each have 14 volunteers,and each group has half male and female.The chronological age is equal to the number of days between the date of sample collection and the date of birth recorded on the ID card,birth certificate or registered permanent residence book divided by365,and two decimal places are reserved.Peripheral blood genomic DNA of 30 males and 30females was detected using the established 41 pyrosequencing assays.Methylation data was extracted using PyroMark Q24 Advanced 3.01 software and statistical software was used to calculate the Spearman correlation coefficient between the methylated beta value of each CpG locus and the individual's age.Finally,further screening of candidate CpG loci was completed under the condition that the absolute value of correlation coefficient was greater than 0.75.ResultsSince the female candidate site cg04875128 is located in the CpG dense region,it is difficult to design suitable primers,and the male candidate site cg13108341 failed multiple optimizations.Therefore,pyrosequencing assays capable of detecting a total of 137CpG sites in 41 genomic regions were established.Correlation analysis showed that with the correlation coefficient absolute value greater than 0.65 as the criterion,14 out of 20 female candidate regions?including 22 candidate sites?contained at least one CpG site that satisfies the requirements.Correspondingly,16 of the 24 male candidate regions?including 24 candidate sites?met the criteria.If the correlation coefficient threshold was adjusted to 0.70,0.75,and0.80,then 12,8,and 5 target regions respectively satisfy the requirement for the female candidate regions,and 11,9,and 5 target regions were satisfied for the male candidate regions,respectively.In addition,the correlation coefficient calculated based on 450K data is significantly larger than using pyrosequencing data.Finally,with the correlation coefficient absolute value greater than 0.75 as the screening criteria,8 female candidate regions?F1cg16867657,F2cg22454769,F3cg06279276,F4cg07547549,F5cg10501210,F9cg27030854,F11cg11584042 and F14cg26947034?and 8 male candidate regions?M1cg16867657,M2cg02844688,M3cg18738190,M4cg03372207,M10cg10501210,M12cg13552692,M18cg17675043 and M24cg17740900?were selected for the next large sample validation and model construction.Conclusion The 41 pyrosequencing assays established can be used for methylation analysis of corresponding regions.In addition,CpG loci located in 8 male and 8 female candidate regions have significant age correlation,which can be used as candidate loci for the development of subsequent age estimation models.Part III:Development of multiple linear regression modelsObjective To select a set of CpG loci that are significantly associated with age from the candidate regions and construct age estimation modles.Methods The optimized pyrosequencing assays was used to quantitatively analyse the methylation of 51 CpG loci in 8 candidate regions of 141 female individuals?380 years?and41 CpG loci in 9 candidate regions of 167 male individuals?185 years?.The Spearman correlation coefficients between each CpG locus and actual age were calculated based on DNAm data.Subsequently,a CpG locus with the largest correlation coefficient was selected from each candidate region for constructing a multiple linear regression model for male,female,and gender-independent,respectively.First,based on the DNAm data of all samples,construct a multiple linear regression model containing all of the selected CpG sites or use stepwise regression to test the relative importance of each marker.Then,the entire sample was randomly divided according to a ratio of 7:3,with 70%of the samples as the training set and 30%of the samples as the test set.Taking the adjusted R2,Malos Cp value and Bayesian information criterion?BIC?value as reference indexes,the model was constructed by using the optimal subset selection method,and the mean absolute error?MAD?,mean square error?MSE?,root mean square error?RMSE?and mean absolute percentage error?MAPE?were calculated to measure the predictive performance of the model.Assuming that the difference between the predicted age and the actual age is within the range of±5 years,the prediction accuracy of the regression model in the training set and the test set were calculated separately.Finally,the k-fold cross-validation method was used to evaluate the model.Results The absolute values of the correlation coefficient between the methylation level of CpG loci in the male regions M1,M2,M3,M4,M8,M10,M12,M24 and a female candidate region F2 and the actual age ranged from 0.8929 to 0.9570,0.7963 to 0.8017,0.5789 to 0.8256,0.8379,0.7748 to 0.8183,0.8016 to 0.9228,0.8928 to 0.9234,0.9143,and 0.0639 to 0.9242,respectively.The absolute values of the correlation coefficient between methylation level of CpG loci in the female candidate regions F1,F2,F3,F4,F5,F9,F11 and F14 and the actual age were 0.87420.9483,0.04090.8982,0.63600.7903,0.44760.7682,respectively.The absolute values of the correlation coefficient between methylation level of CpG loci in the male and female common candidate regions ELOVL2?M1 or F1?,C1orf132?M10 or F5?and FHL2?F2?in the total sample and the actual age ranged from 0.8960 to 0.9478,0.7768 to 0.9151 and0.0178 to 0.9136,respectively.Among them,the CpG locus with the largest correlation coefficient was located in the ELOVL2 gene in all three groups of analyses.After preliminary tests,it was found that the CpG locus with the largest correlation coefficient in some candidate regions played a relatively small role in the regression model.Six?cg17740900,cg19283806,M21,M108,F23,and M16?,four?F22,F46,F55,and F17?and three?ELOVL27,FHL22,and C1orf1328?CpG loci were selected to construct the male model,the female model and the combined model,respectively,with the adjusted R2,Maros Cp values and BIC values as reference indicators.In the male model?adjusted R2=0.9529?,the MAD,MSE,RMSE,and MAPE of the training set were 2.6568,12.0906,3.4772,and 11.9565%,respectively,and the MAD,MSE,RMSE,and MAPE of the test set were 3.0826,16.6841,4.0846,and 17.3213%,respectively.The prediction accuracy in±5 years of the model in the training set and the test set were 87.07%and 86.27%,respectively;the Spearman correlation coefficients of the predicted age and the actual age were 0.98088 and 0.97622,respectively.The adjusted R2,MAD,MSE,RMSE and MAPE for 10 times of 10-fold cross-validation were 0.9544±5.8613 E-5,2.9026±0.5555,14.6893±6.2751,3.7495±0.8011,and 0.1352±0.0550,respectively.In the female model?adjusted R2=0.9373?,the MAD,MSE,RMSE,and MAPE of the training set were 2.9627,13.3577,3.6548,and 2.1281%,respectively,and the MAD,MSE,RMSE,and MAPE of the test set were 3.0521,17.2682,4.1555,and 11.4948%,respectively.The prediction accuracy in±5 years of the model in the training set and the test set were 85.71%and 76.74%,respectively;the Spearman correlation coefficients of the predicted age and the actual age were0.96503 and 0.95681 respectively.The adjusted R2,MAD,MSE,RMSE,and MAPE for 10times of 10-fold cross-validation were 0.9312±6.6451 E-5,3.1103±0.7211,15.8586±7.1785,3.8925±0.8451,and 0.1249±0.0475,respectively.In the combined model?adjusted R2=0.9317?,the MAD,MSE,RMSE,and MAPE of the training set were 3.1875,16.2752,4.0342,and 13.0524%;the MAD,MSE,RMSE,and MAPE of the test set were 3.2506,17.99997,4.2426,and 13.7312%,respectively.The prediction accuracy in±5 years of the model in the training set and the test set are 77.67%and 78.49%,respectively;the Spearman correlation coefficients of the predicted age and the actual age are 0.96405 and 0.97026,respectively.The adjusted R2,MAD,MSE,RMSE,and MAPE for 10 times of 10-fold cross-validation were 0.9352±2.3084E-5,3.2483±0.3998,17.2531±4.1733,4.1233±0.5043 and0.1423±0.0440,respectively.When the three models were applied to the training set and the test set,it was found that the prediction accuracy of the elderly individuals?>50 years old?decreased.On the other hand,when using the combined model for age prediction,although the difference between the actual age and the predicted age was found to be significantly different between males and females?Man-Whitney test:P=0.00482;Kruskal-Wallis variance analysis:P=0.00481;Kolmogorov-Smirnov test:P=0.01382?,but the inclusion of gender in the joint model did not significantly improve the prediction accuracy.It was worth noting that the male model contains a new CpG site cg17740900.Conclusion Multiple CpG sites that can be used to develop age estimation models have been identified.Three age estimation models with a MAD value of about 3.0 years were established,which laid the foundation for follow-up research and practice.
Keywords/Search Tags:Forensic genetics, Age estimation, Biomarkers, DNA methylation, Pyrosequencing, Multiple linear regression
PDF Full Text Request
Related items