Genotyping in biology,that is,genotyping,is the process of checking individual DNA sequence by using biological experiments.With the vigorous development of sequencing technology,more and more scholars analyze genotypes by computer processing technology according to the characteristics of genome,which improves the efficiency of genotyping.However,due to the diversity of variation and the existence of areas with complex results of structural variation detection,the existing genotyping methods are still to be improved,and the accurate genotyping information obtained is still a challenge.Based on the third-generation sequencing technology,this paper deeply analyzes the characteristics of mutation sites in gene sequences and their relationship with genotypes,and proposes a dynamic programming-based genotyping method SVGLR(SV genotyping for long reads)for third-generation sequencing data.The main research work of this paper is as follows:(1)Generate matching vector sequenceFirstly,sequence information is extracted according to the BAM parsing file method var Sig,and the sequence with the largest variation signal in the variation range is selected as the "seed sequence" among all sequencing sequences;Secondly,based on the Needleman-Wunsch sequence alignment algorithm,the sequenced sequences within the variation range of the genome except the "seed sequence" are compared with the "seed sequence" respectively,and the alignment matrix is filled according to the scoring rules,and the backtracking path of each unit is recorded at the same time.After filling,the score in the lower right corner of the matrix is the best score.According to the recorded backtracking path,backtrack the comparison matrix to the upper left corner,and get the matching vector sequence according to the backtracking result.(2)Optimizing vector sequences and genotypingOn the basis of the method of generating matching vector sequence,considering that when the sequence information is within the statistical variation range,the left and right bases with a certain length extension may contain invalid variation signals,and when classifying the matching vector sequence,it may cause too many categories or even classification errors,resulting in wrong genotype determination.A matching vector optimization algorithm is proposed to determine the validity of each matching vector according to the overlap of each signal and the variation range.Finally,set the genotypic criteria,classify according to the optimized matching vector sequence,record the vector sequence of each category and its quantity,determine the variation as the corresponding genotype by calculating the ratio,obtain the information of the variant genotype,and write it into the VCF(Variant call format)file. |