With the rapid growth of global data volume,the existing storage media such as magnetic disks,optical disks,and tapes have difficulty meeting the increasing demand for data storage.Synthetic DNA as a storage medium for storing data has the advantages of high density and long-term use of the medium,making it a new generation of storage technology with great potential.However,as a storage medium,DNA may introduce base errors during its synthesis,amplification,storage,and sequencing process,resulting in misidentification of the DNA sequence and failure to recover data correctly.Base errors in DNA data storage include insertion,deletion,and substitution errors,which are quite different from the types of errors in traditional storage media.In order to effectively deal with the insertion,deletion,and replacement errors in the DNA data storage process,this thesis proposes a method of merging multiple sequences to correct errors in the data part,and proposes a highly robust molecular addressing strategy for the molecular tag part,which can ensure the reliable distinction of a large number of molecules.The specific work includes:On the one hand,in view of the low utilization rate of sequencing reads in DNA data storage and the existence of insertion and deletion errors,a highly robust data recovery method that can correct base insertion and deletion errors is proposed.This method first obtains reads for data recovery by calculating the edit distance of the overlapping part between paired-end reads,generating sample files.Then,the sample files are grouped according to the embedded indexing information in the DNA fragments,and the representative central reference sequences in each group are selected.Furthermore,the insertions and deletions in the remaining sequences are identified via the aligning standard served by central sequences.Finally,the reference and all corrected remaining sequences are combined according to majority voting scheme to eliminate the substitution errors,and the optimal sequence is obtained.The simulation result shows that the proposed method can ensure reliable recovery of DNA storage data under low coverage.On the other hand,in order to study the channel characteristics of oligonucleotide pool data storage and solve the problem of difficulty in molecular number identification due to insertion and deletion errors,the error characteristics of the primer pool are analyzed and a highly robust molecular number design and identification method is proposed.First,a design method of primer and logical block numbering for segment multiplexing is proposed,and then a recognition method based on sliding window and dynamic programming algorithm is proposed,which can effectively identify insertion and deletion errors in primers and locate primer boundaries.Then,based on the logical block number decoding,the molecular sequence is sorted.Furthermore,for the sequence label in the logical block,a molecular label design and identification method of spread spectrum coding and sparse data superposition multiplexing is proposed.The simulation results show that the designed molecular number method can enhance the addressing range of oligo molecules and the proposed molecular number recognition method can effectively identify the insertion,deletion and substitution errors on the primer sequence,and realize the file recovery under low coverage with high robustness. |