| With the development of gene sequencing technology and the increased awareness of precision medicine,the number of genomic studies has exploded.However,due to the confounding factors such as sample source and quality,experimental technique and operation,library quality and sequence characteristics,genomics data suffer from small sample size,high incompleteness,systematic errors from extensive sources that are difficult to eliminate,and diverse and imbalanced sample types.Therefore,appropriate processing methods are vital to guarantee the accuracy of data analysis.Current approaches fall short in addressing the following three specific problems:(1)When dealing with the missing data,zero values in real-world data generally consist of both biologically meaningful true zeros and false zeros derived from missing signals.Existing imputation methods have difficulty in simultaneously satisfing biological significance preservation,imputation accuracy,and operational speed.(2)When handling imbalanced data,multipe factors make the actual data highly sparse,including low sequencing depth,high resolution requirements,and particular application scenarios such as single-cell sequencing,etc,and established methods cannot normalize such data.(3)When processing imbalanced samples,the high dimensionality and heterogeneous characteristics of some real data render a single fixed rcoping strategy unable to meet the diverse needs caused by differences in data characteristics.To overcome the above-mentioned issues,three case-studies were conducted in single-cell transcriptomics,three-dimensional(3D)genomics,and metagenomics,respectively.For the problem of missing values in single-cell transcriptomics data,an imputation method FRMC based on gene expression correlation in cellular taxa was proposed.The FRMC method first evaluates the similar cell sets based on the Jaccard similarity index of cells to further prejudge the true zeros and false zeros in the expression matrix,then imputes the false zeros using an optimization model based on the low-rank matrix recovery.This process introduces Lagrange multipliers to transform the matrix optimization problem with equation constraints into an unconstrained optimization problem,and finally applies a singular value threshold iterative algorithm to find the optimal solution.The results of experiments conducted on five publicly available single-cell datasets demonstrated that the FRMC method was capable of efficiently imputing datasets with different experimental protocols and cell sizes.Evaluated from the perspective of imputation accuracy,FRMC was able to correctly distinguish true zeros from false zeros and perform imputation accurately,and had lower error values than the other four imputation methods with a normalized root mean square error(NRMSE)of 0.522,which indicating that the data matrices after imputation by FRMC is closer to the expected true matrices.Further biological verification also showed that FRMC can effectively enhance intracellular and intergenic connections and help achieve accurate clustering of cells.In terms of running performance,FRMC runs at least 8.7 times faster compared to 2DImpute when processing the test datasets under the same conditions.For the problem of imbalanced contact matrix data generated by high-through chromosome conformation capture(Hi-C)technology in 3D genomic studies,a Hi-C data normalization method HCMB based on equal visibility of genomic regions was proposed.The HCMB method uses the symmetry property of the Hi-C original matrix to transform the matrix normalization problem into a matrix balance problem,and architected on the Levenberg-Marquardt iterative equation,which can maintain the density of the coefficients matrix of the iterative system of equations in the convergence process;the transformed nonlinear equations are then solved by combining with the line search strategy and the projection method for obtaining the scaling factor vector to overcome the imbalance problem of high sparsity matrices.The results of experiments performed on four simulated and four real Hi-C public datasets suggested that matrices balanced by HCMB are completely consistent with that of Knight-Ruiz method,and could eliminate the biases affecting the biological relevant signals of A/B compartment annotation,topological associated domains(TADs)calling and P(s)curve estimation.Besides,HCMB could efficiently solve highly sparse contact matrices that cannot be solved by Knight-Ruiz method,and had higer convergence rate and more stable operation performance with an constant number of 5 iterations for Hi-C datasets with different sparsity and distribution characteristics.For the problem of classification prediction for imbalanced samples in metagenomics studies,a classification strategy selection method MCMLI oriented towards the high dimensionality of metagenomics data was proposed.After preprocessing the metagenomic sequence data,MCMLI firstly uses the maximum correlation and minimum redundancy methods for feature selection to construct feature subsets with different number of features,secondly introduces two types of resampling techniques to adrdress the imbalanced sample problem,and also introduces seven machine learning algorithms for classification prediction,then cross-combines different feature subsets,resampling,and classification methods to form multiple processing paths,and lastly evaluates the optimal strategy using a five-fold cross-validation method to achieve the strategy optimization for classification of imbalanced samples.A public metagenomics dataset containing multiple phenotypes with imbalanced sample proportions was collected for validation.In the results,the best strategy evaluated by MCMLI used the SMOTEENN integrated resampling technique to balance the number of samples in each group,and combined logistic regression algorithm to create a multi-classification prediction model.The prediction result of the model had an F1 score of0.9142 and an average of the area under the receiver operating characteristic curve(AUC)of 0.9475,which outperformed the other combinatorial strategies.This study proposed corresponding processing methods for missing and imbalanced problems in genomics data in three cases,aiming to improve the accuracy of the genomics data analysis,provide more references for researchers to choose appropriate data processing methods in combination with their own analysis needs,and provide methodological basis and theoretical rationale for subsequent studies. |