| Background: In recent years,Routine Data generated in medical work,as an important part of Real World Data(RWD),has effectively contributed to the development of Real World Study(RWS).Compared with traditional Randomized Controlled Trial(RCT)data,Routine Data are featured with high variable latitude,large heterogeneity,and complex relationships among variables,and traditional statistical methods have certain limitations in handling such data.Machine learning(ML)algorithms are increasingly used in processing such data because they can identify the complex relationships among variables effectively.Machine learning algorithms have high requirements for sample size and subject labeling.However,high quality and sufficient structured data are often difficult to obtain in medical-related studies.When the incidence of some diseases is low or the study population is sparsely distributed,it is often labor-intensive and time-consuming to obtain sufficient amount of high-quality structured data.Therefore,the problem of Data Scarcity has become unavoidable in studies related to disease diagnosis and prognosis,especially in studies related to rare diseases.According to the European Medicines Agency(EMA)2019 statistics on rare diseases worldwide,there are about 6,000-8,000 rare diseases covering about 250 million patients,and it is often difficult to collect sufficient research samples when conducting research on rare diseases.To address the above problems,conventional machine learning algorithms mostly adopt algorithm selection and parameter adjustment or relax the inclusion criteria of subjects to enlarge the sample size to enhance the model performance on the validation set.However,in actual research work,the improvement of model accuracy by algorithm selection and parameter tuning may lead to the decrease of model extrapolation,which in turn makes the prediction model performance seriously degraded in the subsequent application of the model,making it difficult to meet the clinical requirements.In contrast,relaxing the inclusion criteria of subjects is likely to introduce more confounding factors and also lead to more complex relationships among variables,and the enhanced sample size due to the relaxed criteria does not necessarily meet the requirements of algorithm training to improve the prediction accuracy of the constructed models.Transfer Learning algorithm is a machine learning algorithm that has emerged in recent years to help solve the problems of low prediction accuracy and poor extrapolation of small sample data in the application of single machine learning algorithms.The main idea is to use other data similar to the target population(i.e.,the Target Domain)as the Source Domain,pre-train the model in the Source Domain,and learn the General Feature among variables;then transfer the pre-trained model to the Target Domain for further training of the model,and in turn,the Target Domain with a smaller sample size can be used as the Source Domain.The final model is then constructed in the Target Domain,which has a smaller sample size,thus reducing the sample size required for direct training based on the Target Domain.At present,this algorithm has been more widely used in unstructured data such as images and text in the medical field,but the similarity relationship between the Source Domain and Target Domain of unstructured data mainly depends on the similarity of the structure,and the theoretical basis of its transfer is mainly the theory related to computer vision.For example,when doing image recognition,ordinary lung nodule data can be used as the Source Domain to improve the accuracy of lung cancer recognition in the Target Domain.However,in clinical structured data,there are large differences between the pathological states of ordinary lung nodules and lung cancer patients,and the clinical characteristics are also different,and it may be difficult to obtain better prediction accuracy by using Transfer Learning to build models.Therefore,the performance of Transfer Learning in medical structured data and its clinical application value remain to be explored.In this study,we propose to build a Transfer Learning algorithm based on Long ShortTerm Memory(LSTM),introduce it into the study of prognosis prediction based on medical structured data,explore its prediction performance in small sample data,evaluate its application value,and provide methodological reference for the construction of prediction models for rare diseases or small samples.Objectives: To solve the problem of low model prediction accuracy when the sample size is insufficient,we explore the construction of a Transfer Learning algorithm based on Long Short-Term Memory artificial neural network,and evaluate the performance of the Transfer Learning algorithm under data scenarios with different Target Domain sample sizes,different Source Domain to Target Domain sample size ratios,and different types of outcome variables through simulation studies.Transfer Learning was also applied to the prognosis prediction problems of 30-day death and Length of ICU Stay(LOS-ICU)for patients with Moyamoya Disease(MMD)stroke to examine its practical application and provide methodological references for prognosis prediction studies of rare diseases based on medical structured data.Methods: Based on the above research objectives,the process of model construction,data simulation,model evaluation and example application are adopted to carry out this topic,which is described as follows.1.Simulation study results.(1)Simulation study of Transfer Learning in continuous type ending.The study proposes to use a linear function-based data simulation scheme for continuous type outcome data.Where the outcome variables are continuous type variables and the influences contain 29 continuous type variables conforming to normal distribution and 29 discrete type variables conforming to Bernoulli distribution.6 different Target Domain observations(N=50,100,250,500,750,1000)and 5 different Source Domain to Target Domain ratios(r=5:1,10:1,20:1,50:1,100:1)were set,and a total of 30 scenarios were used to systematically evaluate the model prediction performance.Using Transfer Learning as the study algorithm and Multiple Linear Regression,Random Forest Regression,Support Vector Machine Regression And K-Nearest Neighbor Regression directly on the Target Domain training set as the control algorithms,the models were evaluated based on their Mean Squared Error(MSE),Root Mean Square Error(RMSE),and Mean Absolute Error(MSE)on the Target Domain validation set.and Mean Absolute Error(MAE)on the Target Domain validation set to evaluate the prediction performance of the Transfer Learning algorithm for continuous outcomes in different scenarios.(2)Simulation study of Transfer Learning in dichotomous type ending.The study proposes to use a logit function-based data simulation scheme for dichotomous type outcome data.Where the outcome variables are dichotomous type variables and the influences contain 29 continuous type variables conforming to normal distribution and 29 discrete type variables conforming to Bernoulli distribution.6 different Target Domain observations(N=50,100,250,500,750,1000),5 different source to Target Domain ratios(r=5:1,10:1,20:1,50:1,100:1),and 3 positive outcome percentages(P=50%,70%,90%)were set,and a total of 90 scenarios were used to systematically evaluate the model prediction performance.Using Transfer Learning as the study algorithm and Multivariate Logistic Regression,Random Forest,Support Vector Machine And K-Nearest Neighbor Algorithms directly on the Target Domain training set as the control algorithms,the models were validated on the Target Domain validation set based on their Accuracy,Precision,Recall,F1 Score and Area Under Curve(AUC)to evaluate the performance of the Transfer Learning algorithm for predicting binary outcomes in different scenarios.2.Case study Case study was conducted to construct a prediction model for 30-day mortality and length of ICU stay(LOS-ICU)in patients with Moyamoya Disease(MMD).The model was pre-trained using patients with general ischemic stroke as the Source Domain,and Transfer Learning model was constructed using patients with MMD as the Target Domain.Multiple Logistic Regression,Multiple Linear Regression,Random Forest(regression),Support Vector Machines(regression),and K-Nearest Neighbor(regression)models were constructed using the MMD patient data according to the type of outcome.Differences in predictive efficacy between other algorithms and Transfer Learning were compared to comprehensively evaluate the model performance in the example.The study data were extracted from the Medical Information Mart for Intensive Care IV(MIMIC-IV)version 2.0,based on patient ICD codes;the predictors were selected from previous studies and extracted in the database.Results: 1.Simulation study results.(1)Simulation study of Transfer Learning in continuous type ending(1)When the sample size in the Target Domain is less than or equal to 500,Transfer Learning exhibits better model performance than other algorithms,and its prediction accuracy far exceeds that of other algorithms;(2)When the sample size of the Target Domain is above 500 and continues to increase,the prediction accuracy of other models begins to improve,and although Transfer Learning still has a certain accuracy advantage,its advantage decreases gradually with the increase of the sample size of the Target Domain,and when the sample size is 1000,its accuracy is close to that of Random Forest Regression,and the difference in accuracy is small;(3)Under different Source Domain to Target Domain ratios,the model accuracy is not exactly positively correlated with the ratio enhancement or Source Domain sample size enhancement.When the ratio of Source Domain to Target Domain exceeds 50:1 or the sample size of Source Domain is larger than 2000,the model prediction performance is no longer improved,and even the prediction performance decreases;(4)When the sample size of the Target Domain is larger than 500 or even more,the sample size requirement of the model for the Source Domain is reduced accordingly,and the increase of the sample size of the Source Domain has limited effect on the improvement of the model prediction performance.Therefore,it is suggested that in the actual study,the sample size of the Source Domain is reasonably selected according to the sample size of the Target Domain,and then the optimal prediction model is constructed.(2)Simulation study of Transfer Learning in dichotomous type ending(1)When the Target Domain sample size is between 250 and 750,Transfer Learning shows better model performance in most scenarios,and its prediction accuracy is better than other algorithms;when the Target Domain sample size is higher than 750,Transfer Learning outperforms other algorithms only in some scenarios,and its prediction performance is similar to that of Support Vector Regression algorithm in most scenarios;when the Target Domain sample size is lower than 250,Transfer Learning does not show outstanding prediction performance in most scenarios,and is similar to that of multiple logistic regression.When the sample size of the Target Domain is below 250,Transfer Learning does not show more outstanding prediction performance in most scenarios,and the results are similar to those of multiple logistic regression;(2)For different Source Domain to Target Domain ratios,when the Target Domain sample size is less than or equal to 500,the prediction model performance no longer improves or even decreases after the Source Domain to Target Domain ratio exceeds 50:1;when the Target Domain sample size is greater than 500,the prediction model performance improves with the increase of the Source Domain to Target Domain ratio;(3)For the case of positive outcome ratio,Transfer Learning performs better at 50% positive outcome ratio when the sample size in the Target Domain is less than or equal to 250;at a sample size of 500,Transfer Learning performs better at 30% positive outcome ratio;at a sample size greater than or equal to 750,Transfer Learning performs better at 10% positive outcome ratio;(4)By the overall judgment of different scenarios,Transfer Learning can show better prediction effectiveness when the number of positive sample cases in the Target Domain is higher,and the prediction effectiveness improves with the increase of the sample size in the Source Domain.2.Case study In the construction of prediction models for 30-day mortality and ICU length of stay for patients with MMD stroke,Transfer Learning showed algorithmic advantages consistent with the simulation results.The MSE,RMSE and MAE of Transfer Learning were lower than other algorithms in the comparison of prediction models for ICU length of stay,showing better prediction accuracy for continuous variables.In the construction of the prediction model for 30-day patient death,all algorithms except Transfer Learning underestimated the risk of patient death;while Transfer Learning predicted the risk of patient death more accurately through the General Feature obtained by Source Domain pre-training,and showed better performance in six metrics,including Accuracy,Precision,Recall,F1 value,AUC,and clinical benefit.Conclusions: As one of the important methods in the field of machine learning to solve the shortage of training sample size,Transfer Learning has been gradually developed and improved in recent years in both algorithm research and related codes,and its applications have gradually increased.In this study,we compared the prediction performance of Transfer Learning with commonly used statistical models and machine learning algorithms under different covariate strengths,sample sizes,source-target ratios,and outcome types by simulating continuous outcomes and dichotomous outcomes,and found that Transfer Learning outperforms other algorithms in continuous outcomes with small sample sizes in the Target Domain and dichotomous outcomes with more positive outcomes.The algorithms provide a reference for the selection of methods in different data scenarios.It also validates the value of Transfer Learning algorithms in practical medical research by predicting the risk of death and the length of ICU stay for MMD stroke patients,and provides a reference for researchers to select machine learning methods in disease prognosis prediction studies based on limited sample size. |