| Drugs are specific commodities related to everyone’s health.Drug market supervision,which is related to the national economy and the people’s livelihood,has become a problem that all countries should confront.How to widely,quickly,and effectively screen drugs and standardize the drug market has always been a problem that governments worldwide attach great importance to and urgently should solve.Due to the incomparable advantages of near-infrared spectroscopy in rapid drug detection(which is in the process of continuous rapid development and improvement),China has been supervising the drug market in a wide range according to near-infrared spectroscopy detection technology for many years.However,according to the new situation and the changes of drug crimes,the classification,or the identification technology,which is the most important drug near-infrared spectroscopy detection technology,has encountered several problems,which must be introduced and combined with new technologies such as deep learning for its solving.First,the universality and complexity of census and screening work and the current low rate of crime detection require that near-infrared spectroscopy technology applies to the scene in many categories and spectra and further improve its performance.Second,The sharp rise in the proportion of production and sales of substandard drugs,and the obstruction of drug market management orders such as illegal counterfeiting and import,have gradually become the mainstream of drug crime.Therefore,near-infrared spectroscopy must provide rapid and accurate multi-classification identification support in the scene of small inter-class differences and large intraclass differences for counter-measures.Third,the complex discriminant classification under the condition of multiple categories and spectral numbers is always consequence by the problems of "sample imbalance between classes" and "classification error cost sensitivity."Finally,but importantly,with the rapid development of the pharmaceutical industry and the continuous development of new products,new forms of drug crime are also emerging.Therefore,to use the general prior knowledge contained in the existing spectral big data and then promote the crime identification of the "outside the category of modeling samples," new methods such as contrast learning in deep learning should be introduced.Reducing the cost of introducing new samples and expanding the new scope are also under consideration.According to the new situation of illegal drug detection,starting with the practical needs of drug quality supervision and the needs of nearinfrared spectrum rapid detection technology,this paper proposes four multi-classification modeling methods of drug near-infrared spectrum data under the scene of "many categories and spectra" based on deep learning and tries solving the small inter-class differences problem,large intraclass difference problem,data imbalance,and difficulties in identifying unknown samples beyond the scope of modeling sample category.1)Aiming at the problem of small inter-classes difference,based on the characteristics that variational auto-encoding(VAE)is both a feature extractor and a data generator,a multi-classification modeling method of drug near-infrared spectra is proposed,which considers both feature extraction and classification,trains VAE and classification networks simultaneously and uses VAE features to generate samples for classification.This method attempts to change the idea of relying on original samples in traditional classification algorithms and uses only generated samples according to VAE features for classification instead.Around this idea,its cost function considers both auto-encoding feature extraction and multiclass classification,and its classifier directly connected the generator in its network structure.It mainly depends on the difference of the transformed samples to improve the classification accuracy in the scene of small interclass differences.Compared with the eight commonly used classification algorithms,the experimental results show that the proposed model can achieve better results in most cases(when the training set accounts for more than 50%of the whole data set)in the scenes of many categories and spectra with small inter-class differences.2)Aiming at solving the data imbalance problem,based on the advantage that the generative adversarial net(GAN)can generate samples with appropriate authenticity and diversity,a modeling method for drug multi-classification by quantitatively generating high-quality samples spectra of specified categories is proposed by transforming Bi-Gan.First,the pre-trained BP-ANN classifier is used to provide initial classification supervision.Then,the authenticity and diversity demands of the data are comprehensively set by limiting the local random sampling of the original Bi-Gan algorithm.Finally,through the alternating training of the generator,discriminator,and classifier,the generated samples with high authenticity,appropriate diversity,and supervised classification are used to replace the original uneven samples as the basis of classification,which has achieved good modeling effect and relatively stable time cost.3)Aiming at confronting complex scenes with many problems,such as the small inter-class difference problem,the large intraclass difference problem,data imbalance,and the difficulties of recognition beyond the category range of modeling samples,by using the advantages of the Siamese network in contrast feature extraction,near spectral multi-class drug classification method based on the prior knowledge of category differences in the background spectrum database is proposed.This method takes full advantage of the Siamese network’s ability to extract features that are "only related to the similarities and differences of the categories,but unrelated to specific category labels." By using a balanced and rational sampling strategy,a well-structured 1d-CNN feature extraction sub-network is constructed.First,the Siamese network is adapted into a discrimination network to realize the identification algorithm of true and false drugs,and then the identification algorithm is transformed into a multi-classification algorithm.It can better achieve the goal of high accuracy identification and classification in various complex scenes.Simultaneously,the algorithm attempts to use the method of "modeling on one data set and testing with another data set." Additionally,in the test data,its manufacturer name and drug name are unknown at the time of modeling.Using this method verified the general prior knowledge about category similarities and differences contained in the background data can still act on the data beyond the category range outside the modeling with a high probability generalization.Thus,the general prior knowledge provides a useful reference for further optimization.The method uses the near-infrared spectra of 32015 samples and 472 categories of drugs to form a data set and gradually input data for five modeling experiments.It has successively verified the role of 1d-CNN in the model,the realization of recognition and classification objectives,the possibility of using the prior knowledge of category similarity and difference to identify unknown category samples,and the accuracy of the model in simulating real difficult scenes compared with other algorithms.The experimental results show that compared with the other six common methods,this method can achieve better recognition results in complex situations such as many categories,small differences between classes,large differences within classes,unbalanced data,sensitive cost of classification error,and so on.In most cases,it has more than 96%accuracy of classification and identification and has good universality and generalization.4)Based on the Siamese network and deep clustering,a method of near-infrared spectral modeling of clustered drugs is proposed.This method combines VAE,Siamese network,DBSCAN,Hungarian algorithm,and other algorithms in its modeling.It makes full use of VAE to generate different samples around the characteristics of the same template and use the Siamese network to bring similar samples closer,pulling the heterogeneous samples far away in the feature space.The extracted features are clustered with DBSCAN clustering to realize the supervised clustering objective in the hidden feature space,where unknown samples can cluster under the supervised classification labels.The model can actively recommend several samples near the cluster center as typical samples.After the recommended samples are correctly labeled,and the drug samples to be tested can be effectively classified.The dataset was composed of the near-infrared spectra of drugs with 32015 samples and 472 classes and is gradually input to conduct three clustering and classification experiments,which are respectively used to analyze and verify the basic effectiveness of the model function and show the clustering and classification effect of the model in complex scenes.Compared with the other eight common multi-classification methods,the experimental results show that the model has better classification performance and more than 96%classification accuracy in most cases.In summary,the methods proposed in this paper solve some problems of multi-classification identification encountered in current drug supervision works and provide new technical ideas for developing nearinfrared rapid detection technology.Some attempts and explorations also provide clues and references for other NIRS,infrared,Raman,and other molecular spectra analysis in dealing with similar problems and lay a certain foundation for future research work. |