| Sequential data grow enormously and rapidly in the era of big data.Different from the non-sequential data,long-term dependencies exist in sequences.Learning and mining the potential dependencies are critical for sequential data analysis.So far,sequential data analysis has been widely used in the fields of language,speech,video,finance,medicine,biology,Internet of things,traffic and is a hot topic in the research of big data intelligence.The traditional approaches show poor adaptability and are unable to solve the long-term dependencies effectively when facing the explosive growth of sequential data.The dependencies which are characterized by large span and deeply hidden are also big challenges in dealing with sequences.Recurrent neural networks,including the standard RNN and LSTM are able to learn long-term dependencies in sequences of arbitrary length theoretically,and are the remarkably effective models for sequential data.However,when training RNNs with a large number of sequences,tremendous parameters are involved.These parameters are updated and optimized step by step through a classical iteration method using massive training set,which makes the training of RNNs a combined issue of big data processing and high-performance computing.Therefore,it is an important problem to study a novel distributed storage and computing system according to the characteristics of RNNs training in order to improve the training efficiency and the accuracy of sequential data analysis.Based on the introduction of the relevant studies and techniques,we give the main challenges which affect the model training efficiency and the accuracy of sequential data analysis.Then,we present a distributed storage and computing system for sequential data analysis and describe its architecture.We focus on increasing the training efficiency and the accuracy,and carry out the research from three aspects: the storage method on the individual node,the distributed data and metadata management method and the training method based on distributed storage and computing for sequential data analysis.1)We design a node storage method based on NVM,which involves a fast file system and an asymmetric access algorithm for NVM.The prototype is implemented and evaluations are performed using gerneral testing tools.The testing results verify the node storage method can substantially increase the I/O performance of data access and reduce the response time,which guarantees the rapid access to the model parameters and the training set during sequential data analysis.2)A distributed storage strategy is designed for sequential data analysis.The metadata and data in the distributed storage system are used to store and manage the model parameters and the training set,respectively.Then,a metadata hierarchical management algorithm and a data distribution management algorithm based on NVM are proposed.The prototype is implemented and evaluations are carried out by general testing tools.The experimental results show the metadata hierarchical management algorithm can provide strong adaptibility and reduce the space and time overhead of metadata search,and the data distribution management algorithm based on NVM can speedup read and write as well as improve IOPS.Thus,the distributed storage strategy helps to improve the efficiency of training the sequential data analysis model.3)Distributed RNN training methods are proposed for sequential data analysis.RNN is used for modeling the sequential data and in order to speedup its training,the model parameters,the training samples and the computations are properly distributed among the multiple nodes of the distributed system.An autonomous RNN based on distributed storage and computing,an efficient training algorithm based on neuron dynamic activation,and an adaptive LSTM with duration are presented,respectively.The prototype is implemented and various evaluations are carried out.The experimental results validate the proposed approaches can increase the training efficiency and the accuracy of RNN as well as improve the model scalability for sequential data analysis. |