| With the development of microelectronics and computer system,wearable smart devices are gradually popularized around the world.Those devices are usually equipped with various sensors,such as accelerometer and gyroscope,etc,which provides a wealth of information for Human Activity Recognition(HAR).Using signals collected from various sensors,the HAR system can identify various human activities,such as some daily activities,including running,walking,etc.,as well as some more complex activities,including eating,assembly lines worker's activities.The HAR system has been used in many mobile-aware applications,such as medical intake monitoring,rehabilitation monitoring,and fitness trackers.Early activity recognition algorithms mainly rely on handcraft feature extraction and supervised statistical machine learning algorithms such as and support vector machines.Handcraft feature extraction usually uses statistical features such as mean and variance.These methods are limited by domain knowledge and can only capture lower level features.In recent years,deep learning(Deep Learning)methods have been successfully introduced into human activity recognition systems.Deep learning has the capability of automatic feature extraction and can be effectively applied to construct HAR systems.However,there are currently two problems with HAR systems based on deep learning.One is the problem of multimodal data fusion,the input of the HAR system is usually a multi-modal signal collected from various parts of the human body.However,only some modalsignals of body parts related to activities can provide valuable information for activity recognition.Irrelevant information usually affects recognition and reduces the performance of the recognizer.Previous research fusing multimodal feature by naive concatenation,which may limit the model's ability to select important modal.Secondly,deep learning require massive amounts of labeled data to extract generalizable features from raw input,However,due to privacy protection and the high cost of data annotation,it is almost impossible for HAR systems to collect a large amount of labeled data in reality.To solve the problem of multimodal data fusion,we propose an attentionbased multimodal neural network model called Attn Sense for multimodal human activity recognition.The model uses convolutional network and gated recurrent unit(GRU)to perform feature transformation,and merges the multimodal signal and the hidden layer state through the attention mechanism,so as to selectively use the features from different modalities.Extensive experiments are conducted based on three public HAR datasets,which show that Attn Sense has achieved very good performance.We also visualize which modality of the input signals gets more attention to provide more clues about the internal state of the model and improves the model's comprehensibility.To solve the difficulty of collecting labeled data,we propose multi-task unsupervised representation learning model,which integrates the autoencoder reconstruction and K-means objective to generate hidden representation that that contains the main information of the original input and is cluster-friendly,we also train a classfier based on clustering pseudo label.Our experimental results on three HAR datasets show that the proposed model achieves state-of-the-art performance under unsupervised learning and transfer learning setting.We also analyzed the impact of hidden layer representation size,encoder network depth and training set ratio on model performance. |