| Video summaries are an important form of summarising the main content of a video and a powerful means of improving the efficiency of video retrieval.However,existing work pays less attention to human-object interaction in video,or only extracts information about human actions,lacking the representation of object information in the interaction,which greatly limits the usefulness of the information expressed in video summaries and greatly reduces the usefulness for fast video browsing and retrieval.Radio Frequency Identification(RFID)opens up new ideas for multi-target sensing and provides richer information for video summary generation.In this paper,we propose a human-object interaction log generation method based on RFID and video fusion.The research content and contributions of the paper include the following three aspects.Firstly,the impact of human-object interaction on RFID signals is explored.By comparing the changes of RFID signals in static and dynamic environments and in different environments where interaction occurs,it is found that the RFID signal phase is sensitive to the environment,the signal characteristics are highly correlated with the category of interaction behaviour,and the signal characteristics of similar interaction behaviour are relatively stable,which provides the feasibility of using RFID signals to sense interaction behaviour and generate This provides the feasibility of using RFID signals to sense interaction behaviours and generate interaction logs.Secondly,the signals of the two modalities are processed separately.The original RFID phase signal is segmented to extract action segments,and after pre-processing,the data is input as features to the RFID neural network Rf_Net to obtain the interaction behaviour category;the human skeleton sequence is extracted from the video signal,and the skeleton sequence is input as features to the skeleton behaviour recognition network Ske_Net to obtain the interaction behaviour category.Finally,data are collected in real scenarios,data sets are constructed,and models are trained and predicted using the network proposed in this paper.The trained model was used to test the data and achieve information matching using a multimodal information matching algorithm based on temporal synchronisation to generate interaction logs that collectively describe the human-object interaction behaviour.The accuracy of the test set for interaction behaviour recognition using Rf_Net and Ske_Net was above 97%,and the accuracy of the multimodal matching algorithm for behaviour matching was above 96%,with a matching time error of less than 100 ms.The experimental results show that the multimodal matching algorithm proposed in this paper has high accuracy and robustness. |