| With the continued development of ’metaverse’ technology and the proliferation of XR devices,people can move freely between the physical and digital worlds through immersive technologies such as VR/AR,all of which are closely related to the synthesis of spatial audio.Binaural-based spatial audio technology requires an accurate headrelated transfer function(HRTF)for describing the transmission process from the sound source to the binaural to achieve realistic spatial sound effects.In response to the low computing power,low power consumption and high real-time requirements of XR end devices,this thesis designs a lightweight spatial audio algorithm library based on virtual sound source objects to implement an algorithm library supporting head tracking,pose perception and spatial surround sound.Low PCM data frame processing latency is achieved on entry-level XR end devices,while generating spatial audio with lower average spectral distortion values at different locations in space.Spatial audio generated using generic HRTF data can lead to head-in-head confusion,inaccurate positioning and poor spatial orientation for different listeners.Therefore,according to the phenomenon of "a thousand ears for a thousand people",personalized generation of HRTF data and thus personalized spatial audio is an effective way to improve spatial orientation.In this thesis,proposed to construct a hybrid model based on integrated learning to realise personalized HRTF generation.This method first uses a Gaussian hybrid model to build a common model on HRTF that is not affected by human physiological parameters,and then models the personalized model associated with human physiological parameters after stripping the common components.The HRTF data is finally integrated to learn and reconstruct the HRTF data,and its localisation information is reduced.In modelling the personalized model,a deep learning approach is used to generate HRTF data using the generic HRTF magnitude spectrum with the shared components separated and the human physiological parameters as input features.A fully convolutional neural network(FCN)was designed to predict each HRTFs in the full spatial direction using the key human physiological parameters and the generic HRTF magnitude spectrum,and the interaural time delay(ITD)was predicted by the transformer module.An attention mechanism is used in the HRTFs prediction model to better capture the relationship between HRTFs in two different directions with large angular differences in space.Finally,the shared components of HRTFs,predicted HRTFs and ITDs were integrated to obtain personalized HRTF and HRIR data.In addition to the individual training of the HRTFs and ITD generating models in the personalized model,their joint training models are considered and evaluated for direct output of the time-domain HRIR.The spatial audio algorithm library designed in this thesis can be applied to XR terminal devices to achieve low power consumption and high real-time performance,and generate corresponding spatial audio based on user’s positional information for binaural playback;meanwhile,the personalized HRTF generation method proposed in this thesis can quickly generate personalized HRTF data based on user’s physiological parameter characteristics,which,combined with the spatial audio algorithm library designed in this thesis can effectively improve the sense of spatial orientation. |