Font Size: a A A

Computational Simplification For Long Sequence Transformer

Posted on:2024-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:J C LianFull Text:PDF
GTID:2568306932462124Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Transformer model has made great progress in many intelligent application fields such as natural language processing,computer vision,audio processing,etc.As computational tasks become more complicated,the length of input sequences grows a lot,and the computational complexity of Transformer also increases rapidly.Specifically,the core operator of Transformer is self-attention,the complexity of which is O(N~2)due to the non-linear function combining with matrix multiplication,where N represents the length of the input sequence,and this will bring huge overhead to the hardware.For hardware with limited computing power,the input length of the model is also limited,which hinders the deployment of complex applications.Therefore,how to reduce the computational amount of self-attention while maintaining the accuracy is the key to simplifying the calculation of Transformer(especially with long input sequences),which is also the main research content of this thesis.To solve the problem of high computational cost in Transformer for long sequences,this thesis mainly focuses on the following three tasks to simplify self-attention calculation and always keeps a comparable accuracy.First,aiming at the problem of high computational complexity,this thesis proposes a linearized self-attention method called IJformer using ReLU acivation function and i/j coefficient to linearize the matrix manipulation,which reduces the computational complexity of self-attention from the square of the sequence length to linear.When the input length is 1000-2000,IJformer can reduce the computational amount of softmax attenion by 13-26 times,and increases the inference speed of Transformer by an average of 5.23 times without precision loss.Second,aiming at the problem of large computational scale,this thesis proposes a self-attention dimension reduction method based on the theory of low-rank matrix factorization,which reduces the dimension of K and V matrix,and further reduces the computational amount of matrix manipulation.After dimension reduction,IJformer reduces the quantity of computation by about 17%,and inceases inference speed by 1.17 times while the precision is only 0.5%less than Transformer.Third,aiming at the problem of data redundancy,this thesis proposes a joint sparsity method based on the input and ouput to prune matrix manipulation with a higher sparsity rate,which can theoretically reduce computational amount.After sparsity,the average sparsity rate of self-attention in various scenarios increases from 69.2%to 97.6%while still keeping a slightly higher precision than Transformer.IJformer,the linearization method can be combined with dimension reduction method and sparsity method respectively.For those two kinds of combination,IJformer after dimension reduction has a faster inference speed while IJformer after sparsity has a higher precision.
Keywords/Search Tags:Transformer, self-attention, linearization method, dimension reduction, sparsity
PDF Full Text Request
Related items