Self-Attention with Relative Position Representations
arxiv.org › pdf › 1803relative position representations from O(hn2d a) to O(n2d a) by sharing them across each heads. Additionally, relative position representations can be shared across sequences. Therefore, the over-all self-attention space complexity increases from O(bhnd z) to O(bhnd z + n2d a). Given d a = d z, the size of the relative increase depends on n bh.