[2003.09229] Learning to Encode Position for Transformer ...
https://arxiv.org/abs/2003.0922913.03.2020 · We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models. Unlike RNN and LSTM, which contain inductive bias by loading the input tokens sequentially, non-recurrent models are less sensitive to position. The main reason is that position information among input units is not inherently encoded, i.e., the …