18.08.2019 · If a FloatTensor is provided, it will be added to the attention weight. [src/tgt/memory]_key_padding_mask provides specified elements in the key to be ignored by the attention. If a ByteTensor is provided, the non-zero positions will be ignored while the zero positions will be unchanged.
02.06.2020 · Both src_mask and src_key_padding_mask is used in the MultiheadAttention mechanism. According to the documentation of MultiheadAttention: key_padding_mask – if provided, specified padding elements in the key will be ignored by the attention. attn_mask – 2D or 3D mask that prevents attention to certain positions.
08.12.2019 · as for, src_key_padding_mask, it has to be of shape (N, S), where N is batch size, and S is source sequence length. I think it is to make us not consider any padded words for finding representation of other words. for example, if we want to not consider third word in our source sequence, for finding attention weights, then, (batch size of 1)
The documentation says, to add an argument src_key_padding_mask to the forward function of the nn.TransformerEncoder module. This mask should be a tensor ...
12.08.2019 · src_key_padding_mask – the ByteTensor mask for src keys per batch (optional). In my opinion, src_mask 's dimension is (S,S), and S is the max source length in batch, so i need to send input src_mask (N,S,S) to the Transformer.I don’t know if i understand that correctly.
In Pytorch, this is done by passing src_key_padding_mask to the transformer. For the example, this looks like [False, False, False, False, False, False, ...
The documentation says, to add an argument src_key_padding_mask to the forward function of the nn.TransformerEncoder module. This mask should be a tensor ...