MultiheadAttention — PyTorch 1.10.1 documentation
https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.htmlattn_mask – If specified, a 2D or 3D mask preventing attention to certain positions. Must be of shape (L, S) (L,S) or (N\cdot\text {num\_heads}, L, S) (N ⋅ num_heads,L,S), where N N is the batch size, L L is the target sequence length, and S S is the source sequence length.