can't use masks in multi-head-attention layer

### Motivation and description

let's say we have an array of shape (embedding_size, seq_len, batch_size), our padding mask will have a shape of  (seq_len, batch_size) which can't be used in multi-head-attension mask layer, we can only use casual masking which has the shape (seq_len, seq_len)

### Possible Implementation

_No response_