Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Positional Encoding in Conformer Implementation #3887

Open
Deep-unlearning opened this issue Feb 25, 2025 · 1 comment
Open

Add Positional Encoding in Conformer Implementation #3887

Deep-unlearning opened this issue Feb 25, 2025 · 1 comment

Comments

@Deep-unlearning
Copy link

Missing Relative Positional Encoding in Conformer Implementation

Issue Description

The current Conformer implementation in Torchaudio is missing the relative sinusoidal positional encoding scheme that is a key component of the original Conformer architecture as described in the paper "Conformer: Convolution-augmented Transformer for Speech Recognition".

Details

In the original paper, section 2.1 "Multi-Headed Self-Attention Module" specifically states:

"We employ multi-headed self-attention (MHSA) while integrating an important technique from Transformer-XL [20], the relative sinusoidal positional encoding scheme. The relative positional encoding allows the self-attention module to generalize better on different input length and the resulting encoder is more robust to the variance of the utterance length."

However, the current implementation in conformer.py uses standard PyTorch MultiheadAttention without implementing the relative positional encoding:

self.self_attn = torch.nn.MultiheadAttention(input_dim, num_attention_heads, dropout=dropout)

Reference Implementation

For reference, NVIDIA's NeMo library does properly implement the positional encoding in their Conformer implementation: https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/modules/conformer_encoder.py

@Dannynis
Copy link

+1
I was about to open an issue about it as well a moment before i came across this one...
I don't see how would it work in any ASR scenario where temporal information should be preserved without the positional encoding.
Also note that it was already asked here:
https://discuss.pytorch.org/t/conformer-has-no-positional-encoding/207137

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants