Skip to content

Self-attn in decoder layers. #16

@TsingWei

Description

@TsingWei

I noticed there is a section about DETA does not need self-attention in the decoder. in the paper. The results show that when the self-attn is replaced by ffn in decoder, the performance is better. I wonder whether the final version in the table of compared-with-other-SOTAs using this setting? Because I found in the code that the self-attn is hard-coded in the decoder layer:

self.self_attn = nn.MultiheadAttention(d_model, n_heads, dropout=dropout)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions