For models with only Decoder-stacks, how to apply causal mask?
For models with only Decoder-stacks, how to apply causal mask?