There are two possibilities:
- use convolution to aggregate information from adjacent input words
- apply convolution to the individual token embeddings
The goal here is to experiment with the second option, so as to a) counteract potential over-fitting, and b) speed up processing due to decreased embedding size. This could be particularly useful within the context of large contextualized embeddings (e.g., ELMo), which already capture information related to adjacent words.
There are two possibilities:
The goal here is to experiment with the second option, so as to a) counteract potential over-fitting, and b) speed up processing due to decreased embedding size. This could be particularly useful within the context of large contextualized embeddings (e.g.,
ELMo), which already capture information related to adjacent words.