Congrats on your great work!
I am verifying your method on vision tasks and have a small concern on the influence of the "conv_kernel_size" of the 2D group-convolution in your task and I find that you choose relatively large numbers such as 35.
In vision tasks, applying the convolution with such a large kernel size is typically for ensuring a larger receptive field. Considering the proposed Nystrom attention already has the capability to model the long-range context following the original Multi-Head Attention. In summary, I am a little bit confused about the motivation of such a design.
Another important concern is that: should we set the num_landmarks equal to the feature map width as the image feature maps are of grid structure?
It would be great if you could share your advice on the influence of this parameter!
|
self.conv = nn.Conv2d( |
|
in_channels = self.num_head, out_channels = self.num_head, |
|
kernel_size = (config["conv_kernel_size"], 1), padding = (config["conv_kernel_size"] // 2, 0), |
|
bias = False, |
|
groups = self.num_head) |
|
|
|
"extra_attn_config":{ |
|
"softmax":{"attention_grad_checkpointing":True}, |
|
"nystrom-32":{"attention_grad_checkpointing":False, "num_landmarks":32, "conv_kernel_size":35}, |
|
"nystrom-64":{"attention_grad_checkpointing":False, "num_landmarks":64, "conv_kernel_size":35}, |
|
"nystrom-128":{"attention_grad_checkpointing":False, "num_landmarks":128, "conv_kernel_size":35}, |
|
"nystrom-256":{"attention_grad_checkpointing":False, "num_landmarks":256, "conv_kernel_size":35}, |
|
"linformer-256":{"attention_grad_checkpointing":False, "linformer_k":256}, |
Congrats on your great work!
I am verifying your method on vision tasks and have a small concern on the influence of the "conv_kernel_size" of the 2D group-convolution in your task and I find that you choose relatively large numbers such as 35.
In vision tasks, applying the convolution with such a large kernel size is typically for ensuring a larger receptive field. Considering the proposed Nystrom attention already has the capability to model the long-range context following the original Multi-Head Attention. In summary, I am a little bit confused about the motivation of such a design.
Another important concern is that: should we set the num_landmarks equal to the feature map width as the image feature maps are of grid structure?
It would be great if you could share your advice on the influence of this parameter!
Nystromformer/code/attention_nystrom.py
Lines 23 to 28 in effde25
Nystromformer/LRA/code/lra_config.py
Lines 46 to 52 in 2bcc280