Skip to content

Influence of the "conv_kernel_size" within the proposed Nystrom Attention #5

@PkuRainBow

Description

@PkuRainBow

Congrats on your great work!

I am verifying your method on vision tasks and have a small concern on the influence of the "conv_kernel_size" of the 2D group-convolution in your task and I find that you choose relatively large numbers such as 35.

In vision tasks, applying the convolution with such a large kernel size is typically for ensuring a larger receptive field. Considering the proposed Nystrom attention already has the capability to model the long-range context following the original Multi-Head Attention. In summary, I am a little bit confused about the motivation of such a design.

Another important concern is that: should we set the num_landmarks equal to the feature map width as the image feature maps are of grid structure?

It would be great if you could share your advice on the influence of this parameter!

self.conv = nn.Conv2d(
in_channels = self.num_head, out_channels = self.num_head,
kernel_size = (config["conv_kernel_size"], 1), padding = (config["conv_kernel_size"] // 2, 0),
bias = False,
groups = self.num_head)

"extra_attn_config":{
"softmax":{"attention_grad_checkpointing":True},
"nystrom-32":{"attention_grad_checkpointing":False, "num_landmarks":32, "conv_kernel_size":35},
"nystrom-64":{"attention_grad_checkpointing":False, "num_landmarks":64, "conv_kernel_size":35},
"nystrom-128":{"attention_grad_checkpointing":False, "num_landmarks":128, "conv_kernel_size":35},
"nystrom-256":{"attention_grad_checkpointing":False, "num_landmarks":256, "conv_kernel_size":35},
"linformer-256":{"attention_grad_checkpointing":False, "linformer_k":256},

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions