Understanding Q, K, V in SegVIT

Hi! I'm new to working with Vision Transformers and currently exploring various approaches to image segmentation. While reading the paper, I found the approach taken here quite interesting.

However, I'm having some trouble fully understanding how the Attention to Mask (ATM) mechanism is implemented—specifically, how it modifies or reinterprets the use of queries, keys, and values compared to standard attention mechanisms.

If anyone knows of any resources or explanations that could help clarify this, I’d really appreciate it. Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding Q, K, V in SegVIT #24

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Understanding Q, K, V in SegVIT #24

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions