-
Notifications
You must be signed in to change notification settings - Fork 26
Understanding Q, K, V in SegVIT #24
Copy link
Copy link
Open
Description
Hi! I'm new to working with Vision Transformers and currently exploring various approaches to image segmentation. While reading the paper, I found the approach taken here quite interesting.
However, I'm having some trouble fully understanding how the Attention to Mask (ATM) mechanism is implemented—specifically, how it modifies or reinterprets the use of queries, keys, and values compared to standard attention mechanisms.
If anyone knows of any resources or explanations that could help clarify this, I’d really appreciate it. Thanks in advance!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels