Skip to content

Understanding Q, K, V in SegVIT #24

@lapiceroazul4

Description

@lapiceroazul4

Hi! I'm new to working with Vision Transformers and currently exploring various approaches to image segmentation. While reading the paper, I found the approach taken here quite interesting.

However, I'm having some trouble fully understanding how the Attention to Mask (ATM) mechanism is implemented—specifically, how it modifies or reinterprets the use of queries, keys, and values compared to standard attention mechanisms.

If anyone knows of any resources or explanations that could help clarify this, I’d really appreciate it. Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions