Single head attention, decoupled LR, autoregressive auxiliary loss, and gradient accumulation#191
Open
lucidrains wants to merge 63 commits intonanoporetech:masterfrom
Open
Single head attention, decoupled LR, autoregressive auxiliary loss, and gradient accumulation#191lucidrains wants to merge 63 commits intonanoporetech:masterfrom
lucidrains wants to merge 63 commits intonanoporetech:masterfrom
Conversation
…which to place an SHA block
…iments being more stable with more aggressive clipping at 0.5 on 20 million chunks
lucidrains
commented
Oct 22, 2021
| sha_sandwich_norm = true | ||
|
|
||
| [aux_decoder] | ||
| loss_weight = 0.25 |
Author
There was a problem hiding this comment.
set this to 0 to turn off auxiliary AR loss
Author
There was a problem hiding this comment.
protocol should be to start off with 0.25 and search for higher values up to 1. if you see continued improvement
…ing --batch * --accum
iiSeymour
reviewed
Oct 25, 2021
… a command line flag. also add ability to turn off self attention in the AR decoder
…concatting feature dimension across layers
lucidrains
commented
Nov 19, 2021
samgd
reviewed
Nov 20, 2021
…b_attn flag in configs
… default head dimension to 64
lucidrains
commented
Nov 26, 2021
| ff_dropout = 0.1 | ||
| num_attn_heads = 1 | ||
|
|
||
| use_isab_attn = true |
Author
There was a problem hiding this comment.
when using ISAB attention, num_attn_heads above should be set to at least 4
| use_isab_attn = true | ||
| isab_num_latents = 6 | ||
|
|
||
| weight_tie_attn_blocks = false |
Author
There was a problem hiding this comment.
for parameter saving when using ISAB blocks, which has twice the number of attention parameters than S(M)HA blocks
lucidrains
commented
Nov 26, 2021
| num_attn_heads = 1 # number of attention heads, which should be kept at 1 for single-head attention, but can be increased to > 1 to turn on multi-head attention | ||
| dim_attn_head = 64 # dimension per attention head, should just keep at 64, but can be lowered to 32 for further efficiency / perf tradeoff | ||
|
|
||
| use_isab_attn = false # whether to use ISAB attention (induced-set attention block from the Set Transformers paper) |
Author
There was a problem hiding this comment.
if you were to set this to true, the number of attention heads need to be increased to 4 or above. a good starting config would be
num_attn_heads = 4
dim_attn_head = 64
use_isab_attn = true
isab_num_latents = 6
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
clean PR