-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Labels
coreImproves core model while keeping core idea intactImproves core model while keeping core idea intactresearchCreative project that might fail but could give high returnsCreative project that might fail but could give high returns
Description
Our model uses a lot of parameters for the output layer. Specifically, 2 * vocab_size * devices * features, where features=256 and devices=256 for the planned 20B model, implying that it would use 4.2B + 4.2B parameters using the GPT-2 tokenizer purely for the embedding matrices.
For example, ALBERT used factorized embeddings, reducing the number of parameters from 256*256*vocab = 8.59B to 256*256*sqrt(vocab)*2 = 33.5M .
Metadata
Metadata
Assignees
Labels
coreImproves core model while keeping core idea intactImproves core model while keeping core idea intactresearchCreative project that might fail but could give high returnsCreative project that might fail but could give high returns