Skip to content

Compact Loss #79

@ClashLuke

Description

@ClashLuke

Our model uses a lot of parameters for the output layer. Specifically, 2 * vocab_size * devices * features, where features=256 and devices=256 for the planned 20B model, implying that it would use 4.2B + 4.2B parameters using the GPT-2 tokenizer purely for the embedding matrices.
For example, ALBERT used factorized embeddings, reducing the number of parameters from 256*256*vocab = 8.59B to 256*256*sqrt(vocab)*2 = 33.5M .

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreImproves core model while keeping core idea intactresearchCreative project that might fail but could give high returns

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions