Skip to content

Qwen 8B from scratch: context length setup + KL vs CE #17

@Walid-Ahmed

Description

@Walid-Ahmed

Thanks for great work

I’m having issues training Qwen 8B from scratch using SpecForge as I am getting low acceptance length , and I want to confirm my context-length setup.
My understanding is that the training context length should be longer than the generation context length to account for the prompt + response. Right now, I regenerate data with 2k , train with a 4k context window, and evaluate benchmarks with max_new_tokens = 2k. Is this configuration correct, or should I change the training/eval context lengths?

Also, in initial runs KL loss gave better results than CE. Has anyone seen similar behavior, and are there recommended KL settings (e.g., temperature/logit scaling)?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions