Qwen 8B from scratch: context length setup + KL vs CE


Thanks for great work

I’m having issues training Qwen 8B from scratch using SpecForge as I am getting low acceptance length , and I want to confirm my context-length setup. 
My understanding is that the training context length should be longer than the generation context length to account for the prompt + response. Right now, I regenerate data with 2k , train with a 4k context window, and evaluate benchmarks with max_new_tokens = 2k. Is this configuration correct, or should I change the training/eval context lengths?

Also, in initial runs KL loss gave better results than CE. Has anyone seen similar behavior, and are there recommended KL settings (e.g., temperature/logit scaling)?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen 8B from scratch: context length setup + KL vs CE #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen 8B from scratch: context length setup + KL vs CE #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions