where can I find the training corpus uploaded? I read the paper and it seems pre-training corpus was a mixture of DCLM-Baseline and SlimPajama from Cerebras. Unfortunately that does not provide the full picture since recipe can vary.
is there any robust source of where can the pre-training corpus be found? I need it to use this model on my paper.
where can I find the training corpus uploaded? I read the paper and it seems pre-training corpus was a mixture of DCLM-Baseline and SlimPajama from Cerebras. Unfortunately that does not provide the full picture since recipe can vary.
is there any robust source of where can the pre-training corpus be found? I need it to use this model on my paper.