Conversation
| try: | ||
| loss = model[0].eval_batch(data_iterator) # average loss per sample per microbatch | ||
| # difficult to know if it is the right way to get the total loss | ||
| loss = loss * args.micro_batch_size * args.seq_length # losses per token |
There was a problem hiding this comment.
Why do you want a total loss, and not an average loss ?
I am not sure micro_batch_size is the correct one : this is the batch size per GPU, the effective batch size is macro_batch_size.
I would suggest to save the average loss per token AND the total number of tokens in the dataset (separately).
So that we can chose between the stats (average / total) and make checks based on the numbers of tokens.
There was a problem hiding this comment.
To clarify, I suggest to use
loss = model[0].eval_batch(data_iterator)
loss_dicts = [{'lm loss' : loss, 'num_batches' : 1}]
and to aggregate losses and number of batches where relevant (I think it's around line 417).
Then at the really end to normalize using the number of batches.
ablation/perplexity.py
Outdated
| if is_last_rank(): | ||
|
|
||
| val_loss = total_loss_dict['lm loss'].item() / (num_tokenized_tokens - 1) |
There was a problem hiding this comment.
Aussi il me semble (je peux avoir tord) que "is_last_rank" n'est True que sur un GPU en cas de multi-GPU.
Ce qui voudrait dire qu'en multi-GPU, on ignorerait les résultats sur "n-1" GPUs ?
ablation/perplexity.py
Outdated
| if is_last_rank(): | ||
|
|
||
| val_loss = total_loss_dict['lm loss'].item() / (num_tokenized_tokens - 1) | ||
| ppl = math.exp(min(20, val_loss)) |
There was a problem hiding this comment.
| ppl = math.exp(min(20, val_loss)) | |
| dist.all_reduce(val_loss, op=ReduceOp.SUM) # mean reduction is not supported | |
| dist.all_reduce(ppl, op=ReduceOp.SUM) | |
| dist.all_reduce(adjusted_ppl, op=ReduceOp.SUM) | |
| dist.all_reduce(token_ratio, op=ReduceOp.SUM) | |
| val_loss = val_loss / NB_SHARDS | |
| token_ratio = token_ratio / NB_SHARDS | |
| ppl = math.exp(min(20, val_loss)) | |
| adjusted_ppl = math.exp(min(20, val_loss * token_ratio)) |
There was a problem hiding this comment.
Thanks, I'll try it out, hope it solves the synchronization problem ;)
3c1de29 to
f60db54
Compare
d7b6a0d to
9c83ef1
Compare
- Add datasets: Pile (WIP) and Stac (tiny). - Improve a bit folder organization. - Add zstandard in requirements (to read datasets in .jsonl.zst format)
ea89965 to
8b5cec0
Compare
7a1f6de to
e3da1de
Compare
4a9f8dc to
7a8da24
Compare
No description provided.