Skip to content

fix/nan loss#685

Merged
joellidin merged 4 commits intodevfrom
fix/nan-loss
Jan 18, 2026
Merged

fix/nan loss#685
joellidin merged 4 commits intodevfrom
fix/nan-loss

Conversation

@joellidin
Copy link
Copy Markdown
Collaborator

@joellidin joellidin commented Jan 18, 2026

  • (neurons) Skip batches with all masked labels
  • (neurons) Switch anneal mode to shard 5
  • Bump run version

Description

Related Issue(s)

  • Closes #[issue number]

Type of Change

  • Feature (adding new functionality)
  • Fix (resolving a bug or issue)
  • Docs (documentation updates)
  • Refactor (code changes that don't affect functionality)
  • Maintenance (dependency updates or other maintenance)
  • Tests (adding or improving tests)
  • Breaking change (fix or feature with incompatible API changes)
  • Other: _____

Branch Naming

  • My branch follows the project's naming convention (e.g., feature/add-new-capability)

Commit Messages

  • My commits are small, atomic, and have proper commit messages
  • Commit messages are in imperative mood with a capitalized summary under 50 chars

Code Quality

  • I've performed a self-review of my code
  • I've added appropriate docstrings following the project's conventions
  • I've added proper logging where necessary (without trailing periods)
  • I've applied linting and formatting with Ruff
  • My code generates no new warnings

Testing

  • I've added tests for new functionality or bug fixes
  • All tests pass locally with my changes
  • Test coverage has not decreased

Documentation

  • I've updated documentation to reflect my changes
  • I've updated comments in hard-to-understand areas

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

  • Bug Fixes

    • Skip fully-masked training batches to prevent NaN loss and emit warnings during training and evaluation.
    • Adjusted annealing behavior to start from a different shard when anneal mode is active.
  • Documentation

    • Updated partial-migration example to reflect the corrected shard used for copying.
  • Chores

    • Version bumped to 2.1.27 and annealing decay schedule tuned (outer steps increased).

✏️ Tip: You can customize this high-level summary in your review settings.

Add guards in evaluate_model and inner_steps to prevent NaN loss when
all labels in a batch are masked (-100). This occurs when batches
contain only padding or special tokens.

- Check valid_labels count before forward pass
- Log warning and skip batch if valid_labels == 0
- Clean up tensors before continuing to next batch
- Prevent cross_entropy from receiving empty loss target
Update miner and validator to use anneal shard 5 instead of shard 4.
Update documentation to reflect the new shard number in rclone migration
examples.

- Change current_shard from 4 to 5 in miner.py
- Change current_shard from 4 to 5 in validator.py
- Update docs with anneal_000005.npy examples
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 18, 2026

Walkthrough

Updates anneal-mode starting shard from 4 → 5 in miner, validator, and docs; adds early skipping of batches whose labels are all masked (-100) in trainer to avoid NaN loss; bumps package version to 2.1.27; updates anneal hyperparameter decay_outer_steps from 120 → 150.

Changes

Cohort / File(s) Change Summary
Anneal shard configuration
neurons/miner.py, neurons/validator.py, docs/shared_sharded_dataset.md
Changed anneal-mode starting dataset shard selection from shard 4 to shard 5 and updated example rclone copy commands in docs; initialization sets shard epoch to 0 in anneal mode.
Batch validation in trainer
neurons/trainer.py
Added early-skip logic for batches with all labels equal to -100 in three locations (compute valid label count, log warning, free temp tensors, continue) to avoid NaN from cross_entropy.
Anneal hyperparameters
hparams/hparams.json
Adjusted anneal_mode.decay_outer_steps from 120 to 150.
Version bump
src/tplr/__init__.py
Updated __version__ from "2.1.26""2.1.27".

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • v2.1.24 #680 — Modifies anneal-mode shard selection in miner and validator and updates docs (same code paths affected).
  • feat/new anneal shard #683 — Changes anneal-mode shard selection and related docs/hparams (overlaps on shard-index updates).
  • fix/shard switching at new run #637 — Adjusts shard-selection/rotation logic and tracking in miner/validator (related shard-state changes).

Suggested reviewers

  • amiiir-sarfi
  • shivam-MBZUAI

Poem

🐇 I hopped from four to five today,

Skipped masked batches on my way,
No NaNs to cloud the training run,
New steps stretch out — one-five-oh, fun!
Version bumped, the job is done. 🎉

🚥 Pre-merge checks | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Title check ⚠️ Warning The title 'fix/nan loss' is a branch name format rather than a proper PR title describing the main changes. It fails to clearly summarize the actual changes (NaN loss prevention, shard switching, version bump) that reviewers need to understand. Replace with a clear, descriptive title like 'Prevent NaN loss by skipping fully masked batches' or 'Fix NaN loss and switch anneal shard to 5'.
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description lists three bullet points summarizing changes but lacks depth on the 'why' and lacks detail on rationale. Most of the template checkboxes remain unchecked, and no proper description section explains the motivation or impact of these changes. Add a fuller Description section explaining why NaN loss occurs and why the shard/version changes matter. Check relevant Type of Change and Code Quality checkboxes to clarify author's self-review status.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

❌ Your project status has failed because the head coverage (57.69%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage.

Impacted file tree graph

@@           Coverage Diff           @@
##              dev     #685   +/-   ##
=======================================
  Coverage   57.69%   57.69%           
=======================================
  Files          27       27           
  Lines        4990     4990           
=======================================
  Hits         2879     2879           
  Misses       2111     2111           
Files with missing lines Coverage Δ
src/tplr/__init__.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Change from 120 to 150 to mitigate for now alwyas gathering the full 20
peers.
@joellidin joellidin merged commit b48cad2 into dev Jan 18, 2026
7 of 8 checks passed
@joellidin joellidin deleted the fix/nan-loss branch January 18, 2026 14:02
@coderabbitai coderabbitai bot mentioned this pull request Jan 18, 2026
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant