[Bug] Memory Leak in Training Loop (CPU)

While training LSTMiniNet, memory usage increases continuously with every epoch and is never released. On longer training runs, this eventually leads to a system crash due to memory exhaustion.

This occurs on CPU build (C++20, no external ML/DL libraries) and seems related to how computational graphs or intermediate states are handled inside each epoch.

✅ Expected Behavior
Memory usage should remain stable across epochs (aside from temporary batch allocations).
After an epoch completes, no leftover graph nodes, gradients, or matrices should persist.

📊 Actual Behavior
Memory usage grows linearly with the number of epochs.
On long runs, this causes memory overflow and program termination.

🔍 Possible Causes
Reverse AutoDiff Engine: shared_ptr<Node> objects from previous epochs may not be released due to lingering references.
Gradient accumulation: Gradients may be stacking without reset between epochs.
Matrix allocations: Intermediate matrices might not be destroyed after use.

📌 Additional Context
This issue makes it difficult to run training for more than a few epochs.

Fixing it likely requires ensuring that:
    1. Graph nodes from one epoch don’t persist into the next.
    2. Gradients and states are reset after each step/epoch.
    3. Matrices are properly destructed when out of scope.

⚠️ Request: Could contributors suggest best practices for cleaning up autodiff graphs in C++ (shared_ptr cycle breaking, epoch cleanup, etc.)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Memory Leak in Training Loop (CPU) #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] Memory Leak in Training Loop (CPU) #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions