Skip to content

Prototype/incremental sim#1718

Open
bradNASA wants to merge 240 commits intodevelopfrom
prototype/incremental-sim
Open

Prototype/incremental sim#1718
bradNASA wants to merge 240 commits intodevelopfrom
prototype/incremental-sim

Conversation

@bradNASA
Copy link
Copy Markdown
Contributor

@bradNASA bradNASA commented Aug 4, 2025

  • Review: By file with walkthroughs
  • Merge strategy: Merge (no squash)

Description

Implementation of in-memory incremental simulation. By default, this will be turned off except for scheduling.

Details can be found in presentations (that should be downloaded as pptx instead of viewing on web):

Also see the Incremental Simulation discussion.

Merging these changes avoids maintaining this as a separate version, which is problematic for architecture changes that do not consider this version. Updating this branch for new Aerie versions developed in parallel has been almost as much work as the implementation.

Verification

Tests added: IncrementalSimTests, EdgeCaseTests, GeneratedTests

The EdgeCaseTests and GeneratedTests tests validate against a past version of the develop branch copied as a module (merlin-driver-develop). Other implementations may also be compared; and alternative version of incremental sim is included).

Passes all tests

Documentation

No documentation has been invalidated. Documentation is needed.

Future work

Work needed before merging

  • Need a mechanism for turning it on/off while keeping it on for all tests.
  • Ensure simulation is not incremental with the sim configuration changes.
  • Compare performance to develop branch

Future work potentially beyond this pull request

  • IncrementalSimulationFacade needs to be hooked in and debugged--scheduling currently does not use it.
  • Safeguard memory usage; offload in-memory simulations.
  • Store incremental simulation results; currently, the full simulation results are constructed and stored, a bottleneck.
  • Enable UI to update incremental simulation results.
  • UI visualization and controls for turning on/off incremental sim and applicability of incremental sim
  • Coalesce chains of SimulationEngines into one for performance/memory improvement.
  • Tests that validate understanding of how incremental sim works, e.g.. assertDirectiveNotRerun(ActivityDirectiveId)
  • Tests that verify that performance has not changed significantly
  • Performance history/trending across versions would be nice
  • Add back the Aerie 2.1(?) version of merlin-driver to compare. It was replaced with Aerie 3.0.1.
  • Combine with checkpoint restart (persistent checkpoints) to do incremental sim without blowing up memory.
  • Incremental sim for sim configuration changes
  • Others found in discussion

mattdailis and others added 30 commits February 13, 2023 10:59
Need to see if currentTime is a problem for tasks running in the past.
The TimePoint.Deltas in TemporalEventSource make dealing with old and new timelines too awkward.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimulationResultsInterface was added so that the results of multiple SimulationEngines can be combined lazily.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lets you exclude running specific tests with something like this:
./gradlew test -PexcludeTests="**/IntegrationTest*"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helps with debug

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to know why equals() and maybe hashCode() are necessary.

Copy link
Copy Markdown
Contributor Author

@bradNASA bradNASA Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added switch to optionally run newly added daemon tasks in the mission model since only some newly added tests need it.

@JoelCourtney
Copy link
Copy Markdown
Contributor

🫨

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Aug 9, 2025

Quality Gate Failed Quality Gate failed

Failed conditions
24.9% Duplication on New Code (required ≤ 3%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

bradNASA and others added 9 commits December 30, 2025 10:59
The counter 'i' is used to differentiate old and new TemporalEventSource
instances during incremental simulation, helping track which event sources
belong to which simulation engine in the chain.
- Created INCREMENTAL_SIM_PR_SUMMARY.md: Comprehensive analysis of all
  changes in the incremental simulation PR, organized by module
- Added to .gitignore: docker-compose.override.yml, output*.json,
  report-tests.sh (local development/testing files)
Merged 136 commits from origin/develop to bring the incremental simulation branch up to date.

Conflicts resolved:
- orchestration-utils/.../SimulationResultsWriter.java: Updated to use SimulationResultsInterface with accessor methods (getRealProfiles(), getDiscreteProfiles(), etc.)
- stateless-aerie/.../CLIArgumentsTest.java: Accepted develop's test assertion fix
- stateless-aerie/.../simpleFooPlanResults.json: Accepted develop's test fixture with updated JSON structure

Key changes from develop:
- JUnit 5.10.0 → 6.0.1 upgrade
- New @description and @subSYStem annotations for mission models
- Workspace permissions and action-server improvements
- Database migrations (27-30)
- Various bug fixes and dependency updates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Updated simpleFooPlanResults.json test fixture to match new JSON output format from develop
  - Field ordering changed in spans (directiveId, parentId, childIds now before type)
  - Activity and topics ordering changed
  - Removed extraneous blank line at start of file
- Added .claude/settings.local.json to .gitignore for local-only settings
- Added TEST_FAILURES_POST_MERGE.md documenting merge test results
  - stateless-aerie tests now pass (2 failures fixed)
  - merlin-driver-test has flaky tests (known issue, not caused by merge)

Test results:
- stateless-aerie: All tests passing ✓
- merlin-driver-test: 11 flaky test failures (pre-existing issue)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added comprehensive documentation of merlin-driver-test flaky behavior:
- Root cause analysis (race condition in TestContext static field)
- Two attempted fixes and why they failed (ThreadLocal approach broke sharing)
- Test failure patterns across multiple runs (0-13 failures, timing-dependent)
- Recommendations for future fixes if needed

Updated CLAUDE.md with documentation structure:
- docs/investigations/ for technical writeups
- .tmp/ directory for scratch work (already in .gitignore)

Updated .gitignore to exclude local development files:
- docker-compose override files
- output*.json test artifacts
- report-tests.sh helper script
- .claude/settings.local.json

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Merged 136 commits from origin/develop into prototype/incremental-sim.
This brings the prototype branch up to date with latest develop changes.

Key changes:
- Updated JUnit 5.10.0 → 6.0.1
- JSON serialization improvements (field ordering changes)
- Database migrations 27-30
- New @description and @subSYStem annotations

Test results after merge:
- stateless-aerie: 2 failures FIXED (updated test fixtures)
- merlin-driver-test: 0-1 flaky failures (pre-existing, not caused by merge)
- All other modules: PASSING

Flaky test investigation documented in:
docs/investigations/flaky-tests-investigation-2025-12-30.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implemented ThreadLocal context isolation with explicit context passing
to eliminate race conditions in test execution.

Problem:
- TestContext was a static field shared across main and worker threads
- Worker threads executed tasks asynchronously, accessing context
- Race condition: worker could access context before it was set or after
  it was cleared, causing intermittent NullPointerExceptions
- Tests showed 0-11 flaky failures depending on timing

Solution:
1. Made TestContext use ThreadLocal for per-thread isolation
2. Added contextQueue to pass context from main thread to worker thread
3. Worker thread explicitly receives and sets context in its ThreadLocal
4. Context is updated on task resume (after delay/call/wait)

Changes:
- TestContext: Changed from static field to ThreadLocal, added
  setContext()/clearContext() for manual lifecycle management
- ThreadedTask: Added contextQueue to TaskThread record, modified step()
  to pass context to worker, updated start() to receive and set context
- Worker yield methods (delay/call/waitUntil): Update context on resume

Results:
- Before: 0-11 flaky failures (highly variable)
- After: Consistently 1 failure (IncrementalSimPropertyTests, unrelated)
- 10+ test runs show stable behavior with only property test failing

The remaining IncrementalSimPropertyTests failure is a separate issue
unrelated to TestContext race conditions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Extracted from jqwik property test (seed: -1474950772479154912).

test8: Minimal shrunk version showing activities deleted at T=3584
       and topics not being unmarked when re-added.

test9: Full scenario demonstrating EventGraph.Concurrently grouping
       differences between regular and incremental simulation.
       At T=4, events that should be concurrent (39493 | 39493)
       appear sequential (39493; 39493) in incremental sim.

These explicit test cases ensure the bug is reproducible in CI
regardless of jqwik seed randomization.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants