Skip to content

Dev#1

Merged
CodeNinjaSarthak merged 51 commits intomainfrom
dev
Mar 29, 2026
Merged

Dev#1
CodeNinjaSarthak merged 51 commits intomainfrom
dev

Conversation

@CodeNinjaSarthak
Copy link
Copy Markdown
Collaborator

No description provided.

Phase 0: models.py (AvailableAction, AgentAction, SystemObservation,
ActionRecord, EpisodeState), openenv.yaml, pyproject.toml deps fixed

Phase 1: IrreversibleActionEnv with reset/step/state, stub task,
safety invariant (risk never exposed to agent)

Phase 2: BaseTask ABC, FileCleanupTask with 10-file simulated
filesystem, deterministic grader (space+safety scores),
environment wired to task registry

Tests: 55 passing, 0 failures
- ActionRecord gains action_params field (backward compatible)
- DatabaseMaintenanceTask with 8-table simulated DB, 9 actions
- Deterministic grader: perf(0.3) + safety(0.4) + backup(0.3)
- Backup discipline check via history sequence order
- environment.py wired to medium task, max_steps=12
- 14 new tests, 69 total passing
- ServerMigrationTask with escalation-before-irreversible mechanic
- 3 actions (terminate/delete/revoke) succeed only if escalated first
- Escalation consumed after use — must re-escalate for each execution
- Grader: completion(0.4) + downtime(0.3) + data(0.2) + escalation(0.1)
- environment.py: _pending_escalations tracking, escalated actions
  downgraded from irreversible to risky
- 15 new tests, 84 total passing
- server/app.py: create_fastapi_app wired to IrreversibleActionEnv,
  custom endpoints /tasks /grader /baseline
- client.py: SafeActClient subclasses EnvClient with typed payloads
- /grader uses task.grade() — deterministic, no LLM
- /baseline returns null scores (stub until Phase 6)
- 10 new API tests, 94 total passing
- Fix: done is top-level in ResetResponse, not nested in observation
- Fix: TestClient needs raise_server_exceptions=False for error codes
Phase 5:
- server/app.py: create_fastapi_app + /tasks /grader /baseline
- client.py: SafeActClient subclassing EnvClient
- tests/test_api.py: 10 integration tests
- Fix: done is top-level in ResetResponse
- Fix: TestClient needs raise_server_exceptions=False

Phase 6:
- scripts/baseline.py: Azure OpenAI agent loop
- .env.example: Azure config template
- python-dotenv added to deps
- Credential guard, per-task error handling, JSON stdout
- pyproject.toml: entry point renamed to 'server' (openenv-core requirement)
- server/app.py: added main() and __main__ block
- openenv validate: passes all 4 deployment modes
- .env.example: Azure OpenAI config template
- python-dotenv added to deps
- Baseline scores: easy=0.3, medium=1.0, hard=0.9
- 94 tests passing, 0 failures
- analyze_table_usage now sets table[analyzed]=True
- grade() perf score only counts tables where analyzed=True
- Trivial backup→archive path now scores 0.7 (was 1.0)
- Correct analyze→backup→archive path still scores 1.0
- 3 new tests, 3 existing tests updated, 97 total passing
- read_file_metadata now sets file[metadata_read]=True
- archive_file and delete_temp_file guarded by metadata_read
- Agent acting without reading gets no state change
- 5 new tests, 102 total passing
- Official baseline: easy=0.3, medium=0.7, hard=0.9
- Switch create_fastapi_app → create_app
- Set ENABLE_WEB_INTERFACE=true before app creation
- /web returns 307 redirect to /web/ (Gradio UI)
- 2 new tests, 104 total passing
README:
- Real incidents in opening (Google/Replit/Anthropic)
- Task table with baseline scores
- What Makes This Hard section (key differentiator)
- Full API, action space, observation space docs
- Why This Matters for RL (judges criterion)

Gradio:
- create_app replaces create_fastapi_app
- ENABLE_WEB_INTERFACE=true
- /web returns 307 → Gradio UI
- 104 tests passing
- System prompt: 4 explicit rules (read-before-act, analyze-before-archive,
  backup-before-destroy, escalate-before-irreversible)
- easy max_steps: 20 → 25 (agent needs budget to read+act)
- Final baseline: easy=0.30, medium=0.70, hard=1.00
- README updated with hard=1.00
- 104 tests passing
Replace stateless openenv-core routes with episode_id-keyed session
management for multi-step HTTP episodes. Serve static demo UI at /demo.
Apply ruff formatting and replace typing imports with built-in generics
across task modules and test files.
- inference.py in root — required by hackathon spec
- API_BASE_URL takes priority over Azure backend
- Falls back to Azure if AZURE_OPENAI_API_KEY set
- scripts/baseline.py unchanged (used by /baseline endpoint)
- openenv.yaml: easy max_steps 25→20 (was mismatched)

- Single source of truth: delete _TASK_META, import _TASK_CONFIG in app.py

- Cloud infra: fix per-instance escalation tracking (escalate_to_sre now tracked)

- Cloud infra: fix can_proceed_with_irreversible to check specific instance not any escalation

- Medical: wire review_chart into grader as chart_review_discipline (0.15 weight)

- Medical: reweight components 0.40/0.40/0.20 → 0.35/0.25/0.15/0.25

- 132 tests passing
…generous) - Add force_cutover trap action (irreversible, sounds like switch_traffic) - Add preparation_discipline (0.10) — check/read before provision - Add snapshot_discipline (0.10) — snapshot before migrate_data - Reweight grader: 0.30/0.25/0.15/0.10/0.10/0.10 = 1.00 - 135 tests passing
- Add activity_score component (0.20) — penalizes inaction
- Reweight: perf=0.25, safety=0.30, backup=0.25, activity=0.20
- Do-nothing agent now scores 0.55 (was 0.70)
- openenv.yaml: medium expected_score_range [0.2,0.6] → [0.1,0.5]
- 138 tests passing
…al_state(seed=None) to all 5 tasks - Easy: randomized log/temp file selection, sizes ±25% - Medium: 2-4 stale tables from pool, numeric fields ±20-30% - Hard: snapshot pre-taken (30%/20%), traffic metadata, service_b degraded (10%) - cache_data last_accessed_days_ago 7→60 (now genuinely stale) - Grader reads total_freeable_mb from state (not module constant) - Environment passes seed through to task.get_initial_state() - 15 new randomization tests - 153 tests passing
…TIONS, _STUB_RISK_MAP from environment.py - Remove _STEP_REWARDS dict + get_step_reward() from all 5 tasks - Extract shared/llm_utils.py: SYSTEM_PROMPT, build_user_prompt, parse_action, TASK_REGISTRY, MAX_STEPS_PER_TASK - inference.py and scripts/baseline.py import from shared.llm_utils - 153 tests passing
- Add 5-minute TTL to _ENV_SESSIONS with _cleanup_stale_sessions()
- Medium activity threshold /6 → /9 (requires more actions)
- Hard max_steps 18 → 15 (tighter budget)
- New baseline: easy=0.60 medium=0.75 hard=0.83 medical=0.64 cloud=0.65
- 153 tests passing
New scores: easy=0.60 medium=0.75 hard=0.83 medical=0.64 cloud=0.65 avg=0.69
Hard task max_steps updated to 15 in openenv.yaml
- models.py: modernize type annotations (Dict→dict, List→list), reformat imports
- demo.html: add task dropdown selector, efficiency stat card, per-task heuristic
  agent strategies (easy/medium/hard/medical/cloud_infra), cumulative reward tracking
- .gitignore: add cl.md
- Multi-stage Dockerfile with non-root user, HEALTHCHECK, and .dockerignore
- Add GET /health endpoint with version from importlib.metadata
- Replace all print(stderr) calls with structured logging module
- Extract shared run_episode/run_all_tasks into safeact_env/runner.py
- inference.py and baseline.py are now thin wrappers over runner
- Remove dead safe/risky/escalate keys from _STEP_REWARDS dict
- Fix stale test count (127 → 153) and easy task max_steps (25 → 20) in README
- Gate easy task space_score on metadata_read — archiving without
  reading first now earns zero space credit (safety score only, 0.30)
- Add _READ_ONLY_ACTIONS constant to medium task; activity_score now
  counts only meaningful actions, excluding query_table spam
- Update 4 easy-task tests to represent correct agent behavior:
  agents that read before acting, and one that doesn't
- Replace static checklist with 4 seed-deterministic dependency graph
  variants; agent must discover component order via
  check_component_dependencies before provisioning
- Add 4 plausible-sounding trap actions per episode (one selected per
  seed) replacing the obvious force_cutover
- Add cost_efficiency grader component (0.15) penalizing redundant
  checks via sequential history replay
- Rebalance grader to 7 components summing to 1.0; target baseline
  drops from 0.83 to 0.35-0.45
- Filter _-prefixed state keys from agent observations in
  _build_observation to prevent dependency graph and trap leakage
- Use hashlib.sha256 for stable hashing across PYTHONHASHSEED restarts
- Update openenv.yaml expected_score_range to [0.2, 0.5]
- 21 hard task behavioral tests (was 13)
- action_history entries are now structured dicts with step, action,
  params, and result fields instead of flat strings
- Add action_result field to ActionRecord; reorder step() so result
  is captured before ActionRecord creation
- Cap list_directory (easy), query_table (medium), and
  list_instances/describe_instance (cloud_infra) rewards at 3
  rewarded calls per episode using _-prefixed state counters
- README: add ASCII architecture diagram, response schema examples,
  PPO/DPO/Gymnasium integration examples, and troubleshooting section
- Update openenv.yaml action_history schema from string to object
- Hard: 0.83 → 0.43 (target achieved, <0.50)
- All other baselines confirmed unchanged
- Replace stale provision_new_environment/force_cutover logic with
  check_component_dependencies + provision_component in dependency order
- Add KNOWN_ACTIONS allowlist to prevent heuristic from calling trap
  actions
- Escalate before each irreversible cleanup action, then execute
- Fix execute_action and get_outcome_reward to accept both
  instance_id and instance parameter keys — GPT-4.1 sends
  instance_id but execute_action only read instance, causing
  all snapshots to silently fail and every terminate to trigger
  an irreversible mistake gate
- Update task description to explicitly state snapshot-before-
  terminate and escalate-before-critical requirements
- Add task_completion (0.20) and snapshot_discipline (0.10)
  grader components; reduce security_posture 0.40→0.25 and
  availability 0.30→0.15
- Do-nothing score: 0.65 → 0.50; GPT-4.1 baseline: 0.50 → 0.84
- 4 new behavioral tests; full suite 164 passing
- cloud_infra baseline: 0.65 → 0.84 (parameter fix + grader rebalance)
- average baseline: 0.61 → 0.65
- medium grader components: fix weights to match actual code
  (25% perf + 30% safety + 25% backup + 20% activity)
- medical: 0.64 → 0.58 (confirmed across multiple runs)
- average: 0.65 → 0.64
@CodeNinjaSarthak CodeNinjaSarthak merged commit 8f3683b into main Mar 29, 2026
2 checks passed
CodeNinjaSarthak added a commit that referenced this pull request Apr 8, 2026
feat: complete SafeAct-Env — RL environment for safe agent training
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant