diff --git a/CHANGELOG.md b/CHANGELOG.md index 540b221..bfd3cce 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -18,6 +18,27 @@ The format is inspired by Keep a Changelog and this project follows Semantic Ver - (none yet) +## 1.8.0 - 2026-04-05 + +### Added + +- New `syslog-analysis` investigation template matching keywords: log, syslog, journal, event, audit. Runs `read_syslog` → `audit_account_changes` → `inspect_persistence_locations` (#141). +- New `enumerate_ssh_keys` tool: cross-platform SSH key enumeration scanning `.ssh` directories for authorized_keys, private keys, and public keys (#141). +- New `--task-template syslog-summary` maps to the `syslog-analysis` investigation template (#141). + +### Changed + +- **Severity calibration**: raised listener thresholds (Info <50, Low 50–149, Medium 150–249, High ≥250), lowered account severity (1 account → Low, 3–4 → Medium, ≥5 → High), raised persistence thresholds (Low <3, Medium 3–7, High ≥8). Normal desktops no longer trigger spurious high-severity findings (#139). +- **Findings detail**: finding titles now include specifics — account names, persistence entry text, and SSH directory info — instead of bare counts (#140). +- **Parameter estimation**: quantization-aware divisor replaces the hardcoded 2.2. Q4 models use 0.55 bytes/param, Q8 uses 1.1, FP16 uses 2.2, FP32 uses 4.4. Detected from model filename conventions (#138). +- **Template tool ordering**: `file-integrity-check` now leads with `hash_binary` (was `audit_account_changes`), `ssh-key-investigation` now leads with `enumerate_ssh_keys` (#141). + +### Fixed + +- **KV-cache attention mask**: prefill attention length now accounts for forced cache padding when the model lacks a `use_cache` toggle, preventing shape broadcast errors on models like Qwen2.5 and Llama 3.2 (#136). +- **ReAct garbage output**: when the model produces a `` tag at step 0 without calling any tools, the agent falls back to template-driven execution. Quality guard now detects hallucinated `` tags and `[observation]` markers inside final answers and replaces them with a deterministic summary (#137). +- **EP reporting**: `detect_execution_provider()` now recognises DirectML and CUDA backend overrides instead of always reporting CPU (#142). + ## 1.7.1 - 2026-04-05 ### Changed diff --git a/Cargo.lock b/Cargo.lock index d0c7726..7e4b97b 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -83,7 +83,7 @@ checksum = "7f202df86484c868dbad7eaa557ef785d5c66295e41b460ef922eca0723b842c" [[package]] name = "api_server" -version = "1.7.1" +version = "1.8.0" dependencies = [ "anyhow", "axum", @@ -314,7 +314,7 @@ checksum = "a6ef517f0926dd24a1582492c791b6a4818a4d94e789a334894aa15b0d12f55c" [[package]] name = "core_engine" -version = "1.7.1" +version = "1.8.0" dependencies = [ "anyhow", "async-trait", @@ -377,7 +377,7 @@ dependencies = [ [[package]] name = "cyber_tools" -version = "1.7.1" +version = "1.8.0" dependencies = [ "async-trait", "serde", @@ -807,7 +807,7 @@ dependencies = [ [[package]] name = "inference_bridge" -version = "1.7.1" +version = "1.8.0" dependencies = [ "anyhow", "async-trait", @@ -2139,7 +2139,7 @@ dependencies = [ [[package]] name = "wraithrun" -version = "1.7.1" +version = "1.8.0" dependencies = [ "anyhow", "api_server", diff --git a/Cargo.toml b/Cargo.toml index c2e9ed8..ea240df 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -10,7 +10,7 @@ resolver = "2" [workspace.package] edition = "2021" -version = "1.7.1" +version = "1.8.0" license = "MIT" [workspace.dependencies] diff --git a/docs/RELEASE_PLAN.md b/docs/RELEASE_PLAN.md index 54f4d76..404056f 100644 --- a/docs/RELEASE_PLAN.md +++ b/docs/RELEASE_PLAN.md @@ -91,31 +91,30 @@ Release should be blocked when: Define `ExecutionProviderBackend` trait, provider registry, provider-agnostic config, extract Vitis/CPU backends, provider-aware doctor, CLI `--backend` flag, and multi-backend test harness. - `v1.4.0` Concrete Hardware Backends (milestone #17, tracking: #55): DirectML (Windows GPU), CoreML (macOS/Apple Silicon), CUDA/TensorRT (NVIDIA), QNN (Qualcomm Hexagon), non-ONNX formats (GGUF/SafeTensors), and quantization-aware loading. +- `v1.5.0` Concrete Hardware Backends (completed). +- `v1.6.0` Agentic Investigation Engine (completed): ReAct agent loop, task-aware LLM synthesis, temperature-scaled sampling, EP-aware debug logs, session caching, KV-cache prefix reuse. +- `v1.7.0` Live Evaluation Hardening (completed): per-tool timing, LLM reasoning capture, evidence-derived confidence, task-specific synthesis, expanded privilege/persistence checks, tokenizer discovery. +- `v1.7.1` Dependency Bumps (completed): toml 1.1, thiserror 2.0, sha2 0.11, CI actions v6–v8. +- `v1.8.0` Live Evaluation Fixes (completed): KV-cache attention mask fix, ReAct garbage fallback, quantization-aware param estimation, severity recalibration, findings detail, template/tool fixes, EP reporting, syslog-analysis template, enumerate_ssh_keys tool. -## Immediate Next Steps for v1.0.0 +## Immediate Next Steps Use this runbook to execute the active next milestone end-to-end. 1. Create a tracking issue from the Release Checklist template. -2. Apply labels `release`, `milestone:v1.0.0`, and priority labels as needed. -3. Run milestone bootstrap workflow: - - Workflow: `Milestone Bootstrap` - - Inputs: - - `seed_roadmap`: `true` (upserts canonical milestones for the active roadmap set) - - `title`: `v1.0.0` - - `description`: `Local API and Web UI MVP: local server endpoints, security baseline, durable local data model, and initial triage UI.` - - `due_date`: optional (`YYYY-MM-DD`) -4. Verify quality gates locally: +2. Apply labels `release` and the target milestone label. +3. Verify quality gates locally: - `cargo check` - `cargo test --workspace` + - `cargo clippy --all-targets -- -D warnings` - `cargo check -p inference_bridge --features vitis` -5. Verify GitHub Actions CI is green on latest `main`. -6. Tag and publish: - - `git tag -a v1.0.0 -m "Release v1.0.0"` - - `git push origin v1.0.0` -7. Confirm `Release` workflow completed and assets are attached. -8. Close the milestone and open a follow-on milestone. -9. Open planning issue for the next milestone scope. +4. Verify GitHub Actions CI is green on latest `main`. +5. Tag and publish: + - `git tag -a vX.Y.Z -m "Release vX.Y.Z"` + - `git push origin vX.Y.Z` +6. Confirm `Release` workflow completed and assets are attached. +7. Close the milestone and open a follow-on milestone. +8. Open planning issue for the next milestone scope. ## Labels and Milestones diff --git a/docs/cli-reference.md b/docs/cli-reference.md index 5336a9b..2db168b 100644 --- a/docs/cli-reference.md +++ b/docs/cli-reference.md @@ -143,7 +143,7 @@ Behavior: `--list-tools` output includes tool names, descriptions, and JSON argument schemas. -Current built-in coverage includes log tailing, listener inventory, file hashing, privilege vectors, persistence inventory, account-role snapshots, process-network correlation, and baseline capture for drift workflows. +Current built-in coverage includes log tailing, listener inventory, file hashing, privilege vectors, persistence inventory, account-role snapshots, process-network correlation, SSH key enumeration, and baseline capture for drift workflows. Coverage tool argument highlights: @@ -465,11 +465,12 @@ When a free-text `--task` is provided, the agent resolves a declarative investig Built-in investigation templates: - **broad-host-triage**: default fallback. Runs all host-level tools. -- **ssh-key-investigation**: SSH key and account audit focus. +- **ssh-key-investigation**: SSH key enumeration and account audit focus. - **persistence-analysis**: autorun and persistence mechanism checks. - **network-exposure-audit**: listener and network binding analysis. - **privilege-escalation-check**: privilege escalation indicator checks. - **file-integrity-check**: hash verification and file integrity analysis. +- **syslog-analysis**: log review, account audit, and persistence checks. Matches keywords: log, syslog, journal, event, audit. List investigation templates via `--list-task-templates`. diff --git a/docs/getting-started.md b/docs/getting-started.md index 529ec94..a1fc07c 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -141,6 +141,8 @@ When running in live mode, WraithRun automatically probes the loaded model to cl - **Moderate**: medium models. Agent uses a ReAct (Reason + Act) loop, iteratively choosing tools based on observations, then synthesizes findings via LLM. - **Strong**: large models (≥10B params and ≤50ms latency). Agent uses a full ReAct loop with the complete evidence window for deep iterative reasoning and synthesis. +Since v1.8.0, parameter estimation is quantization-aware: Q4 models use 0.55 bytes/param, Q8 uses 1.1, FP16 uses 2.2, FP32 uses 4.4. This means Q4 models are classified more accurately — a 750 MB Q4 file now correctly estimates ~1.4B parameters instead of ~0.3B. + Override automatic classification when you know your model's capability: ```powershell diff --git a/docs/live-mode-operations.md b/docs/live-mode-operations.md index 10ef20d..7ffc843 100644 --- a/docs/live-mode-operations.md +++ b/docs/live-mode-operations.md @@ -128,6 +128,10 @@ WraithRun caches the ONNX session and tokenizer across investigation steps withi The agent also tracks prompt prefix reuse across steps. When consecutive prompts share a common prefix (e.g., system prompt + prior context), the prefix hit/miss ratio is logged for observability. Full KV-state reuse is scaffolded for a future release. +Since v1.8.0, the prefill attention mask correctly accounts for forced cache padding on models that lack a `use_cache` branch toggle (#136). Previously, models like Qwen2.5 and Llama 3.2 could crash with a shape broadcast error during prefill because the attention mask length did not include the initial cache dimension. + +Also since v1.8.0, execution provider reporting now detects DirectML and CUDA backend overrides (#142), so `model_capability.execution_provider` in JSON output accurately reflects the active backend instead of always showing `CPUExecutionProvider`. + Temperature controls affect live inference behavior: - `--temperature 0` (or omit): greedy decoding — fastest, fully deterministic output. diff --git a/docs/tool-reference.md b/docs/tool-reference.md index bab6b20..197ca2e 100644 --- a/docs/tool-reference.md +++ b/docs/tool-reference.md @@ -142,6 +142,23 @@ Output fields: - `network_risk_level` - `records` +## enumerate_ssh_keys + +Purpose: + +- Enumerates SSH key material across user home directories. Cross-platform: scans Windows `%USERPROFILE%\.ssh`, `ProgramData\ssh`, and other user profiles; on Linux/macOS scans `/root/.ssh` and `/home/*/.ssh`. + +Arguments: + +- none + +Output fields: + +- `directories` (array): per-directory summary including path, `has_authorized_keys`, `private_key_count`, and `public_key_count`. +- `total_authorized_keys_files` (integer) +- `total_private_keys` (integer) +- `total_public_keys` (integer) + ## capture_coverage_baseline Purpose: diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 28af5ab..0e24d4b 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -168,6 +168,7 @@ wraithrun --task "Investigate ..." --live --model C:/models/llm.onnx --tokenizer ``` - Tier thresholds: Basic ≤2B params or ≥200ms latency; Strong ≥10B params and ≤50ms latency; Moderate is everything in between. +- Since v1.8.0, parameter estimation is quantization-aware. Q4 models use 0.55 bytes/param, Q8 uses 1.1, FP16 uses 2.2, FP32 uses 4.4. This may reclassify models that were previously under-estimated (e.g., a Q4 model that reported 0.5B may now correctly report ~2B and shift from Basic to Moderate). ## Final answer looks generic or templated @@ -179,6 +180,7 @@ Fix: - This happens when the model is classified as Basic tier (deterministic summary) or when LLM output quality is detected as low. - Since v1.6.0, Moderate/Strong tiers use a ReAct loop that typically produces richer output. If output is still generic, try `--capability-override strong` or increase `--temperature` slightly (e.g., `0.1`). +- Since v1.8.0, the quality guard also catches hallucinated `` tags and `[observation]` markers inside the final answer. When detected, the agent replaces the garbage with a deterministic summary built from real findings. This means even Moderate/Strong tier runs may show a structured summary if the model hallucinates. ## Agent not calling expected tools @@ -191,6 +193,7 @@ Fix: - Moderate/Strong tiers use a ReAct loop where the LLM decides which tools to call. The model may not choose the same tools as the template-driven Basic tier. - Increase `--max-steps` if the agent is exhausting its step budget before reaching all relevant tools. - If the model is too small, it may produce a `` answer immediately. Try `--capability-override strong` to allow full iterative reasoning. +- Since v1.8.0, if the model produces `` at step 0 without calling any tools, the agent automatically falls back to template-driven execution so that real host data is still collected. - Check `RUST_LOG=debug` output for `react_step` entries showing the agent's reasoning at each step. ## Task returned a scope-boundary finding instead of running diff --git a/docs/upgrades.md b/docs/upgrades.md index 0ed5855..5c55770 100644 --- a/docs/upgrades.md +++ b/docs/upgrades.md @@ -1,5 +1,22 @@ # Upgrade Notes +## v1.8.0 + +### Breaking/visible changes + +- **Severity thresholds recalibrated** (#139): listener, account, and persistence findings now use higher thresholds. A normal desktop with ~100 listeners and 1 non-default admin account will report Low instead of High. Automation that keys on specific severity values should be reviewed. +- **Finding titles include specifics** (#140): finding `title` fields now embed account names, persistence entry text, and SSH directory details (e.g., `"Non-default privileged accounts observed (1): shrey"` instead of `"Non-default privileged accounts observed (1)"`). Parsers matching on exact title strings must be updated. +- **Parameter estimation changed** (#138): quantization-aware sizing means `estimated_params_b` in `model_capability` output may change. Q4 models now report ~4× higher param counts than before. This may reclassify some models into a higher capability tier. +- **New tool and template** (#141): `enumerate_ssh_keys` tool added to the registry; `syslog-analysis` investigation template added. Template tool ordering changed for `file-integrity-check` and `ssh-key-investigation`. +- **ReAct fallback behavior** (#137): Moderate/Strong tier runs may now produce a deterministic summary instead of LLM-generated text when the model hallucinates. The `final_answer` field will contain a structured SUMMARY block in these cases. + +### Migration + +- No TOML config changes required. +- If you parse `RunReport` JSON `findings[].title` strings, update matchers — titles now include entity names and entry details. +- If automation relies on severity thresholds, review the new calibration: listener counts below 50 are now Info (was Low at 25), single non-default admin accounts are Low (was Medium). +- The `enumerate_ssh_keys` tool is automatically included in `ssh-key-investigation` template runs. No opt-in needed. + ## v1.7.1 - Dependency-only release. No breaking API changes.