Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,27 @@ The format is inspired by Keep a Changelog and this project follows Semantic Ver

- (none yet)

## 1.8.0 - 2026-04-05

### Added

- New `syslog-analysis` investigation template matching keywords: log, syslog, journal, event, audit. Runs `read_syslog` → `audit_account_changes` → `inspect_persistence_locations` (#141).
- New `enumerate_ssh_keys` tool: cross-platform SSH key enumeration scanning `.ssh` directories for authorized_keys, private keys, and public keys (#141).
- New `--task-template syslog-summary` maps to the `syslog-analysis` investigation template (#141).

### Changed

- **Severity calibration**: raised listener thresholds (Info <50, Low 50–149, Medium 150–249, High ≥250), lowered account severity (1 account → Low, 3–4 → Medium, ≥5 → High), raised persistence thresholds (Low <3, Medium 3–7, High ≥8). Normal desktops no longer trigger spurious high-severity findings (#139).
- **Findings detail**: finding titles now include specifics — account names, persistence entry text, and SSH directory info — instead of bare counts (#140).
- **Parameter estimation**: quantization-aware divisor replaces the hardcoded 2.2. Q4 models use 0.55 bytes/param, Q8 uses 1.1, FP16 uses 2.2, FP32 uses 4.4. Detected from model filename conventions (#138).
- **Template tool ordering**: `file-integrity-check` now leads with `hash_binary` (was `audit_account_changes`), `ssh-key-investigation` now leads with `enumerate_ssh_keys` (#141).

### Fixed

- **KV-cache attention mask**: prefill attention length now accounts for forced cache padding when the model lacks a `use_cache` toggle, preventing shape broadcast errors on models like Qwen2.5 and Llama 3.2 (#136).
- **ReAct garbage output**: when the model produces a `<final>` tag at step 0 without calling any tools, the agent falls back to template-driven execution. Quality guard now detects hallucinated `<call>` tags and `[observation]` markers inside final answers and replaces them with a deterministic summary (#137).
- **EP reporting**: `detect_execution_provider()` now recognises DirectML and CUDA backend overrides instead of always reporting CPU (#142).

## 1.7.1 - 2026-04-05

### Changed
Expand Down
10 changes: 5 additions & 5 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ resolver = "2"

[workspace.package]
edition = "2021"
version = "1.7.1"
version = "1.8.0"
license = "MIT"

[workspace.dependencies]
Expand Down
33 changes: 16 additions & 17 deletions docs/RELEASE_PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,31 +91,30 @@ Release should be blocked when:
Define `ExecutionProviderBackend` trait, provider registry, provider-agnostic config, extract Vitis/CPU backends, provider-aware doctor, CLI `--backend` flag, and multi-backend test harness.
- `v1.4.0` Concrete Hardware Backends (milestone #17, tracking: #55):
DirectML (Windows GPU), CoreML (macOS/Apple Silicon), CUDA/TensorRT (NVIDIA), QNN (Qualcomm Hexagon), non-ONNX formats (GGUF/SafeTensors), and quantization-aware loading.
- `v1.5.0` Concrete Hardware Backends (completed).
- `v1.6.0` Agentic Investigation Engine (completed): ReAct agent loop, task-aware LLM synthesis, temperature-scaled sampling, EP-aware debug logs, session caching, KV-cache prefix reuse.
- `v1.7.0` Live Evaluation Hardening (completed): per-tool timing, LLM reasoning capture, evidence-derived confidence, task-specific synthesis, expanded privilege/persistence checks, tokenizer discovery.
- `v1.7.1` Dependency Bumps (completed): toml 1.1, thiserror 2.0, sha2 0.11, CI actions v6–v8.
- `v1.8.0` Live Evaluation Fixes (completed): KV-cache attention mask fix, ReAct garbage fallback, quantization-aware param estimation, severity recalibration, findings detail, template/tool fixes, EP reporting, syslog-analysis template, enumerate_ssh_keys tool.

## Immediate Next Steps for v1.0.0
## Immediate Next Steps

Use this runbook to execute the active next milestone end-to-end.

1. Create a tracking issue from the Release Checklist template.
2. Apply labels `release`, `milestone:v1.0.0`, and priority labels as needed.
3. Run milestone bootstrap workflow:
- Workflow: `Milestone Bootstrap`
- Inputs:
- `seed_roadmap`: `true` (upserts canonical milestones for the active roadmap set)
- `title`: `v1.0.0`
- `description`: `Local API and Web UI MVP: local server endpoints, security baseline, durable local data model, and initial triage UI.`
- `due_date`: optional (`YYYY-MM-DD`)
4. Verify quality gates locally:
2. Apply labels `release` and the target milestone label.
3. Verify quality gates locally:
- `cargo check`
- `cargo test --workspace`
- `cargo clippy --all-targets -- -D warnings`
- `cargo check -p inference_bridge --features vitis`
5. Verify GitHub Actions CI is green on latest `main`.
6. Tag and publish:
- `git tag -a v1.0.0 -m "Release v1.0.0"`
- `git push origin v1.0.0`
7. Confirm `Release` workflow completed and assets are attached.
8. Close the milestone and open a follow-on milestone.
9. Open planning issue for the next milestone scope.
4. Verify GitHub Actions CI is green on latest `main`.
5. Tag and publish:
- `git tag -a vX.Y.Z -m "Release vX.Y.Z"`
- `git push origin vX.Y.Z`
6. Confirm `Release` workflow completed and assets are attached.
7. Close the milestone and open a follow-on milestone.
8. Open planning issue for the next milestone scope.

## Labels and Milestones

Expand Down
5 changes: 3 additions & 2 deletions docs/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ Behavior:

`--list-tools` output includes tool names, descriptions, and JSON argument schemas.

Current built-in coverage includes log tailing, listener inventory, file hashing, privilege vectors, persistence inventory, account-role snapshots, process-network correlation, and baseline capture for drift workflows.
Current built-in coverage includes log tailing, listener inventory, file hashing, privilege vectors, persistence inventory, account-role snapshots, process-network correlation, SSH key enumeration, and baseline capture for drift workflows.

Coverage tool argument highlights:

Expand Down Expand Up @@ -465,11 +465,12 @@ When a free-text `--task` is provided, the agent resolves a declarative investig
Built-in investigation templates:

- **broad-host-triage**: default fallback. Runs all host-level tools.
- **ssh-key-investigation**: SSH key and account audit focus.
- **ssh-key-investigation**: SSH key enumeration and account audit focus.
- **persistence-analysis**: autorun and persistence mechanism checks.
- **network-exposure-audit**: listener and network binding analysis.
- **privilege-escalation-check**: privilege escalation indicator checks.
- **file-integrity-check**: hash verification and file integrity analysis.
- **syslog-analysis**: log review, account audit, and persistence checks. Matches keywords: log, syslog, journal, event, audit.

List investigation templates via `--list-task-templates`.

Expand Down
2 changes: 2 additions & 0 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,8 @@ When running in live mode, WraithRun automatically probes the loaded model to cl
- **Moderate**: medium models. Agent uses a ReAct (Reason + Act) loop, iteratively choosing tools based on observations, then synthesizes findings via LLM.
- **Strong**: large models (≥10B params and ≤50ms latency). Agent uses a full ReAct loop with the complete evidence window for deep iterative reasoning and synthesis.

Since v1.8.0, parameter estimation is quantization-aware: Q4 models use 0.55 bytes/param, Q8 uses 1.1, FP16 uses 2.2, FP32 uses 4.4. This means Q4 models are classified more accurately — a 750 MB Q4 file now correctly estimates ~1.4B parameters instead of ~0.3B.

Override automatic classification when you know your model's capability:

```powershell
Expand Down
4 changes: 4 additions & 0 deletions docs/live-mode-operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,10 @@ WraithRun caches the ONNX session and tokenizer across investigation steps withi

The agent also tracks prompt prefix reuse across steps. When consecutive prompts share a common prefix (e.g., system prompt + prior context), the prefix hit/miss ratio is logged for observability. Full KV-state reuse is scaffolded for a future release.

Since v1.8.0, the prefill attention mask correctly accounts for forced cache padding on models that lack a `use_cache` branch toggle (#136). Previously, models like Qwen2.5 and Llama 3.2 could crash with a shape broadcast error during prefill because the attention mask length did not include the initial cache dimension.

Also since v1.8.0, execution provider reporting now detects DirectML and CUDA backend overrides (#142), so `model_capability.execution_provider` in JSON output accurately reflects the active backend instead of always showing `CPUExecutionProvider`.

Temperature controls affect live inference behavior:

- `--temperature 0` (or omit): greedy decoding — fastest, fully deterministic output.
Expand Down
17 changes: 17 additions & 0 deletions docs/tool-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,23 @@ Output fields:
- `network_risk_level`
- `records`

## enumerate_ssh_keys

Purpose:

- Enumerates SSH key material across user home directories. Cross-platform: scans Windows `%USERPROFILE%\.ssh`, `ProgramData\ssh`, and other user profiles; on Linux/macOS scans `/root/.ssh` and `/home/*/.ssh`.

Arguments:

- none

Output fields:

- `directories` (array): per-directory summary including path, `has_authorized_keys`, `private_key_count`, and `public_key_count`.
- `total_authorized_keys_files` (integer)
- `total_private_keys` (integer)
- `total_public_keys` (integer)

## capture_coverage_baseline

Purpose:
Expand Down
3 changes: 3 additions & 0 deletions docs/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ wraithrun --task "Investigate ..." --live --model C:/models/llm.onnx --tokenizer
```

- Tier thresholds: Basic ≤2B params or ≥200ms latency; Strong ≥10B params and ≤50ms latency; Moderate is everything in between.
- Since v1.8.0, parameter estimation is quantization-aware. Q4 models use 0.55 bytes/param, Q8 uses 1.1, FP16 uses 2.2, FP32 uses 4.4. This may reclassify models that were previously under-estimated (e.g., a Q4 model that reported 0.5B may now correctly report ~2B and shift from Basic to Moderate).

## Final answer looks generic or templated

Expand All @@ -179,6 +180,7 @@ Fix:

- This happens when the model is classified as Basic tier (deterministic summary) or when LLM output quality is detected as low.
- Since v1.6.0, Moderate/Strong tiers use a ReAct loop that typically produces richer output. If output is still generic, try `--capability-override strong` or increase `--temperature` slightly (e.g., `0.1`).
- Since v1.8.0, the quality guard also catches hallucinated `<call>` tags and `[observation]` markers inside the final answer. When detected, the agent replaces the garbage with a deterministic summary built from real findings. This means even Moderate/Strong tier runs may show a structured summary if the model hallucinates.

## Agent not calling expected tools

Expand All @@ -191,6 +193,7 @@ Fix:
- Moderate/Strong tiers use a ReAct loop where the LLM decides which tools to call. The model may not choose the same tools as the template-driven Basic tier.
- Increase `--max-steps` if the agent is exhausting its step budget before reaching all relevant tools.
- If the model is too small, it may produce a `<final>` answer immediately. Try `--capability-override strong` to allow full iterative reasoning.
- Since v1.8.0, if the model produces `<final>` at step 0 without calling any tools, the agent automatically falls back to template-driven execution so that real host data is still collected.
- Check `RUST_LOG=debug` output for `react_step` entries showing the agent's reasoning at each step.

## Task returned a scope-boundary finding instead of running
Expand Down
17 changes: 17 additions & 0 deletions docs/upgrades.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,22 @@
# Upgrade Notes

## v1.8.0

### Breaking/visible changes

- **Severity thresholds recalibrated** (#139): listener, account, and persistence findings now use higher thresholds. A normal desktop with ~100 listeners and 1 non-default admin account will report Low instead of High. Automation that keys on specific severity values should be reviewed.
- **Finding titles include specifics** (#140): finding `title` fields now embed account names, persistence entry text, and SSH directory details (e.g., `"Non-default privileged accounts observed (1): shrey"` instead of `"Non-default privileged accounts observed (1)"`). Parsers matching on exact title strings must be updated.
- **Parameter estimation changed** (#138): quantization-aware sizing means `estimated_params_b` in `model_capability` output may change. Q4 models now report ~4× higher param counts than before. This may reclassify some models into a higher capability tier.
- **New tool and template** (#141): `enumerate_ssh_keys` tool added to the registry; `syslog-analysis` investigation template added. Template tool ordering changed for `file-integrity-check` and `ssh-key-investigation`.
- **ReAct fallback behavior** (#137): Moderate/Strong tier runs may now produce a deterministic summary instead of LLM-generated text when the model hallucinates. The `final_answer` field will contain a structured SUMMARY block in these cases.

### Migration

- No TOML config changes required.
- If you parse `RunReport` JSON `findings[].title` strings, update matchers — titles now include entity names and entry details.
- If automation relies on severity thresholds, review the new calibration: listener counts below 50 are now Info (was Low at 25), single non-default admin accounts are Low (was Medium).
- The `enumerate_ssh_keys` tool is automatically included in `ssh-key-investigation` template runs. No opt-in needed.

## v1.7.1

- Dependency-only release. No breaking API changes.
Expand Down
Loading