Skip to content

Malbox End-to-End Tracker #73

@Mjoyufull

Description

@Mjoyufull

v0.1.0 - Open Issues

A. Task Model / Submission / Scheduler

  • A1 - Task plugin selection + config flow broken - Task creation stores plugins = ["0"]. Worker ignores tasks.plugins and iterates the registry snapshot. Worker-side plugin config building is a TODO.
  • A2 - Submission API / runtime alignment partial - Task-create API accepts package, module, options, machine, custom, memory, unique - none are meaningfully honored at runtime.
  • A3 - Submission-path panic cleanup partial - Request path uses unwrap() / expect() where it should return proper API errors.
  • Timeout semantics partial - enforce_timeout is stored, but worker timeout is just "wrap execute_task() in a timeout".

B. Machine Lifecycle / Cleanup / Recovery

  • B1 - Timeout-safe machine ownership broken - Once execute_task() is canceled, worker loses machine_id and the machine stays assigned until restart.
  • B1 - Post-acquire failure cleanup broken - Later failure paths can bubble out without guaranteed release/repair. Not exception-safe in general.
  • B2 - Completed vs. reverted contract partial - Task Completed means "worker reached the release path", not "machine is confirmed clean". Needs an explicit contract.
  • B3 - Recovery and reprovision fidelity partial - Recovery is restart-centric, not self-healing while live. Interrupted Provisioning machines get restored to Ready with provider_config = None - lossy.

C. Guest Access / Networking / Trust Model

  • C1 - Guest RPC contract + trust/auth broken - Guest sample execution hard-requires gRPC on port 50051. Disabling guest_access doesn't actually disable guest execution semantics. Guest gRPC is a trusted-lab transport assumption, not an authenticated boundary. Registration is reachability-based, not handshake-based.
  • C2 - Guest endpoint discovery / IP refresh partial - DHCP lease lookup works for libvirt but the IP is persisted and trusted later without revalidation.
  • C3 - Transport config authority partial - ResolvedTransport::Grpc stores an address, but workers still dial http://<db_machine.ip>:50051 directly. network_isolated = false is passed unconditionally. Provider-side endpoint reconstruction hardcodes Windows in places.
  • C4 - Guest runtime config parity partial - Windows guest-plugin deploy sets MALBOX_WORK_DIR, but the default guest runtime only reads MALBOX_PLUGIN_PORT - override path isn't wired.
  • Task lifetime ownership ambiguous - Sample execution is fire-and-observe. Open decision: should task lifetime be driven by sample exit, observation window, plugin-controlled keepalive, overall timeout, or a mix?
  • Sequential guest-plugin ports partial - 50051 + index is brittle if plugin ordering becomes dynamic. First plugin on 50051 also acts as the base guest agent.

D. Plugin Runtime / Ordering / Lifecycle

  • D1 - Host plugin reality check broken - Persistent host plugins spawn in Starting, manager acquires Ready/Busy, but the explicit host execution path returns empty results. Event-runtime fallback gives task handlers an empty sample path while examples expect sample bytes.
  • D2 - Deterministic plugin ordering + execution contexts broken - Plugin selection and ordering are not task-driven or deterministic. Manifest contexts (exclusive, sequential, parallel) are parsed but not enforced.
  • D3 - Guest plugin lifecycle contract partial - Guest plugin lifetime is VM-boot scoped, not task scoped. Worker doesn't drive initialize() / shutdown() in a real lifecycle sense.
  • D4 - Event-hook / runtime alignment partial - Event-hook docs and manifests are ahead of reality. Worker only emits a small subset of task events over host IPC; guest event delivery doesn't appear to be used operationally.
  • Plugin acquisition tracking partial - Still records Busy { task_id: 0 } instead of the actual task ID.
  • Live reconcile wiring partial - Registry pending-change application and manager reconcile exist, but no top-level runtime loop wires plugin watch changes into live reconcile.
  • Lifecycle types partial - Persistent and ephemeral are partially real; scoped is not implemented.

E. Results / Reports / Contributor Health

  • E1 - Result artifact read/download API partial - Result retrieval is metadata-only.
  • E2 - Result model consolidation partial - Two result stories: per-plugin artifact path and older aggregate TaskResult shape. TaskStore::update_task_result() is still a TODO.
  • E3 - SQLx contributor workflow + CI broken - task_results repo breaks clean contributor test runs unless DATABASE_URL or prepared SQLx metadata exists.

F. Docs / Config / Operator Surfaces

  • F1 - Config / docs drift cleanup doc-lag - Sample config references the removed native provisioner. Some operator-facing comments still reference "native". CLI grouping is aspirational in places.
  • F2 - Front-end truthfulness doc-lag - Svelte front end is mock-backed.
  • F3 - Security assumptions / lab-mode docs missing - No auth model across HTTP API, CLI, and web UI.
  • Windows provisioning shortcuts security - Relies on Administrator / packer, WinRM, insecure cert validation, guest plugin scheduled tasks as SYSTEM.
  • Linux provisioning shortcuts security - Base images allow password SSH, malbox user has passwordless sudo.
  • Network isolation partial - Isolation controls exist in playbooks as image/provisioning policy, not centrally enforced. Libvirt domain build hardcodes default NAT network + e1000 NIC.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions