Skip to content

fix(ecs): surface failure reason on /logs when task fails to start#829

Merged
vieiralucas merged 1 commit intomainfrom
worktree-ecs-finalize-failure-diag
Apr 28, 2026
Merged

fix(ecs): surface failure reason on /logs when task fails to start#829
vieiralucas merged 1 commit intomainfrom
worktree-ecs-finalize-failure-diag

Conversation

@vieiralucas
Copy link
Copy Markdown
Member

@vieiralucas vieiralucas commented Apr 28, 2026

Summary

Stabilize CI runner flakes (3/3) — make ECS task failures diagnosable.

The ECS secrets E2E test `ecs_task_resolves_secretsmanager_secret` has been failing intermittently in CI with `secret not injected: ` (note the empty string after the colon). Empty captured-logs means `finalize_failure` ran instead of `finalize_stopped` — i.e. the task died before the container started — but `finalize_failure` was not setting `task.captured_logs`, so the HTTP introspection endpoint `/_fakecloud/ecs/tasks/{id}/logs` returned an empty string and the assertion had nothing to attribute the failure to.

This PR is a diagnostic surface fix:

  • `finalize_failure` (in `crates/fakecloud-ecs/src/runtime.rs`) now writes `[task failed to start]: ` into `task.captured_logs`, mirroring the assignment that `finalize_stopped` already performs on the success path.
  • The `run_task` error handler also `eprintln!`s the reason so nextest's captured-output for a failed E2E surfaces it directly, not just inside the task state.
  • Adds a unit test `finalize_failure_writes_reason_into_captured_logs` that asserts the reason is present in `captured_logs` and prefixed correctly.

The actual intermittency (real `resolve_secret` race vs disk-pressure side-effect of PRs #827/#828) is intentionally left untouched here; once those land and CI logs show the real reason text on a failure, we'll know which root cause to fix.

Sibling PRs: #827 (nextest retries + disk diagnostics), #828 (linker OOM via test profile debuginfo).

Test plan

  • `cargo test -p fakecloud-ecs --lib finalize_failure` passes locally
  • `cargo clippy -p fakecloud-ecs --all-targets -- -D warnings` clean
  • CI runs green first try
  • Next CI run that fails the ECS secrets test now contains the failure reason in the assertion message

Summary by cubic

Show ECS task start failure reasons on /_fakecloud/ecs/tasks/{id}/logs and stderr so failures no longer appear as empty logs.

  • Bug Fixes
    • Set task.captured_logs in finalize_failure to "[task failed to start]: ".
    • Print the failure reason to stderr in run_task for clearer nextest output.
    • Add unit test finalize_failure_writes_reason_into_captured_logs.

Written for commit 2e33b98. Summary will update on new commits. Review in cubic

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 98.48485% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/fakecloud-ecs/src/runtime.rs 98.48% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@vieiralucas vieiralucas force-pushed the worktree-ecs-finalize-failure-diag branch from e2814f9 to 5dc1243 Compare April 28, 2026 20:46
Tasks that died before the container started (secret resolution miss,
image pull error, `docker run` failure) went through `finalize_failure`,
which did not set `task.captured_logs`. The HTTP introspection endpoint
`/_fakecloud/ecs/tasks/{id}/logs` therefore returned an empty string,
leaving E2E assertions like `ecs_task_resolves_secretsmanager_secret`
with `"secret not injected: "` and no diagnostic.

- `finalize_failure` now writes `[task failed to start]: <reason>` into
  `task.captured_logs`, mirroring the assignment `finalize_stopped`
  already does on the success path.
- The `run_task` error handler also `eprintln!`s the reason so nextest's
  captured-output for a failed E2E surfaces it directly.
- Add unit test `finalize_failure_writes_reason_into_captured_logs`.

This is a diagnostic surface fix only; the underlying intermittent
`resolve_secret` miss in CI (if real and not just disk pressure
side-effect) is left for a follow-up once the diagnostics expose its
root cause.
@vieiralucas vieiralucas force-pushed the worktree-ecs-finalize-failure-diag branch from 5dc1243 to 2e33b98 Compare April 28, 2026 21:26
@vieiralucas vieiralucas merged commit f6b1cec into main Apr 28, 2026
59 of 60 checks passed
@vieiralucas vieiralucas deleted the worktree-ecs-finalize-failure-diag branch April 28, 2026 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant