Skip to content

Incident Mitigation: 500 errors and resource scaling for Container App octopetsapi #8

@dm-chelupati

Description

@dm-chelupati

Summary

  • Incident: INC0010020 – octopetsapi Container App returned 500s and View Details was slow/unresponsive.
  • Impacted resource: /subscriptions/ca5ce512-88e1-44b1-97c6-22caf84fb2b0/resourceGroups/rg-octopets-v2/providers/Microsoft.App/containerApps/octopetsapi (eastus2).
  • Timeline (UTC):
    • 06:46 – Incident opened in ServiceNow.
    • 06:50 – Ownership acknowledged; investigation initiated.
    • 06:51–06:56 – Baseline gathered: latest rev octopetsapi--0000005, 0.5 vCPU/1Gi, min=2/max=4 replicas; logs show repeated System.OutOfMemoryException originating in Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation() (ListingEndpoints.cs:18, invoked from line 53). Liveness probe timeouts observed (1s timeout), likely secondary to pressure.
    • 06:56 – Metrics analyzed: CPU low; memory sustained ~63–78% on 1Gi with spikes during error bursts. Requests present during error interval.
    • 06:58 – Mitigation applied: scaled to 4 vCPU/8Gi (ephemeral 8Gi). New revision octopetsapi--0000006 provisioned.
    • 07:03 – Secondary mitigation: restarted latest revision to clear transient faults.
    • 07:00–07:05 – Early post-mitigation checks: low CPU, rollout in progress; endpoint health initially not reachable while revision settled.

Log and Metrics Highlights

  • Repeated unhandled exceptions during request processing:
    • System.OutOfMemoryException in Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation() at ListingEndpoints.cs:18; called from MapListingEndpoints (line 53).
    • Example pattern: "fail: Microsoft.AspNetCore.Server.Kestrel[13] ... An unhandled exception was thrown by the application. System.OutOfMemoryException ... at ... AReallyExpensiveOperation() ..."
  • Probes:
    • Liveness/Startup probe timeouts logged with "timeout in 1 seconds" during pressure/rollout.
  • Resource pressure:
    • CPU: mostly low (<= ~17%).
    • Memory: sustained ~63–78% under 1Gi before scaling; consistent with OOM exceptions and unbounded allocations.

Mitigations Implemented

  • Scaled compute from 0.5 vCPU/1Gi to 4 vCPU/8Gi to relieve memory pressure while a code fix is produced.
  • Restarted latest revision post-scale to clear transients.
  • Ongoing monitoring for request/5xx signals and health after rollout stabilization.

Recommended Code Fixes

  • Focus: Octopets.Backend.Endpoints.ListingEndpoints.AReallyExpensiveOperation()
    • Replace unbounded allocations with streaming/pagination; avoid loading large datasets into memory; cap maximum result sizes.
    • Implement defensive checks on input parameters to prevent pathological workloads.
    • Ensure proper disposal of buffers/streams; consider pooled buffers.
    • Add cancellation and timeouts; return 429/503 or partial results instead of letting OOM occur.
    • Add telemetry around allocation sizes, operation duration, and GC pauses.
  • Additional cleanup:
    • EF Core warnings: add ValueComparer for Listing.AllowedPets and Listing.Amenities to avoid subtle tracking bugs.

IaC Review and Drift

  • Observed runtime config before mitigation: 0.5 vCPU/1Gi; after mitigation: 4 vCPU/8Gi; minReplicas=2, maxReplicas=4; HTTP scaler concurrentRequests previously 10.
  • Probe settings observed in logs show 1s timeouts for startup/liveness which are too aggressive for this workload. Suggest increasing timeouts and considering initialDelaySeconds.
  • Attempted to scan repo for IaC (.bicep/.tf/.json/.y*ml) but none were detected by the IaC discovery utility. If IaC exists elsewhere, please update compute/probe parameters to match the scaled production settings and recommended probe thresholds.

Action Items for the Repository

  1. Fix AReallyExpensiveOperation to be memory-safe (pagination/streaming, bounded allocations) and add tests simulating large payloads.
  2. Add instrumentation (OpenTelemetry/Application Insights) around this endpoint and memory allocations.
  3. Review and adjust ASP.NET Core Kestrel limits and GC settings if applicable.
  4. Update IaC (if present) to:
    • Set resources for octopetsapi to at least 4 vCPU/8Gi (or a right-sized value post-fix),
    • Tune probes (startup/liveness) with more realistic timeouts and initial delays,
    • Keep autoscaling configuration consistent with desired concurrency.

Follow-up and References

  • This issue tracks the code/IaC follow-up for the incident above. The service has been scaled and restarted to stabilize while the code fix is implemented. Please triage to the team owning ListingEndpoints and prioritize remediation to prevent recurrence.
  • Note: Unable to assign labels/assignees due to permissions; please triage to Copilot or relevant owners manually.

This issue was created by sreagent-octopets-007--70b460e3
Tracked by the SRE agent here

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions