Skip to content

[BUG] Deleting a Monitor while it is being executed orphans Alerts #2086

@toepkerd

Description

@toepkerd

What is the bug?
Given a Monitor that has an ACTIVE ongoing Alert, if a user deletes that Monitor in the middle of one of its executions, a race condition causes Alert duplication/orphaning where the ongoing Alert continues to exist in the .opendistro-alerting-alerts index as ACTIVE, but the same Alert is written to the .opendistro-alerting-alert-history* index pattern as DELETED.

On the side, trying to acknowledge the orphaned ACTIVE alert leads to a stuck "acknowledge alert" task.

The longer monitor executions run, the wider the race condition window, and the more likely this bug is to occur.

What is the expected behavior?
The presence of the DELETED Alert is correct. However, the ACTIVE Alert should be deleted. This suggests an issue with AlertMover.kt's postDelete() flows, where the Alert is successfully copied but somehow it does not get deleted. Strangely, no "Failed to delete alerts" error logs or exceptions were found in the Elasticsearch logs.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Create an index and send continuous voluminous data to it. Example mappings:
"mappings": {
    "properties": {
        # basic mock fields
        "@timestamp": {"type": "date"},
        "status": {"type": "text"},
        "application": {"type": "text"},
        "severity": {"type": "text"},
        "number": {"type": "integer"},

        # voluminous mock fields
        "log_message": {"type": "text"},
        "stack_trace": {"type": "text"},
        "request_payload": {"type": "text"},
        "response_payload": {"type": "text"},

        # metadata mock fields
        "host": {"type": "text"},
        "pod": {"type": "text"},
        "container": {"type": "text"},
    }
}
  1. Create a query-level monitor on that index with an expensive query to make executions run longer. The longer you make a Monitor execution last, the easier the bug will be to reproduce. Example config:
{
    "monitor_type": "query_level_monitor",
    "name": "Any Results Monitor",
    "enabled": true,
    "schedule": {"period": {"interval": 1, "unit": "MINUTES"}},
    "inputs": [
        {
            "search": {
                "indices": <index_name]>,
                "query": {
                    "size": 10000,
                    "query": {"wildcard": {"log_message": {"value": "*e*"}}},
                    "aggs": {
                        "nested_agg": {
                            "terms": {"field": "host.keyword", "size": 5000},
                            "aggs": {
                                "deep": {"terms": {"field": "status.keyword", "size": 1000}},
                                "percentiles": {"percentiles": {"field": "number"}}
                            }
                        }
                    }
                },
            }
        }
    ],
    "triggers": [
        {
            "query_level_trigger": {
                "name": "Results found",
                "severity": "1",
                "condition": {
                    "script": {
                        "source": "ctx.results[0].hits.total.value > 0",
                        "lang": "painless",
                    }
                },
                "actions": [],
            }
        }
    ],
}
  1. Monitor _cat/tasks and wait for a Monitor execution task to spawn
  2. The instant one spawns, delete the Monitor
  3. You should find the ACTIVE and DELETED Monitor in Get Alerts now

What is your host/environment?

  • OS: Managed Service Domain
  • Version: OS 3.5
  • Plugins: All

Do you have any screenshots?
Here are the orphaned Alerts created on my bug reproduction (source: Get Alerts):

{
  "alerts": [
    {
      "id": "fYUxWp0BcHdHVvWiSqXZ",
      "version": 2,
      "monitor_id": "ziEwWp0BmYffgjIyGBRI",
      "workflow_id": "",
      "workflow_name": "",
      "associated_alert_ids": [],
      "schema_version": 6,
      "monitor_version": 1,
      "monitor_name": "Any Results Monitor",
      "execution_id": "ziEwWp0BmYffgjIyGBRI_2026-04-04T20:30:55.462670165_1fdbbf84-8141-47f2-bef5-04cc31b35fc0",
      "trigger_id": "zSEwWp0BmYffgjIyGBQ-",
      "trigger_name": "Results found",
      "finding_ids": [],
      "related_doc_ids": [],
      "state": "DELETED",
      "error_message": null,
      "alert_history": [],
      "severity": "1",
      "action_execution_results": [],
      "start_time": 1775334673107,
      "last_notification_time": 1775334741149,
      "end_time": null,
      "acknowledged_time": null
    },
    {
      "id": "fYUxWp0BcHdHVvWiSqXZ",
      "version": 3,
      "monitor_id": "ziEwWp0BmYffgjIyGBRI",
      "workflow_id": "",
      "workflow_name": "",
      "associated_alert_ids": [],
      "schema_version": 6,
      "monitor_version": 1,
      "monitor_name": "Any Results Monitor",
      "execution_id": "ziEwWp0BmYffgjIyGBRI_2026-04-04T20:30:55.462670165_1fdbbf84-8141-47f2-bef5-04cc31b35fc0",
      "trigger_id": "zSEwWp0BmYffgjIyGBQ-",
      "trigger_name": "Results found",
      "finding_ids": [],
      "related_doc_ids": [],
      "state": "ACTIVE",
      "error_message": null,
      "alert_history": [],
      "severity": "1",
      "action_execution_results": [],
      "start_time": 1775334673107,
      "last_notification_time": 1775334795674,
      "end_time": null,
      "acknowledged_time": null
    }
  ],
  "totalAlerts": 2
}

Proposed Solutions

  1. (Preferred) Locking Mechanism: add a lock that Monitor execution and postDelete() must acquire before they proceed with their flows
  2. Alert Sweeper: add an independent scheduled job that scans for and cleans up orphaned Alerts

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions