What is the bug?
Given a Monitor that has an ACTIVE ongoing Alert, if a user deletes that Monitor in the middle of one of its executions, a race condition causes Alert duplication/orphaning where the ongoing Alert continues to exist in the .opendistro-alerting-alerts index as ACTIVE, but the same Alert is written to the .opendistro-alerting-alert-history* index pattern as DELETED.
On the side, trying to acknowledge the orphaned ACTIVE alert leads to a stuck "acknowledge alert" task.
The longer monitor executions run, the wider the race condition window, and the more likely this bug is to occur.
What is the expected behavior?
The presence of the DELETED Alert is correct. However, the ACTIVE Alert should be deleted. This suggests an issue with AlertMover.kt's postDelete() flows, where the Alert is successfully copied but somehow it does not get deleted. Strangely, no "Failed to delete alerts" error logs or exceptions were found in the Elasticsearch logs.
How can one reproduce the bug?
Steps to reproduce the behavior:
- Create an index and send continuous voluminous data to it. Example mappings:
"mappings": {
"properties": {
# basic mock fields
"@timestamp": {"type": "date"},
"status": {"type": "text"},
"application": {"type": "text"},
"severity": {"type": "text"},
"number": {"type": "integer"},
# voluminous mock fields
"log_message": {"type": "text"},
"stack_trace": {"type": "text"},
"request_payload": {"type": "text"},
"response_payload": {"type": "text"},
# metadata mock fields
"host": {"type": "text"},
"pod": {"type": "text"},
"container": {"type": "text"},
}
}
- Create a query-level monitor on that index with an expensive query to make executions run longer. The longer you make a Monitor execution last, the easier the bug will be to reproduce. Example config:
{
"monitor_type": "query_level_monitor",
"name": "Any Results Monitor",
"enabled": true,
"schedule": {"period": {"interval": 1, "unit": "MINUTES"}},
"inputs": [
{
"search": {
"indices": <index_name]>,
"query": {
"size": 10000,
"query": {"wildcard": {"log_message": {"value": "*e*"}}},
"aggs": {
"nested_agg": {
"terms": {"field": "host.keyword", "size": 5000},
"aggs": {
"deep": {"terms": {"field": "status.keyword", "size": 1000}},
"percentiles": {"percentiles": {"field": "number"}}
}
}
}
},
}
}
],
"triggers": [
{
"query_level_trigger": {
"name": "Results found",
"severity": "1",
"condition": {
"script": {
"source": "ctx.results[0].hits.total.value > 0",
"lang": "painless",
}
},
"actions": [],
}
}
],
}
- Monitor
_cat/tasks and wait for a Monitor execution task to spawn
- The instant one spawns, delete the Monitor
- You should find the ACTIVE and DELETED Monitor in Get Alerts now
What is your host/environment?
- OS: Managed Service Domain
- Version: OS 3.5
- Plugins: All
Do you have any screenshots?
Here are the orphaned Alerts created on my bug reproduction (source: Get Alerts):
{
"alerts": [
{
"id": "fYUxWp0BcHdHVvWiSqXZ",
"version": 2,
"monitor_id": "ziEwWp0BmYffgjIyGBRI",
"workflow_id": "",
"workflow_name": "",
"associated_alert_ids": [],
"schema_version": 6,
"monitor_version": 1,
"monitor_name": "Any Results Monitor",
"execution_id": "ziEwWp0BmYffgjIyGBRI_2026-04-04T20:30:55.462670165_1fdbbf84-8141-47f2-bef5-04cc31b35fc0",
"trigger_id": "zSEwWp0BmYffgjIyGBQ-",
"trigger_name": "Results found",
"finding_ids": [],
"related_doc_ids": [],
"state": "DELETED",
"error_message": null,
"alert_history": [],
"severity": "1",
"action_execution_results": [],
"start_time": 1775334673107,
"last_notification_time": 1775334741149,
"end_time": null,
"acknowledged_time": null
},
{
"id": "fYUxWp0BcHdHVvWiSqXZ",
"version": 3,
"monitor_id": "ziEwWp0BmYffgjIyGBRI",
"workflow_id": "",
"workflow_name": "",
"associated_alert_ids": [],
"schema_version": 6,
"monitor_version": 1,
"monitor_name": "Any Results Monitor",
"execution_id": "ziEwWp0BmYffgjIyGBRI_2026-04-04T20:30:55.462670165_1fdbbf84-8141-47f2-bef5-04cc31b35fc0",
"trigger_id": "zSEwWp0BmYffgjIyGBQ-",
"trigger_name": "Results found",
"finding_ids": [],
"related_doc_ids": [],
"state": "ACTIVE",
"error_message": null,
"alert_history": [],
"severity": "1",
"action_execution_results": [],
"start_time": 1775334673107,
"last_notification_time": 1775334795674,
"end_time": null,
"acknowledged_time": null
}
],
"totalAlerts": 2
}
Proposed Solutions
- (Preferred) Locking Mechanism: add a lock that Monitor execution and postDelete() must acquire before they proceed with their flows
- Alert Sweeper: add an independent scheduled job that scans for and cleans up orphaned Alerts
What is the bug?
Given a Monitor that has an ACTIVE ongoing Alert, if a user deletes that Monitor in the middle of one of its executions, a race condition causes Alert duplication/orphaning where the ongoing Alert continues to exist in the
.opendistro-alerting-alertsindex as ACTIVE, but the same Alert is written to the.opendistro-alerting-alert-history*index pattern as DELETED.On the side, trying to acknowledge the orphaned ACTIVE alert leads to a stuck "acknowledge alert" task.
The longer monitor executions run, the wider the race condition window, and the more likely this bug is to occur.
What is the expected behavior?
The presence of the DELETED Alert is correct. However, the ACTIVE Alert should be deleted. This suggests an issue with AlertMover.kt's postDelete() flows, where the Alert is successfully copied but somehow it does not get deleted. Strangely, no "Failed to delete alerts" error logs or exceptions were found in the Elasticsearch logs.
How can one reproduce the bug?
Steps to reproduce the behavior:
_cat/tasksand wait for a Monitor execution task to spawnWhat is your host/environment?
Do you have any screenshots?
Here are the orphaned Alerts created on my bug reproduction (source: Get Alerts):
Proposed Solutions