add long gaps by nkhristinin · Pull Request #313 · elastic/security-documents-generator

nkhristinin · 2026-02-25T09:47:30Z

No description provided.

…eduler creating excessively large saved objects. (#254788) ## Summary Fixes OOM crashes and Kibana restarts caused by the gap auto-fill scheduler creating excessively large saved objects. ### Problem The backfill client creates one `AdHocRunSO` saved object per rule, containing a `schedule` array with one entry per rule-interval step across all gap ranges. There is no upper bound on the number of entries. For rules with short intervals and long gaps, this array grows to tens of thousands of entries. Each entry is ~70 bytes serialized, so a single SO can reach multiple megabytes. When batched into `bulkCreate` requests (previously chunks of 10, concurrency of 10), the combined payload exceeds Elasticsearch's `http.max_content_length`, causing: - `"payload too large"` errors from Elasticsearch `_bulk` requests - V8 heap exhaustion during JSON serialization of large SO arrays - Event loop blocking for minutes (preventing task timeout/cancellation) - Repeated OOM crashes and pod restarts Additionally, the scheduler previously processed up to 100 rules per batch (`DEFAULT_RULES_BATCH_SIZE`), building all SOs in memory before sending any, which amplified peak memory usage. ### Changes **Cap schedule entries per SO (`calculateSchedule`)** - Introduce `MAX_SCHEDULE_ENTRIES = 10,000`, bounding each SO to ~700 KB regardless of rule interval. - `calculateSchedule` now returns `{ schedule, truncated }`. When truncated, the `end` field on the SO is derived from the last scheduled entry (not the original range end), keeping the SO self-consistent. - The unfilled remainder stays on the gap document and is picked up by the next scheduler run. **Reduce bulk request sizes** - `bulkCreate` chunk size: `10` → `3` (each SO can be up to ~700 KB, so a chunk of 3 stays well within payload limits). - `bulkCreate` concurrency: `10` → `2` (reduces peak memory from concurrent in-flight requests). **Reduce rules per scheduler batch and gap page size** - `DEFAULT_RULES_BATCH_SIZE`: `100` → `10`. Fewer rules per iteration means less memory pressure and more frequent cancellation checkpoints. - `DEFAULT_GAPS_PER_PAGE`: `5,000` → `DEFAULT_RULES_BATCH_SIZE * 50` (`500`). Aligned with the smaller batch size to avoid fetching far more gaps than can be processed in one batch. **Limit backfill task concurrency** - Set `maxConcurrency: 3` on the backfill ad-hoc task runner registration, preventing Task Manager from running too many backfill tasks in parallel and overwhelming the Kibana process. - Added `'ad_hoc_run-backfill'` to the Task Manager `CONCURRENCY_ALLOW_LIST_BY_TASK_TYPE` to enable the concurrency limit. **Harden error handling** - Catch blocks in `bulkQueue` now handle non-`Error` objects safely (`error instanceof Error ? error.message : String(error)`). ### Both code paths covered The scheduler task and the UI bulk-fill API both converge on `processGapsBatch` → `scheduleBackfill` → `bulkQueue`, so the schedule cap and chunk size changes apply to both. ### Performance comparison Local benchmarks comparing `main` (default) vs this branch, using rules with 1-minute interval: | Scenario | Default (main) | This branch | |---|---|---| | 100 rules, 1,000 gaps/rule (~1–2 min per gap) | 2s | 22s | | 100 rules, 1 long gap/rule (50-day duration) | **OOM crash** | 45s | | 500 rules, 10 gaps/rule | 2s | 23s | The new branch is slower due to smaller chunk sizes, lower concurrency, and per-SO schedule caps. But it no longer crashes. The tradeoff is intentional: safety over throughput. ## What is next We should consider to not storing schedule array in SO, and dynamically calculate next run during backfill execution. It allow us to increase back chunk size, as SO will be much smaller. ### How to test To test case where Kibana crash we need to have ~100 rules with small interval, and long gaps. I created this [PR](elastic/security-documents-generator#313) for utility which generate rules with gaps. It allows to have long gaps: 1. 100 rules with 1m interval with 1 gap of 50 days durations ` npm run start -- rules --rules 100 -d 50 -i"1m" -c` 2. 100 rules with 1m interval and 1000 gap per rule (~1m gap) `npm run start -- rules --rules 100 -g 1000 -i"1m" -c` Then enable auto gap fill scheduler. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

…eduler creating excessively large saved objects. (elastic#254788) ## Summary Fixes OOM crashes and Kibana restarts caused by the gap auto-fill scheduler creating excessively large saved objects. ### Problem The backfill client creates one `AdHocRunSO` saved object per rule, containing a `schedule` array with one entry per rule-interval step across all gap ranges. There is no upper bound on the number of entries. For rules with short intervals and long gaps, this array grows to tens of thousands of entries. Each entry is ~70 bytes serialized, so a single SO can reach multiple megabytes. When batched into `bulkCreate` requests (previously chunks of 10, concurrency of 10), the combined payload exceeds Elasticsearch's `http.max_content_length`, causing: - `"payload too large"` errors from Elasticsearch `_bulk` requests - V8 heap exhaustion during JSON serialization of large SO arrays - Event loop blocking for minutes (preventing task timeout/cancellation) - Repeated OOM crashes and pod restarts Additionally, the scheduler previously processed up to 100 rules per batch (`DEFAULT_RULES_BATCH_SIZE`), building all SOs in memory before sending any, which amplified peak memory usage. ### Changes **Cap schedule entries per SO (`calculateSchedule`)** - Introduce `MAX_SCHEDULE_ENTRIES = 10,000`, bounding each SO to ~700 KB regardless of rule interval. - `calculateSchedule` now returns `{ schedule, truncated }`. When truncated, the `end` field on the SO is derived from the last scheduled entry (not the original range end), keeping the SO self-consistent. - The unfilled remainder stays on the gap document and is picked up by the next scheduler run. **Reduce bulk request sizes** - `bulkCreate` chunk size: `10` → `3` (each SO can be up to ~700 KB, so a chunk of 3 stays well within payload limits). - `bulkCreate` concurrency: `10` → `2` (reduces peak memory from concurrent in-flight requests). **Reduce rules per scheduler batch and gap page size** - `DEFAULT_RULES_BATCH_SIZE`: `100` → `10`. Fewer rules per iteration means less memory pressure and more frequent cancellation checkpoints. - `DEFAULT_GAPS_PER_PAGE`: `5,000` → `DEFAULT_RULES_BATCH_SIZE * 50` (`500`). Aligned with the smaller batch size to avoid fetching far more gaps than can be processed in one batch. **Limit backfill task concurrency** - Set `maxConcurrency: 3` on the backfill ad-hoc task runner registration, preventing Task Manager from running too many backfill tasks in parallel and overwhelming the Kibana process. - Added `'ad_hoc_run-backfill'` to the Task Manager `CONCURRENCY_ALLOW_LIST_BY_TASK_TYPE` to enable the concurrency limit. **Harden error handling** - Catch blocks in `bulkQueue` now handle non-`Error` objects safely (`error instanceof Error ? error.message : String(error)`). ### Both code paths covered The scheduler task and the UI bulk-fill API both converge on `processGapsBatch` → `scheduleBackfill` → `bulkQueue`, so the schedule cap and chunk size changes apply to both. ### Performance comparison Local benchmarks comparing `main` (default) vs this branch, using rules with 1-minute interval: | Scenario | Default (main) | This branch | |---|---|---| | 100 rules, 1,000 gaps/rule (~1–2 min per gap) | 2s | 22s | | 100 rules, 1 long gap/rule (50-day duration) | **OOM crash** | 45s | | 500 rules, 10 gaps/rule | 2s | 23s | The new branch is slower due to smaller chunk sizes, lower concurrency, and per-SO schedule caps. But it no longer crashes. The tradeoff is intentional: safety over throughput. ## What is next We should consider to not storing schedule array in SO, and dynamically calculate next run during backfill execution. It allow us to increase back chunk size, as SO will be much smaller. ### How to test To test case where Kibana crash we need to have ~100 rules with small interval, and long gaps. I created this [PR](elastic/security-documents-generator#313) for utility which generate rules with gaps. It allows to have long gaps: 1. 100 rules with 1m interval with 1 gap of 50 days durations ` npm run start -- rules --rules 100 -d 50 -i"1m" -c` 2. 100 rules with 1m interval and 1000 gap per rule (~1m gap) `npm run start -- rules --rules 100 -g 1000 -i"1m" -c` Then enable auto gap fill scheduler. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> (cherry picked from commit afc555f)

…re than 90 days" (#256507) ## Fix: gap auto-fill scheduler fails with "Backfill cannot look back more than 90 days" ### Problem Suppose we have gaps that are 100 days old. Backfill scheduling enforces a 90-day validation limit. The gap auto-fill scheduler fetches all gaps that overlap the `now-90d` range, so these 100-day-old gaps can still be fetched because their interval overlaps that window. Their interval is then clamped to 90 days. Later, when `scheduleBackfill` validates the ranges, it computes its own `now`, which is slightly later because some processing time has elapsed since the task started. As a result, `now - startDate` can become greater than 90 days, and the validation rejects the range with: ```text Backfill cannot look back more than 90 days ``` ### Fix After parsing `gapFillRange`, clamp `startDate` so it stays at least 5 minutes inside the 90-day lookback window. This gives enough buffer for processing delays and ensures the clamped ranges remain safely within the validation limit. ### How to test I created this elastic/security-documents-generator#313 for utility which generate rules with gaps. It allows to have long gaps: 1 rules with 1m interval with 1 gap of 100 days durations `npm run start -- rules --rules 100 -d 100 -i"1m" -c` In main enable gap auto fill scheduler, and observe that execution is failed. In this PR - it should successfully execute the task Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

…re than 90 days" (elastic#256507) ## Fix: gap auto-fill scheduler fails with "Backfill cannot look back more than 90 days" ### Problem Suppose we have gaps that are 100 days old. Backfill scheduling enforces a 90-day validation limit. The gap auto-fill scheduler fetches all gaps that overlap the `now-90d` range, so these 100-day-old gaps can still be fetched because their interval overlaps that window. Their interval is then clamped to 90 days. Later, when `scheduleBackfill` validates the ranges, it computes its own `now`, which is slightly later because some processing time has elapsed since the task started. As a result, `now - startDate` can become greater than 90 days, and the validation rejects the range with: ```text Backfill cannot look back more than 90 days ``` ### Fix After parsing `gapFillRange`, clamp `startDate` so it stays at least 5 minutes inside the 90-day lookback window. This gives enough buffer for processing delays and ensures the clamped ranges remain safely within the validation limit. ### How to test I created this elastic/security-documents-generator#313 for utility which generate rules with gaps. It allows to have long gaps: 1 rules with 1m interval with 1 gap of 100 days durations `npm run start -- rules --rules 100 -d 100 -i"1m" -c` In main enable gap auto fill scheduler, and observe that execution is failed. In this PR - it should successfully execute the task Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> (cherry picked from commit c929b3e)

…ack more than 90 days" (#256507) (#256578) # Backport This will backport the following commits from `main` to `9.3`: - [Fix: gap auto-fill scheduler fails with "Backfill cannot look back more than 90 days" (#256507)](#256507)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sorenlouv/backport)  Co-authored-by: Khristinin Nikita <nikita.khristinin@elastic.co> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

…re than 90 days" (elastic#256507) ## Fix: gap auto-fill scheduler fails with "Backfill cannot look back more than 90 days" ### Problem Suppose we have gaps that are 100 days old. Backfill scheduling enforces a 90-day validation limit. The gap auto-fill scheduler fetches all gaps that overlap the `now-90d` range, so these 100-day-old gaps can still be fetched because their interval overlaps that window. Their interval is then clamped to 90 days. Later, when `scheduleBackfill` validates the ranges, it computes its own `now`, which is slightly later because some processing time has elapsed since the task started. As a result, `now - startDate` can become greater than 90 days, and the validation rejects the range with: ```text Backfill cannot look back more than 90 days ``` ### Fix After parsing `gapFillRange`, clamp `startDate` so it stays at least 5 minutes inside the 90-day lookback window. This gives enough buffer for processing delays and ensures the clamped ranges remain safely within the validation limit. ### How to test I created this elastic/security-documents-generator#313 for utility which generate rules with gaps. It allows to have long gaps: 1 rules with 1m interval with 1 gap of 100 days durations `npm run start -- rules --rules 100 -d 100 -i"1m" -c` In main enable gap auto fill scheduler, and observe that execution is failed. In this PR - it should successfully execute the task Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

…eduler creating excessively large saved objects. (elastic#254788) ## Summary Fixes OOM crashes and Kibana restarts caused by the gap auto-fill scheduler creating excessively large saved objects. ### Problem The backfill client creates one `AdHocRunSO` saved object per rule, containing a `schedule` array with one entry per rule-interval step across all gap ranges. There is no upper bound on the number of entries. For rules with short intervals and long gaps, this array grows to tens of thousands of entries. Each entry is ~70 bytes serialized, so a single SO can reach multiple megabytes. When batched into `bulkCreate` requests (previously chunks of 10, concurrency of 10), the combined payload exceeds Elasticsearch's `http.max_content_length`, causing: - `"payload too large"` errors from Elasticsearch `_bulk` requests - V8 heap exhaustion during JSON serialization of large SO arrays - Event loop blocking for minutes (preventing task timeout/cancellation) - Repeated OOM crashes and pod restarts Additionally, the scheduler previously processed up to 100 rules per batch (`DEFAULT_RULES_BATCH_SIZE`), building all SOs in memory before sending any, which amplified peak memory usage. ### Changes **Cap schedule entries per SO (`calculateSchedule`)** - Introduce `MAX_SCHEDULE_ENTRIES = 10,000`, bounding each SO to ~700 KB regardless of rule interval. - `calculateSchedule` now returns `{ schedule, truncated }`. When truncated, the `end` field on the SO is derived from the last scheduled entry (not the original range end), keeping the SO self-consistent. - The unfilled remainder stays on the gap document and is picked up by the next scheduler run. **Reduce bulk request sizes** - `bulkCreate` chunk size: `10` → `3` (each SO can be up to ~700 KB, so a chunk of 3 stays well within payload limits). - `bulkCreate` concurrency: `10` → `2` (reduces peak memory from concurrent in-flight requests). **Reduce rules per scheduler batch and gap page size** - `DEFAULT_RULES_BATCH_SIZE`: `100` → `10`. Fewer rules per iteration means less memory pressure and more frequent cancellation checkpoints. - `DEFAULT_GAPS_PER_PAGE`: `5,000` → `DEFAULT_RULES_BATCH_SIZE * 50` (`500`). Aligned with the smaller batch size to avoid fetching far more gaps than can be processed in one batch. **Limit backfill task concurrency** - Set `maxConcurrency: 3` on the backfill ad-hoc task runner registration, preventing Task Manager from running too many backfill tasks in parallel and overwhelming the Kibana process. - Added `'ad_hoc_run-backfill'` to the Task Manager `CONCURRENCY_ALLOW_LIST_BY_TASK_TYPE` to enable the concurrency limit. **Harden error handling** - Catch blocks in `bulkQueue` now handle non-`Error` objects safely (`error instanceof Error ? error.message : String(error)`). ### Both code paths covered The scheduler task and the UI bulk-fill API both converge on `processGapsBatch` → `scheduleBackfill` → `bulkQueue`, so the schedule cap and chunk size changes apply to both. ### Performance comparison Local benchmarks comparing `main` (default) vs this branch, using rules with 1-minute interval: | Scenario | Default (main) | This branch | |---|---|---| | 100 rules, 1,000 gaps/rule (~1–2 min per gap) | 2s | 22s | | 100 rules, 1 long gap/rule (50-day duration) | **OOM crash** | 45s | | 500 rules, 10 gaps/rule | 2s | 23s | The new branch is slower due to smaller chunk sizes, lower concurrency, and per-SO schedule caps. But it no longer crashes. The tradeoff is intentional: safety over throughput. ## What is next We should consider to not storing schedule array in SO, and dynamically calculate next run during backfill execution. It allow us to increase back chunk size, as SO will be much smaller. ### How to test To test case where Kibana crash we need to have ~100 rules with small interval, and long gaps. I created this [PR](elastic/security-documents-generator#313) for utility which generate rules with gaps. It allows to have long gaps: 1. 100 rules with 1m interval with 1 gap of 50 days durations ` npm run start -- rules --rules 100 -d 50 -i"1m" -c` 2. 100 rules with 1m interval and 1000 gap per rule (~1m gap) `npm run start -- rules --rules 100 -g 1000 -i"1m" -c` Then enable auto gap fill scheduler. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

…re than 90 days" (elastic#256507) ## Fix: gap auto-fill scheduler fails with "Backfill cannot look back more than 90 days" ### Problem Suppose we have gaps that are 100 days old. Backfill scheduling enforces a 90-day validation limit. The gap auto-fill scheduler fetches all gaps that overlap the `now-90d` range, so these 100-day-old gaps can still be fetched because their interval overlaps that window. Their interval is then clamped to 90 days. Later, when `scheduleBackfill` validates the ranges, it computes its own `now`, which is slightly later because some processing time has elapsed since the task started. As a result, `now - startDate` can become greater than 90 days, and the validation rejects the range with: ```text Backfill cannot look back more than 90 days ``` ### Fix After parsing `gapFillRange`, clamp `startDate` so it stays at least 5 minutes inside the 90-day lookback window. This gives enough buffer for processing delays and ensures the clamped ranges remain safely within the validation limit. ### How to test I created this elastic/security-documents-generator#313 for utility which generate rules with gaps. It allows to have long gaps: 1 rules with 1m interval with 1 gap of 100 days durations `npm run start -- rules --rules 100 -d 100 -i"1m" -c` In main enable gap auto fill scheduler, and observe that execution is failed. In this PR - it should successfully execute the task Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

add long gaps

84e7f72

nkhristinin mentioned this pull request Feb 25, 2026

Fixes OOM crashes and Kibana restarts caused by the gap auto-fill scheduler creating excessively large saved objects. elastic/kibana#254788

Merged

nkhristinin mentioned this pull request Mar 6, 2026

Fix: gap auto-fill scheduler fails with "Backfill cannot look back more than 90 days" elastic/kibana#256507

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add long gaps#313

add long gaps#313
nkhristinin wants to merge 1 commit intomainfrom
long-gaps

nkhristinin commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nkhristinin commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant