fix(cron): make cron scheduler reliable under load and in containers by mmmeff · Pull Request #186 · spacedriveapp/spacebot

mmmeff · 2026-02-24T01:37:35Z

Summary

The cron scheduler has several compounding issues that cause jobs to silently miss their scheduled firings, particularly under load or in containerized environments:

Non-blocking execution: run_cron_job was awaited inline in the timer loop, so a slow or stuck job blocked all subsequent ticks. With MissedTickBehavior::Skip, blocked ticks were permanently lost. Each execution is now spawned as an independent tokio task with an AtomicBool lock to prevent overlapping runs of the same job. The circuit breaker and run-once logic run inside the spawned task and update the shared state; the timer loop picks up enabled = false on the next tick.
True wall-clock timeout: The timeout wrapped each individual recv() call rather than the total collection phase, so periodic non-terminal output (status updates, stream chunks) could extend runtime indefinitely. A single deadline is now computed up front, and each recv uses the remaining budget.
Timezone startup warning: When cron_timezone is unset, active_hours silently falls back to chrono::Local, which is often UTC in Docker/containerized environments. Scheduler::new now logs a warning explaining the fallback and how to configure it.
API validation parity: The CronTool (LLM-facing) validated ID format, minimum interval, prompt length, delivery target format, and active hour ranges, but the HTTP API create_or_update_cron had none of these checks. Added equivalent validation.

Test plan

Create a cron job with a short interval (e.g. 60s) and a prompt that takes >60s to complete — verify subsequent ticks are not skipped
Create a cron job with timeout_secs: 10 and a prompt that produces periodic output beyond 10s — verify it is killed at the 10s mark, not extended
Run in a container without cron_timezone set and check logs for the timezone fallback warning
Attempt to create a cron job via the HTTP API with interval_secs: 5 — verify it is rejected with a validation error
Verify circuit breaker still disables jobs after 3 consecutive failures
Verify run-once jobs still auto-disable after first execution

Made with Cursor

Note

Automated Summary

Fixed critical reliability issues in the cron scheduler affecting job execution under load and in containerized deployments. Changes span two files and address non-blocking execution with atomic task locks, precise deadline-based timeouts (rather than per-recv timeouts), missing timezone configuration warnings, and HTTP API validation parity with the LLM-facing CronTool.

_{Written by Tembo for commit da4e65f (fix/cron-scheduler-reliability)}

src/api/cron.rs

src/cron/scheduler.rs

The cron job system had several compounding issues that caused jobs to silently miss their scheduled firings: 1. Timer loop blocked on job execution — run_cron_job was awaited inline, so a slow or stuck job prevented subsequent ticks from firing. With MissedTickBehavior::Skip, every blocked tick was permanently lost. Fix: spawn each execution as an independent tokio task with an AtomicBool lock to prevent overlapping runs of the same job. 2. Timeout was not true wall-clock — the timeout wrapped each individual recv() call rather than the total collection phase, so periodic non-terminal output (status updates, stream chunks) could extend runtime indefinitely. Fix: compute a single deadline up front and use remaining budget for each recv. 3. Active-hours timezone silent fallback — when cron_timezone is unset, active_hours silently uses chrono::Local, which is often UTC in Docker/containerized environments. Jobs with active-hours constraints would be skipped with no obvious indicator. Fix: log a warning at scheduler startup when falling back to system local time. 4. API validation gap — the CronTool (LLM-facing) validated ID format, minimum interval, prompt length, delivery target format, and active hour ranges, but the HTTP API create_or_update_cron had none of these checks, allowing malformed entries to enter the scheduler. Fix: add equivalent validation to the API path. Co-authored-by: Cursor <cursoragent@cursor.com>

…reliability

coderabbitai · 2026-02-24T16:30:36Z

Walkthrough

The changes enhance cron request validation through a new validation layer in the API, restructure error responses with richer messaging, introduce concurrency guards to prevent overlapping job executions, add timezone awareness for cron scheduling, and refine active hours logic to handle edge cases in normalization.

Changes

Cohort / File(s)	Summary
Request Validation and Error Handling `src/api/cron.rs`	Added validate_cron_request function enforcing id format, interval bounds, prompt length, and delivery target validation. Changed create_or_update_cron return type to include structured error responses with Json payloads. Introduced MIN_CRON_INTERVAL_SECS and MAX_CRON_PROMPT_LENGTH constants. Replaced direct StatusCode errors with cron_err helper for consistent error messaging.
Scheduler Execution and Concurrency `src/cron/scheduler.rs`	Introduced ExecutionGuard RAII pattern to reset execution flags on panic. Added per-job AtomicBool execution locks to prevent overlapping runs. Implemented active_hours normalization via new normalize_active_hours function. Enhanced timeout handling with deadline tracking. Added Scheduler::cron_timezone_label public method. Expanded scheduler initialization logging for missing timezone configuration.
Store Layer Filtering Logic `src/cron/store.rs`	Modified active_hours mapping in load_all and load_all_unfiltered to yield None when start equals end, filtering out identity-range configurations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Validation gates now guard the way,
Errors structured, come what may,
Locks ensure no race takes flight,
Timezones bloom in moonlit night,
Cron hops forward, strong and bright! 🌙⏰

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely summarizes the main objective of the PR: making the cron scheduler reliable under load and in containers, which directly reflects the substantive changes across validation, execution, timeout, and timezone handling.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, providing clear context for the four major issues being addressed: non-blocking execution, true wall-clock timeouts, timezone startup warnings, and API validation parity.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

src/api/cron.rs (2)

217-222: contains(':') accepts malformed targets like ":" or "discord:".

A bare contains(':') passes for strings with empty adapter or target segments. Validate both halves are non-empty.

Suggested fix

-    if !request.delivery_target.contains(':') {
+    if !matches!(request.delivery_target.split_once(':'), Some((adapter, target)) if !adapter.is_empty() && !target.is_empty()) {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/api/cron.rs` around lines 217 - 222, The current validation uses
request.delivery_target.contains(':') which accepts inputs like ":" or
"discord:" with empty halves; update the check to split the string on the first
colon (e.g., using splitn(2, ':')) and ensure both adapter and target segments
are non-empty before proceeding, returning the same BAD_REQUEST Err if either
part is empty (referencing request.delivery_target in this validation logic).

188-189: is_alphanumeric() permits non-ASCII codepoints.

The error message says "alphanumeric" but Rust's is_alphanumeric() accepts Unicode letters/digits (e.g. é, ñ). Use is_ascii_alphanumeric() if IDs should be restricted to [a-zA-Z0-9\-_].

Suggested fix

-            .all(|c| c.is_alphanumeric() || c == '-' || c == '_')
+            .all(|c| c.is_ascii_alphanumeric() || c == '-' || c == '_')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/api/cron.rs` around lines 188 - 189, The validation currently uses
chars().all(|c| c.is_alphanumeric() || c == '-' || c == '_'), but
is_alphanumeric() permits non-ASCII Unicode characters; replace it with
is_ascii_alphanumeric() so IDs are restricted to [a-zA-Z0-9]. Update the
expression to chars().all(|c| c.is_ascii_alphanumeric() || c == '-' || c == '_')
wherever this validation is used (the closure passed to .all in src/api/cron.rs)
to enforce the intended ASCII-only rule.

🧹 Nitpick comments (1)

src/cron/store.rs (1)
70-93: Consider extracting the shared row-to-CronConfig mapping into a helper function.

The mapping closure in load_all (Lines 72–92) and load_all_unfiltered (Lines 162–182) are identical. Extracting a fn row_to_config(row: SqliteRow) -> CronConfig helper would eliminate this duplication and ensure the new s != e guard (and any future mapping changes) stays consistent across both paths.

Also applies to: 160-183
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cron/store.rs` around lines 70 - 93, The closure that maps DB rows to
CronConfig in load_all and load_all_unfiltered is duplicated; extract it into a
shared helper like fn row_to_config(row: &Row) -> CronConfig (or accept
SqliteRow) and replace both closures with calls to row_to_config(row). Ensure
the helper performs the same try_get calls and the active_hours guard (the s !=
e check) plus conversions for interval_secs, enabled, run_once, and timeout_secs
so behavior remains identical across load_all and load_all_unfiltered.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/api/cron.rs`:
- Around line 224-239: The validation currently accepts each active_start_hour
and active_end_hour independently but create_or_update_cron discards the pair if
only one is supplied; update validate_cron_request to detect the case where
exactly one of request.active_start_hour or request.active_end_hour is Some and
return an Err with StatusCode::BAD_REQUEST (e.g. "both active_start_hour and
active_end_hour must be provided together or omitted") so callers get immediate
feedback; keep the existing per-value range checks for 0-23 and reference the
validate_cron_request and create_or_update_cron functions to locate where to add
the joint-presence check.

In `@src/cron/scheduler.rs`:
- Around line 258-264: The current load/store on execution_lock (used around the
scheduled tick in the timer loop) has a TOCTOU gap allowing trigger_now to race
with the scheduled path; change the scheduled path to attempt to set the lock
atomically using execution_lock.compare_exchange(..., Ordering::Acquire,
Ordering::Relaxed or ::Acquire/::Release as appropriate) and only call
run_cron_job when compare_exchange succeeds, and update trigger_now to perform
the same atomic compare_exchange check before calling run_cron_job so both
manual and scheduled triggers respect the same lock; ensure the lock is released
(store false with Ordering::Release) when run_cron_job completes or on error.
- Around line 317-330: The run-once disable logic currently executed
unconditionally after job execution (checking job.run_once, mutating exec_jobs
and calling exec_context.store.update_enabled) causes failed run-once jobs to be
permanently disabled; change the flow so the block that sets j.enabled = false
and calls exec_context.store.update_enabled(&exec_job_id, false).await only runs
when the execution succeeded (i.e., inside the Ok arm of the execution result)
or, if the intention is to disable regardless, add a clear comment next to
job.run_once documenting that behavior and why failures should not be retried;
reference job.run_once, exec_jobs, exec_job_id, and
exec_context.store.update_enabled when making the change.

---

Duplicate comments:
In `@src/api/cron.rs`:
- Around line 217-222: The current validation uses
request.delivery_target.contains(':') which accepts inputs like ":" or
"discord:" with empty halves; update the check to split the string on the first
colon (e.g., using splitn(2, ':')) and ensure both adapter and target segments
are non-empty before proceeding, returning the same BAD_REQUEST Err if either
part is empty (referencing request.delivery_target in this validation logic).
- Around line 188-189: The validation currently uses chars().all(|c|
c.is_alphanumeric() || c == '-' || c == '_'), but is_alphanumeric() permits
non-ASCII Unicode characters; replace it with is_ascii_alphanumeric() so IDs are
restricted to [a-zA-Z0-9]. Update the expression to chars().all(|c|
c.is_ascii_alphanumeric() || c == '-' || c == '_') wherever this validation is
used (the closure passed to .all in src/api/cron.rs) to enforce the intended
ASCII-only rule.

---

Nitpick comments:
In `@src/cron/store.rs`:
- Around line 70-93: The closure that maps DB rows to CronConfig in load_all and
load_all_unfiltered is duplicated; extract it into a shared helper like fn
row_to_config(row: &Row) -> CronConfig (or accept SqliteRow) and replace both
closures with calls to row_to_config(row). Ensure the helper performs the same
try_get calls and the active_hours guard (the s != e check) plus conversions for
interval_secs, enabled, run_once, and timeout_secs so behavior remains identical
across load_all and load_all_unfiltered.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af095f3 and 3c7a1db.

📒 Files selected for processing (3)

src/api/cron.rs
src/cron/scheduler.rs
src/cron/store.rs

coderabbitai · 2026-02-24T16:33:43Z

src/api/cron.rs

+    if let Some(start) = request.active_start_hour {
+        if start > 23 {
+            return Err((
+                StatusCode::BAD_REQUEST,
+                "active_start_hour must be 0-23".into(),
+            ));
+        }
+    }
+    if let Some(end) = request.active_end_hour {
+        if end > 23 {
+            return Err((
+                StatusCode::BAD_REQUEST,
+                "active_end_hour must be 0-23".into(),
+            ));
+        }
+    }


⚠️ Potential issue | 🟡 Minor

Supplying only one of active_start_hour / active_end_hour is silently ignored.

Validation checks each hour independently (Lines 224–239), but create_or_update_cron maps the pair to None when only one is provided (Lines 271–274). A user who accidentally omits one field gets no feedback that their active-hours window was discarded. Consider validating that both or neither are provided.

Suggested addition inside `validate_cron_request`

+ if request.active_start_hour.is_some() != request.active_end_hour.is_some() { + return Err(( + StatusCode::BAD_REQUEST, + "active_start_hour and active_end_hour must both be provided or both omitted".into(), + )); + } + Ok(()) }

Also applies to: 271-274

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/api/cron.rs` around lines 224 - 239, The validation currently accepts each active_start_hour and active_end_hour independently but create_or_update_cron discards the pair if only one is supplied; update validate_cron_request to detect the case where exactly one of request.active_start_hour or request.active_end_hour is Some and return an Err with StatusCode::BAD_REQUEST (e.g. "both active_start_hour and active_end_hour must be provided together or omitted") so callers get immediate feedback; keep the existing per-value range checks for 0-23 and reference the validate_cron_request and create_or_update_cron functions to locate where to add the joint-presence check.

coderabbitai · 2026-02-24T16:33:43Z

src/cron/scheduler.rs

+                if execution_lock.load(std::sync::atomic::Ordering::Acquire) {
+                    tracing::debug!(cron_id = %job_id, "previous execution still running, skipping tick");
+                    continue;
+                }

-                        let should_disable = {
-                            let mut j = jobs.write().await;
-                            if let Some(j) = j.get_mut(&job_id) {
-                                j.consecutive_failures += 1;
-                                j.consecutive_failures >= MAX_CONSECUTIVE_FAILURES
-                            } else {
-                                false
+                tracing::info!(cron_id = %job_id, "cron job firing");
+                execution_lock.store(true, std::sync::atomic::Ordering::Release);


⚠️ Potential issue | 🟡 Minor

TOCTOU gap between the lock check and the lock set.

The load on Line 258 and store on Line 264 are not performed atomically. While the timer loop is single-threaded (so only one iteration runs at a time), trigger_now (Line 401) calls run_cron_job directly without checking or setting execution_lock. This means a manual trigger can race with a scheduled tick, resulting in two concurrent executions of the same job.

If overlapping a manual trigger with a scheduled run is acceptable, this is fine as-is. If not, consider using compare_exchange here and having trigger_now also respect the lock.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cron/scheduler.rs` around lines 258 - 264, The current load/store on execution_lock (used around the scheduled tick in the timer loop) has a TOCTOU gap allowing trigger_now to race with the scheduled path; change the scheduled path to attempt to set the lock atomically using execution_lock.compare_exchange(..., Ordering::Acquire, Ordering::Relaxed or ::Acquire/::Release as appropriate) and only call run_cron_job when compare_exchange succeeds, and update trigger_now to perform the same atomic compare_exchange check before calling run_cron_job so both manual and scheduled triggers respect the same lock; ensure the lock is released (store false with Ordering::Release) when run_cron_job completes or on error.

coderabbitai · 2026-02-24T16:33:43Z

src/cron/scheduler.rs

+                    if job.run_once {
+                        tracing::info!(cron_id = %exec_job_id, "run-once cron completed, disabling");

-                    {
-                        let mut j = jobs.write().await;
-                        if let Some(j) = j.get_mut(&job_id) {
-                            j.enabled = false;
+                        {
+                            let mut j = exec_jobs.write().await;
+                            if let Some(j) = j.get_mut(&exec_job_id) {
+                                j.enabled = false;
+                            }
                        }
-                    }

-                    if let Err(error) = context.store.update_enabled(&job_id, false).await {
-                        tracing::error!(%error, "failed to persist run-once cron disabled state");
+                        if let Err(error) = exec_context.store.update_enabled(&exec_job_id, false).await {
+                            tracing::error!(%error, "failed to persist run-once cron disabled state");
+                        }
                    }


⚠️ Potential issue | 🟡 Minor

run_once disables the job even when execution failed.

The if job.run_once block runs unconditionally after both the Ok and Err arms. A failed run-once job will be permanently disabled without ever completing successfully. If this is intentional ("run once regardless of outcome"), a brief comment would clarify the design choice. If a failed run-once job should be retried on the next tick, this block should only run on success.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/cron/scheduler.rs` around lines 317 - 330, The run-once disable logic currently executed unconditionally after job execution (checking job.run_once, mutating exec_jobs and calling exec_context.store.update_enabled) causes failed run-once jobs to be permanently disabled; change the flow so the block that sets j.enabled = false and calls exec_context.store.update_enabled(&exec_job_id, false).await only runs when the execution succeeded (i.e., inside the Ok arm of the execution result) or, if the intention is to disable regardless, add a clear comment next to job.run_once documenting that behavior and why failures should not be retried; reference job.run_once, exec_jobs, exec_job_id, and exec_context.store.update_enabled when making the change.

mmmeff changed the title ~~fix: make cron scheduler reliable under load and in containers~~ fix(cron): make cron scheduler reliable under load and in containers Feb 24, 2026

tembo bot reviewed Feb 24, 2026

View reviewed changes

src/api/cron.rs Show resolved Hide resolved

src/api/cron.rs Show resolved Hide resolved

src/api/cron.rs Outdated Show resolved Hide resolved

src/cron/scheduler.rs Outdated Show resolved Hide resolved

mmmeff force-pushed the fix/cron-scheduler-reliability branch from 6de5ef6 to e4919be Compare February 24, 2026 01:47

mmmeff force-pushed the fix/cron-scheduler-reliability branch from e4919be to 1328e65 Compare February 24, 2026 03:04

mmmeff closed this Feb 24, 2026

mmmeff reopened this Feb 24, 2026

Merge branch 'main' into fix/cron-scheduler-reliability

10effd0

jamiepine approved these changes Feb 24, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into fix/cron-scheduler-…

3c7a1db

…reliability

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix(cron): make cron scheduler reliable under load and in containers#186

fix(cron): make cron scheduler reliable under load and in containers#186
mmmeff wants to merge 3 commits intospacedriveapp:mainfrom
mmmeff:fix/cron-scheduler-reliability

mmmeff commented Feb 24, 2026 •

edited by tembo bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 24, 2026

Uh oh!

coderabbitai bot Feb 24, 2026

Uh oh!

coderabbitai bot Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

mmmeff commented Feb 24, 2026 • edited by tembo bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mmmeff commented Feb 24, 2026 •

edited by tembo bot

Loading

coderabbitai bot commented Feb 24, 2026 •

edited

Loading