fix(cron): make cron scheduler reliable under load and in containers#186
fix(cron): make cron scheduler reliable under load and in containers#186mmmeff wants to merge 3 commits intospacedriveapp:mainfrom
Conversation
6de5ef6 to
e4919be
Compare
The cron job system had several compounding issues that caused jobs to silently miss their scheduled firings: 1. Timer loop blocked on job execution — run_cron_job was awaited inline, so a slow or stuck job prevented subsequent ticks from firing. With MissedTickBehavior::Skip, every blocked tick was permanently lost. Fix: spawn each execution as an independent tokio task with an AtomicBool lock to prevent overlapping runs of the same job. 2. Timeout was not true wall-clock — the timeout wrapped each individual recv() call rather than the total collection phase, so periodic non-terminal output (status updates, stream chunks) could extend runtime indefinitely. Fix: compute a single deadline up front and use remaining budget for each recv. 3. Active-hours timezone silent fallback — when cron_timezone is unset, active_hours silently uses chrono::Local, which is often UTC in Docker/containerized environments. Jobs with active-hours constraints would be skipped with no obvious indicator. Fix: log a warning at scheduler startup when falling back to system local time. 4. API validation gap — the CronTool (LLM-facing) validated ID format, minimum interval, prompt length, delivery target format, and active hour ranges, but the HTTP API create_or_update_cron had none of these checks, allowing malformed entries to enter the scheduler. Fix: add equivalent validation to the API path. Co-authored-by: Cursor <cursoragent@cursor.com>
e4919be to
1328e65
Compare
WalkthroughThe changes enhance cron request validation through a new validation layer in the API, restructure error responses with richer messaging, introduce concurrency guards to prevent overlapping job executions, add timezone awareness for cron scheduling, and refine active hours logic to handle edge cases in normalization. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
src/api/cron.rs (2)
217-222:contains(':')accepts malformed targets like":"or"discord:".A bare
contains(':')passes for strings with empty adapter or target segments. Validate both halves are non-empty.Suggested fix
- if !request.delivery_target.contains(':') { + if !matches!(request.delivery_target.split_once(':'), Some((adapter, target)) if !adapter.is_empty() && !target.is_empty()) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/api/cron.rs` around lines 217 - 222, The current validation uses request.delivery_target.contains(':') which accepts inputs like ":" or "discord:" with empty halves; update the check to split the string on the first colon (e.g., using splitn(2, ':')) and ensure both adapter and target segments are non-empty before proceeding, returning the same BAD_REQUEST Err if either part is empty (referencing request.delivery_target in this validation logic).
188-189:is_alphanumeric()permits non-ASCII codepoints.The error message says "alphanumeric" but Rust's
is_alphanumeric()accepts Unicode letters/digits (e.g.é,ñ). Useis_ascii_alphanumeric()if IDs should be restricted to[a-zA-Z0-9\-_].Suggested fix
- .all(|c| c.is_alphanumeric() || c == '-' || c == '_') + .all(|c| c.is_ascii_alphanumeric() || c == '-' || c == '_')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/api/cron.rs` around lines 188 - 189, The validation currently uses chars().all(|c| c.is_alphanumeric() || c == '-' || c == '_'), but is_alphanumeric() permits non-ASCII Unicode characters; replace it with is_ascii_alphanumeric() so IDs are restricted to [a-zA-Z0-9]. Update the expression to chars().all(|c| c.is_ascii_alphanumeric() || c == '-' || c == '_') wherever this validation is used (the closure passed to .all in src/api/cron.rs) to enforce the intended ASCII-only rule.
🧹 Nitpick comments (1)
src/cron/store.rs (1)
70-93: Consider extracting the shared row-to-CronConfigmapping into a helper function.The mapping closure in
load_all(Lines 72–92) andload_all_unfiltered(Lines 162–182) are identical. Extracting afn row_to_config(row: SqliteRow) -> CronConfighelper would eliminate this duplication and ensure the news != eguard (and any future mapping changes) stays consistent across both paths.Also applies to: 160-183
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cron/store.rs` around lines 70 - 93, The closure that maps DB rows to CronConfig in load_all and load_all_unfiltered is duplicated; extract it into a shared helper like fn row_to_config(row: &Row) -> CronConfig (or accept SqliteRow) and replace both closures with calls to row_to_config(row). Ensure the helper performs the same try_get calls and the active_hours guard (the s != e check) plus conversions for interval_secs, enabled, run_once, and timeout_secs so behavior remains identical across load_all and load_all_unfiltered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/api/cron.rs`:
- Around line 224-239: The validation currently accepts each active_start_hour
and active_end_hour independently but create_or_update_cron discards the pair if
only one is supplied; update validate_cron_request to detect the case where
exactly one of request.active_start_hour or request.active_end_hour is Some and
return an Err with StatusCode::BAD_REQUEST (e.g. "both active_start_hour and
active_end_hour must be provided together or omitted") so callers get immediate
feedback; keep the existing per-value range checks for 0-23 and reference the
validate_cron_request and create_or_update_cron functions to locate where to add
the joint-presence check.
In `@src/cron/scheduler.rs`:
- Around line 258-264: The current load/store on execution_lock (used around the
scheduled tick in the timer loop) has a TOCTOU gap allowing trigger_now to race
with the scheduled path; change the scheduled path to attempt to set the lock
atomically using execution_lock.compare_exchange(..., Ordering::Acquire,
Ordering::Relaxed or ::Acquire/::Release as appropriate) and only call
run_cron_job when compare_exchange succeeds, and update trigger_now to perform
the same atomic compare_exchange check before calling run_cron_job so both
manual and scheduled triggers respect the same lock; ensure the lock is released
(store false with Ordering::Release) when run_cron_job completes or on error.
- Around line 317-330: The run-once disable logic currently executed
unconditionally after job execution (checking job.run_once, mutating exec_jobs
and calling exec_context.store.update_enabled) causes failed run-once jobs to be
permanently disabled; change the flow so the block that sets j.enabled = false
and calls exec_context.store.update_enabled(&exec_job_id, false).await only runs
when the execution succeeded (i.e., inside the Ok arm of the execution result)
or, if the intention is to disable regardless, add a clear comment next to
job.run_once documenting that behavior and why failures should not be retried;
reference job.run_once, exec_jobs, exec_job_id, and
exec_context.store.update_enabled when making the change.
---
Duplicate comments:
In `@src/api/cron.rs`:
- Around line 217-222: The current validation uses
request.delivery_target.contains(':') which accepts inputs like ":" or
"discord:" with empty halves; update the check to split the string on the first
colon (e.g., using splitn(2, ':')) and ensure both adapter and target segments
are non-empty before proceeding, returning the same BAD_REQUEST Err if either
part is empty (referencing request.delivery_target in this validation logic).
- Around line 188-189: The validation currently uses chars().all(|c|
c.is_alphanumeric() || c == '-' || c == '_'), but is_alphanumeric() permits
non-ASCII Unicode characters; replace it with is_ascii_alphanumeric() so IDs are
restricted to [a-zA-Z0-9]. Update the expression to chars().all(|c|
c.is_ascii_alphanumeric() || c == '-' || c == '_') wherever this validation is
used (the closure passed to .all in src/api/cron.rs) to enforce the intended
ASCII-only rule.
---
Nitpick comments:
In `@src/cron/store.rs`:
- Around line 70-93: The closure that maps DB rows to CronConfig in load_all and
load_all_unfiltered is duplicated; extract it into a shared helper like fn
row_to_config(row: &Row) -> CronConfig (or accept SqliteRow) and replace both
closures with calls to row_to_config(row). Ensure the helper performs the same
try_get calls and the active_hours guard (the s != e check) plus conversions for
interval_secs, enabled, run_once, and timeout_secs so behavior remains identical
across load_all and load_all_unfiltered.
| if let Some(start) = request.active_start_hour { | ||
| if start > 23 { | ||
| return Err(( | ||
| StatusCode::BAD_REQUEST, | ||
| "active_start_hour must be 0-23".into(), | ||
| )); | ||
| } | ||
| } | ||
| if let Some(end) = request.active_end_hour { | ||
| if end > 23 { | ||
| return Err(( | ||
| StatusCode::BAD_REQUEST, | ||
| "active_end_hour must be 0-23".into(), | ||
| )); | ||
| } | ||
| } |
There was a problem hiding this comment.
Supplying only one of active_start_hour / active_end_hour is silently ignored.
Validation checks each hour independently (Lines 224–239), but create_or_update_cron maps the pair to None when only one is provided (Lines 271–274). A user who accidentally omits one field gets no feedback that their active-hours window was discarded. Consider validating that both or neither are provided.
Suggested addition inside `validate_cron_request`
+ if request.active_start_hour.is_some() != request.active_end_hour.is_some() {
+ return Err((
+ StatusCode::BAD_REQUEST,
+ "active_start_hour and active_end_hour must both be provided or both omitted".into(),
+ ));
+ }
+
Ok(())
}Also applies to: 271-274
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/api/cron.rs` around lines 224 - 239, The validation currently accepts
each active_start_hour and active_end_hour independently but
create_or_update_cron discards the pair if only one is supplied; update
validate_cron_request to detect the case where exactly one of
request.active_start_hour or request.active_end_hour is Some and return an Err
with StatusCode::BAD_REQUEST (e.g. "both active_start_hour and active_end_hour
must be provided together or omitted") so callers get immediate feedback; keep
the existing per-value range checks for 0-23 and reference the
validate_cron_request and create_or_update_cron functions to locate where to add
the joint-presence check.
| if execution_lock.load(std::sync::atomic::Ordering::Acquire) { | ||
| tracing::debug!(cron_id = %job_id, "previous execution still running, skipping tick"); | ||
| continue; | ||
| } | ||
|
|
||
| let should_disable = { | ||
| let mut j = jobs.write().await; | ||
| if let Some(j) = j.get_mut(&job_id) { | ||
| j.consecutive_failures += 1; | ||
| j.consecutive_failures >= MAX_CONSECUTIVE_FAILURES | ||
| } else { | ||
| false | ||
| tracing::info!(cron_id = %job_id, "cron job firing"); | ||
| execution_lock.store(true, std::sync::atomic::Ordering::Release); |
There was a problem hiding this comment.
TOCTOU gap between the lock check and the lock set.
The load on Line 258 and store on Line 264 are not performed atomically. While the timer loop is single-threaded (so only one iteration runs at a time), trigger_now (Line 401) calls run_cron_job directly without checking or setting execution_lock. This means a manual trigger can race with a scheduled tick, resulting in two concurrent executions of the same job.
If overlapping a manual trigger with a scheduled run is acceptable, this is fine as-is. If not, consider using compare_exchange here and having trigger_now also respect the lock.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cron/scheduler.rs` around lines 258 - 264, The current load/store on
execution_lock (used around the scheduled tick in the timer loop) has a TOCTOU
gap allowing trigger_now to race with the scheduled path; change the scheduled
path to attempt to set the lock atomically using
execution_lock.compare_exchange(..., Ordering::Acquire, Ordering::Relaxed or
::Acquire/::Release as appropriate) and only call run_cron_job when
compare_exchange succeeds, and update trigger_now to perform the same atomic
compare_exchange check before calling run_cron_job so both manual and scheduled
triggers respect the same lock; ensure the lock is released (store false with
Ordering::Release) when run_cron_job completes or on error.
| if job.run_once { | ||
| tracing::info!(cron_id = %exec_job_id, "run-once cron completed, disabling"); | ||
|
|
||
| { | ||
| let mut j = jobs.write().await; | ||
| if let Some(j) = j.get_mut(&job_id) { | ||
| j.enabled = false; | ||
| { | ||
| let mut j = exec_jobs.write().await; | ||
| if let Some(j) = j.get_mut(&exec_job_id) { | ||
| j.enabled = false; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if let Err(error) = context.store.update_enabled(&job_id, false).await { | ||
| tracing::error!(%error, "failed to persist run-once cron disabled state"); | ||
| if let Err(error) = exec_context.store.update_enabled(&exec_job_id, false).await { | ||
| tracing::error!(%error, "failed to persist run-once cron disabled state"); | ||
| } | ||
| } |
There was a problem hiding this comment.
run_once disables the job even when execution failed.
The if job.run_once block runs unconditionally after both the Ok and Err arms. A failed run-once job will be permanently disabled without ever completing successfully. If this is intentional ("run once regardless of outcome"), a brief comment would clarify the design choice. If a failed run-once job should be retried on the next tick, this block should only run on success.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/cron/scheduler.rs` around lines 317 - 330, The run-once disable logic
currently executed unconditionally after job execution (checking job.run_once,
mutating exec_jobs and calling exec_context.store.update_enabled) causes failed
run-once jobs to be permanently disabled; change the flow so the block that sets
j.enabled = false and calls exec_context.store.update_enabled(&exec_job_id,
false).await only runs when the execution succeeded (i.e., inside the Ok arm of
the execution result) or, if the intention is to disable regardless, add a clear
comment next to job.run_once documenting that behavior and why failures should
not be retried; reference job.run_once, exec_jobs, exec_job_id, and
exec_context.store.update_enabled when making the change.
Summary
The cron scheduler has several compounding issues that cause jobs to silently miss their scheduled firings, particularly under load or in containerized environments:
Non-blocking execution:
run_cron_jobwas awaited inline in the timer loop, so a slow or stuck job blocked all subsequent ticks. WithMissedTickBehavior::Skip, blocked ticks were permanently lost. Each execution is now spawned as an independent tokio task with anAtomicBoollock to prevent overlapping runs of the same job. The circuit breaker and run-once logic run inside the spawned task and update the shared state; the timer loop picks upenabled = falseon the next tick.True wall-clock timeout: The timeout wrapped each individual
recv()call rather than the total collection phase, so periodic non-terminal output (status updates, stream chunks) could extend runtime indefinitely. A single deadline is now computed up front, and each recv uses the remaining budget.Timezone startup warning: When
cron_timezoneis unset,active_hourssilently falls back tochrono::Local, which is often UTC in Docker/containerized environments.Scheduler::newnow logs a warning explaining the fallback and how to configure it.API validation parity: The
CronTool(LLM-facing) validated ID format, minimum interval, prompt length, delivery target format, and active hour ranges, but the HTTP APIcreate_or_update_cronhad none of these checks. Added equivalent validation.Test plan
timeout_secs: 10and a prompt that produces periodic output beyond 10s — verify it is killed at the 10s mark, not extendedcron_timezoneset and check logs for the timezone fallback warninginterval_secs: 5— verify it is rejected with a validation errorMade with Cursor
Note
Automated Summary
Fixed critical reliability issues in the cron scheduler affecting job execution under load and in containerized deployments. Changes span two files and address non-blocking execution with atomic task locks, precise deadline-based timeouts (rather than per-recv timeouts), missing timezone configuration warnings, and HTTP API validation parity with the LLM-facing CronTool.
Written by Tembo for commit da4e65f (fix/cron-scheduler-reliability)