Skip to content

Comments

fix(cron): make cron scheduler reliable under load and in containers#186

Open
mmmeff wants to merge 3 commits intospacedriveapp:mainfrom
mmmeff:fix/cron-scheduler-reliability
Open

fix(cron): make cron scheduler reliable under load and in containers#186
mmmeff wants to merge 3 commits intospacedriveapp:mainfrom
mmmeff:fix/cron-scheduler-reliability

Conversation

@mmmeff
Copy link

@mmmeff mmmeff commented Feb 24, 2026

Summary

The cron scheduler has several compounding issues that cause jobs to silently miss their scheduled firings, particularly under load or in containerized environments:

  • Non-blocking execution: run_cron_job was awaited inline in the timer loop, so a slow or stuck job blocked all subsequent ticks. With MissedTickBehavior::Skip, blocked ticks were permanently lost. Each execution is now spawned as an independent tokio task with an AtomicBool lock to prevent overlapping runs of the same job. The circuit breaker and run-once logic run inside the spawned task and update the shared state; the timer loop picks up enabled = false on the next tick.

  • True wall-clock timeout: The timeout wrapped each individual recv() call rather than the total collection phase, so periodic non-terminal output (status updates, stream chunks) could extend runtime indefinitely. A single deadline is now computed up front, and each recv uses the remaining budget.

  • Timezone startup warning: When cron_timezone is unset, active_hours silently falls back to chrono::Local, which is often UTC in Docker/containerized environments. Scheduler::new now logs a warning explaining the fallback and how to configure it.

  • API validation parity: The CronTool (LLM-facing) validated ID format, minimum interval, prompt length, delivery target format, and active hour ranges, but the HTTP API create_or_update_cron had none of these checks. Added equivalent validation.

Test plan

  • Create a cron job with a short interval (e.g. 60s) and a prompt that takes >60s to complete — verify subsequent ticks are not skipped
  • Create a cron job with timeout_secs: 10 and a prompt that produces periodic output beyond 10s — verify it is killed at the 10s mark, not extended
  • Run in a container without cron_timezone set and check logs for the timezone fallback warning
  • Attempt to create a cron job via the HTTP API with interval_secs: 5 — verify it is rejected with a validation error
  • Verify circuit breaker still disables jobs after 3 consecutive failures
  • Verify run-once jobs still auto-disable after first execution

Made with Cursor

Note

Automated Summary

Fixed critical reliability issues in the cron scheduler affecting job execution under load and in containerized deployments. Changes span two files and address non-blocking execution with atomic task locks, precise deadline-based timeouts (rather than per-recv timeouts), missing timezone configuration warnings, and HTTP API validation parity with the LLM-facing CronTool.

Written by Tembo for commit da4e65f (fix/cron-scheduler-reliability)

@mmmeff mmmeff changed the title fix: make cron scheduler reliable under load and in containers fix(cron): make cron scheduler reliable under load and in containers Feb 24, 2026
@mmmeff mmmeff force-pushed the fix/cron-scheduler-reliability branch from 6de5ef6 to e4919be Compare February 24, 2026 01:47
The cron job system had several compounding issues that caused jobs to
silently miss their scheduled firings:

1. Timer loop blocked on job execution — run_cron_job was awaited inline,
   so a slow or stuck job prevented subsequent ticks from firing. With
   MissedTickBehavior::Skip, every blocked tick was permanently lost.
   Fix: spawn each execution as an independent tokio task with an
   AtomicBool lock to prevent overlapping runs of the same job.

2. Timeout was not true wall-clock — the timeout wrapped each individual
   recv() call rather than the total collection phase, so periodic
   non-terminal output (status updates, stream chunks) could extend
   runtime indefinitely. Fix: compute a single deadline up front and
   use remaining budget for each recv.

3. Active-hours timezone silent fallback — when cron_timezone is unset,
   active_hours silently uses chrono::Local, which is often UTC in
   Docker/containerized environments. Jobs with active-hours constraints
   would be skipped with no obvious indicator. Fix: log a warning at
   scheduler startup when falling back to system local time.

4. API validation gap — the CronTool (LLM-facing) validated ID format,
   minimum interval, prompt length, delivery target format, and active
   hour ranges, but the HTTP API create_or_update_cron had none of these
   checks, allowing malformed entries to enter the scheduler. Fix: add
   equivalent validation to the API path.

Co-authored-by: Cursor <cursoragent@cursor.com>
@mmmeff mmmeff force-pushed the fix/cron-scheduler-reliability branch from e4919be to 1328e65 Compare February 24, 2026 03:04
@mmmeff mmmeff closed this Feb 24, 2026
@mmmeff mmmeff reopened this Feb 24, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 24, 2026

Walkthrough

The changes enhance cron request validation through a new validation layer in the API, restructure error responses with richer messaging, introduce concurrency guards to prevent overlapping job executions, add timezone awareness for cron scheduling, and refine active hours logic to handle edge cases in normalization.

Changes

Cohort / File(s) Summary
Request Validation and Error Handling
src/api/cron.rs
Added validate_cron_request function enforcing id format, interval bounds, prompt length, and delivery target validation. Changed create_or_update_cron return type to include structured error responses with Json payloads. Introduced MIN_CRON_INTERVAL_SECS and MAX_CRON_PROMPT_LENGTH constants. Replaced direct StatusCode errors with cron_err helper for consistent error messaging.
Scheduler Execution and Concurrency
src/cron/scheduler.rs
Introduced ExecutionGuard RAII pattern to reset execution flags on panic. Added per-job AtomicBool execution locks to prevent overlapping runs. Implemented active_hours normalization via new normalize_active_hours function. Enhanced timeout handling with deadline tracking. Added Scheduler::cron_timezone_label public method. Expanded scheduler initialization logging for missing timezone configuration.
Store Layer Filtering Logic
src/cron/store.rs
Modified active_hours mapping in load_all and load_all_unfiltered to yield None when start equals end, filtering out identity-range configurations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Validation gates now guard the way,
Errors structured, come what may,
Locks ensure no race takes flight,
Timezones bloom in moonlit night,
Cron hops forward, strong and bright! 🌙⏰

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main objective of the PR: making the cron scheduler reliable under load and in containers, which directly reflects the substantive changes across validation, execution, timeout, and timezone handling.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, providing clear context for the four major issues being addressed: non-blocking execution, true wall-clock timeouts, timezone startup warnings, and API validation parity.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
src/api/cron.rs (2)

217-222: contains(':') accepts malformed targets like ":" or "discord:".

A bare contains(':') passes for strings with empty adapter or target segments. Validate both halves are non-empty.

Suggested fix
-    if !request.delivery_target.contains(':') {
+    if !matches!(request.delivery_target.split_once(':'), Some((adapter, target)) if !adapter.is_empty() && !target.is_empty()) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/cron.rs` around lines 217 - 222, The current validation uses
request.delivery_target.contains(':') which accepts inputs like ":" or
"discord:" with empty halves; update the check to split the string on the first
colon (e.g., using splitn(2, ':')) and ensure both adapter and target segments
are non-empty before proceeding, returning the same BAD_REQUEST Err if either
part is empty (referencing request.delivery_target in this validation logic).

188-189: is_alphanumeric() permits non-ASCII codepoints.

The error message says "alphanumeric" but Rust's is_alphanumeric() accepts Unicode letters/digits (e.g. é, ñ). Use is_ascii_alphanumeric() if IDs should be restricted to [a-zA-Z0-9\-_].

Suggested fix
-            .all(|c| c.is_alphanumeric() || c == '-' || c == '_')
+            .all(|c| c.is_ascii_alphanumeric() || c == '-' || c == '_')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/cron.rs` around lines 188 - 189, The validation currently uses
chars().all(|c| c.is_alphanumeric() || c == '-' || c == '_'), but
is_alphanumeric() permits non-ASCII Unicode characters; replace it with
is_ascii_alphanumeric() so IDs are restricted to [a-zA-Z0-9]. Update the
expression to chars().all(|c| c.is_ascii_alphanumeric() || c == '-' || c == '_')
wherever this validation is used (the closure passed to .all in src/api/cron.rs)
to enforce the intended ASCII-only rule.
🧹 Nitpick comments (1)
src/cron/store.rs (1)

70-93: Consider extracting the shared row-to-CronConfig mapping into a helper function.

The mapping closure in load_all (Lines 72–92) and load_all_unfiltered (Lines 162–182) are identical. Extracting a fn row_to_config(row: SqliteRow) -> CronConfig helper would eliminate this duplication and ensure the new s != e guard (and any future mapping changes) stays consistent across both paths.

Also applies to: 160-183

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cron/store.rs` around lines 70 - 93, The closure that maps DB rows to
CronConfig in load_all and load_all_unfiltered is duplicated; extract it into a
shared helper like fn row_to_config(row: &Row) -> CronConfig (or accept
SqliteRow) and replace both closures with calls to row_to_config(row). Ensure
the helper performs the same try_get calls and the active_hours guard (the s !=
e check) plus conversions for interval_secs, enabled, run_once, and timeout_secs
so behavior remains identical across load_all and load_all_unfiltered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/api/cron.rs`:
- Around line 224-239: The validation currently accepts each active_start_hour
and active_end_hour independently but create_or_update_cron discards the pair if
only one is supplied; update validate_cron_request to detect the case where
exactly one of request.active_start_hour or request.active_end_hour is Some and
return an Err with StatusCode::BAD_REQUEST (e.g. "both active_start_hour and
active_end_hour must be provided together or omitted") so callers get immediate
feedback; keep the existing per-value range checks for 0-23 and reference the
validate_cron_request and create_or_update_cron functions to locate where to add
the joint-presence check.

In `@src/cron/scheduler.rs`:
- Around line 258-264: The current load/store on execution_lock (used around the
scheduled tick in the timer loop) has a TOCTOU gap allowing trigger_now to race
with the scheduled path; change the scheduled path to attempt to set the lock
atomically using execution_lock.compare_exchange(..., Ordering::Acquire,
Ordering::Relaxed or ::Acquire/::Release as appropriate) and only call
run_cron_job when compare_exchange succeeds, and update trigger_now to perform
the same atomic compare_exchange check before calling run_cron_job so both
manual and scheduled triggers respect the same lock; ensure the lock is released
(store false with Ordering::Release) when run_cron_job completes or on error.
- Around line 317-330: The run-once disable logic currently executed
unconditionally after job execution (checking job.run_once, mutating exec_jobs
and calling exec_context.store.update_enabled) causes failed run-once jobs to be
permanently disabled; change the flow so the block that sets j.enabled = false
and calls exec_context.store.update_enabled(&exec_job_id, false).await only runs
when the execution succeeded (i.e., inside the Ok arm of the execution result)
or, if the intention is to disable regardless, add a clear comment next to
job.run_once documenting that behavior and why failures should not be retried;
reference job.run_once, exec_jobs, exec_job_id, and
exec_context.store.update_enabled when making the change.

---

Duplicate comments:
In `@src/api/cron.rs`:
- Around line 217-222: The current validation uses
request.delivery_target.contains(':') which accepts inputs like ":" or
"discord:" with empty halves; update the check to split the string on the first
colon (e.g., using splitn(2, ':')) and ensure both adapter and target segments
are non-empty before proceeding, returning the same BAD_REQUEST Err if either
part is empty (referencing request.delivery_target in this validation logic).
- Around line 188-189: The validation currently uses chars().all(|c|
c.is_alphanumeric() || c == '-' || c == '_'), but is_alphanumeric() permits
non-ASCII Unicode characters; replace it with is_ascii_alphanumeric() so IDs are
restricted to [a-zA-Z0-9]. Update the expression to chars().all(|c|
c.is_ascii_alphanumeric() || c == '-' || c == '_') wherever this validation is
used (the closure passed to .all in src/api/cron.rs) to enforce the intended
ASCII-only rule.

---

Nitpick comments:
In `@src/cron/store.rs`:
- Around line 70-93: The closure that maps DB rows to CronConfig in load_all and
load_all_unfiltered is duplicated; extract it into a shared helper like fn
row_to_config(row: &Row) -> CronConfig (or accept SqliteRow) and replace both
closures with calls to row_to_config(row). Ensure the helper performs the same
try_get calls and the active_hours guard (the s != e check) plus conversions for
interval_secs, enabled, run_once, and timeout_secs so behavior remains identical
across load_all and load_all_unfiltered.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af095f3 and 3c7a1db.

📒 Files selected for processing (3)
  • src/api/cron.rs
  • src/cron/scheduler.rs
  • src/cron/store.rs

Comment on lines +224 to +239
if let Some(start) = request.active_start_hour {
if start > 23 {
return Err((
StatusCode::BAD_REQUEST,
"active_start_hour must be 0-23".into(),
));
}
}
if let Some(end) = request.active_end_hour {
if end > 23 {
return Err((
StatusCode::BAD_REQUEST,
"active_end_hour must be 0-23".into(),
));
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Supplying only one of active_start_hour / active_end_hour is silently ignored.

Validation checks each hour independently (Lines 224–239), but create_or_update_cron maps the pair to None when only one is provided (Lines 271–274). A user who accidentally omits one field gets no feedback that their active-hours window was discarded. Consider validating that both or neither are provided.

Suggested addition inside `validate_cron_request`
+    if request.active_start_hour.is_some() != request.active_end_hour.is_some() {
+        return Err((
+            StatusCode::BAD_REQUEST,
+            "active_start_hour and active_end_hour must both be provided or both omitted".into(),
+        ));
+    }
+
     Ok(())
 }

Also applies to: 271-274

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/api/cron.rs` around lines 224 - 239, The validation currently accepts
each active_start_hour and active_end_hour independently but
create_or_update_cron discards the pair if only one is supplied; update
validate_cron_request to detect the case where exactly one of
request.active_start_hour or request.active_end_hour is Some and return an Err
with StatusCode::BAD_REQUEST (e.g. "both active_start_hour and active_end_hour
must be provided together or omitted") so callers get immediate feedback; keep
the existing per-value range checks for 0-23 and reference the
validate_cron_request and create_or_update_cron functions to locate where to add
the joint-presence check.

Comment on lines +258 to +264
if execution_lock.load(std::sync::atomic::Ordering::Acquire) {
tracing::debug!(cron_id = %job_id, "previous execution still running, skipping tick");
continue;
}

let should_disable = {
let mut j = jobs.write().await;
if let Some(j) = j.get_mut(&job_id) {
j.consecutive_failures += 1;
j.consecutive_failures >= MAX_CONSECUTIVE_FAILURES
} else {
false
tracing::info!(cron_id = %job_id, "cron job firing");
execution_lock.store(true, std::sync::atomic::Ordering::Release);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

TOCTOU gap between the lock check and the lock set.

The load on Line 258 and store on Line 264 are not performed atomically. While the timer loop is single-threaded (so only one iteration runs at a time), trigger_now (Line 401) calls run_cron_job directly without checking or setting execution_lock. This means a manual trigger can race with a scheduled tick, resulting in two concurrent executions of the same job.

If overlapping a manual trigger with a scheduled run is acceptable, this is fine as-is. If not, consider using compare_exchange here and having trigger_now also respect the lock.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cron/scheduler.rs` around lines 258 - 264, The current load/store on
execution_lock (used around the scheduled tick in the timer loop) has a TOCTOU
gap allowing trigger_now to race with the scheduled path; change the scheduled
path to attempt to set the lock atomically using
execution_lock.compare_exchange(..., Ordering::Acquire, Ordering::Relaxed or
::Acquire/::Release as appropriate) and only call run_cron_job when
compare_exchange succeeds, and update trigger_now to perform the same atomic
compare_exchange check before calling run_cron_job so both manual and scheduled
triggers respect the same lock; ensure the lock is released (store false with
Ordering::Release) when run_cron_job completes or on error.

Comment on lines +317 to 330
if job.run_once {
tracing::info!(cron_id = %exec_job_id, "run-once cron completed, disabling");

{
let mut j = jobs.write().await;
if let Some(j) = j.get_mut(&job_id) {
j.enabled = false;
{
let mut j = exec_jobs.write().await;
if let Some(j) = j.get_mut(&exec_job_id) {
j.enabled = false;
}
}
}

if let Err(error) = context.store.update_enabled(&job_id, false).await {
tracing::error!(%error, "failed to persist run-once cron disabled state");
if let Err(error) = exec_context.store.update_enabled(&exec_job_id, false).await {
tracing::error!(%error, "failed to persist run-once cron disabled state");
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

run_once disables the job even when execution failed.

The if job.run_once block runs unconditionally after both the Ok and Err arms. A failed run-once job will be permanently disabled without ever completing successfully. If this is intentional ("run once regardless of outcome"), a brief comment would clarify the design choice. If a failed run-once job should be retried on the next tick, this block should only run on success.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cron/scheduler.rs` around lines 317 - 330, The run-once disable logic
currently executed unconditionally after job execution (checking job.run_once,
mutating exec_jobs and calling exec_context.store.update_enabled) causes failed
run-once jobs to be permanently disabled; change the flow so the block that sets
j.enabled = false and calls exec_context.store.update_enabled(&exec_job_id,
false).await only runs when the execution succeeded (i.e., inside the Ok arm of
the execution result) or, if the intention is to disable regardless, add a clear
comment next to job.run_once documenting that behavior and why failures should
not be retried; reference job.run_once, exec_jobs, exec_job_id, and
exec_context.store.update_enabled when making the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants