Idea: parallel sub-goal execution with a lightweight DAG scheduler #5

chernistry · 2026-04-12T08:57:00Z

chernistry
Apr 12, 2026

Hey! I saw your question on our project (bernstein#723) and spent some time reading through your architecture. Really solid setup - 28 lobsters on a single box is no joke.

One thing jumped out at me while reading ARCHITECTURE.md and crontab.example: goal execution appears to be sequential, driven by a 30-minute patrol cron. With 28 agents available, a 6-step goal takes 6 patrol cycles (potentially 3+ hours) even when some sub-goals have no dependency on each other.

Here's a pattern we use that might help - a file-based DAG executor that fits naturally into your shell-script architecture. The core idea: when you decompose a goal, declare dependencies between sub-goals explicitly, then let independent ones run in parallel.

The pattern

1. Goal decomposition with dependency declarations

When the controller decomposes a goal, instead of a flat list, emit a dependency graph:

// .corellis/goals/goal-042/dag.json
{
  "goal_id": "goal-042",
  "sub_goals": {
    "research-api": { "depends_on": [], "assigned_to": null, "status": "pending" },
    "research-auth": { "depends_on": [], "assigned_to": null, "status": "pending" },
    "design-schema": { "depends_on": ["research-api"], "assigned_to": null, "status": "pending" },
    "implement": { "depends_on": ["design-schema", "research-auth"], "assigned_to": null, "status": "pending" },
    "write-tests": { "depends_on": ["implement"], "assigned_to": null, "status": "pending" },
    "review": { "depends_on": ["write-tests"], "assigned_to": null, "status": "pending" }
  }
}

In this example, research-api and research-auth have no dependencies - they can run immediately and in parallel. design-schema only needs research-api. implement waits for both research streams. This turns a 6-step serial chain into 4 waves.

2. The scheduler (shell-native)

Replace the 30-min cron patrol with an event-driven loop:

#!/usr/bin/env bash
# scripts/goal-dag-scheduler.sh
set -euo pipefail

GOAL_DIR="$1"
DAG_FILE="$GOAL_DIR/dag.json"
LOCK_DIR="$GOAL_DIR/.locks"
mkdir -p "$LOCK_DIR"

get_status() { jq -r ".sub_goals[\"$1\"].status" "$DAG_FILE"; }

set_status() {
  local tmp
  tmp=$(mktemp)
  jq ".sub_goals[\"$1\"].status = \"$2\"" "$DAG_FILE" > "$tmp" && mv "$tmp" "$DAG_FILE"
}

all_deps_met() {
  local sg="$1"
  local deps
  deps=$(jq -r ".sub_goals[\"$sg\"].depends_on[]" "$DAG_FILE" 2>/dev/null)
  for dep in $deps; do
    [[ "$(get_status "$dep")" == "completed" ]] || return 1
  done
  return 0
}

find_ready() {
  jq -r '.sub_goals | to_entries[] | select(.value.status == "pending") | .key' "$DAG_FILE" \
    | while read -r sg; do
        all_deps_met "$sg" && echo "$sg"
      done
}

assign_to_lobster() {
  local sg="$1"
  # Acquire lock (atomic via mkdir)
  mkdir "$LOCK_DIR/$sg.lock" 2>/dev/null || return 1
  set_status "$sg" "running"
  # Find an idle lobster and dispatch
  local lobster
  lobster=$(find_idle_lobster)
  if [[ -n "$lobster" ]]; then
    dispatch_subgoal "$lobster" "$GOAL_DIR" "$sg" &
  else
    # No idle lobster - put back in queue
    set_status "$sg" "pending"
    rmdir "$LOCK_DIR/$sg.lock"
  fi
}

# Main loop - event-driven, not polling
while true; do
  remaining=$(jq '[.sub_goals[] | select(.status != "completed" and .status != "failed")] | length' "$DAG_FILE")
  [[ "$remaining" -eq 0 ]] && break

  for sg in $(find_ready); do
    assign_to_lobster "$sg"
  done

  # Wait for any child to finish (inotifywait on dag.json changes)
  # Falls back to short sleep if inotifywait unavailable
  if command -v inotifywait &>/dev/null; then
    inotifywait -qq -e modify "$DAG_FILE" -t 30
  else
    sleep 5
  fi
done

echo "Goal $(jq -r .goal_id "$DAG_FILE") complete."

The key details:

mkdir for atomic locking - no race conditions between lobsters claiming the same sub-goal
inotifywait makes it event-driven instead of polling (falls back to 5s sleep)
Parallel dispatch - all dependency-free sub-goals launch simultaneously
When a lobster finishes, it updates dag.json, which triggers the next wave immediately

3. Lobster completion callback

When a lobster finishes a sub-goal, it writes its output and signals completion:

# Called by lobster on sub-goal completion
complete_subgoal() {
  local goal_dir="$1" sg="$2" output_file="$3"

  # Store output for downstream sub-goals to consume
  cp "$output_file" "$goal_dir/outputs/$sg.md"

  # Update DAG atomically
  local tmp
  tmp=$(mktemp)
  jq ".sub_goals[\"$sg\"].status = \"completed\" | .sub_goals[\"$sg\"].completed_at = \"$(date -u +%FT%TZ)\"" \
    "$goal_dir/dag.json" > "$tmp" && mv "$tmp" "$goal_dir/dag.json"

  rmdir "$goal_dir/.locks/$sg.lock" 2>/dev/null
}

4. Context propagation between waves

The important part - downstream sub-goals need to see upstream outputs. When spawning a lobster for design-schema, inject the output from its dependency:

dispatch_subgoal() {
  local lobster="$1" goal_dir="$2" sg="$3"
  local context=""

  # Gather outputs from all completed dependencies
  for dep in $(jq -r ".sub_goals[\"$sg\"].depends_on[]" "$goal_dir/dag.json"); do
    if [[ -f "$goal_dir/outputs/$dep.md" ]]; then
      context+=$'\n'"--- Output from $dep ---"$'\n'"$(cat "$goal_dir/outputs/$dep.md")"$'\n'
    fi
  done

  # Inject into lobster prompt
  docker exec "$lobster" bash -c "cat >> /tmp/goal-context.md <<'CTX'
$context
CTX
/scripts/run-subgoal.sh '$sg' /tmp/goal-context.md"
}

This way, implement automatically receives the research outputs from both research-api and research-auth without any explicit message passing. File-based, simple, no message broker needed.

Impact estimate

For a typical 6-step goal with the dependency structure above:

Before (sequential patrol): 6 cycles x 30min = ~180 minutes
After (DAG with 4 waves): 4 waves x ~5min each = ~20 minutes

That's roughly an 8-9x speedup on multi-step goals, and it scales with lobster count. Independent research tasks, parallel test suites, concurrent reviews - anything without a true dependency runs immediately.

Bonus: result deduplication

One more thing that might save you significant API costs. If multiple goals ask similar research questions, you're paying for the same Claude calls repeatedly across 28 lobsters. A simple content-addressed cache:

# Before sending a research query to Claude
query_hash=$(echo -n "$query" | sha256sum | cut -d' ' -f1)
cache_file=".corellis/cache/$query_hash.md"

if [[ -f "$cache_file" ]] && [[ $(find "$cache_file" -mmin -60 -print) ]]; then
  cat "$cache_file"  # Cache hit (< 1 hour old)
else
  result=$(call_claude "$query")
  echo "$result" > "$cache_file"
  echo "$result"
fi

For fuzzy matching (not just exact queries), you can normalize queries before hashing - lowercase, strip whitespace, sort words. Won't catch everything but it's zero-dependency and handles the obvious duplicates.

We've been iterating on similar patterns in Bernstein (Python-based, but the file-state + DAG scheduling concepts are the same). Our orchestrator runs a tick-based control loop with tiered phase scheduling - fast ops every tick, heavy ops every 30th tick - which might be useful if you want to keep the cron approach but tier the frequency.

Happy to go deeper on any of this. Solid project you've got here.

CorellisOrg · 2026-04-13T04:06:30Z

CorellisOrg
Apr 13, 2026
Maintainer

Thanks for taking the time to dig into our architecture, and this is our first discussion too !!!!

this is a well-thought-out proposal and the DAG pattern itself is solid.You're right that the patrol cron creates unnecessary serialization. We've been thinking about this exact bottleneck, though our approach is heading in a slightly different direction than shell-level DAG scheduling.The short version: rather than adding a scheduler layer between the cron and the agents, we're working on making goal decomposition itself dependency-aware at the GoalOps framework level. The idea is that when a goal gets decomposed, the controller already understands which sub-goals are independent and can dispatch them in parallel through the existing session system — no separate scheduler process needed. Think of it as pushing the DAG logic up into the orchestration layer rather than down into shell scripts.That said, a few things from your proposal are genuinely useful:
The inotifywait event-driven approach vs polling — we have similar "tick vs event" decisions to make and your fallback pattern is clean
mkdir for atomic locking is elegant. We've been using file locks but the mkdir trick is more portable
The content-addressed cache for deduplication is something we haven't prioritized but should — 28 lobsters doing overlapping research is real waste
I took a look at Bernstein's orchestrator.py — the tiered phase scheduling (fast ops every tick, heavy ops every N ticks) is a nice middle ground. Curious how you handle the case where a downstream task's requirements shift based on upstream output? In our experience, the DAG often needs to be re-planned mid-execution, not just traversed.Appreciate the cross-pollination. Will keep watching Bernstein's progress.

0 replies

chernistry · 2026-04-13T20:51:40Z

chernistry
Apr 13, 2026
Author

glad this resonated. the mid-execution re-planning question is one we spent a lot of time on.

our answer is basically three layers:

retry with escalation - if a downstream task fails because upstream output doesnt match what it expected, the retry system escalates: bumps the model (haiku -> sonnet -> opus), expands the timeout, and injects the failure context into the retry prompt so the agent knows what went wrong. most "shifting requirements" cases resolve here without re-planning the whole DAG.
stall detection + auto-decomposition - if a task is stuck (progress snapshots identical for 3+ ticks), the orchestrator can auto-decompose it into smaller subtasks. so if "implement auth" stalls because the upstream schema changed, it gets broken into "update schema types" + "implement auth handler" + "update tests" - each one small enough to succeed with the new upstream state.
manager review trigger - after N completions or if stall count exceeds a threshold, a manager agent reviews the backlog and can re-prioritize, cancel, or create new tasks based on what actually shipped. this is the closest thing to "re-planning the DAG" but its event-driven rather than a full re-plan pass.

the key insight for us was that full re-planning is almost never needed if your tasks are small enough. a 15-minute task that fails and retries with fresh context costs less than a re-planning step that touches the whole graph. we basically chose to make failure cheap rather than prediction perfect.

the one case where this breaks down is when upstream output fundamentally changes the approach (eg the architect decides to use a different database). for that we have manual plan editing but no automatic detection yet - thats still a gap.

0 replies

kinthaiofficial · 2026-04-29T00:08:39Z

kinthaiofficial
Apr 29, 2026

Parallel sub-goal execution with a DAG scheduler is the right architecture. A few production lessons from running this at scale:

Cost estimation before scheduling. When you have 5 parallel sub-goals, the DAG scheduler should estimate the cost of each before committing to execute them in parallel. If sub-goals 3 and 4 together would exceed the remaining budget, the scheduler should serialize them or skip the lower-priority one, rather than discovering the budget problem mid-execution.

Dependency tracking must include "soft" dependencies. Hard dependencies (B can't start until A finishes) are easy. But agents also have soft dependencies — B's output quality improves if A's context is available, even though B could run without it. These soft dependencies often determine optimal serialization order even when parallelization is technically possible.

Budget propagation through the DAG. The root task has a budget. When it spawns 5 parallel sub-goals, how is the budget divided? Approaches:

Fixed split (20% each) — simple but inflexible
Estimated split (based on pre-estimated cost) — better but requires cost estimation
Dynamic pool (all sub-goals draw from shared pool) — flexible but risks one sub-goal consuming everything

We use dynamic pool with per-sub-goal soft limits (can use up to 40% of pool but can't exceed it). This gives flexibility while preventing one runaway sub-goal from starving the others.

Cancellation propagation. When the root task is cancelled, all in-flight sub-goals need to be cancelled too. In a DAG, this means cascade cancellation — any node whose parent is cancelled should also cancel. The DAG scheduler needs to track running sub-goals and send cancel signals.

More on coordination architecture: https://blog.kinthai.ai/221-agents-multi-agent-coordination-lessons

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: parallel sub-goal execution with a lightweight DAG scheduler #5

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Idea: parallel sub-goal execution with a lightweight DAG scheduler #5

Uh oh!

chernistry Apr 12, 2026

The pattern

1. Goal decomposition with dependency declarations

2. The scheduler (shell-native)

3. Lobster completion callback

4. Context propagation between waves

Impact estimate

Bonus: result deduplication

Replies: 3 comments

Uh oh!

CorellisOrg Apr 13, 2026 Maintainer

Uh oh!

chernistry Apr 13, 2026 Author

Uh oh!

kinthaiofficial Apr 29, 2026

chernistry
Apr 12, 2026

CorellisOrg
Apr 13, 2026
Maintainer

chernistry
Apr 13, 2026
Author

kinthaiofficial
Apr 29, 2026