Skip to content

Fix/document processing failures 353#354

Open
Co-vengers wants to merge 3 commits intoGetBindu:mainfrom
Co-vengers:fix/document-processing-failures-353
Open

Fix/document processing failures 353#354
Co-vengers wants to merge 3 commits intoGetBindu:mainfrom
Co-vengers:fix/document-processing-failures-353

Conversation

@Co-vengers
Copy link
Contributor

Fix: Document Processing Failures (#353)

Branch: fix/document-processing-failures-353
Base: main

Problem

All tasks submitted via the A2A protocol (message/send) failed immediately without executing the agent. The tasks/get response returned tasks in failed state with no artifacts field.

Root Cause

Commit 1cc2a61 (fix(scheduler): resolve anio buffer deadlock, cpu burn loop, and trace serialization) refactored the scheduler to serialize OpenTelemetry trace context as primitive trace_id/span_id strings instead of passing a live Span object. However, the worker (bindu/server/workers/base.py) was not updated in that commit and still accessed task_operation["_current_span"].

This caused a KeyError on every task operation, which the worker's broad exception handler caught and used to mark the task as failed — before any agent logic ran.

message/send → task submitted ✅ → scheduled ✅ → worker KeyError ❌ → task failed (no artifacts)

Issues Resolved

Issue 1: Trace Context Mismatch (CRITICAL)

The scheduler sent {trace_id: "...", span_id: "..."} but the worker expected {_current_span: <Span>}.

Fix: Updated bindu/server/workers/base.py to reconstruct a NonRecordingSpan from the serialized trace_id/span_id strings. Added a _reconstruct_span() helper that:

  • Parses hex-encoded trace/span IDs into a SpanContext
  • Wraps it in a NonRecordingSpan for trace correlation
  • Falls back to an invalid span context if IDs are missing or malformed

Issue 2: Missing artifacts in Response (CONSEQUENCE)

tasks/get returned tasks without the artifacts field because tasks never reached the completed state — they crashed before agent execution.

Fix: Resolved automatically by fixing Issue 1. Once tasks execute successfully, ManifestWorker._handle_terminal_state() generates artifacts via build_artifacts() and persists them with update_task().

Issue 3: Unbounded Scheduler Buffer (MINOR)

The InMemoryScheduler used math.inf as the anyio stream buffer size, which could accumulate tasks without backpressure during failures.

Fix: Replaced with a bounded buffer of 100, preserving the deadlock fix while preventing unbounded memory growth.

Files Changed

File Change
bindu/server/workers/base.py Added _reconstruct_span() helper; updated _handle_task_operation() to use trace_id/span_id
bindu/server/scheduler/memory_scheduler.py Replaced math.inf buffer with bounded buffer (100)
tests/conftest.py Added SpanContext, TraceFlags, NonRecordingSpan, INVALID_SPAN_CONTEXT stubs; registered opentelemetry.trace.span submodule

Verification

All 666 unit tests pass with 0 failures:

================= 666 passed, 18 skipped, 77 warnings in 5.99s =================

After this fix, the following flow works end-to-end:

# 1. Submit document for analysis
curl -X POST http://localhost:3773/ \
  -H 'Content-Type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "test-001",
    "method": "message/send",
    "params": {
      "message": {
        "messageId": "msg-001", "contextId": "ctx-001", "taskId": "task-001",
        "kind": "message", "role": "user",
        "parts": [
          {"kind": "text", "text": "Analyze the uploaded document and summarize."},
          {"kind": "file", "text": "paper.pdf", "file": {"name": "paper.pdf", "mimeType": "application/pdf", "bytes": "<base64>"}}
        ]
      }
    }
  }'

# 2. Poll task status — should reach "completed" with artifacts
curl -X POST http://localhost:3773/ \
  -H 'Content-Type: application/json' \
  -d '{
    "jsonrpc": "2.0",
    "id": "test-002",
    "method": "tasks/get",
    "params": {"taskId": "task-001"}
  }'

…Bindu#353)

Worker accessed task_operation["_current_span"] but scheduler now sends
primitive trace_id/span_id strings. Add _reconstruct_span() helper to
rebuild a NonRecordingSpan from hex-encoded IDs with graceful fallback.
Replace math.inf buffer size with a constant of 100 to prevent
unbounded memory growth while still allowing task enqueue before the
worker loop is ready.
Add SpanContext, TraceFlags, NonRecordingSpan, and INVALID_SPAN_CONTEXT
mocks. Register opentelemetry.trace.span submodule so worker imports
resolve in the test environment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant