Fix/document processing failures 353#354
Open
Co-vengers wants to merge 3 commits intoGetBindu:mainfrom
Open
Conversation
…Bindu#353) Worker accessed task_operation["_current_span"] but scheduler now sends primitive trace_id/span_id strings. Add _reconstruct_span() helper to rebuild a NonRecordingSpan from hex-encoded IDs with graceful fallback.
Replace math.inf buffer size with a constant of 100 to prevent unbounded memory growth while still allowing task enqueue before the worker loop is ready.
Add SpanContext, TraceFlags, NonRecordingSpan, and INVALID_SPAN_CONTEXT mocks. Register opentelemetry.trace.span submodule so worker imports resolve in the test environment.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: Document Processing Failures (#353)
Branch:
fix/document-processing-failures-353Base:
mainProblem
All tasks submitted via the A2A protocol (
message/send) failed immediately without executing the agent. Thetasks/getresponse returned tasks infailedstate with noartifactsfield.Root Cause
Commit
1cc2a61(fix(scheduler): resolve anio buffer deadlock, cpu burn loop, and trace serialization) refactored the scheduler to serialize OpenTelemetry trace context as primitivetrace_id/span_idstrings instead of passing a liveSpanobject. However, the worker (bindu/server/workers/base.py) was not updated in that commit and still accessedtask_operation["_current_span"].This caused a
KeyErroron every task operation, which the worker's broad exception handler caught and used to mark the task asfailed— before any agent logic ran.Issues Resolved
Issue 1: Trace Context Mismatch (CRITICAL)
The scheduler sent
{trace_id: "...", span_id: "..."}but the worker expected{_current_span: <Span>}.Fix: Updated
bindu/server/workers/base.pyto reconstruct aNonRecordingSpanfrom the serializedtrace_id/span_idstrings. Added a_reconstruct_span()helper that:SpanContextNonRecordingSpanfor trace correlationIssue 2: Missing
artifactsin Response (CONSEQUENCE)tasks/getreturned tasks without theartifactsfield because tasks never reached thecompletedstate — they crashed before agent execution.Fix: Resolved automatically by fixing Issue 1. Once tasks execute successfully,
ManifestWorker._handle_terminal_state()generates artifacts viabuild_artifacts()and persists them withupdate_task().Issue 3: Unbounded Scheduler Buffer (MINOR)
The
InMemorySchedulerusedmath.infas the anyio stream buffer size, which could accumulate tasks without backpressure during failures.Fix: Replaced with a bounded buffer of 100, preserving the deadlock fix while preventing unbounded memory growth.
Files Changed
bindu/server/workers/base.py_reconstruct_span()helper; updated_handle_task_operation()to usetrace_id/span_idbindu/server/scheduler/memory_scheduler.pymath.infbuffer with bounded buffer (100)tests/conftest.pySpanContext,TraceFlags,NonRecordingSpan,INVALID_SPAN_CONTEXTstubs; registeredopentelemetry.trace.spansubmoduleVerification
All 666 unit tests pass with 0 failures:
After this fix, the following flow works end-to-end: