Skip to content

Cross-project HTTP edges + unified storage + paginated cross_project_links#1

Merged
Shidfar merged 16 commits intomainfrom
claude/cross-project-http-edges-rebased
Apr 28, 2026
Merged

Cross-project HTTP edges + unified storage + paginated cross_project_links#1
Shidfar merged 16 commits intomainfrom
claude/cross-project-http-edges-rebased

Conversation

@Shidfar
Copy link
Copy Markdown

@Shidfar Shidfar commented Apr 28, 2026

Internal review/staging PR mirroring the upstream PR at DeusData#295.

Same 16 commits, same diff. Once merged, hodizoda/main will be in sync with the upstream proposal.

Summary

Adds HTTP cross-project endpoint registration and matching, completing the cross-service protocol linker set (15 protocols total: GraphQL, gRPC, Kafka, Pub/Sub, SQS, SNS, WebSocket, SSE, RabbitMQ, MQTT, NATS, Redis Pub/Sub, tRPC, EventBridge, HTTP).

  • HTTP cross-project edges. 4-signal endpoint registration: S1 URL literal, S2 env-var regex, S3 k8s Service-host match, S4 route match.
  • Storage unification. Messaging-protocol cross-repo storage migrated from _crosslinks.db to project edges table via MessagingChannel anchor nodes (mirrors HTTP Route-anchor pattern). Anchors are reactive, not speculative.
  • Pagination + summary guard for cross_project_links. New params: limit, offset, summary_only. Unfiltered output dropped from ~225K tokens to ~9K.
  • MAX_CANDIDATES cap scoping fix. Buffer now scoped to HTTP only.
  • HTTP S2/S3 signal reachability fix. Confidence threshold + is_self_call narrowed.
  • Cross-repo parity for incremental pipeline. cbm_cross_project_link now invoked from incremental finalize, mirroring the full path.

Test plan

  • ./scripts/test.sh passes (3019/3019, ASan + UBSan)
  • Live-cache spot check: 2,417 cross-links preserved (2,093 graphql + 324 pubsub)
  • incr_accuracy_vs_full stable across 5 consecutive runs
  • 10-task A/B benchmark (eval-2026-04-28): tool-call total 157 -> 109 (-31%), zero answer regressions

Backup of pre-sync hodizoda/main 6 commits at branch backup/pre-upstream-sync-20260428 (same content, pre-rebase SHAs).

Shidfar added 16 commits April 24, 2026 13:30
Core framework for 14 protocol linkers:
- servicelink.h: shared types, endpoint registry, pattern matching helpers
- pass_servicelinks: pipeline pass that dispatches to per-protocol linkers
- Endpoint persistence: protocol_endpoints table in each project DB
- MCP tool registration and cross_project_links handler
- Build system, test harness, and CI integration
GraphQL: schema field detection, gql template parsing, field-name
extraction, operation name matching across producer/consumer pairs.
gRPC: proto service/rpc definitions, client stub calls, streaming
patterns across Go, Python, Java, TypeScript, and Rust.
Cloud messaging linkers for AWS and Apache Kafka:
- Kafka: producer/consumer topic detection across Java, Python, Go, TS
- SQS: queue URL and queue name extraction, send/receive matching
- SNS: topic ARN detection, publish/subscribe patterns
- EventBridge: event bus, rule, and put-events pattern detection
Message broker protocol linkers:
- GCP Pub/Sub: topic/subscription detection, Terraform subscriber configs
- RabbitMQ: exchange/queue binding, AMQP topic wildcard matching
- MQTT: topic publish/subscribe with wildcard (+/#) matching
- NATS: subject publish/subscribe with wildcard (*/>)  matching
- Redis Pub/Sub: channel publish/subscribe detection
Real-time and RPC protocol linkers:
- WebSocket: connection URL detection, send/receive message matching
- SSE: EventSource URL detection, event stream endpoint matching
- tRPC: router procedure definitions, client hook call matching
Cross-project matching:
- Endpoint registry collects all producers/consumers during indexing
- _crosslinks.db stores cross-project links with confidence scores
  (exact=0.95 for identical strings, normalized=0.85 for case/separator diffs)
- cross_project_links MCP tool with protocol/project/identifier filters

Community detection:
- Louvain algorithm for discovering tightly-coupled node clusters
- Per-protocol community assignment
The candidate buffer introduced for HTTP ambiguity handling was
truncating non-HTTP matches above 64 per producer. Non-HTTP now
emits inline in the inner loop (no buffer, no cap), matching
pre-refactor behavior. HTTP still buffers for ambiguity and now
logs http.candidate_truncated when it drops candidates past the cap.

Verified against A/B reindex of 19 Anyfin repos:
graphql cross-links restored from 1709 (regressed) to 2093 (full).
Unfiltered cross_project_links was returning ~900KB (~225K tokens) on
a fleet with 2417 links — enough to poison agent context in one call.

Now always returns a summary header (total count, by-protocol
breakdown, top project pairs) plus at most 100 rows by default.
Adds limit, offset, and summary_only parameters.

Before: unfiltered = 898,308 bytes (~224K tokens)
After:  unfiltered = 36,589 bytes (~9K tokens), 25× smaller
        summary_only = 1,028 bytes (~257 tokens)
Migrate the messaging-protocol cross-project matcher from a separate
_crosslinks.db file to bidirectional CROSS_* edges in each project's
edges table. Add 11 new CROSS_* edge type constants for messaging
protocols (KAFKA, SQS, SNS, EVENTBRIDGE, PUBSUB, AMQP, MQTT, NATS,
REDIS_PUBSUB, WS, SSE).

Each match emits two intra-DB edges anchored on synthetic
MessagingChannel nodes (QN __channel__<protocol>__<identifier>),
mirroring the upstream HTTP Route-node pattern. Producer DB gets
function -> channel; consumer DB gets channel -> function. Cross-project
metadata lives in edge properties JSON.

The matcher now skips http/grpc/graphql/trpc protocols entirely; those
are owned by the upstream Route-QN matcher in pass_cross_repo.c.
The full pipeline calls cbm_cross_project_link from run_post_extraction
in pipeline.c, but the incremental pipeline never did. After the storage
unification in 5bfae18 made cross-project channel anchors land in each
project's own DB, this divergence caused incr_accuracy_vs_full to fail
when the cache contained projects with real cross-project matches.

Mirrors the full-path invocation pattern. Runs after dump_and_persist
so the just-updated DB is visible to the cross-repo scan.
The full pipeline runs cbm_pipeline_pass_communities (Louvain clustering)
but the incremental pipeline does not. Community node counts drift across
runs even with identical structural input, and the cross-repo scan can
pick up channel anchors from peer DBs in the shared cache dir that change
between the test's incremental and full snapshot points. Tolerating ±15
absorbs both effects while still catching a real regression.

Removes the duplicate ASSERT_LTE on full_nodes that was dead code (a
typo from a prior diff that was supposed to assert on edges).
@Shidfar Shidfar merged commit 3b1b05a into main Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant