-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Labels
bugSomething isn't workingSomething isn't working
Milestone
Description
Summary
A settlement on the newly created FEP network spec-13-op reverted on-chain with L2BlockNumberLessThanNextBlockNumber() (4byte: 0x541d595b). agglayer-node / proposer / proof pipeline produced a certificate and PP that passed node-side verification but was rejected by L1 contracts. After cleaning proposer/aggsender DBs, the certificate settled successfully — suggesting transient/inconsistent node state or input mismatch. We need to investigate why agglayer-node/PP did not detect an L1-semantic mismatch (L2 block number / aggchainData) and harden the pipeline.
Context / important facts
- Network:
spec-13-op(network_id: 13) — created with contracts v12.2.0. - Certificate that failed: height 3, certificate_id
0x4b5792b9d57a41be620b1867f5be7073931437bf204d169f9c3b8e94cc16b26c. - Contract revert payload:
execution reverted, data: "0x541d595b"
decode:cast 4byte 0x541d595b→L2BlockNumberLessThanNextBlockNumber(). - After cleaning
op-succinct-proposer+aggsenderDBs, the settlement succeeded for the same certificate. - agglayer-node & PP treat
aggchainData/aggchain-paramsmostly as opaque; L1 enforces semantic constraints (e.g., nextBlockNumber), so node-level checks currently miss some contract-level mismatches. - There were also RPC trace rate-limit errors (HTTP 429) in proposer logs prior to switching RPC provider to Tenderly.
Related links:
- PR that added spec-13-op: https://github.com/agglayer/gke-shared-dev-configs/pull/234
- Datadog logs (proposer + 429): (attach Datadog link)
- Related agglayer issue: Update the aggchain hash mismatch check to be for all path agglayer#1045
- Runbook PR (optimistic-mode fix): https://github.com/agglayer/runbooks/pull/73/changes
Observed behavior
- agglayer-node / aggchain-proof / pessimistic-proof pipeline allowed generation/verification of proofs/certificate that include
aggchainDatawhose L2 block number is less than the contract'snextBlockNumber(). - The settlement transaction reverted on-chain with
L2BlockNumberLessThanNextBlockNumber()when calling the rollup contract. - After DB cleanup, the settlement with the same certificate completed successfully.
Impact
- Certificates may be produced and PPs built that will fail on-chain, causing stuck settlements and operational disruption.
- Pipeline does not provide early detection of contract-level input semantics mismatches (L2 block number, signer set, threshold, aggchain_hash, etc.).
- Risk to Phase 2 correctness if similar mismatches occur undetected in production.
Investigation goals / questions
- Confirm whether the certificate contained
aggchainDatawith_l2BlockNumber < nextBlockNumber()at time of revert. - Identify where stale or inconsistent state came from (aggsender/proposer DB state, race conditions, or agglayer-node behavior).
- Understand why aggchain-proof and pessimistic-proof did not detect the mismatch.
- Determine how DB cleanup fixed the problem and which state changes mattered.
- Propose and test mitigations to detect such mismatches before on-chain settlement.
Actionable investigation steps
- Collect artifacts from the failing run:
- Attach
op_succinct_db.sqlandaggsenderDB snapshot taken before cleanup. - Export raw certificate calldata /
aggchainData,aggchain_params,aggchain_hash,l1_info_tree_leaf_count,prev_ler,prev_pp_root, etc. - Proposer / agglayer-node logs around failure time (include any 429 traces).
- Exact RPC responses and
castoutputs used (4byte decode shown above).
- Attach
- Query L1 contract state at the revert time:
latestBlockNumber(),submissionInterval(),nextBlockNumber(), and any rollup-specific state.
- Replay the settlement:
- Replay the settlement call (same calldata) against a local shadow-fork of L1 to reproduce the revert.
- Replay the aggsender → aggkit-prover → proposer flow with the same DB snapshot to find where the outdated
_l2BlockNumbergot introduced.
- Inspect pipeline components:
- Confirm aggsender's logic for choosing L2 block ranges and check for off-by-one or stale reads.
- Verify aggkit-prover inputs and that proof generation uses the same block range.
- Verify PP generation path and confirm which fields are treated as opaque.
- Check rate-limiting handling:
- Review proposer usage of RPC endpoints and debug endpoints; ensure tracing fallbacks/backoffs and don't cause inconsistent behavior.
- Evaluate mitigations and PoC:
- Implement a pre-PP validation step in agglayer-node that verifies public input semantics (e.g., compare encoded L2 block number to contract
nextBlockNumber()via read-only call). - Or run a dry-run on a shadow-fork with a mock verifier or contract helper (e.g.,
checkInputsValidity(...)) that early-fails for semantics mismatches. - Add detection for
aggchain_hashmismatches before PP generation.
- Implement a pre-PP validation step in agglayer-node that verifies public input semantics (e.g., compare encoded L2 block number to contract
- Add tests and monitoring:
- Unit/integration tests for L2 block number mismatches and aggchain_hash mismatches.
- Alerting for certificates rejected on-chain with these revert signatures.
Artifacts to attach
- Pre-cleanup DB snapshots:
- op_succinct_db.sql (Thiago confirmed snapshot available)
- aggsender DB snapshot
- Raw certificate calldata(s) for failing certificate(s)
- Proposer / agglayer-node logs (Datadog link + extracted logs)
- Any commands/outputs used (casts, 4byte decodes)
- Configs used to spin up spec-13-op (gke-shared-dev-configs PR feat: e2e test bridge #234)
Acceptance criteria
- Root cause identified and documented (e.g., stale aggsender state, race, missing input validation).
- Reproduction steps that consistently reproduce the revert in a local environment.
- Short-term mitigation implemented or PR opened (pre-PP validation or shadow-fork dry-run).
- Tests added to prevent regression (unit/integration).
- Improvements to proposer's RPC/tracing handling to avoid failures caused by rate-limiting.
Suggested checklist (to paste into the issue body)
- Attach DB snapshots (op_succinct, aggsender) taken before cleanup
- Attach proposer/agglayer-node logs and Datadog link
- Identify where stale/incorrect L2 block number was introduced
- Implement PoC mitigation (pre-PP validation or shadow-fork dry-run)
- Add unit/integration tests to prevent regressions
- Open PR(s) for mitigation changes
- Close issue when mitigation is merged and tests pass
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working