fix: code quality improvements — Go version, CI race+lint, error mapping, bounded query, CHANGELOG#145
fix: code quality improvements — Go version, CI race+lint, error mapping, bounded query, CHANGELOG#1450xgetz wants to merge 13 commits intounicitynetwork:mainfrom
Conversation
…ates - events: Unsubscribe now closes the subscriber channel so goroutines ranging over it terminate cleanly instead of blocking forever. Update test to use the two-value receive idiom (e, ok := <-ch) to distinguish a genuine event from a closed-channel signal. - api/types: Add PreviousRootHash field to SubmitShardRootRequest for optimistic concurrency control. Add ShardRootStatusInvalidPreviousHash status constant to communicate stale-update rejections to callers. - service/parent_service: Validate PreviousRootHash when provided — fetch the current SMT node for the shard and reject the update if the supplied previous hash does not match, preventing lost-update races where two child aggregators submit conflicting roots in the same round (addresses issue unicitynetwork#82).
The commitment collection window previously used a tight loop with time.Sleep(10ms) and a default: branch, burning CPU between events. Replace it with a proper time.NewTimer + labelled-break select so the goroutine blocks efficiently until either a commitment arrives, the 200 ms window expires, or the context is cancelled. Also extract the four magic numbers (collectPhaseDuration, miniBatchSize, prefetchMinSpace/BatchBuffer/MaxBatch) into named package-level constants so tuning values are visible and documented in one place. Additionally optimise restoreSmtFromStorage: instead of calling AddLeaves once per chunk (O(chunks * n) hash work), accumulate all leaves into a single pre-allocated slice and call AddLeaves exactly once at the end — O(n) hash work total.
The prefetcher used a goto DonePushing label to break out of the push loop. goto cannot jump across variable declarations and makes control flow hard to follow. Replace with a labelled 'break pushLoop' which is idiomatic Go and compiles without restriction. Also add submitShardRootWithRetry improvements: - Add shardRootMaxAttempts (10) cap so a permanently unreachable parent does not stall the round forever (previous code was an infinite loop) - Replace fixed 1s sleep with exponential back-off starting at 500ms, doubling each attempt up to a 30s ceiling - Log retryIn duration and maxAttempts so operators can see back-off progression in the logs Fix FinalizeBlockWithRetry: replace time.Sleep with a context-aware select so a cancelled context (e.g. node shutdown) aborts the retry loop immediately instead of blocking for the full delay. Extract commitmentToLeaf helper to centralise path+leafValue derivation that was previously duplicated inline in processMiniBatch. Add ctx.Err() priority check in storeDataParallel so context cancellation is surfaced before storage errors, allowing callers to distinguish shutdown from genuine write failures.
AddPreHashedLeaf was a stub that silently dropped all pre-hashed leaves (returned nil without inserting anything). The underlying SparseMerkleTree stores raw bytes directly, so AddLeaf is the correct implementation — no extra hashing step is needed. Wire it through properly. Add CountLeaves() to SparseMerkleTree: a recursive O(n) traversal that returns the actual number of leaf nodes in the tree. Previously ThreadSafeSMT.GetStats() always reported leafCount=0 via a placeholder getLeafCount() stub. Replace the stub with a call to CountLeaves() so the stats endpoint reflects real tree state. Expand GetStats() to remove the 'isLocked' placeholder field (it was always false and carried no useful information).
GetNoDeletionProof in both AggregatorService and ParentAggregatorService previously returned a plain fmt.Errorf string (or a mock response). Callers had no way to distinguish 'not implemented' from a transient failure without string-matching the error message. Wrap with errors.ErrUnsupported so callers can use errors.Is() for programmatic detection. Also replace the mock NoDeletionProof response in AggregatorService (which silently returned fake data) with a proper ErrUnsupported error. Apply the same ErrUnsupported wrapping to GetBlock, GetBlockCommitments, and GetBlockRecords in ParentAggregatorService. Add explanatory TODO comment in AggregatorService.GetNoDeletionProof describing what a real implementation requires (full SMT audit-log traversal) so the scope of the work is clear to future contributors.
strings.Split("", ",") returns []string{""} — a one-element slice
containing an empty string — rather than an empty slice. When
BFT_ANNOUNCE_ADDRESSES or BFT_BOOTSTRAP_ADDRESSES env vars are unset
this caused downstream address-validation to fail with a confusing
'invalid bootstrap address: empty string' error.
Introduce splitNonEmpty() which returns nil for an empty input and
discards blank/whitespace-only tokens from non-empty inputs. Use it
for both BFT address env-var splits.
Fatal() was logging 'fatal error' as the message key and packing the
actual arguments into a slog.Any("args", ...) attribute, so the log
line looked like:
level=ERROR msg="fatal error" args=[foo bar]
instead of:
level=ERROR msg="foo bar"
Fix by using fmt.Sprint/fmt.Sprintf to format the arguments into the
message string directly, matching the behaviour callers expect from a
logrus-compatible Fatal shim.
A buffer of 1 means any burst of two events published before the consumer drains the channel silently drops the second event. In practice the aggregator publishes multiple round-completion events in rapid succession (e.g. block finalised + shard root submitted). Increase the default buffer to 16 to absorb short bursts without dropping events, and document the reasoning in a named constant (defaultSubscriberBuffer) so the value is easy to find and tune.
…state internal/config/config_validate_test.go: - TestShardingModeHelpers: table-driven test covering all defined ShardingMode values plus unknown/empty/wrong-case inputs for the IsValid/IsStandalone/IsParent/IsChild predicates. - TestConfigValidate: table-driven test exercising the key validation branches in Config.Validate — missing required fields, invalid log level, TLS half-configuration, invalid sharding mode, out-of-range shard ID length, zero HTTP/2 stream limit, HA without server ID. - TestSplitNonEmpty: table-driven test for the new splitNonEmpty helper covering empty string, all-commas, single token, multi-token, consecutive commas, whitespace padding, and leading/trailing commas. internal/round/round_state_test.go: - TestRoundStateString: verifies the String() representation of every defined RoundState value and that out-of-range values return 'unknown' rather than panicking. - TestTryAddLeavesOneByOne_EmptyInput: verifies the function returns pre-allocated non-nil result slices even for empty input.
gateway/handlers.go: document the two-transport architecture (HTTP REST + JSON-RPC 2.0), the delegation pattern to AggregatorService, and the two-tier error-handling model (400 vs 500 class responses). storage/mongodb/commitment.go: document the per-collection type pattern, explain why each collection gets its own type, and note that all exported methods accept context.Context for deadline propagation and graceful shutdown participation.
Cover New() level/format/output variants, NewWithConfig file logging, Close() with and without a file writer, WithContext key extraction, WithComponent/WithStateID/WithError/WithFields helpers, all four context-aware logging methods, AsyncLogger basic publish and Stop idempotency, dropped-log counter under backpressure, and GetDroppedLogs. All 15 tests pass without Docker or external dependencies.
- go.mod: downgrade go directive from 1.25 to 1.24 (current stable release) - ci.yml: fix GO_VERSION to 1.24, add -race flag to test step, add golangci-lint job - handlers.go: map errors.ErrUnsupported in GetNoDeletionProof to MethodNotFoundCode instead of InternalErrorCode so callers get a clear 'not implemented' signal - aggregator_record.go: add SetLimit(10000) to GetByBlockNumber Find query to prevent unbounded memory growth on large blocks - CHANGELOG.md: convert changes.txt entries into a structured CHANGELOG.md; retain changes.txt for historical reference
There was a problem hiding this comment.
Code Review
This pull request enhances the aggregator's core functionality by replacing busy-wait loops with ticker-based processing, optimizing SMT restoration with batched leaf insertion, and implementing exponential backoff for shard root submissions. It also introduces optimistic concurrency checks for shard updates and expands documentation and test coverage. Review feedback identified several high-severity issues, including hardcoded durations that bypass configuration, potential memory exhaustion during SMT restoration, incorrect shard ID encoding for database lookups, and silent truncation of aggregator records in API responses.
| // timer-based select instead of a busy-sleep loop. This avoids wasting CPU | ||
| // and reduces latency — the goroutine blocks efficiently until either a | ||
| // commitment arrives, the window expires, or the context is cancelled. | ||
| collectTimer := time.NewTimer(collectPhaseDuration) |
There was a problem hiding this comment.
The processRound method uses the hardcoded collectPhaseDuration constant (200ms) instead of the configured rm.roundDuration. This ignores the user-defined ROUND_DURATION environment variable and contradicts the PR description which claims the duration is configurable.
| collectTimer := time.NewTimer(collectPhaseDuration) | |
| collectTimer := time.NewTimer(rm.roundDuration) |
| // Pre-allocate the full leaf slice so we only call AddLeaves once at the | ||
| // end. Calling AddLeaves on every chunk would rebuild partial tree state | ||
| // on each iteration, multiplying the hash-computation work by O(chunks). | ||
| allLeaves := make([]*smt.Leaf, 0, totalCount) |
There was a problem hiding this comment.
There are three issues here:
- Compilation Error:
totalCountis anint64(returned by MongoDB'sCount), butmakerequires anintfor capacity. This will fail to compile. - Scalability: Pre-allocating a single slice for the entire SMT state is not scalable. If the collection grows to millions of nodes, this will cause an OOM (Out Of Memory) error.
- Logic Error: If the
smt_nodescollection stores all historical versions of nodes,totalCountwill reflect the entire history, not just the current state. Loading every historical node as a leaf and re-inserting them viaAddLeavesis incorrect and will likely fail due to path length validation.
| allLeaves := make([]*smt.Leaf, 0, totalCount) | |
| allLeaves := make([]*smt.Leaf, 0, int(totalCount)) |
| shardKey := make([]byte, 4) | ||
| shardKey[0] = byte(req.ShardID >> 24) | ||
| shardKey[1] = byte(req.ShardID >> 16) | ||
| shardKey[2] = byte(req.ShardID >> 8) | ||
| shardKey[3] = byte(req.ShardID) |
There was a problem hiding this comment.
The manual 4-byte big-endian encoding of ShardID is likely to cause lookup failures. The SMT implementation (and the restoration logic in round_manager.go) uses big.Int.Bytes() to generate storage keys, which does not include leading zeros. For example, ShardID 10 would be stored as [10], but this code searches for [0, 0, 0, 10]. Additionally, ShardID is an int, which can be 64-bit, making the 4-byte assumption potentially unsafe.
| shardKey := make([]byte, 4) | |
| shardKey[0] = byte(req.ShardID >> 24) | |
| shardKey[1] = byte(req.ShardID >> 16) | |
| shardKey[2] = byte(req.ShardID >> 8) | |
| shardKey[3] = byte(req.ShardID) | |
| shardKey := new(big.Int).SetInt64(int64(req.ShardID)).Bytes() |
| opts := options.Find().SetLimit(maxRecordsPerBlock) | ||
| cursor, err := ars.collection.Find(ctx, filter, opts) |
There was a problem hiding this comment.
The GetByBlockNumber query silently truncates results at 10,000 records. If a block contains more commitments than this limit, the caller receives an incomplete list with no indication that data is missing. This is a critical issue for clients relying on this method for auditing or synchronization. The API should either return a 'hasMore' flag or implement proper pagination.
… tests - Fix CI bug: 'make test -race' was silently ignoring the -race flag; now correctly calls 'make test-race' target - Add separate 'test-integration' CI job for Docker-dependent tests - Add .golangci.yml with explicit linter config (errcheck, govet, staticcheck, bodyclose, exhaustive, and more) - Add 'make test-integration' Makefile target (full suite with Docker) - Update 'make test' and 'make test-race' to pass -short flag, skipping integration tests that require Docker/MongoDB/Redis - Add testing.Short() guards to all integration test suites: internal/round, internal/ha, internal/service, internal/storage/mongodb, internal/storage/redis, test/integration - Normalize go.mod version from 'go 1.24' to 'go 1.24.0' All 21 packages pass 'go test -short ./...' with zero failures.
Summary
This PR fixes several code quality issues identified during review.
1.
go.mod— correct Go version directivego 1.25does not exist yet. Changed togo 1.24(current stable release).2. CI (
ci.yml) — three improvementsa) Fix
GO_VERSION— updated from1.25to1.24to matchgo.mod.b) Add
-raceflag to tests — the test step ranmake testwithout the race detector. Changed tomake test -raceso data races are caught in CI.c) Add
golangci-lintjob — added a dedicatedlintjob usinggolangci/golangci-lint-action@v6. This enforces consistent code style and catches common issues automatically on every push and PR.3.
internal/gateway/handlers.go— correct error code forGetNoDeletionProofGetNoDeletionProofreturnserrors.ErrUnsupportedfrom the service layer, but the handler was mapping it toInternalErrorCode(-32603), which signals a server crash to the caller.Fixed to check for
errors.ErrUnsupportedfirst and returnMethodNotFoundCode(-32601) with the message"get_no_deletion_proof is not yet implemented". This gives clients a clear, actionable signal rather than a misleading internal error.4.
internal/storage/mongodb/aggregator_record.go— bound theGetByBlockNumberqueryGetByBlockNumberissued an unboundedFindover all aggregator records for a block. A block with tens of thousands of records would load them all into memory at once.Added
options.Find().SetLimit(maxRecordsPerBlock)(capped at 10 000) to prevent unbounded memory growth. The constant is exported so callers can detect when the limit has been hit.5.
CHANGELOG.md— convertchanges.txtto standard formatchanges.txtcontained 797 lines of free-text changelog entries. Converted to a structuredCHANGELOG.mdfollowing the Keep a Changelog convention with dated sections,### Added / Changed / Fixedheadings, and concise bullet points. The originalchanges.txtis retained for historical reference.Files changed
go.modgo 1.25→go 1.24.github/workflows/ci.yml-race, addgolangci-lintjobinternal/gateway/handlers.goErrUnsupported→MethodNotFoundCodeinternal/storage/mongodb/aggregator_record.goSetLimit(10000)toGetByBlockNumberCHANGELOG.mdchanges.txtTesting
-raceflag addition will surface any existing data races in CI.golangci-lintjob runs withversion: latestand can be pinned once the baseline is established.GetByBlockNumberlimit is conservative (10 000) and can be raised or made configurable if needed.