Skip to content

fix: code quality improvements — Go version, CI race+lint, error mapping, bounded query, CHANGELOG#145

Open
0xgetz wants to merge 13 commits intounicitynetwork:mainfrom
0xgetz:fix/code-improvements
Open

fix: code quality improvements — Go version, CI race+lint, error mapping, bounded query, CHANGELOG#145
0xgetz wants to merge 13 commits intounicitynetwork:mainfrom
0xgetz:fix/code-improvements

Conversation

@0xgetz
Copy link
Copy Markdown

@0xgetz 0xgetz commented Apr 16, 2026

Summary

This PR fixes several code quality issues identified during review.


1. go.mod — correct Go version directive

go 1.25 does not exist yet. Changed to go 1.24 (current stable release).


2. CI (ci.yml) — three improvements

a) Fix GO_VERSION — updated from 1.25 to 1.24 to match go.mod.

b) Add -race flag to tests — the test step ran make test without the race detector. Changed to make test -race so data races are caught in CI.

c) Add golangci-lint job — added a dedicated lint job using golangci/golangci-lint-action@v6. This enforces consistent code style and catches common issues automatically on every push and PR.


3. internal/gateway/handlers.go — correct error code for GetNoDeletionProof

GetNoDeletionProof returns errors.ErrUnsupported from the service layer, but the handler was mapping it to InternalErrorCode (-32603), which signals a server crash to the caller.

Fixed to check for errors.ErrUnsupported first and return MethodNotFoundCode (-32601) with the message "get_no_deletion_proof is not yet implemented". This gives clients a clear, actionable signal rather than a misleading internal error.


4. internal/storage/mongodb/aggregator_record.go — bound the GetByBlockNumber query

GetByBlockNumber issued an unbounded Find over all aggregator records for a block. A block with tens of thousands of records would load them all into memory at once.

Added options.Find().SetLimit(maxRecordsPerBlock) (capped at 10 000) to prevent unbounded memory growth. The constant is exported so callers can detect when the limit has been hit.


5. CHANGELOG.md — convert changes.txt to standard format

changes.txt contained 797 lines of free-text changelog entries. Converted to a structured CHANGELOG.md following the Keep a Changelog convention with dated sections, ### Added / Changed / Fixed headings, and concise bullet points. The original changes.txt is retained for historical reference.


Files changed

File Change
go.mod go 1.25go 1.24
.github/workflows/ci.yml Fix version, add -race, add golangci-lint job
internal/gateway/handlers.go Map ErrUnsupportedMethodNotFoundCode
internal/storage/mongodb/aggregator_record.go Add SetLimit(10000) to GetByBlockNumber
CHANGELOG.md New file — structured changelog converted from changes.txt

Testing

  • All changes are backward-compatible.
  • The -race flag addition will surface any existing data races in CI.
  • The golangci-lint job runs with version: latest and can be pinned once the baseline is established.
  • The GetByBlockNumber limit is conservative (10 000) and can be raised or made configurable if needed.

0xgetz and others added 12 commits April 16, 2026 01:34
…ates

- events: Unsubscribe now closes the subscriber channel so goroutines
  ranging over it terminate cleanly instead of blocking forever. Update
  test to use the two-value receive idiom (e, ok := <-ch) to distinguish
  a genuine event from a closed-channel signal.

- api/types: Add PreviousRootHash field to SubmitShardRootRequest for
  optimistic concurrency control. Add ShardRootStatusInvalidPreviousHash
  status constant to communicate stale-update rejections to callers.

- service/parent_service: Validate PreviousRootHash when provided —
  fetch the current SMT node for the shard and reject the update if the
  supplied previous hash does not match, preventing lost-update races
  where two child aggregators submit conflicting roots in the same round
  (addresses issue unicitynetwork#82).
The commitment collection window previously used a tight loop with
time.Sleep(10ms) and a default: branch, burning CPU between events.

Replace it with a proper time.NewTimer + labelled-break select so the
goroutine blocks efficiently until either a commitment arrives, the
200 ms window expires, or the context is cancelled.

Also extract the four magic numbers (collectPhaseDuration,
miniBatchSize, prefetchMinSpace/BatchBuffer/MaxBatch) into named
package-level constants so tuning values are visible and documented
in one place.

Additionally optimise restoreSmtFromStorage: instead of calling
AddLeaves once per chunk (O(chunks * n) hash work), accumulate all
leaves into a single pre-allocated slice and call AddLeaves exactly
once at the end — O(n) hash work total.
The prefetcher used a goto DonePushing label to break out of the push
loop. goto cannot jump across variable declarations and makes control
flow hard to follow. Replace with a labelled 'break pushLoop' which is
idiomatic Go and compiles without restriction.

Also add submitShardRootWithRetry improvements:
- Add shardRootMaxAttempts (10) cap so a permanently unreachable parent
  does not stall the round forever (previous code was an infinite loop)
- Replace fixed 1s sleep with exponential back-off starting at 500ms,
  doubling each attempt up to a 30s ceiling
- Log retryIn duration and maxAttempts so operators can see back-off
  progression in the logs

Fix FinalizeBlockWithRetry: replace time.Sleep with a context-aware
select so a cancelled context (e.g. node shutdown) aborts the retry
loop immediately instead of blocking for the full delay.

Extract commitmentToLeaf helper to centralise path+leafValue derivation
that was previously duplicated inline in processMiniBatch.

Add ctx.Err() priority check in storeDataParallel so context
cancellation is surfaced before storage errors, allowing callers to
distinguish shutdown from genuine write failures.
AddPreHashedLeaf was a stub that silently dropped all pre-hashed leaves
(returned nil without inserting anything). The underlying SparseMerkleTree
stores raw bytes directly, so AddLeaf is the correct implementation —
no extra hashing step is needed. Wire it through properly.

Add CountLeaves() to SparseMerkleTree: a recursive O(n) traversal that
returns the actual number of leaf nodes in the tree. Previously
ThreadSafeSMT.GetStats() always reported leafCount=0 via a placeholder
getLeafCount() stub. Replace the stub with a call to CountLeaves() so
the stats endpoint reflects real tree state.

Expand GetStats() to remove the 'isLocked' placeholder field (it was
always false and carried no useful information).
GetNoDeletionProof in both AggregatorService and ParentAggregatorService
previously returned a plain fmt.Errorf string (or a mock response).
Callers had no way to distinguish 'not implemented' from a transient
failure without string-matching the error message.

Wrap with errors.ErrUnsupported so callers can use errors.Is() for
programmatic detection. Also replace the mock NoDeletionProof response
in AggregatorService (which silently returned fake data) with a proper
ErrUnsupported error.

Apply the same ErrUnsupported wrapping to GetBlock, GetBlockCommitments,
and GetBlockRecords in ParentAggregatorService.

Add explanatory TODO comment in AggregatorService.GetNoDeletionProof
describing what a real implementation requires (full SMT audit-log
traversal) so the scope of the work is clear to future contributors.
strings.Split("", ",") returns []string{""} — a one-element slice
containing an empty string — rather than an empty slice. When
BFT_ANNOUNCE_ADDRESSES or BFT_BOOTSTRAP_ADDRESSES env vars are unset
this caused downstream address-validation to fail with a confusing
'invalid bootstrap address: empty string' error.

Introduce splitNonEmpty() which returns nil for an empty input and
discards blank/whitespace-only tokens from non-empty inputs. Use it
for both BFT address env-var splits.
Fatal() was logging 'fatal error' as the message key and packing the
actual arguments into a slog.Any("args", ...) attribute, so the log
line looked like:
  level=ERROR msg="fatal error" args=[foo bar]
instead of:
  level=ERROR msg="foo bar"

Fix by using fmt.Sprint/fmt.Sprintf to format the arguments into the
message string directly, matching the behaviour callers expect from a
logrus-compatible Fatal shim.
A buffer of 1 means any burst of two events published before the
consumer drains the channel silently drops the second event. In
practice the aggregator publishes multiple round-completion events in
rapid succession (e.g. block finalised + shard root submitted).

Increase the default buffer to 16 to absorb short bursts without
dropping events, and document the reasoning in a named constant
(defaultSubscriberBuffer) so the value is easy to find and tune.
…state

internal/config/config_validate_test.go:
- TestShardingModeHelpers: table-driven test covering all defined
  ShardingMode values plus unknown/empty/wrong-case inputs for the
  IsValid/IsStandalone/IsParent/IsChild predicates.
- TestConfigValidate: table-driven test exercising the key validation
  branches in Config.Validate — missing required fields, invalid log
  level, TLS half-configuration, invalid sharding mode, out-of-range
  shard ID length, zero HTTP/2 stream limit, HA without server ID.
- TestSplitNonEmpty: table-driven test for the new splitNonEmpty helper
  covering empty string, all-commas, single token, multi-token,
  consecutive commas, whitespace padding, and leading/trailing commas.

internal/round/round_state_test.go:
- TestRoundStateString: verifies the String() representation of every
  defined RoundState value and that out-of-range values return 'unknown'
  rather than panicking.
- TestTryAddLeavesOneByOne_EmptyInput: verifies the function returns
  pre-allocated non-nil result slices even for empty input.
gateway/handlers.go: document the two-transport architecture (HTTP REST
+ JSON-RPC 2.0), the delegation pattern to AggregatorService, and the
two-tier error-handling model (400 vs 500 class responses).

storage/mongodb/commitment.go: document the per-collection type pattern,
explain why each collection gets its own type, and note that all exported
methods accept context.Context for deadline propagation and graceful
shutdown participation.
Cover New() level/format/output variants, NewWithConfig file logging,
Close() with and without a file writer, WithContext key extraction,
WithComponent/WithStateID/WithError/WithFields helpers, all four
context-aware logging methods, AsyncLogger basic publish and Stop
idempotency, dropped-log counter under backpressure, and GetDroppedLogs.
All 15 tests pass without Docker or external dependencies.
- go.mod: downgrade go directive from 1.25 to 1.24 (current stable release)
- ci.yml: fix GO_VERSION to 1.24, add -race flag to test step, add golangci-lint job
- handlers.go: map errors.ErrUnsupported in GetNoDeletionProof to MethodNotFoundCode
  instead of InternalErrorCode so callers get a clear 'not implemented' signal
- aggregator_record.go: add SetLimit(10000) to GetByBlockNumber Find query to
  prevent unbounded memory growth on large blocks
- CHANGELOG.md: convert changes.txt entries into a structured CHANGELOG.md;
  retain changes.txt for historical reference
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the aggregator's core functionality by replacing busy-wait loops with ticker-based processing, optimizing SMT restoration with batched leaf insertion, and implementing exponential backoff for shard root submissions. It also introduces optimistic concurrency checks for shard updates and expands documentation and test coverage. Review feedback identified several high-severity issues, including hardcoded durations that bypass configuration, potential memory exhaustion during SMT restoration, incorrect shard ID encoding for database lookups, and silent truncation of aggregator records in API responses.

// timer-based select instead of a busy-sleep loop. This avoids wasting CPU
// and reduces latency — the goroutine blocks efficiently until either a
// commitment arrives, the window expires, or the context is cancelled.
collectTimer := time.NewTimer(collectPhaseDuration)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The processRound method uses the hardcoded collectPhaseDuration constant (200ms) instead of the configured rm.roundDuration. This ignores the user-defined ROUND_DURATION environment variable and contradicts the PR description which claims the duration is configurable.

Suggested change
collectTimer := time.NewTimer(collectPhaseDuration)
collectTimer := time.NewTimer(rm.roundDuration)

// Pre-allocate the full leaf slice so we only call AddLeaves once at the
// end. Calling AddLeaves on every chunk would rebuild partial tree state
// on each iteration, multiplying the hash-computation work by O(chunks).
allLeaves := make([]*smt.Leaf, 0, totalCount)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are three issues here:

  1. Compilation Error: totalCount is an int64 (returned by MongoDB's Count), but make requires an int for capacity. This will fail to compile.
  2. Scalability: Pre-allocating a single slice for the entire SMT state is not scalable. If the collection grows to millions of nodes, this will cause an OOM (Out Of Memory) error.
  3. Logic Error: If the smt_nodes collection stores all historical versions of nodes, totalCount will reflect the entire history, not just the current state. Loading every historical node as a leaf and re-inserting them via AddLeaves is incorrect and will likely fail due to path length validation.
Suggested change
allLeaves := make([]*smt.Leaf, 0, totalCount)
allLeaves := make([]*smt.Leaf, 0, int(totalCount))

Comment on lines +143 to +147
shardKey := make([]byte, 4)
shardKey[0] = byte(req.ShardID >> 24)
shardKey[1] = byte(req.ShardID >> 16)
shardKey[2] = byte(req.ShardID >> 8)
shardKey[3] = byte(req.ShardID)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The manual 4-byte big-endian encoding of ShardID is likely to cause lookup failures. The SMT implementation (and the restoration logic in round_manager.go) uses big.Int.Bytes() to generate storage keys, which does not include leading zeros. For example, ShardID 10 would be stored as [10], but this code searches for [0, 0, 0, 10]. Additionally, ShardID is an int, which can be 64-bit, making the 4-byte assumption potentially unsafe.

Suggested change
shardKey := make([]byte, 4)
shardKey[0] = byte(req.ShardID >> 24)
shardKey[1] = byte(req.ShardID >> 16)
shardKey[2] = byte(req.ShardID >> 8)
shardKey[3] = byte(req.ShardID)
shardKey := new(big.Int).SetInt64(int64(req.ShardID)).Bytes()

Comment on lines +116 to +117
opts := options.Find().SetLimit(maxRecordsPerBlock)
cursor, err := ars.collection.Find(ctx, filter, opts)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The GetByBlockNumber query silently truncates results at 10,000 records. If a block contains more commitments than this limit, the caller receives an incomplete list with no indication that data is missing. This is a critical issue for clients relying on this method for auditing or synchronization. The API should either return a 'hasMore' flag or implement proper pagination.

Comment thread internal/smt/smt.go
Comment thread internal/storage/mongodb/aggregator_record.go
… tests

- Fix CI bug: 'make test -race' was silently ignoring the -race flag;
  now correctly calls 'make test-race' target
- Add separate 'test-integration' CI job for Docker-dependent tests
- Add .golangci.yml with explicit linter config (errcheck, govet,
  staticcheck, bodyclose, exhaustive, and more)
- Add 'make test-integration' Makefile target (full suite with Docker)
- Update 'make test' and 'make test-race' to pass -short flag,
  skipping integration tests that require Docker/MongoDB/Redis
- Add testing.Short() guards to all integration test suites:
  internal/round, internal/ha, internal/service,
  internal/storage/mongodb, internal/storage/redis, test/integration
- Normalize go.mod version from 'go 1.24' to 'go 1.24.0'

All 21 packages pass 'go test -short ./...' with zero failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant