Skip to content

Hyperscaled backup rest api call#696

Open
brokespace wants to merge 115 commits intofeat/hyperscaledfrom
feat/hyperscaled_backup_call
Open

Hyperscaled backup rest api call#696
brokespace wants to merge 115 commits intofeat/hyperscaledfrom
feat/hyperscaled_backup_call

Conversation

@brokespace
Copy link
Copy Markdown
Collaborator

Taoshi Pull Request

Description

This adds a secondary service that runs alongside the primary hyperliquid tracker to periodically reach out to the rest api to get the recent trades on all accounts to ensure that if theres a hickup in tracking an account that it will be measured.

Hitting the rate limit on the REST api endpoint does not effect the websocket connection, therefore the two can use the same IP's.

Related Issues (JIRA)

[Reference any related issues or tasks that this pull request addresses or closes.]

Checklist

  • I have tested my changes on testnet.
  • I have updated any necessary documentation.
  • I have added unit tests for my changes (if applicable).
  • If there are breaking changes for validators, I have (or will) notify the community in Discord of the release.

Reviewer Instructions

[Provide any specific instructions or areas you would like the reviewer to focus on.]

Definition of Done

  • Code has been reviewed.
  • All checks and tests pass.
  • Documentation is up to date.
  • Approved by at least one reviewer.

Checklist (for the reviewer)

  • Code follows project conventions.
  • Code is well-documented.
  • Changes are necessary and align with the project's goals.
  • No breaking changes introduced.

Optional: Deploy Notes

[Any instructions or notes related to deployment, if applicable.]

/cc @mention_reviewer

ward-taoshi and others added 30 commits March 3, 2026 14:08
* Simplify order sending for miners

* Display deprecated message on run receive signals server.

* Fix miner integration test

* Update docstring
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 12, 2026

🤖 Claude AI Code Review

Last reviewed on: 14:23:47

Summary

This PR adds a backup REST API polling mechanism to complement the primary WebSocket-based Hyperliquid tracker. The backup system periodically polls for missed fills using the REST API with exponential backoff for unhealthy proxy ports. It includes comprehensive test coverage and enhanced port health monitoring.


✅ Strengths

  1. Excellent Test Coverage: The PR includes 788 new lines of comprehensive unit tests covering edge cases, failure scenarios, and async behavior.

  2. Well-Structured Health Tracking: The _PortHealthRecord class uses __slots__ for memory efficiency and provides clean separation of concerns for port health state.

  3. Graceful Degradation: Falls back to direct connection when all proxy ports are unhealthy, maintaining service availability.

  4. Idempotent Watermarking: The backup poll system correctly prevents duplicate processing and only advances watermarks on successful polls.

  5. Proper Resource Cleanup: The orchestrator correctly cancels the backup task on shutdown and closes sessions after use.

  6. Backward Compatibility: The _unhealthy_ports property maintains API compatibility while transitioning to the new health record system.


⚠️ Concerns

CRITICAL

  1. Race Condition in Orchestrator Cleanup (lines 597-602)

    for _ in range(int(self.ADDRESS_REFRESH_INTERVAL_S)):
        if self._stop_event.is_set():
            break
        await asyncio.sleep(1.0)

    The loop breaks but doesn't execute cleanup in the finally block immediately. If stop is signaled during this sleep loop, cleanup is delayed up to ADDRESS_REFRESH_INTERVAL_S seconds.

  2. Missing Rate Limiting Enforcement (lines 1019-1041)

    min_delay_s = 60.0 / ValiConfig.HL_BACKUP_POLL_RATE_BUDGET

    While min_delay_s is calculated, there's no enforcement if processing a fill takes longer than this delay. Multiple addresses polled sequentially could exceed the 600 req/min budget.

  3. Thread Safety Concerns

    • _last_poll_time_ms dict is modified in async context without locks (lines 663, 1063)
    • _backup_fills_caught and _backup_polls_total counters are not atomic
    • While Python's GIL provides some protection, async tasks can interleave, potentially causing inconsistent state
  4. Unhandled Session Leaks on Exception (line 897)

    except Exception as e:
        bt.logging.error(f"[HL_TRACKER] REST error fetching account state for {hl_address}: {e}")
        self._report_rest_proxy_failure(proxy_port)
        return None
    finally:
        session.close()

    If an exception occurs during session creation in _make_proxied_session(), the session might not be assigned, and session.close() could fail.

MAJOR

  1. Infinite Cooldown Possible (line 439)

    ref_time = rec.last_probe_time or rec.unhealthy_since or 0

    If both last_probe_time and unhealthy_since are somehow None, using 0 could cause immediate probe retries in a tight loop.

  2. Memory Leak in Hash Tracking (line 1053)

    self._record_hash(fill_hash)

    _processed_hashes set grows unbounded. While fills have dedup, there's no TTL or size limit. Over time, this could consume significant memory.

  3. Blocking I/O in Async Context (lines 989-996)

    def _do_request():
        try:
            resp = session.post(api_url, json=payload, timeout=10)

    Using run_in_executor(None, ...) with the default thread pool executor is acceptable, but could be problematic under high load. Consider using an async HTTP client like aiohttp.

  4. Missing Validation on Backup Fill Processing (lines 1048-1059)
    The backup poll doesn't validate fill timestamps or check if fills are too old. Could process ancient fills if watermark is corrupted.


💡 Suggestions

Code Quality

  1. Extract Magic Numbers to Constants (line 1008)

    await asyncio.sleep(10.0)  # Let WS connections establish first

    Define HL_BACKUP_POLL_STARTUP_DELAY_S = 10.0 in ValiConfig.

  2. Simplify Proxy Port Collection (lines 919-926)

    all_ports = list(self._available_ports)
    for shard in self._shards.values():
        if shard.healthy and shard.port is not None:
            all_ports.append(shard.port)
    all_ports = sorted(set(all_ports))

    Could be:

    all_ports = set(self._available_ports)
    all_ports.update(s.port for s in self._shards.values() 
                     if s.healthy and s.port is not None)
  3. Add Type Hints (line 381)

    self._port_health: Dict[int, _PortHealthRecord] = {}

    Good! But _PortHealthRecord class itself should have return type hints on methods.

  4. Consolidate Logging Levels (lines 1049-1055)
    Mix of bt.logging.info and bt.logging.debug. Establish consistent criteria for when to use each level.

  5. Extract Fill Hash Logic (line 1042)

    fill_hash = fill.get("hash") or fill.get("tid")

    This pattern appears twice (lines 829, 1042). Extract to helper method _get_fill_hash(fill: dict) -> Optional[str].

Performance

  1. Optimize Port Health Status Building (lines 440-453)
    Creates a new list on every get_status() call. Consider caching with TTL if status is polled frequently.

  2. Batch REST Calls (lines 1019-1041)
    Currently polls addresses sequentially. If HL API supports batch requests, could significantly reduce latency.

  3. Lazy Initialization of Backup Poll (line 580)
    The backup task starts immediately in orchestrator. Consider delaying until first address is tracked to avoid unnecessary polling.

Maintainability

  1. Add Circuit Breaker Pattern
    If backup polls consistently fail for an address, consider temporarily blacklisting it to avoid wasting resources.

  2. Metrics Enhancement
    Track additional metrics:

    • Average backup poll latency per address
    • Miss rate (fills caught by backup vs total fills)
    • REST API error types breakdown

🔒 Security Notes

  1. Proxy Credential Exposure (lines 936, 776)

    proxy_url = self._make_shard_proxy_url(port)

    Ensure proxy URLs containing credentials are never logged. Review all logging statements that might output proxy_url.

  2. Input Validation Missing

    • _fetch_fills_by_time doesn't validate the structure of returned fills before processing (line 996)
    • Malformed API responses could cause exceptions or unexpected behavior
    • Add schema validation for critical fields: hash, coin, side, sz, px
  3. Timeout Configuration (line 992)

    resp = session.post(api_url, json=payload, timeout=10)

    10-second timeout seems reasonable, but should be configurable via ValiConfig for operational flexibility.

  4. DOS Vector via Watermark Reset
    If an attacker can manipulate _last_poll_time_ms, they could force re-processing of old fills. Ensure this state is protected and persisted securely.

  5. Rate Limit Bypass Risk
    The PR claims "Hitting the rate limit on the REST api endpoint does not effect the websocket connection." However, there's no handling for 429 responses. Add explicit rate limit detection and backoff.


📋 Additional Observations

Testing

  1. Missing Integration Tests: All tests are unit tests with mocked dependencies. Need at least one integration test hitting a real (or containerized mock) HL API.

  2. Race Condition Tests Missing: No tests verify thread safety of shared counters and dictionaries under concurrent access.

  3. Test Naming: Excellent descriptive test names. Consider grouping related tests into nested test classes for better organization.

Documentation

  1. Missing Docstring (line 56): _PortHealthRecord class needs a module-level docstring explaining the exponential backoff strategy.

  2. Config Documentation (lines 545-554): Add comments explaining why specific thresholds were chosen (e.g., why 600 req/min instead of closer to 1200 limit).

  3. Architecture Decision: Consider adding an ADR (Architecture Decision Record) documenting why REST backup was chosen over WebSocket redundancy or other alternatives.

Dependency Management

  1. Duplicate SOCKS Libraries (requirements.txt)
    +python-socks>=2.7.1
    +PySocks>=1.7.1
    Both python-socks and PySocks are added. The code uses requests.Session().proxies which works with PySocks. Verify if both are needed or if one can be removed.

🎯 Recommendation

APPROVE WITH REQUIRED CHANGES

This is solid engineering work with excellent test coverage and thoughtful design. However, the critical issues around race conditions, rate limiting, and memory leaks must be addressed before merge.

Required Before Merge:

  1. Fix race condition in orchestrator cleanup (break → return)
  2. Add rate limiting enforcement for backup polls
  3. Add bounded size to _processed_hashes (LRU cache or TTL)
  4. Add explicit session leak protection with context managers
  5. Validate fill structure before processing

Strongly Recommended:

  1. Add integration test with real API
  2. Document proxy credential handling in security guidelines
  3. Add 429 rate limit handling
  4. Review and consolidate SOCKS dependencies

Nice to Have:

  1. Convert to async HTTP client (aiohttp)
  2. Add circuit breaker for problematic addresses
  3. Extract common fill hash logic
  4. Add more observability metrics

Great work overall! The backup mechanism significantly improves reliability. 🚀

@brokespace brokespace force-pushed the feat/hyperscaled branch 2 times, most recently from 48461d5 to 3b5630d Compare March 13, 2026 19:41
@sli-tao sli-tao force-pushed the feat/hyperscaled branch 2 times, most recently from c2d55af to 595b86c Compare April 2, 2026 11:07
@sli-tao sli-tao force-pushed the feat/hyperscaled branch 2 times, most recently from b077a4b to be3cb98 Compare April 3, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants