Skip to content

Conversation

@mitchhs12
Copy link
Contributor

@mitchhs12 mitchhs12 commented Jan 7, 2026

Summary

  • Added GET /datasets/{namespace}/{name}/versions/{revision}/sync-progress endpoint.
  • Returns per-table sync progress including current_block, start_block, job_status, and file stats.
  • Uses TableSnapshot::synced_range() with canonical_chain logic to accurately report sync progress, handling gaps and reorgs.

Tests

  • Endpoint returns correct structure for valid dataset
  • Returns 404 for non-existent dataset
  • Verifies RUNNING status while job is actively syncing
  • Verifies COMPLETED status when end block is reached

Response format:

{
  "dataset_namespace": "ethereum",
  "dataset_name": "mainnet",
  "revision": "0.0.0",
  "manifest_hash": "2dbf16e8a4d1c526e3893341d1945040d51ea1b68d1c420e402be59b0646fcfa",
  "tables": [
    {
      "table_name": "blocks",
      "current_block": 950000,
      "start_block": 0,
      "job_id": 1,
      "job_status": "RUNNING",
      "files_count": 47,
      "total_size_bytes": 2147483648
    }
  ]
}

@mitchhs12 mitchhs12 requested review from LNSD and leoyvens January 7, 2026 23:00
@mitchhs12 mitchhs12 self-assigned this Jan 7, 2026
@mitchhs12 mitchhs12 linked an issue Jan 7, 2026 that may be closed by this pull request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do this to get the sync progress? I find this ad-hoc and not integrated with the Amp "data lake" (or a data store).

Copy link
Contributor Author

@mitchhs12 mitchhs12 Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. This approach does couple us a lot though...

The problem with doing a query-based approach is that it might take a while to execute the query. For example, if we do SELECT MAX(block_number) FROM eth_rpc.logs this will be pretty slow if there are a lot of files.

Do you think it would be better to update the worker to save the progress whenever it finishes writing a file? and then we can just query the new column (so updating the jobs table in the database)?

…p + reorg handling via canonical_chain logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Job progress reporting

3 participants