-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Background
As part of PR #181, we removed the test_cladetime_assign_clades_historical test from tests/integration/test_cladetime_integration.py (lines 59-93) due to Nextstrain's deletion of historical S3 data.
What the Test Was Testing
The removed test verified that CladeTime could:
- Initialize with a historical date (
freeze_time("2024-10-30")) - Retrieve historical reference tree metadata for clade assignments
- Assign clades using a historical reference tree (different from current tree)
- Verify metadata accuracy against known values from hub archives:
nextclade_dataset_version: "2024-10-17--16-48-48Z"nextclade_version_num: "3.9.1"assignment_as_of: "2024-10-30 00:00"
Test Code Reference
@pytest.mark.skipif(not docker_enabled, reason="Docker is not installed")
def test_cladetime_assign_clades_historical(tmp_path, demo_mode, patch_s3_for_tests):
"""
Test clade assignment with historical date using real hub metadata.
This test verifies that CladeTime can correctly retrieve and use historical
metadata from variant-nowcast-hub archives. We use 2024-10-30 because we
know the exact metadata that should be returned from the hub archive.
"""
assignment_file = tmp_path / "assignments_historical.tsv"
with freeze_time("2024-10-30"):
ct = CladeTime()
metadata_filtered = sequence.filter_metadata(
ct.sequence_metadata,
collection_min_date="2024-10-01"
)
# Assign clades using historical reference tree
assigned_clades = ct.assign_clades(metadata_filtered, output_file=assignment_file)
# Verify metadata reflects 2024-10-30 state
assert assigned_clades.meta.get("sequence_as_of") == datetime(2024, 10, 30, tzinfo=timezone.utc)
assert assigned_clades.meta.get("tree_as_of") == datetime(2024, 10, 30, tzinfo=timezone.utc)
# Hard-code expected values from 2024-10-30 hub archive
assert assigned_clades.meta.get("nextclade_dataset_version") == "2024-10-17--16-48-48Z"
assert assigned_clades.meta.get("nextclade_version_num") == "3.9.1"
assert assigned_clades.meta.get("assignment_as_of") == "2024-10-30 00:00"Why It No Longer Works
Nextstrain S3 Data Deletion (October 2025)
Nextstrain implemented a ~7-week retention policy for ALL versioned S3 objects:
- Sequence data (
sequences.fasta.zst): Only available back to 2025-09-29 - Genome metadata (
metadata.tsv.zst): Only available back to 2025-09-29 - Reference tree metadata (
metadata_version.json): Only available back to 2025-09-29 (but Hub fallback provides back to 2024-10-09)
The Problem
CladeTime requires three data sources to initialize:
- ✅ Reference tree metadata (available via Hub fallback to 2024-10-09)
- ❌ Sequence data (NOT available before 2025-09-29)
- ❌ Genome metadata (NOT available before 2025-09-29)
Without sequence data from S3, CladeTime cannot initialize for dates before 2025-09-29, even if reference tree metadata is available via Hub fallback.
Investigation Details
See S3_AVAILABILITY_FINDINGS.md for comprehensive investigation results showing uniform 7-week retention across all file types.
Current Workaround
The test was temporarily kept working using patch_s3_for_tests fixture that mocks S3 responses. However, this doesn't reflect reality and was removed per review feedback to simplify tests and avoid testing non-functional behavior.
What We Need
Short-term Solution (Implemented in PR #181)
- ✅ Add validation that raises clear errors for dates outside data availability
- ✅ Add negative tests verifying errors are thrown appropriately
- ✅ Document the limitation clearly in README and docstrings
Long-term Solution (This Issue)
We need to restore test coverage for historical clade assignments. Possible approaches:
Option 1: Test with Recent Historical Dates
- Use dates within S3 retention window (e.g., 3-4 weeks ago)
- Tests real functionality but limited historical range
- Fragile: breaks if Nextstrain reduces retention below test age
Option 2: Archive Representative Test Data
- Archive small datasets (100k demo size) at specific dates
- Store in repo or external service (Zenodo, institutional storage)
- Mock S3 to return archived data for specific test dates
- Provides stable, reproducible historical testing
- Requires infrastructure and maintenance
Option 3: Integration Test Against Current Hub Archives
- Test dates known to exist in Hub archives (2024-10-09 onwards)
- Only works if we can also archive corresponding sequence data
- Most faithful to actual use case but requires data archiving
Option 4: Hybrid Approach
- Recent dates: Test real S3 data (within retention window)
- Historical dates: Test with archived representative datasets
- Document which scenarios are covered and limitations
Recommendation
We recommend Option 4 (Hybrid Approach):
- Immediate (in PR Add fallback to variant-nowcast-hub archives for historical metadata #181): Add test using date ~3 weeks ago (within S3 retention)
- Near-term: Archive 100k demo datasets at key dates (monthly snapshots)
- Long-term: Build test infrastructure to use archived data for historical tests
Related Files
tests/integration/test_cladetime_integration.py:59-93- Removed testsrc/cladetime/cladetime.py- CladeTime initialization with date validationsrc/cladetime/util/config.py- Configuration with min date constantsS3_AVAILABILITY_FINDINGS.md- Investigation documenting S3 retention policy
Additional Context
This is part of addressing Nextstrain's infrastructure changes that fundamentally altered historical data availability. The Hub fallback mechanism (PR #181) is a partial solution that enables specific use cases (variant-nowcast-hub workflows) but does not fully restore historical analysis capabilities.
Important Note: Given uncertainty around Nextstrain's data pipeline longevity and retention policies, any solution should be designed with flexibility to adapt to future infrastructure changes. We cannot assume the current 7-week retention window is permanent, nor can we assume it won't be further reduced.
References
- PR Add fallback to variant-nowcast-hub archives for historical metadata #181: Add fallback to variant-nowcast-hub archives
S3_AVAILABILITY_FINDINGS.md: Comprehensive S3 retention investigation- Nick Reich's review (2025-11-21): Add fallback to variant-nowcast-hub archives for historical metadata #181 (review)