Skip to content

Document and restore test coverage for historical clade assignments #185

@trobacker

Description

@trobacker

Background

As part of PR #181, we removed the test_cladetime_assign_clades_historical test from tests/integration/test_cladetime_integration.py (lines 59-93) due to Nextstrain's deletion of historical S3 data.

What the Test Was Testing

The removed test verified that CladeTime could:

  1. Initialize with a historical date (freeze_time("2024-10-30"))
  2. Retrieve historical reference tree metadata for clade assignments
  3. Assign clades using a historical reference tree (different from current tree)
  4. Verify metadata accuracy against known values from hub archives:
    • nextclade_dataset_version: "2024-10-17--16-48-48Z"
    • nextclade_version_num: "3.9.1"
    • assignment_as_of: "2024-10-30 00:00"

Test Code Reference

@pytest.mark.skipif(not docker_enabled, reason="Docker is not installed")
def test_cladetime_assign_clades_historical(tmp_path, demo_mode, patch_s3_for_tests):
    """
    Test clade assignment with historical date using real hub metadata.
    
    This test verifies that CladeTime can correctly retrieve and use historical
    metadata from variant-nowcast-hub archives. We use 2024-10-30 because we
    know the exact metadata that should be returned from the hub archive.
    """
    assignment_file = tmp_path / "assignments_historical.tsv"
    
    with freeze_time("2024-10-30"):
        ct = CladeTime()
        
        metadata_filtered = sequence.filter_metadata(
            ct.sequence_metadata,
            collection_min_date="2024-10-01"
        )
        
        # Assign clades using historical reference tree
        assigned_clades = ct.assign_clades(metadata_filtered, output_file=assignment_file)
        
        # Verify metadata reflects 2024-10-30 state
        assert assigned_clades.meta.get("sequence_as_of") == datetime(2024, 10, 30, tzinfo=timezone.utc)
        assert assigned_clades.meta.get("tree_as_of") == datetime(2024, 10, 30, tzinfo=timezone.utc)
        
        # Hard-code expected values from 2024-10-30 hub archive
        assert assigned_clades.meta.get("nextclade_dataset_version") == "2024-10-17--16-48-48Z"
        assert assigned_clades.meta.get("nextclade_version_num") == "3.9.1"
        assert assigned_clades.meta.get("assignment_as_of") == "2024-10-30 00:00"

Why It No Longer Works

Nextstrain S3 Data Deletion (October 2025)

Nextstrain implemented a ~7-week retention policy for ALL versioned S3 objects:

  • Sequence data (sequences.fasta.zst): Only available back to 2025-09-29
  • Genome metadata (metadata.tsv.zst): Only available back to 2025-09-29
  • Reference tree metadata (metadata_version.json): Only available back to 2025-09-29 (but Hub fallback provides back to 2024-10-09)

The Problem

CladeTime requires three data sources to initialize:

  1. ✅ Reference tree metadata (available via Hub fallback to 2024-10-09)
  2. ❌ Sequence data (NOT available before 2025-09-29)
  3. ❌ Genome metadata (NOT available before 2025-09-29)

Without sequence data from S3, CladeTime cannot initialize for dates before 2025-09-29, even if reference tree metadata is available via Hub fallback.

Investigation Details

See S3_AVAILABILITY_FINDINGS.md for comprehensive investigation results showing uniform 7-week retention across all file types.

Current Workaround

The test was temporarily kept working using patch_s3_for_tests fixture that mocks S3 responses. However, this doesn't reflect reality and was removed per review feedback to simplify tests and avoid testing non-functional behavior.

What We Need

Short-term Solution (Implemented in PR #181)

  • ✅ Add validation that raises clear errors for dates outside data availability
  • ✅ Add negative tests verifying errors are thrown appropriately
  • ✅ Document the limitation clearly in README and docstrings

Long-term Solution (This Issue)

We need to restore test coverage for historical clade assignments. Possible approaches:

Option 1: Test with Recent Historical Dates

  • Use dates within S3 retention window (e.g., 3-4 weeks ago)
  • Tests real functionality but limited historical range
  • Fragile: breaks if Nextstrain reduces retention below test age

Option 2: Archive Representative Test Data

  • Archive small datasets (100k demo size) at specific dates
  • Store in repo or external service (Zenodo, institutional storage)
  • Mock S3 to return archived data for specific test dates
  • Provides stable, reproducible historical testing
  • Requires infrastructure and maintenance

Option 3: Integration Test Against Current Hub Archives

  • Test dates known to exist in Hub archives (2024-10-09 onwards)
  • Only works if we can also archive corresponding sequence data
  • Most faithful to actual use case but requires data archiving

Option 4: Hybrid Approach

  • Recent dates: Test real S3 data (within retention window)
  • Historical dates: Test with archived representative datasets
  • Document which scenarios are covered and limitations

Recommendation

We recommend Option 4 (Hybrid Approach):

  1. Immediate (in PR Add fallback to variant-nowcast-hub archives for historical metadata #181): Add test using date ~3 weeks ago (within S3 retention)
  2. Near-term: Archive 100k demo datasets at key dates (monthly snapshots)
  3. Long-term: Build test infrastructure to use archived data for historical tests

Related Files

  • tests/integration/test_cladetime_integration.py:59-93 - Removed test
  • src/cladetime/cladetime.py - CladeTime initialization with date validation
  • src/cladetime/util/config.py - Configuration with min date constants
  • S3_AVAILABILITY_FINDINGS.md - Investigation documenting S3 retention policy

Additional Context

This is part of addressing Nextstrain's infrastructure changes that fundamentally altered historical data availability. The Hub fallback mechanism (PR #181) is a partial solution that enables specific use cases (variant-nowcast-hub workflows) but does not fully restore historical analysis capabilities.

Important Note: Given uncertainty around Nextstrain's data pipeline longevity and retention policies, any solution should be designed with flexibility to adapt to future infrastructure changes. We cannot assume the current 7-week retention window is permanent, nor can we assume it won't be further reduced.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions