Skip to content

Conversation

@alancleary
Copy link
Member

This PR updates VCF ingestion to support resuming failed ingestions. This is done by using a new "ready" status in the VCF manifest array.

Previous ingestion process:

  1. Given a set of VCF URIS, add URIs that aren't already in the manifest
    • Samples that are ready to be ingested should now have "ok" or "missing index"
  2. Ingest URIs from the manifest array with status "ok" or "missing index" that aren't already in the dataset

Notes:

  • This process prevents URIs from ever being ingested if they don't have the "ok" or "missing index" status when added to the manifest
  • This process prevents resuming ingestions that got far enough to add samples to the dataset but then failed

New ingestion process:

  1. Given a set of VCF URIs, add URIs that aren't already in the manifest
    • Samples that are ready to be ingested should now have "ready" or "missing index"
  2. Ingest URIs from the manifest array with status "ready" or "missing index"
    • If the sample for a URI is already in the dataset, it will only be ingested if the resume ingestion parameter is True
  3. change sample status to "ok" after successful ingestion

Notes:

  • This workflow allows resuming partial/failed ingestions
  • This workflow is backwards compatible with existing manifests
    • Previously ingested entries with status "ok" are assumed to be ingested
    • Previously ingested entries with status "missing index" can be "resumed" and their status will changed to "ok"
    • Previously failed ingestions with status "missing index" that got far enough to have an entry in the dataset can now be resumed

Edge cases:

  • Samples already in the manifest with status "ok" whose ingestion failed after the sample was added to the dataset will need to be manually removed from the manifest
    • There is no simple way to detect these samples programmatically so manual intervention is required

This replaces hard-coded strings used when reading/writing the manifest array.
Ingest now distinguishes between smaples that are ready to load and samples that have been loaded using a new "ready" status. The status of a sample is changed from "ready" / "missing index" to "ok" upon successful ingestion, allowing failed ingestions to be resumed.
@alancleary alancleary added bug Something isn't working enhancement New feature or request labels Jun 14, 2025
Copy link
Collaborator

@spencerseale spencerseale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks @alancleary!

I think the edge case won't be too big of a concern, unless someone did not realize their dataset failed ingestions. I don't think much of that is out there in our customers thankfully!

@leipzig
Copy link
Member

leipzig commented Jun 16, 2025

Can we get some clarification on the behavior when people try to ingest the same sample split into multiple chromosomes - both during the same ingest attempt and in staggered attempts?

No expectations here but this is one case where even an unfair policy is better than an unclear or inconsistent one.

@alancleary
Copy link
Member Author

Can we get some clarification on the behavior when people try to ingest the same sample split into multiple chromosomes - both during the same ingest attempt and in staggered attempts?

No expectations here but this is one case where even an unfair policy is better than an unclear or inconsistent one.

This PR preserves the existing behavior, i.e. a sample name cannot be ingested from more than one VCF. Implementation details aside, this is basically because once a sample is in the dataset you can't tell which VCF it came from.

I'll be sure to update the documentation to convey this, as we discussed.

@jp-dark
Copy link

jp-dark commented Jun 17, 2025

Do we need to worry about duplicate data if there is a failure between successfully writing a sample to the dataframe and updating the status in the manifest to "ok"?

manifest_df = A.query(
cond="status == 'ok' or status == 'missing index'"
).df[:]
manifest_df = A.df[:]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like manifest_df is only used for writing the number of samples in the manifest to the logger. Is the extra work worth the extra log info?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't benchmarked this but I'm guessing not. The same pattern is used when filtering the URIs to ingest, which could be done with a Query to avoid loading the entire manifest into memory. Also, I think in practice manifests will be no larger than hundreds of thousands of entries (someone please correct me if this is wrong).

That said, we could revise this to only fetch these data if the logger is set to a level that will report it, or just omit it altogether.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think in practice manifests will be no larger than hundreds of thousands of entries (someone please correct me if this is wrong).

Your generally correct. We've got a couple use cases in 1-10M sample range but 100k-1M is typically the upper bound for a lot of use cases. Usually when there are more samples involved multiple datasets start coming into play. Its not a TileDB limitation but more often geographic boundaries and/or other compliance or regulations tend for push users to split the datasets up and you never end up with 100s of millions of samples in a single TIleDB-VCF data.

@alancleary
Copy link
Member Author

alancleary commented Jun 17, 2025

Do we need to worry about duplicate data if there is a failure between successfully writing a sample to the dataframe and updating the status in the manifest to "ok"?

No. If resume=False then the data won't be loaded again because it's already in the dataset. If resume=True then the underlying call to TileDB-VCF ingest will be run with resume=True as well, resulting in no data duplication. I've tested both scenarios.

Copy link
Collaborator

@sgillies sgillies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alancleary I made one request to structure the logging and made a minor suggestion about the state of ingestion that may or may not be useful to you.

ingest_df = ingest_df[~ingest_df.sample_name.isin(failed_samples)]
result = ingest_df.vcf_uri.to_list()

logger.info("%d samples in the dataset.", len(dataset_samples))
Copy link
Collaborator

@sgillies sgillies Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alancleary would you be willing to make the log messages more structured? I don't mean using structlog necessarily, but changing the messages from being prose to being event + data. For example: logger.info("Ingestion iteration completed: successful_samples=%r, failed_samples=%r", ...). Everything relevant to the event on a single line.

Messages like these are easy to scan for and can be more easily parsed and turned into data.

I know that's not the prevailing style in our Python code, but I'm trying to nudge new work in the structured direction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion @sgillies. That makes sense to me. I'll make the change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sgillies. Looking at the logging code now, reworking all the messages in ingestion would require touching a bit of code unrelated to this PR so I think I'll do it as its own PR or as part of a manifest rework. Are there guidelines anywhere for this that I can reference in an issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alancleary I'm only asking about the new logging you've added. We can revisit the existing code later, I agree.

We don't have any guidelines about logging, but I'll add something to a contributing doc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. So you're imagining something like this (note a few variable names are different to add context):

message = (
    "Filtering sample URIs: dataset_samples=%d, ingested_samples=%d, "
    "partial_samples=%d, manifest_samples=%d, ready_samples=%d, "
    "queued_samples=%d"
)
logger.info(message, 
    len(dataset_samples),
    len(ingested_samples),
    len(partial_samples),
    len(manifest_samples),
    len(ready_samples),
    len(queued_samples),
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alancleary Yes, like that. In looking around for good examples on the web, I came across this one from Stripe: https://brandur.org/canonical-log-lines#what-are-they. "Literally one big log line".

I think you could leverage f-strings to simplify the above.

logger.info(f"Filtering sample URIs: {len(dataset_samples)=}, {len(ingested_samples)=}, {len(partial_samples)=}, {len(manifest_samples)=}, {len(ready_samples)=}, {len(queued_samples)=}")

With the = syntax, you get, for example, "Filtering sample URIs: len(dataset_samples)=100000, ...".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Literally one big log line"

I like it! Also, I haven't encountered = syntax with f-strings before. Thanks for sharing!

status = "" if status == "ok" else status + ","
status += "missing index"
status = "" if status == Status.READY else status + ","
status += Status.MISSING_INDEX
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing this, I wonder if status shouldn't be a stack (list) instead of a string of comma-separated values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think string is quite right here. I would like to revisit the whole manifest at some point but this PR is just a bug fix!

This includes renaming some variables to add context to the messages.
@alancleary
Copy link
Member Author

@sgillies I added a commit that updates the log messages, as we discussed. I ended up updating most of the log messages in the file because nearly all of them were in the code paths this commit affects.

f"{num_partitions=}, "
f"{num_consolidates=}, "
f"{ingest_resources=}, "
f"{CONSOLIDATE_RESOURCES=}"
Copy link
Collaborator

@sgillies sgillies Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alancleary ! Merge when you're ready 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I'll merge after verifying that this has no unintended effects on 1-click ingestion.

This ensures that a member's value is saved to the ingestion manifest, rather than the member's name.
@alancleary
Copy link
Member Author

Pushed another commit that adds a __str__() special method to the Status enum. Without it, a status that is never concatenated with a string will be saved to the manifest as the Status member's name, rather than its value!

@spencerseale
Copy link
Collaborator

@alancleary @sgillies are we good to merge this?

@alancleary
Copy link
Member Author

@alancleary @sgillies are we good to merge this?

We still need to test 1-click against these changes, which requires involvement from someone on the cloud team. I'll try and push it along this week.

@sgillies
Copy link
Collaborator

There are some conflicts to be resolved, too. After which, I'm fine with a merge. I've already approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants