Add support for VCF ingestion failure #723

alancleary · 2025-06-14T00:14:33Z

This PR updates VCF ingestion to support resuming failed ingestions. This is done by using a new "ready" status in the VCF manifest array.

Previous ingestion process:

Given a set of VCF URIS, add URIs that aren't already in the manifest
- Samples that are ready to be ingested should now have "ok" or "missing index"
Ingest URIs from the manifest array with status "ok" or "missing index" that aren't already in the dataset

Notes:

This process prevents URIs from ever being ingested if they don't have the "ok" or "missing index" status when added to the manifest
This process prevents resuming ingestions that got far enough to add samples to the dataset but then failed

New ingestion process:

Given a set of VCF URIs, add URIs that aren't already in the manifest
- Samples that are ready to be ingested should now have "ready" or "missing index"
Ingest URIs from the manifest array with status "ready" or "missing index"
- If the sample for a URI is already in the dataset, it will only be ingested if the resume ingestion parameter is True
change sample status to "ok" after successful ingestion

Notes:

This workflow allows resuming partial/failed ingestions
This workflow is backwards compatible with existing manifests
- Previously ingested entries with status "ok" are assumed to be ingested
- Previously ingested entries with status "missing index" can be "resumed" and their status will changed to "ok"
- Previously failed ingestions with status "missing index" that got far enough to have an entry in the dataset can now be resumed

Edge cases:

Samples already in the manifest with status "ok" whose ingestion failed after the sample was added to the dataset will need to be manually removed from the manifest
- There is no simple way to detect these samples programmatically so manual intervention is required

This replaces hard-coded strings used when reading/writing the manifest array.

Ingest now distinguishes between smaples that are ready to load and samples that have been loaded using a new "ready" status. The status of a sample is changed from "ready" / "missing index" to "ok" upon successful ingestion, allowing failed ingestions to be resumed.

spencerseale

Looks good! Thanks @alancleary!

I think the edge case won't be too big of a concern, unless someone did not realize their dataset failed ingestions. I don't think much of that is out there in our customers thankfully!

leipzig · 2025-06-16T18:39:04Z

Can we get some clarification on the behavior when people try to ingest the same sample split into multiple chromosomes - both during the same ingest attempt and in staggered attempts?

No expectations here but this is one case where even an unfair policy is better than an unclear or inconsistent one.

alancleary · 2025-06-16T22:02:42Z

Can we get some clarification on the behavior when people try to ingest the same sample split into multiple chromosomes - both during the same ingest attempt and in staggered attempts?

No expectations here but this is one case where even an unfair policy is better than an unclear or inconsistent one.

This PR preserves the existing behavior, i.e. a sample name cannot be ingested from more than one VCF. Implementation details aside, this is basically because once a sample is in the dataset you can't tell which VCF it came from.

I'll be sure to update the documentation to convey this, as we discussed.

jp-dark · 2025-06-17T17:53:07Z

Do we need to worry about duplicate data if there is a failure between successfully writing a sample to the dataframe and updating the status in the manifest to "ok"?

jp-dark · 2025-06-17T17:52:30Z

src/tiledb/cloud/vcf/ingestion.py

-                manifest_df = A.query(
-                    cond="status == 'ok' or status == 'missing index'"
-                ).df[:]
+                manifest_df = A.df[:]


Looks like manifest_df is only used for writing the number of samples in the manifest to the logger. Is the extra work worth the extra log info?

I haven't benchmarked this but I'm guessing not. The same pattern is used when filtering the URIs to ingest, which could be done with a Query to avoid loading the entire manifest into memory. Also, I think in practice manifests will be no larger than hundreds of thousands of entries (someone please correct me if this is wrong).

That said, we could revise this to only fetch these data if the logger is set to a level that will report it, or just omit it altogether.

Also, I think in practice manifests will be no larger than hundreds of thousands of entries (someone please correct me if this is wrong).

Your generally correct. We've got a couple use cases in 1-10M sample range but 100k-1M is typically the upper bound for a lot of use cases. Usually when there are more samples involved multiple datasets start coming into play. Its not a TileDB limitation but more often geographic boundaries and/or other compliance or regulations tend for push users to split the datasets up and you never end up with 100s of millions of samples in a single TIleDB-VCF data.

alancleary · 2025-06-17T18:07:54Z

Do we need to worry about duplicate data if there is a failure between successfully writing a sample to the dataframe and updating the status in the manifest to "ok"?

No. If resume=False then the data won't be loaded again because it's already in the dataset. If resume=True then the underlying call to TileDB-VCF ingest will be run with resume=True as well, resulting in no data duplication. I've tested both scenarios.

sgillies

@alancleary I made one request to structure the logging and made a minor suggestion about the state of ingestion that may or may not be useful to you.

sgillies · 2025-06-17T21:40:24Z

src/tiledb/cloud/vcf/ingestion.py

+                ingest_df = ingest_df[~ingest_df.sample_name.isin(failed_samples)]
+            result = ingest_df.vcf_uri.to_list()

+            logger.info("%d samples in the dataset.", len(dataset_samples))


@alancleary would you be willing to make the log messages more structured? I don't mean using structlog necessarily, but changing the messages from being prose to being event + data. For example: logger.info("Ingestion iteration completed: successful_samples=%r, failed_samples=%r", ...). Everything relevant to the event on a single line.

Messages like these are easy to scan for and can be more easily parsed and turned into data.

I know that's not the prevailing style in our Python code, but I'm trying to nudge new work in the structured direction.

Thanks for the suggestion @sgillies. That makes sense to me. I'll make the change.

Here are a couple of examples:

https://github.com/TileDB-Inc/Internal-Cloud-Py/blob/7a3f33097ee05ab14508ea0c4eb06d9fb5120d8f/src/tiledb/cloud/taskgraphs/server_executor/impl.py#L213

https://github.com/TileDB-Inc/Internal-Cloud-Py/blob/7a3f33097ee05ab14508ea0c4eb06d9fb5120d8f/src/tiledb/cloud/soma/ingest.py#L112

Thanks @sgillies. Looking at the logging code now, reworking all the messages in ingestion would require touching a bit of code unrelated to this PR so I think I'll do it as its own PR or as part of a manifest rework. Are there guidelines anywhere for this that I can reference in an issue?

@alancleary I'm only asking about the new logging you've added. We can revisit the existing code later, I agree.

We don't have any guidelines about logging, but I'll add something to a contributing doc.

Fair enough. So you're imagining something like this (note a few variable names are different to add context):

message = ( "Filtering sample URIs: dataset_samples=%d, ingested_samples=%d, " "partial_samples=%d, manifest_samples=%d, ready_samples=%d, " "queued_samples=%d" ) logger.info(message, len(dataset_samples), len(ingested_samples), len(partial_samples), len(manifest_samples), len(ready_samples), len(queued_samples), )

@alancleary Yes, like that. In looking around for good examples on the web, I came across this one from Stripe: https://brandur.org/canonical-log-lines#what-are-they. "Literally one big log line".

I think you could leverage f-strings to simplify the above.

logger.info(f"Filtering sample URIs: {len(dataset_samples)=}, {len(ingested_samples)=}, {len(partial_samples)=}, {len(manifest_samples)=}, {len(ready_samples)=}, {len(queued_samples)=}")

With the = syntax, you get, for example, "Filtering sample URIs: len(dataset_samples)=100000, ...".

"Literally one big log line"

I like it! Also, I haven't encountered = syntax with f-strings before. Thanks for sharing!

sgillies · 2025-06-17T21:43:43Z

src/tiledb/cloud/vcf/ingestion.py

-                        status = "" if status == "ok" else status + ","
-                        status += "missing index"
+                        status = "" if status == Status.READY else status + ","
+                        status += Status.MISSING_INDEX


Seeing this, I wonder if status shouldn't be a stack (list) instead of a string of comma-separated values.

Yeah, I don't think string is quite right here. I would like to revisit the whole manifest at some point but this PR is just a bug fix!

This includes renaming some variables to add context to the messages.

alancleary · 2025-06-20T17:24:01Z

@sgillies I added a commit that updates the log messages, as we discussed. I ended up updating most of the log messages in the file because nearly all of them were in the code paths this commit affects.

sgillies · 2025-06-20T17:48:23Z

src/tiledb/cloud/vcf/ingestion.py

+        f"{num_partitions=}, "
+        f"{num_consolidates=}, "
+        f"{ingest_resources=}, "
+        f"{CONSOLIDATE_RESOURCES=}"


Thank you @alancleary ! Merge when you're ready 👍

Great! I'll merge after verifying that this has no unintended effects on 1-click ingestion.

This ensures that a member's value is saved to the ingestion manifest, rather than the member's name.

alancleary · 2025-06-23T20:02:00Z

Pushed another commit that adds a __str__() special method to the Status enum. Without it, a status that is never concatenated with a string will be saved to the manifest as the Status member's name, rather than its value!

spencerseale · 2025-08-27T15:57:32Z

@alancleary @sgillies are we good to merge this?

alancleary · 2025-08-27T16:47:42Z

@alancleary @sgillies are we good to merge this?

We still need to test 1-click against these changes, which requires involvement from someone on the cloud team. I'll try and push it along this week.

sgillies · 2025-08-27T20:21:09Z

There are some conflicts to be resolved, too. After which, I'm fine with a merge. I've already approved.

alancleary added 2 commits June 12, 2025 12:16

Added Status enum

3232528

This replaces hard-coded strings used when reading/writing the manifest array.

alancleary requested review from jp-dark, leipzig and spencerseale June 14, 2025 00:14

alancleary added bug Something isn't working enhancement New feature or request labels Jun 14, 2025

spencerseale approved these changes Jun 16, 2025

View reviewed changes

alancleary added 2 commits June 16, 2025 16:04

Removed resume TODO comment

ae3d84b

Code formatting

1cdbb78

jp-dark reviewed Jun 17, 2025

View reviewed changes

jp-dark approved these changes Jun 17, 2025

View reviewed changes

leipzig approved these changes Jun 17, 2025

View reviewed changes

sgillies requested changes Jun 17, 2025

View reviewed changes

sgillies approved these changes Jun 17, 2025

View reviewed changes

Updated VCF ingest log messages to be more structured

04b1832

This includes renaming some variables to add context to the messages.

sgillies reviewed Jun 20, 2025

View reviewed changes

alancleary added 2 commits June 23, 2025 13:10

Merge branch 'main' into alancleary/vcf-12/ingestion-failure-handling

3977139

Added __str__() special method to VCF Status enum

5d4884a

This ensures that a member's value is saved to the ingestion manifest, rather than the member's name.

Merge branch 'main' into alancleary/vcf-12/ingestion-failure-handling

50f41bd

alancleary mentioned this pull request Jul 3, 2025

WIP: VCF Ingestion Tests #727

Draft

Merge branch 'main' into alancleary/vcf-12/ingestion-failure-handling

c8492d7

Add support for VCF ingestion failure #723

Are you sure you want to change the base?

Add support for VCF ingestion failure #723

Uh oh!

Conversation

alancleary commented Jun 14, 2025

Uh oh!

spencerseale left a comment

Choose a reason for hiding this comment

Uh oh!

leipzig commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alancleary commented Jun 16, 2025

Uh oh!

jp-dark commented Jun 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alancleary commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgillies left a comment

Choose a reason for hiding this comment

Uh oh!

sgillies Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alancleary commented Jun 20, 2025

Uh oh!

sgillies Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alancleary commented Jun 23, 2025

Uh oh!

spencerseale commented Aug 27, 2025

Uh oh!

alancleary commented Aug 27, 2025

Uh oh!

sgillies commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

leipzig commented Jun 16, 2025 •

edited

Loading

alancleary commented Jun 17, 2025 •

edited

Loading

sgillies Jun 17, 2025 •

edited

Loading

sgillies Jun 20, 2025 •

edited

Loading