-
Notifications
You must be signed in to change notification settings - Fork 3
Add support for VCF ingestion failure #723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This replaces hard-coded strings used when reading/writing the manifest array.
Ingest now distinguishes between smaples that are ready to load and samples that have been loaded using a new "ready" status. The status of a sample is changed from "ready" / "missing index" to "ok" upon successful ingestion, allowing failed ingestions to be resumed.
spencerseale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks @alancleary!
I think the edge case won't be too big of a concern, unless someone did not realize their dataset failed ingestions. I don't think much of that is out there in our customers thankfully!
|
Can we get some clarification on the behavior when people try to ingest the same sample split into multiple chromosomes - both during the same ingest attempt and in staggered attempts? No expectations here but this is one case where even an unfair policy is better than an unclear or inconsistent one. |
This PR preserves the existing behavior, i.e. a sample name cannot be ingested from more than one VCF. Implementation details aside, this is basically because once a sample is in the dataset you can't tell which VCF it came from. I'll be sure to update the documentation to convey this, as we discussed. |
|
Do we need to worry about duplicate data if there is a failure between successfully writing a sample to the dataframe and updating the status in the manifest to |
| manifest_df = A.query( | ||
| cond="status == 'ok' or status == 'missing index'" | ||
| ).df[:] | ||
| manifest_df = A.df[:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like manifest_df is only used for writing the number of samples in the manifest to the logger. Is the extra work worth the extra log info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't benchmarked this but I'm guessing not. The same pattern is used when filtering the URIs to ingest, which could be done with a Query to avoid loading the entire manifest into memory. Also, I think in practice manifests will be no larger than hundreds of thousands of entries (someone please correct me if this is wrong).
That said, we could revise this to only fetch these data if the logger is set to a level that will report it, or just omit it altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think in practice manifests will be no larger than hundreds of thousands of entries (someone please correct me if this is wrong).
Your generally correct. We've got a couple use cases in 1-10M sample range but 100k-1M is typically the upper bound for a lot of use cases. Usually when there are more samples involved multiple datasets start coming into play. Its not a TileDB limitation but more often geographic boundaries and/or other compliance or regulations tend for push users to split the datasets up and you never end up with 100s of millions of samples in a single TIleDB-VCF data.
No. If |
sgillies
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alancleary I made one request to structure the logging and made a minor suggestion about the state of ingestion that may or may not be useful to you.
src/tiledb/cloud/vcf/ingestion.py
Outdated
| ingest_df = ingest_df[~ingest_df.sample_name.isin(failed_samples)] | ||
| result = ingest_df.vcf_uri.to_list() | ||
|
|
||
| logger.info("%d samples in the dataset.", len(dataset_samples)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alancleary would you be willing to make the log messages more structured? I don't mean using structlog necessarily, but changing the messages from being prose to being event + data. For example: logger.info("Ingestion iteration completed: successful_samples=%r, failed_samples=%r", ...). Everything relevant to the event on a single line.
Messages like these are easy to scan for and can be more easily parsed and turned into data.
I know that's not the prevailing style in our Python code, but I'm trying to nudge new work in the structured direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion @sgillies. That makes sense to me. I'll make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a couple of examples:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sgillies. Looking at the logging code now, reworking all the messages in ingestion would require touching a bit of code unrelated to this PR so I think I'll do it as its own PR or as part of a manifest rework. Are there guidelines anywhere for this that I can reference in an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alancleary I'm only asking about the new logging you've added. We can revisit the existing code later, I agree.
We don't have any guidelines about logging, but I'll add something to a contributing doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. So you're imagining something like this (note a few variable names are different to add context):
message = (
"Filtering sample URIs: dataset_samples=%d, ingested_samples=%d, "
"partial_samples=%d, manifest_samples=%d, ready_samples=%d, "
"queued_samples=%d"
)
logger.info(message,
len(dataset_samples),
len(ingested_samples),
len(partial_samples),
len(manifest_samples),
len(ready_samples),
len(queued_samples),
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alancleary Yes, like that. In looking around for good examples on the web, I came across this one from Stripe: https://brandur.org/canonical-log-lines#what-are-they. "Literally one big log line".
I think you could leverage f-strings to simplify the above.
logger.info(f"Filtering sample URIs: {len(dataset_samples)=}, {len(ingested_samples)=}, {len(partial_samples)=}, {len(manifest_samples)=}, {len(ready_samples)=}, {len(queued_samples)=}")With the = syntax, you get, for example, "Filtering sample URIs: len(dataset_samples)=100000, ...".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Literally one big log line"
I like it! Also, I haven't encountered = syntax with f-strings before. Thanks for sharing!
| status = "" if status == "ok" else status + "," | ||
| status += "missing index" | ||
| status = "" if status == Status.READY else status + "," | ||
| status += Status.MISSING_INDEX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing this, I wonder if status shouldn't be a stack (list) instead of a string of comma-separated values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think string is quite right here. I would like to revisit the whole manifest at some point but this PR is just a bug fix!
This includes renaming some variables to add context to the messages.
|
@sgillies I added a commit that updates the log messages, as we discussed. I ended up updating most of the log messages in the file because nearly all of them were in the code paths this commit affects. |
src/tiledb/cloud/vcf/ingestion.py
Outdated
| f"{num_partitions=}, " | ||
| f"{num_consolidates=}, " | ||
| f"{ingest_resources=}, " | ||
| f"{CONSOLIDATE_RESOURCES=}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alancleary ! Merge when you're ready 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I'll merge after verifying that this has no unintended effects on 1-click ingestion.
This ensures that a member's value is saved to the ingestion manifest, rather than the member's name.
|
Pushed another commit that adds a |
|
@alancleary @sgillies are we good to merge this? |
We still need to test 1-click against these changes, which requires involvement from someone on the cloud team. I'll try and push it along this week. |
|
There are some conflicts to be resolved, too. After which, I'm fine with a merge. I've already approved. |
This PR updates VCF ingestion to support resuming failed ingestions. This is done by using a new "ready" status in the VCF manifest array.
Previous ingestion process:
"ok"or"missing index""ok"or"missing index"that aren't already in the datasetNotes:
"ok"or"missing index"status when added to the manifestNew ingestion process:
"ready"or"missing index""ready"or"missing index"resumeingestion parameter isTrue"ok"after successful ingestionNotes:
"ok"are assumed to be ingested"missing index"can be "resumed" and their status will changed to"ok""missing index"that got far enough to have an entry in the dataset can now be resumedEdge cases:
"ok"whose ingestion failed after the sample was added to the dataset will need to be manually removed from the manifest