Skip to content

Conversation

@alancleary
Copy link
Member

Previously the header for every sample returned for a single position was being loaded into memory. However, many headers are identical except for their sample line, meaning basically the same headers were being loaded multiple times per position.

To alleviate the issue, this PR computes the MD5 checksum for each header (minus the sample line) and only stores the unique ones. Samples are then associated with the correct header using an associative array.

Notes about the implementation:

  • HTSlib includes low-level support for computing MD5 checksums, but the only high-level functions provided for computing checksums on user-space data are in the context of the HTSlib CRAM API. Thus this PR includes a minimal external library that provides the requisite high-level API. Rolling the HTSlib low-level API into an adequate high-level API is left as future work.
  • MD5 collision detection is implemented but there is no mitigation; a run-time exception will be thrown that provides details about the collision.
  • Running on the VCF-42 reproducer query, the query run-time is slightly faster than before and the memory usage was reduced 24x. It's worth noting that disabling collision detection cuts the run-time nearly in half. The ability for users to disable collision detection is currently not implemented.
  • Allowing users to disable collision detection and/or choose the checksum algorithm may be features worth implementing in the future.

This function joins a vector of strings using a given delimiter character while optionally omitting empty strings.
…sample headers

SampleHeaders is a subclass of the TileDBVCFDataset class since it is intended to be instantiated and loaded exclusively by TileDBVCFDataset. Internally SampleHeaders uses MD5 checksums and maps to only load unique headers and to associate samples with these unique headers, respectively.
…r released

This is to prevent memory leaks while making the lifetime of these managed pointers more obvious.
This allows the names of samples stored in a SampleHeaders instance to be efficiently iterated via a view of the keys of an internal map.
This allows the names of only samples that have headers loaded to be iterated efficiently, both in terms of run-time and memory consumption.
@alancleary alancleary added the enhancement New feature or request label Jan 13, 2026
Specifically, the std::string(const char* s, size_t n) constructor on macOS seems to ignore the n parameter and overflows. This was mitigated by adding a terminating character to s and using the std::string(const char* s) constructor instead.
@alancleary alancleary merged commit 2e04999 into main Jan 15, 2026
14 of 15 checks passed
@alancleary alancleary deleted the alancleary/VCF-42 branch January 15, 2026 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants