Fast positional k‑mer grep over RocksDict / RocksDB
kgrep scans and retrieves fixed‑length k‑mers from a RocksDict/RocksDB database. It supports position‑specific filters, exact lookups for large k‑mer lists, and high‑throughput streaming designed to scale linearly to tens of millions of queries.
- Mask syntax with wildcards and character sets (see Mask Syntax).
- Rule syntax with 1‑based positional constraints; supports OR logic across rule groups.
- Tight prefix‑bounded scans using RocksDB lower/upper iterate bounds.
- Automatic alphabet inference (default DNA
ACGT), configurable via--alphabet. - Optional output to file via
--output. - Chunk retrieval (exact lookups for N kmers) with:
- Streaming, O(1) memory per line (no preloading of large files).
- Optional reverse‑complement lookup via
--rc(ACGT ASCII input). - Progress reporting to
stderr. - Value decoding (
--values) and count filtering (--min-count). - Buffered output for high throughput.
- Optional output de‑duplication to avoid double prints when different inputs map to the same DB key (e.g., via RC).
pip install "git+https://github.com/vogulcan/kgrep@master"Note:
--maskoverrides--ruleand is applied from the beginning of the k‑mer.
kgrep --db /path/dbfolder --mask "A[CT]?AA" --min-count 10Note:
--rulepositions are 1‑based.
kgrep --db /path/dbfolder --rule "1:A;2:CT" --min-count 1 --output output.txtkgrep --db /path/dbfolderkgrep --db /path/dbfolder --rule "3:A;5:CG;10:T" --limit 1000# (7:A AND 8:A) OR (17:T AND 18:T)
kgrep --db /path/dbfolder --rule "7:A;8:A|17:T;18:T"
# Same as above with multiple --rule flags:
kgrep --db /path/dbfolder --rule "7:A;8:A" --rule "17:T;18:T"Retrieve values for an explicit list of kmers (all must have the same length):
# Inline ASCII kmers (comma-separated or multiple flags)
kgrep --db /path/dbfolder --chunks AAAAC --chunks TTTTG,CCCCA
# From a file (one kmer per line; lines starting with '#' are ignored)
kgrep --db /path/dbfolder --chunks-file kmers.txt
# Reverse-complement fallback (ACGT ASCII inputs)
kgrep --db /path/dbfolder --chunks-file kmers.txt --rc
# From STDIN
cat kmers.txt | kgrep --db /path/dbfolder --chunks-file -
# Hex input (for non-ASCII keys); 0x prefix optional
kgrep --db /path/dbfolder --chunks-hex --chunks 0x4141414143 --chunks 7474745447- If
--kis not provided, k is inferred from the first provided k‑mer; if none are provided, it falls back to the length of the first key in the DB (original behavior). - Values are decoded per
--values(auto,ascii,u32le,u32be,u64le,u64be,none), and filtered with--min-countwhen numeric. - Output for each found key is either:
<key>\t<count>when a numeric value is decoded, or<key>if--values noneis used or the value cannot be decoded.
- Non‑ASCII keys are printed in hex.
- Missing keys are skipped silently (no output line).
--limitstill applies to the number of printed lines.- Optional progress reporting to
stderr:Whenkgrep --db /path/dbfolder --chunks-file kmers.txt --progress --progress-step 100
--chunks-fileis a normal file, a percentage is shown; for stdin, a running count is shown.
These flags complement existing ones to keep throughput linear on huge input files.
-
--read-only
Open DB in read‑only mode so multiple jobs can run in parallel without DB locks. -
--values <mode>
Prefer a fixed mode (e.g.,u32be) to skip the sampling phase of--values auto. -
--rc
If an exact kmer isn’t found, try its reverse‑complement (ACGT ASCII only). If found, prints the DB key that exists and its value. Not found reporting counts a k‑mer as missing only if both the k‑mer and its RC are absent. -
--progress,--progress-step N
Show progress onstderr. LargerN(e.g.,1000) = fewer TTY writes. -
--readahead 0
Recommended for point lookups (default for chunk retrieval in the latest version). -
--pin
Enable RocksDB pinning to reduce copies on reads. -
--buffer-size N(default:10000)
Buffer N output lines before flushing, reducing per‑line I/O overhead. -
--no-output-dedup
By default, kgrep de‑dups outputs by DB key using a bounded LRU to avoid double prints when different inputs map to the same DB key (e.g., via--rc). Use this flag to disable de‑dup and print every hit. -
--output-lru N(default:200000)
Size of the bounded LRU used for output de‑duplication (ignored when--no-output-dedupis set). -
--assume-unique
Explicitly indicate inputs are unique. The streaming mode already assumes uniqueness and does not allocate large input de‑dup structures; this flag simply documents the assumption and keeps behavior explicit.
Fast single-process, minimal overhead
uv run kgrep --db /path/dbfolder \
--read-only --rc --chunks-file kmers.txt \
--values u32be --readahead 0 --pin \
--progress --progress-step 1000 \
--buffer-size 50000 \
--output hits.txtParallel sharding (8 jobs), safe with --read-only
split -n l:8 --additional-suffix=.txt kmers.txt kmers.part.
parallel -j8 --linebuffer \
'uv run kgrep --db /path/dbfolder \
--read-only --rc --chunks-file {} \
--values u32be --readahead 0 --pin \
--progress-step 1000 \
--output {}.out' ::: kmers.part.*.txt
cat kmers.part.*.txt.out | LC_ALL=C sort -u > hits.txtTip: If you only need presence/absence (not counts), use
--values nonefor maximum speed.
| Symbol | Meaning |
|---|---|
? |
Any character from the alphabet (wildcard) |
[AC] |
Allowed set at this position (only listed characters allowed) |
G |
Exact match for the specified character |
Positions in masks are 1‑based.
- The scanner chooses an optimal prefix length and bounds each scan range tightly with
(lo, hi)to minimize iteration cost. - Chunk retrieval uses robust point lookups (
cf.getwhen available; iterator seek otherwise), buffered output, and no negative caching for RC to keep memory usage small and speed consistent. - Works best when the DB is built with a fixed prefix extractor and prefix Bloom filters;
prefix_same_as_startis enabled at read time. - For pure point lookups, prefer
--readahead 0,--pin, and fixed--valuesmode.
- If keys are not raw bytes or ASCII k‑mers, use
--no-rawor provide inputs with--chunks-hex. - DB should contain fixed‑length k‑mers as keys.
- Reverse‑complement lookup (
--rc) is only supported for ASCII A/C/G/T inputs (not for hex mode).
Run:
kgrep --helpto see all options.
MIT License — see repository for details.