Terminology Harmonization: Intelligent Retrieval with Alignment and Weighting reranking via Automated Transformers
Only support Drug domain concepts in this beta.
Use CUDA (--device cuda) or CPU (--device cpu) for correctness.
MPS on Apple Silicon can produce unstable embeddings/scores with the current beta models. We will release a fix in the full release.
- Python
uv
- Request and download the standard concepts in csv format from https://athena.ohdsi.org/
- Convert the csv files into a DuckDB database using sidataplus/athena2duckdb
- Request access on Hugging Face: https://huggingface.co/na399/THIRAWAT-reranker-beta (click "Access request" / accept terms)
- Install Hugging Face CLI following https://huggingface.co/docs/huggingface_hub/en/guides/cli
- Login via CLI so downloads work from code:
hf auth login
# 1. Install dependencies into a local virtual environment (creates .venv/)
uv sync
# 2. (Optional) Activate the environment for interactive shells
source .venv/bin/activate
# 3. Or just run commands directly via uv
uv run python -m thirawat_mapper_beta.index.build --helpuv sync reads the project metadata and installs the required packages (PyTorch, LanceDB, transformers, etc.) against Python 3.11.x. Subsequent uv run ... invocations will reuse the same environment. Replace paths in the examples below to match your workspace. All text used for indexing and inference is normalized (lower-cased, whitespace collapsed) for stable matching.
uv run python -m thirawat_mapper_beta.index.build \
--duckdb data/derived/concepts.duckdb \
--profiles-table concept_profiles \
--concepts-table concept \
--domain-id Drug \
--concept-class-id "Clinical Drug,Quant Clinical Drug,Clinical Drug Comp,Clinical Drug Form,Ingredient" \
--exclude-concept-class-id "Clinical Drug Box,Branded Drug Box,Branded Pack Box,Clinical Pack Box,Marketed Product,Quant Branded Box,Quant Clinical Box" \
--extra-column "concept_name,domain_id,vocabulary_id,concept_class_id" \
--out-db data/lancedb/db \
--table concepts_drug \
--batch-size 256 \
--device cudaKey options:
--duckdb- DuckDB file produced bysidataplus/athena2duckdb.--profiles-table- Table containingconcept_idandprofile_textcolumns.--concepts-table- OMOP concept table (defaults toconcept). The builder always joins to this table and keeps only standard, valid concepts (standard_concept = 'S' AND invalid_reason IS NULL).--domain-id,--concept-class-id- Optional filters; accept comma-separated lists or repeated flags.--exclude-concept-class-id- Exclude specific classes (comma-separated or repeat flag). Default empty; recommended exclusions: Clinical Drug Box, Branded Drug Box, Branded Pack Box, Clinical Pack Box, Marketed Product, Quant Branded Box, Quant Clinical Box.--extra-column- Carry additional columns from the profiles table into LanceDB (repeat flag).--out-db/--table- Target LanceDB directory and table name.
The command will:
- Load profiles (and apply filters if provided).
- Normalize
profile_textand embed with SapBERT CLS vectors (viatransformers). - Write a LanceDB table where
vectoris aFixedSizeList<float32>[768]column. - Emit a
<table>_manifest.jsonmanifest describing the build (model id, filters, counts).
export TOKENIZERS_PARALLELISM=false
uv run python -m thirawat_mapper_beta.infer.bulk \
--db data/lancedb/db \
--table concepts_drug \
--input data/usagi.csv \
--out runs/mapping \
--candidate-topk 200 \
--n-limit 20 \
--device cudaInput formats: CSV, TSV, Parquet, or Excel. By default the CLI expects the following columns (override via flags):
sourceName(required)sourceCode(optional)conceptId(optional ground truth)mappingStatus(used for Usagi detection). When the input already follows the Usagi CSV schema (seedata/eval/tmt_to_rxnorm.csv), the CLI validates a sample of rows through a Pydantic schema and surfaces a clear error if the structure is invalid. Otherwise, it synthesizes a minimal Usagi row per record so downstream exports stay consistent.
Selected flags:
--source-name-column,--source-code-column- Override input headers.--label-column- Column containing gold concept IDs (optional, defaultconceptId).--status-column,--approved-value- Configure Usagi approval detection.--batch-size- Query embedding batch size (increase for better GPU throughput).--n-limit- Limit to the first N rows (smoke runs).--where- Optional LanceDB filter, e.g.,vocabulary_id = 'RxNorm' AND concept_class_id != 'Ingredient'(when those columns exist in the index).--device-auto|cuda|mps|cpu(defaultautowith safe fallback and fast matmul).--post-weight- Weight for simple post-score blend (default0.3).
Pipeline steps per row:
- Build query text (
sourceNamewithsourceCodeappended in parentheses when present). - Embed with SapBERT.
- Vector search (cosine) against the LanceDB table to gather
--candidate-topkentries. - Rerank with the THIRAWAT reranker. Beta is vector-only; no FTS/BM25/hybrid.
- Optionally apply the strength+Jaccard post-scorer per query (disabled by default via
--post-weight 0.0).
Outputs (written to --out):
results.csv- Classic relabel layout (wide, block-per-query). Columns: leadingrank1..K, then for each query three adjacent columns[match_rank_or_unmatched, source_concept_name, source_concept_code]with K rows beneath. Non-Usagi inputs preserve the original row order; Usagi inputs continue to sort matched rows first so reviewers can focus on confirmed gold IDs.results_with_input.csv- Original input row with candidate columns appended.results_usagi.csv- Always emitted. Each processed row is coerced into the Usagi schema (using the sample indata/eval/tmt_to_rxnorm.csvas ground truth). The top candidate populatesconceptId,conceptName,domainId, andmatchScorewhen available; otherwise those fields remain blank. Every row is markedmappingStatus=UNCHECKED,statusSetBy=THIRAWAT-mapper,mappingType=MAPS_TOso reviewers can import the file directly into Usagi even when the source sheet was not originally in that format.metrics.json- When ground-truth IDs are available (either viaconceptIdor Usagi rows withmappingStatus == APPROVED) the file reports Hit@{1,2,5,10,20,50,100}, MRR@100, coverage, and counts.
Bulk inference can optionally send the top reranked candidates to an LLM for tie-breaking or abstention logic. Enable this flow with --rag-provider and supply provider-specific flags. The CLI saves every prompt/response pair to rag_prompts.md under the chosen --out directory so you can audit exactly what was sent.
General RAG knobs:
--rag-provider {ollama,llamacpp,openrouter,cloudflare}
--rag-model MODEL_ID # default openai/gpt-oss-20b
--rag-candidate-limit 50 # number of reranked candidates passed to the LLM
--rag-profile-char-limit 512 # truncate long profile_text snippets
--rag-include-retrieval-score/--no-rag-include-retrieval-score
--rag-include-final-score/--no-rag-include-final-score
--rag-extra-context-column COLUMN # optional extra context column from the input sheet
--rag-stop-sequence TEXT (repeatable)
--rag-use-normalized-query/--no-rag-use-normalized-queryTip: RAG is isolated to
infer.bulk. The interactive REPL intentionally remains retrieval-only in this beta.
uv run python -m thirawat_mapper_beta.infer.bulk \
--db data/lancedb/db \
--table concepts_drug \
--input data/input/usagi.csv \
--out runs/ollama_rag \
--n-limit 100 \
--rag-provider ollama \
--ollama-base-url http://localhost:11434 \
--ollama-model "gpt-oss:20b"Ollama-specific flags:
--ollama-base-url URL # default http://localhost:11434
--ollama-model MODEL_TAG # defaults to --rag-model value
--ollama-timeout 120 # seconds
--ollama-keep-alive "5m" # optional keep-alive hint sent to serverUse --rag-provider llamacpp only when a llama.cpp llama-server process is already running (default http://127.0.0.1:8080). Launch the server separately with your desired context and batching flags (for example: llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 -fa on). Point the CLI at that HTTP endpoint, not at GGUF files directly:
uv run python -m thirawat_mapper_beta.infer.bulk \
--db data/lancedb/db \
--table concepts_drug \
--input data/input/usagi.csv \
--out runs/llamacpp_rag \
--rag-provider llamacpp \
--llamacpp-base-url http://127.0.0.1:8080 \
--rag-model ggml-org/gpt-oss-20b-GGUFllama.cpp flags:
--llamacpp-base-url URL # default http://127.0.0.1:8080
--llamacpp-timeout 120 # HTTP timeout in seconds
--llamacpp-chat-format FORMAT # e.g., qwen, llama
--llamacpp-system-prompt TEXT # optional instruction prefix
--llamacpp-n-ctx 8192 # forwarded via query parameters when supported
--llamacpp-model-path /path/model.gguf # fallback to llama-cpp-python bindings when no base URL is setIf you omit --llamacpp-base-url, the CLI falls back to the python bindings and expects --llamacpp-model-path to point to a local GGUF file (plus any --llamacpp-n-* overrides). In that mode, the rag-model flag is ignored and the file name controls which model loads.
For all providers, the CLI logs each prompt/response pair and the parsed candidate ordering to rag_prompts.md in the --out directory for downstream review.
export OPENROUTER_API_KEY=<YOUR_KEY>
uv run python -m thirawat_mapper_beta.infer.bulk \
--db data/lancedb/db \
--table concepts_drug \
--input data/input/usagi.csv \
--out runs/openrouter_rag \
--rag-provider openrouter \
--rag-model openrouter/polaris-alphaSet OPENROUTER_API_KEY in your environment; the CLI will refuse to call OpenRouter without it.
```bash
export CLOUDFLARE_ACCOUNT_ID=<ACCOUNT_ID>
export CLOUDFLARE_API_TOKEN=<API_TOKEN>
uv run python -m thirawat_mapper_beta.infer.bulk \
--db data/lancedb/db \
--table concepts_drug \
--input data/input/usagi.csv \
--out runs/cf_rag \
--n-limit 100 \
--rag-provider cloudflare \
--rag-model openai/gpt-oss-20bCloudflare-specific flags:
--cloudflare-base-url https://api.cloudflare.com/client/v4
--cloudflare-use-responses-api / --no-cloudflare-use-responses-api
--gpt-reasoning-effort {low,medium,high}
--cf-reasoning-summary {auto,concise,detailed}Set CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN in your environment before invoking the Cloudflare provider; the CLI reads only from those variables.
- Models under
@cf/openai/*(for example@cf/openai/gpt-oss-120b) use the Workers AI Responses API, so leave--cloudflare-use-responses-apienabled to send the prompt as aninputpayload. - Meta's
@cf/meta/llama-4-*family is served via the/ai/run/<model>endpoint-pass--no-cloudflare-use-responses-apiwhen targeting those models so the CLI emits themessagespayload the endpoint expects.
uv run python -m thirawat_mapper_beta.infer.query \
--db data/lancedb/db \
--table concepts_drug \
--device cpuType a query and press Enter to see the post-scored top results:
query> amoxicillin clavulanate 875 mg
concept_id | score | s_sim | name
--------------------------------------------------------------------------------
123456 | 0.841 | 0.990 | Amoxicillin / Clavulanate 875 MG Oral Tablet
...
Commands:
- Type
:q,:quit, or:exitto leave. - Use
--candidate-topkto change the candidate pool and--show-topkto limit display rows.
- Vector-only retrieval + reranking (no FTS/BM25/hybrid in beta).
- Text is normalized (lowercase + collapsed whitespace) for indexing and inference.
- The reranker model
na399/THIRAWAT-reranker-betais a gated model on Hugging Face. You must request access on the model page (web) and login via the CLI before running. - LanceDB tables must expose a float32 fixed-size vector column (named
vectorwhen built with this CLI). - Index build keeps only standard, valid OMOP concepts (
standard_concept='S' AND invalid_reason IS NULL).