Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions TestingREADME.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# HRA-DO Processor – Testing and Debugging Notes

## 1. Environment Setup and Debugging

If you encounter a **`riot class not found`** error even when using Java 11, ensure that Apache Jena is properly configured in your environment.

Run the following commands in your terminal:

```bash
export JAVA_HOME=$(/usr/libexec/java_home -v 11)
export PATH="$JAVA_HOME/bin:$PATH"
export JENA_HOME="$(pwd)/.venv/opt/apache-jena"
export PATH="$JENA_HOME/bin:$PATH"
hash -r
which riot
```

✅ **Expected output:**
```
.venv/opt/apache-jena/bin/riot
```

If `which riot` points to another location, the `riot` CLI tool might not be using the correct Jena installation. Ensure `.venv/opt/apache-jena/bin` appears first in your `PATH`.

---
## 3. Run Testing Script
The main automation script is located at:
```
hra-do-processor/testing/testing-digital-objects.sh
```
It must be **executed from the root of the repository**, not from inside the `testing` folder.

## 🧭 Paths to Update Before Running

Open the script and review these key variables near the top:

| Variable | Current Default | What to Change |
|-----------|------------------|----------------|
| `source` | `/home/hra-do-processor/.venv/bin/activate` | Update if your virtual environment path is different |
| `DO_PATH` | `/home/hra-do-processor/digital-objects/asct-b/blood-pelvis/v1.4` | Change to the specific Digital Object directory you want to test |
| `COMPARE_SCRIPT` | `compare_and_mismatched_reason.py` | Update if your comparison script is located in another folder. No need to change, this by default is in testing/ folder|

If your project is in a different location, simply replace `/home/hra-do-processor` with your actual path.

---
## 🚀 How to Run

From the **project root**, run:
```bash
bash testing/testing-digital-objects.sh
```

## ⚙️ What the Script Does

This script automates the full **HRA Digital Object (DO)** processing workflow using the `do-processor` CLI and a comparison script.

It performs the following steps:

| Step | Command | Description |
|------|----------|-------------|
| 1️⃣ | `do-processor normalize` | Normalizes the raw CSV data into standardized YAML form |
| 2️⃣ | `do-processor enrich` | Adds ontology links and metadata enrichment |
| 3️⃣ | `do-processor build` | Builds deployable RDF/JSON artifacts |
| 4️⃣ | `do-processor reconstruct` | Reconstructs CSV from normalized data |
| 5️⃣ | `compare_and_mismatched_reason.py` | Compares **raw → normalized → reconstructed** outputs and summarizes mismatches |

Outputs are stored under the Digital Object’s own folders:
```
normalized/
enriched/
reconstructed/
columns_only_in_raw
columns_only_in_recon
value_mismatches_explained.csv
```

---

## 2. Digital Object Testing Summary

| **Digital Object** | **Columns Dropped (only in raw)** | **Key Observations** | **Suggested Changes** |
|--------------------|-----------------------------------|----------------------|------------------------|
| **Allen Brain** | `all_gene_biomarkers` (others empty/dropped) | - Most diffs due to normalization transforms (2,759 records)<br>- `all_gene_biomarkers` dropped during normalize (comma-separated values)<br>- Missing RDFS labels for some LOC IDs | - Add `all_gene_biomarkers` field to schema (array of strings)<br>- Update normalize step to keep and parse this column<br>- Implement ontology label fallback for LOC IDs |
| **Blood – Pelvis** | `all_gene_biomarkers`, `ftu/1`, `ftu/1/id`, `ftu/1/label`, `ref/2`, `ref/2/id`, `ref/2/notes` | - Most diffs from normalization transforms (174)<br>- Gene/protein labels standardized (e.g., “CD19 molecule” → “CD19”)<br>- Metadata order mismatches<br>- Example filtered rows: `bgene/10/label` row 28–29 (“tryptophanyl-tRNA synthetase 1” → “WARS1”) filtered at normalize | - Keep normalization label standardization<br>- Add filter exception handling for `bgene/*/label` when raw not found in `normalized.yaml` |
| **Kidney** | `bprotein/4`, `bprotein/4/id`, `bprotein/4/label`, `bprotein/4/notes`, `ct/1/abbr`, `ct/2/abbr`, `ftu/2/id/notes` | - Diffs from normalization (415) + mapping/format (313)<br>- `bprotein/4*` dropped due to invalid ID format<br>- `ct/*/abbr` dropped across DOs (missing in schema)<br>- `ftu/2/id/notes` empty | 1️⃣ Generate or repair valid IDs for `bprotein/4` items before normalize<br>2️⃣ Add `ct/*/abbr` to schema<br>3️⃣ Allowlist raw-only fields if needed<br>4️⃣ Review 313 mapping/format diffs for consistent URI/CURIE formatting |
| **Large Intestine** | `bprotein/6/id` (HGNC:1678 rows) | - `HGNC:1678` dropped because normalizer found no RDFS label<br>- 7,345 mismatch rows: 4,115 filtered, 2,970 transformed during normalize<br>- Diffs mainly due to normalization (not reconstruction bugs) | - Add fallback for missing RDFS label to retain ID<br>- Review normalization logic for label lookup |
| **Heart** | `combined_gene_markers` | - Combined gene marker values split/dropped by normalizer<br>- Array cells like `GENE1;GENE2` treated as single biomarker or dropped if lookup fails | - Add splitting logic in `setData()` to handle multi-marker cells<br>- Ensure `GENE1;GENE2` → two biomarker entries |
| **Pancreas** | — | - Entries previously appeared shifted due to normalization reordering<br>- Comparator now uses stable key to fix alignment | - Continue using stable key comparator to prevent array misalignment |

---

## 3. Summary

- **Normalization step** is the main source of differences across digital objects.
- **Schema gaps** (like missing `abbr` fields or combined marker arrays) must be addressed for consistency.
- **Ontology label lookups** (e.g., LOC IDs, HGNC symbols) should implement **fallbacks** to prevent data loss during normalization.
14 changes: 13 additions & 1 deletion src/normalization/asct-b-utils/api.functions.js
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,18 @@ function setData(column, columnNumber, row, value, warnings) {
if (objectArray.length === 0 && arrayName) {
row[arrayName] = objectArray;
}
objectArray.push(createObject(value, originalArrayName));
// Split combined biomarker tokens (if this is a biomarker array) into separate objects.
const biomarkerArrays = new Set(['BG','BGENE','BP','BPROTEIN','BL','BLIPID','BM','BMETABOLITES','BF','BPROTEOFORM']);
let tokens = [value];
if (value && typeof value === 'string' && biomarkerArrays.has(originalArrayName.toUpperCase())) {
const escapeForCharClass = (s) => s.replace(/[-\\\]^]/g, '\\$&');
const delimChars = escapeForCharClass(DELIMETER) + ',|';
const separators = new RegExp('[' + delimChars + ']+' );
tokens = value.split(separators).map((s) => s.trim()).filter(Boolean);
}
for (const token of tokens) {
objectArray.push(createObject(token, originalArrayName));
}
} else if (column.length === 3 && arrayName) {
let arrayIndex = parseInt(column[1], 10) - 1;
const fieldName = objectFieldMap[column[2]]; // || (column[2]?.toLowerCase() ?? '').trim();
Expand All @@ -100,6 +111,7 @@ function setData(column, columnNumber, row, value, warnings) {
}
}


const invalidCharacterRegex = /_/gi;
const isLinkRegex = /^http/gi;
const codepointUppercaseA = 65;
Expand Down
226 changes: 226 additions & 0 deletions testing/compare_and_mismatched_reason.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# Robust RAW vs RECONSTRUCTED mismatch explainer for HRA DOs.
# - Uses line-based CSV parsing to avoid pandas delimiter issues.
# - Assumes header is on line --header (1-based, default 11).
# - Produces:
# columns_only_in_raw.csv
# columns_only_in_reconstructed.csv
# value_mismatches.csv
# value_mismatches_explained.csv (adds "reason" + "evidence")
#
# Example:
# python3 compare_and_mismatched_reason.py \
# --raw "/.../asct-b/kidney/v1.6/raw/asct-b-vh-kidney.csv" \
# --recon "/.../asct-b/kidney/v1.6/reconstructed/reconstructed.csv" \
# --normalized "/.../asct-b/kidney/v1.6/normalized/normalized.yaml" \
# --warnings "/.../asct-b/kidney/v1.6/normalized/warnings.yaml" \
# --header 11

from __future__ import annotations
from pathlib import Path
import argparse
import csv
import sys
import pandas as pd

# -------- Config toggles --------
NORMALIZE_TEXT = True # collapse whitespace in cell values
# --------------------------------

def canon_text(x):
if isinstance(x, str):
x = x.strip()
if NORMALIZE_TEXT:
x = " ".join(x.split())
return x

def read_all_lines(path: Path, encoding: str) -> list[str]:
with path.open("r", encoding=encoding, errors="replace", newline="") as f:
return f.read().splitlines()

def choose_delimiter(header_line: str) -> str:
# Pick the delimiter that yields the most fields for the header row
candidates = [",", "\t", ";", "|"]
best = ","
best_n = -1
for d in candidates:
row = next(csv.reader([header_line], delimiter=d, quotechar='"', escapechar="\\"))
if len(row) > best_n:
best_n = len(row)
best = d
return best

def parse_csv_lines(lines: list[str], delimiter: str) -> list[list[str]]:
reader = csv.reader(lines, delimiter=delimiter, quotechar='"', escapechar="\\")
return [row for row in reader]

def pad_to_width(rows: list[list[str]]) -> list[list[str]]:
width = max((len(r) for r in rows), default=0)
return [r + [""] * (width - len(r)) for r in rows]

def norm_col_name(s: str, idx: int) -> str:
name = "_".join(s.strip().split()).lower()
return name if name else f"col_{idx+1}"

def build_df_from_file(path: Path, header_row_1based: int, encoding: str):
lines = read_all_lines(path, encoding)
if len(lines) < header_row_1based:
raise ValueError(f"{path} has only {len(lines)} lines; cannot use line {header_row_1based} as header.")

header_line = lines[header_row_1based - 1]
delim = choose_delimiter(header_line)

rows = parse_csv_lines(lines, delim)
rows = pad_to_width(rows)

header_idx0 = header_row_1based - 1
header_row = rows[header_idx0]

# Build unique, normalized column names
cols, used = [], set()
for i, raw_name in enumerate(header_row):
name = norm_col_name(raw_name, i)
base, k = name, 2
while name in used:
name = f"{base}_{k}"; k += 1
used.add(name)
cols.append(name)

data_rows = rows[header_idx0 + 1:]
df = pd.DataFrame(data_rows, columns=cols)
df = df.map(canon_text)

info = {
"path": str(path),
"n_lines": len(lines),
"delimiter": repr(delim),
"n_rows": df.shape[0],
"n_cols": df.shape[1],
"n_header_cols": len(cols),
}
return df, info

def load_text(path: Path) -> str:
try:
return path.read_text(encoding="utf-8", errors="replace")
except Exception:
return ""

def contains_text(hay: str, needle: str) -> bool:
if not needle:
return False
# normalize both for fair contains check
H = canon_text(hay)
N = canon_text(needle)
return N in H if (H is not None and N is not None) else False

def explain_reason(raw_val: str, recon_val: str, normalized_text: str, warnings_text: str) -> tuple[str, str]:
"""
Heuristic reason assignment:
- Dropped during normalize (warning): warnings.yaml mentions raw_val
- Filtered at normalize: raw_val absent in normalized.yaml
- Transformed during normalize: recon_val present, raw_val absent in normalized.yaml
- Reconstruction mapping/formatting difference: raw_val present in normalized.yaml but differs in recon
- Indeterminate: fallback
"""
if contains_text(warnings_text, raw_val):
return "Dropped during normalize (warning)", "warnings.yaml contains RAW value"

in_norm_raw = contains_text(normalized_text, raw_val)
in_norm_recon = contains_text(normalized_text, recon_val)

if not in_norm_raw:
if in_norm_recon and raw_val != recon_val:
return "Transformed during normalize", "normalized.yaml contains RECON value, not RAW"
return "Filtered at normalize", "RAW value not found in normalized.yaml"

# RAW present in normalized; RECON differs
if raw_val != recon_val:
if in_norm_recon:
return "Reconstruction mapping/formatting difference", "Both RAW and RECON appear in normalized.yaml"
return "Reconstruction mapping/formatting difference", "RAW present in normalized.yaml but RECON differs"

return "Indeterminate", ""

def parse_args():
ap = argparse.ArgumentParser(description="Explain RAW vs RECON mismatches using normalized + warnings.")
ap.add_argument("--raw", required=False, default=None, help="Path to RAW CSV")
ap.add_argument("--recon", required=False, default=None, help="Path to reconstructed CSV")
ap.add_argument("--normalized", required=False, default=None, help="Path to normalized.yaml")
ap.add_argument("--warnings", required=False, default=None, help="Path to warnings.yaml")
ap.add_argument("--header", type=int, default=11, help="Header line number (1-based). Default: 11")
ap.add_argument("--encoding", default="utf-8", help="Text encoding. Default: utf-8")
return ap.parse_args()

def main():
args = parse_args()

# If paths not provided, try kidney v1.6 defaults (easy to override with flags)
raw_p = Path(args.raw) if args.raw else Path("/Users/aishwarya/CNS-Code/hra-do-processor/digital-objects/asct-b/lung/v1.5/raw/asct-b-vh-lung.csv")
recon_p = Path(args.recon) if args.recon else Path("/Users/aishwarya/CNS-Code/hra-do-processor/digital-objects/asct-b/lung/v1.5/reconstructed/reconstructed.csv")
normalized_p = Path(args.normalized) if args.normalized else Path("/Users/aishwarya/CNS-Code/hra-do-processor/digital-objects/asct-b/lung/v1.5/normalized/normalized.yaml")
warnings_p = Path(args.warnings) if args.warnings else Path("/Users/aishwarya/CNS-Code/hra-do-processor/digital-objects/asct-b/lung/v1.5/normalized/warnings.yaml")

# Build dataframes using the robust reader
raw_df, raw_info = build_df_from_file(raw_p, args.header, args.encoding)
recon_df, recon_info = build_df_from_file(recon_p, args.header, args.encoding)

print("RAW info :", raw_info)
print("RECON info:", recon_info)

# Column presence
raw_cols = set(raw_df.columns)
recon_cols = set(recon_df.columns)
only_in_raw = sorted(raw_cols - recon_cols)
only_in_recon = sorted(recon_cols - raw_cols)
in_both = sorted(raw_cols & recon_cols)

out_dir = recon_p.parent
out_dir.mkdir(parents=True, exist_ok=True)

pd.DataFrame({"column_only_in_raw": only_in_raw}).to_csv(out_dir / "columns_only_in_raw.csv", index=False)
pd.DataFrame({"column_only_in_reconstructed": only_in_recon}).to_csv(out_dir / "columns_only_in_reconstructed.csv", index=False)

# Compare values in shared columns (row-wise up to min rows)
n = min(len(raw_df), len(recon_df))
raw_c = raw_df.iloc[:n].reset_index(drop=True)
recon_c = recon_df.iloc[:n].reset_index(drop=True)

# Save raw mismatches (no reasons) for reference
mismatches_plain = []
for col in in_both:
diffs = raw_c[col] != recon_c[col]
if diffs.any():
for i in diffs[diffs].index.tolist():
mismatches_plain.append({
"column": col,
"row_number_in_file": args.header + 1 + i,
"raw_value": raw_c.at[i, col],
"reconstructed_value": recon_c.at[i, col],
})
pd.DataFrame(mismatches_plain).to_csv(out_dir / "value_mismatches.csv", index=False)

# Load normalized & warnings text to derive reasons
normalized_text = load_text(normalized_p)
warnings_text = load_text(warnings_p)

explained = []
for row in mismatches_plain:
reason, evidence = explain_reason(row["raw_value"], row["reconstructed_value"], normalized_text, warnings_text)
row2 = dict(row)
row2["reason"] = reason
row2["evidence"] = evidence
explained.append(row2)

pd.DataFrame(explained).to_csv(out_dir / "value_mismatches_explained.csv", index=False)

print("Saved:")
print(f" - {out_dir/'columns_only_in_raw.csv'}")
print(f" - {out_dir/'columns_only_in_reconstructed.csv'}")
print(f" - {out_dir/'value_mismatches.csv'}")
print(f" - {out_dir/'value_mismatches_explained.csv'}")

if __name__ == "__main__":
try:
main()
except BrokenPipeError:
sys.exit(0)
Loading