Hi — thanks for releasing KARMA!
I'm reproducing the experiments described in the paper and I'm looking for the implementation of the following evaluation metrics mentioned in the manuscript:
- Coverage gain (ΔCov)
- Connectivity gain (ΔCon)
- Conflict rate (RCR)
- LLM-based correctness (RLC, using a hold-out verifier model)
- QA consistency (CQA)
What I've checked so far:
- relationship_extraction sets confidence/clarity/relevance on KnowledgeTriple (sources for MCon/MCla/MRel).
- evaluator implements integration score = 0.5confidence + 0.25clarity + 0.25*relevance and threshold filtering.
- conflict_resolution implements contradiction detection and a simple keep-higher-confidence resolution, but resolve_conflicts currently returns (final_triples, 0, 0, 0.0) (placeholders) — I couldn't find conflict counts or a computed conflict rate.
- KnowledgeGraph.get_statistics() returns entity_count, triple_count, unique_relations, avg_confidence, but there's no incremental ΔCov/ΔCon computation (no before/after KG snapshot comparison or network connectivity metrics).
- I couldn't find an implementation of a hold-out LLM verifier (RLC) or a KG-based QA consistency (CQA) evaluation in the repo.
Files I inspected:
- karma/agents/conflict_resolution/agent.py
- karma/agents/relationship_extraction/agent.py
- karma/agents/evaluator/agent.py
- karma/core/data_structures.py
- karma/core/pipeline.py
- main.py
- examples/basic_usage.py
Could you please clarify:
- Are ΔCov / ΔCon / RCR / RLC / CQA implemented somewhere in this repository? If so, could you point to the exact files/functions or provide an example of how to run them?
- If not included, is there a separate evaluation scripts repository or planned location for these metrics? Alternatively, could you advise on where in the pipeline these metrics are expected to be computed (e.g., KG snapshots before/after integration for ΔCov/ΔCon; conflict counts in ConflictResolutionAgent for RCR; a VerifierAgent or Evaluator extension for RLC; a QA evaluation script for CQA)?
- I'm happy to contribute patches (e.g., record conflict stats in resolve_conflicts & expose via agent.get_metrics or IntermediateOutput.metrics). Are contributions welcome for these additions, and is there a preferred output schema?
Thanks in advance — I can provide my local search logs if helpful.
Hi — thanks for releasing KARMA!
I'm reproducing the experiments described in the paper and I'm looking for the implementation of the following evaluation metrics mentioned in the manuscript:
What I've checked so far:
Files I inspected:
Could you please clarify:
Thanks in advance — I can provide my local search logs if helpful.