This repository provides a CLI-first Text2Box inference, evaluation, and debug-visualization workflow for BOP-style datasets.
The current pipeline does the following:
- Loads metadata from parquet files.
- Reads RGB frames from shard tar files.
- Calls a VLM provider (
openaiorollama). - Parses model JSON outputs.
- Writes predictions incrementally (JSONL manifest + parquet checkpoints).
- Solves 3D pose in
baseline-2d3dmode. - Optionally writes per-image debug JSON + PNG with
--debug. - Computes protocol metrics via the evaluation module.
Metric definitions and interpretation guide: see metrics.md.
Below is an example per-image debug visualization generated with --debug.
pip install -r requirements.txtCreate a .env file as needed.
Commonly used:
OPENAI_API_KEY(required for--provider openai)OPENAI_MODEL(default:gpt-4.1-mini)OPENAI_BASE_URL(optional OpenAI-compatible endpoint)GEMINI_API_KEY(required for--provider gemini)GEMINI_MODEL(default:gemini-robotics-er-1.6-preview)OLLAMA_BASE_URL(default:http://localhost:11434/v1)OLLAMA_MODEL(default:gemma4:latest)TEMPERATURE(default:0.0)MAX_OUTPUT_TOKENS(default:1200)
Also supported in config:
REQUEST_TIMEOUT_S(default:60)MAX_RETRIES(default:3)RETRY_MIN_S(default:1)RETRY_MAX_S(default:8)NVIDIA_BASE_URL(currently not used by built-in providers)
Use --data-root to point to a prepared dataset folder, for example:
Datasets/ycbvDatasets/TLessDatasets/handal
Expected files:
queries_<split>.parquetgts_<split>.parquetimages_info_<split>.parquetobjects_info.parquetimages_<split>/shard-*.tar
Main entrypoint:
PYTHONPATH=src .venv/bin/python -m text2box_infer \
--data-root Datasets/ycbv \
--split test \
--mode baseline-2d3d \
--provider ollama \
--debugEquivalent wrapper script:
PYTHONPATH=src .venv/bin/python run_inference.py \
--data-root Datasets/ycbv \
--split test \
--mode baseline-2d3d \
--provider ollama \
--debugOpenAI provider example:
PYTHONPATH=src .venv/bin/python -m text2box_infer \
--data-root Datasets/ycbv \
--split test \
--mode baseline-2d3d \
--provider openai \
--debugGemini provider example:
PYTHONPATH=src .venv/bin/python -m text2box_infer \
--data-root Datasets/ycbv \
--split test \
--mode baseline-2d3d \
--provider gemini \
--debugQuick sanity-run options:
--limit N: limit by number of queries.--limit-images N: limit by number of unique images.
Example:
PYTHONPATH=src .venv/bin/python -m text2box_infer \
--data-root Datasets/handal \
--split test \
--mode baseline-2d3d \
--provider ollama \
--limit 100 \
--debugThe pipeline uses one unified prompt template. Current prompt contract asks for:
bbox_2d_norm_1000box_3d=[x_center_mm, y_center_mm, z_center_mm, x_size_mm, y_size_mm, z_size_mm, roll_deg, pitch_deg, yaw_deg]- optional
confidence
The parser still accepts legacy fields (for backward compatibility), including 3D corners fields when provided by older model outputs.
By default outputs are written under:
outputs/<dataset>/<model>/<timestamp__config>/predictions/preds_<provider>_<split>_manifest.jsonloutputs/<dataset>/<model>/<timestamp__config>/predictions/preds_<provider>_<split>.parquetoutputs/<dataset>/<model>/<timestamp__config>/predictions/preds_<provider>_<split>_manifest.summary.jsonoutputs/<dataset>/<model>/<timestamp__config>/debug/<image_id>.json(with--debug)outputs/<dataset>/<model>/<timestamp__config>/debug/<image_id>_report.pdf(with--debug)
Notes:
- Manifest is appended query-by-query.
- Parquet is checkpointed per image and finalized at end.
- Dataset/model folder names are slugified in lowercase.
Run evaluation from predictions + GT:
PYTHONPATH=src .venv/bin/python -m text2box_infer.evaluation \
--data-root Datasets/ycbv \
--split testDefault behavior:
- Auto-discovers a recent
*_manifest.jsonlunderoutputs/. - Writes to
outputs/metrics/final_metrics.jsonwhen--output-jsonis not provided.
Specific manifest example:
PYTHONPATH=src .venv/bin/python -m text2box_infer.evaluation \
--manifest-jsonl outputs/<dataset>/<model>/<timestamp__config>/predictions/preds_ollama_test_manifest.jsonl \
--data-root Datasets/ycbv \
--split testYou can regenerate reports after inference in two ways.
Replay from existing debug JSON:
PYTHONPATH=src .venv/bin/python -m text2box_infer.visualization \
--debug-json-dir outputs/<dataset>/<model>/<timestamp__config>/debug \
--run-dir outputs/<dataset>/<model>/<timestamp__config> \
--data-root Datasets/ycbv \
--split test \
--model-name autoManifest-enriched rendering (recomputes metrics + writes fresh reports):
PYTHONPATH=src .venv/bin/python -m text2box_infer.visualization \
--manifest-jsonl outputs/<dataset>/<model>/<timestamp__config>/predictions/preds_ollama_test_manifest.jsonl \
--run-dir outputs/<dataset>/<model>/<timestamp__config> \
--data-root Datasets/ycbv \
--split test \
--model-name autoEquivalent CLI route using --mode visualize is also available via -m text2box_infer.
Run inference first, or pass --manifest-jsonl explicitly.
Check that Ollama is running and OLLAMA_BASE_URL is correct.
Use --limit and/or --limit-images for quick validation before full runs.
