Skip to content

VarunBurde/text2box_baseline

Repository files navigation

text2box inference baseline

This repository provides a CLI-first Text2Box inference, evaluation, and debug-visualization workflow for BOP-style datasets.

Overview

The current pipeline does the following:

  • Loads metadata from parquet files.
  • Reads RGB frames from shard tar files.
  • Calls a VLM provider (openai or ollama).
  • Parses model JSON outputs.
  • Writes predictions incrementally (JSONL manifest + parquet checkpoints).
  • Solves 3D pose in baseline-2d3d mode.
  • Optionally writes per-image debug JSON + PNG with --debug.
  • Computes protocol metrics via the evaluation module.

Metric definitions and interpretation guide: see metrics.md.

Example Debug Report

Below is an example per-image debug visualization generated with --debug.

Handal debug report example

Installation

pip install -r requirements.txt

Environment Configuration

Create a .env file as needed.

Commonly used:

  • OPENAI_API_KEY (required for --provider openai)
  • OPENAI_MODEL (default: gpt-4.1-mini)
  • OPENAI_BASE_URL (optional OpenAI-compatible endpoint)
  • GEMINI_API_KEY (required for --provider gemini)
  • GEMINI_MODEL (default: gemini-robotics-er-1.6-preview)
  • OLLAMA_BASE_URL (default: http://localhost:11434/v1)
  • OLLAMA_MODEL (default: gemma4:latest)
  • TEMPERATURE (default: 0.0)
  • MAX_OUTPUT_TOKENS (default: 1200)

Also supported in config:

  • REQUEST_TIMEOUT_S (default: 60)
  • MAX_RETRIES (default: 3)
  • RETRY_MIN_S (default: 1)
  • RETRY_MAX_S (default: 8)
  • NVIDIA_BASE_URL (currently not used by built-in providers)

Dataset Input

Use --data-root to point to a prepared dataset folder, for example:

  • Datasets/ycbv
  • Datasets/TLess
  • Datasets/handal

Expected files:

  • queries_<split>.parquet
  • gts_<split>.parquet
  • images_info_<split>.parquet
  • objects_info.parquet
  • images_<split>/shard-*.tar

1) Run Inference

Main entrypoint:

PYTHONPATH=src .venv/bin/python -m text2box_infer \
  --data-root Datasets/ycbv \
  --split test \
  --mode baseline-2d3d \
  --provider ollama \
  --debug

Equivalent wrapper script:

PYTHONPATH=src .venv/bin/python run_inference.py \
  --data-root Datasets/ycbv \
  --split test \
  --mode baseline-2d3d \
  --provider ollama \
  --debug

OpenAI provider example:

PYTHONPATH=src .venv/bin/python -m text2box_infer \
  --data-root Datasets/ycbv \
  --split test \
  --mode baseline-2d3d \
  --provider openai \
  --debug

Gemini provider example:

PYTHONPATH=src .venv/bin/python -m text2box_infer \
  --data-root Datasets/ycbv \
  --split test \
  --mode baseline-2d3d \
  --provider gemini \
  --debug

Quick sanity-run options:

  • --limit N: limit by number of queries.
  • --limit-images N: limit by number of unique images.

Example:

PYTHONPATH=src .venv/bin/python -m text2box_infer \
  --data-root Datasets/handal \
  --split test \
  --mode baseline-2d3d \
  --provider ollama \
  --limit 100 \
  --debug

Prompt Contract (Current)

The pipeline uses one unified prompt template. Current prompt contract asks for:

  • bbox_2d_norm_1000
  • box_3d = [x_center_mm, y_center_mm, z_center_mm, x_size_mm, y_size_mm, z_size_mm, roll_deg, pitch_deg, yaw_deg]
  • optional confidence

The parser still accepts legacy fields (for backward compatibility), including 3D corners fields when provided by older model outputs.

Inference Outputs

By default outputs are written under:

  • outputs/<dataset>/<model>/<timestamp__config>/predictions/preds_<provider>_<split>_manifest.jsonl
  • outputs/<dataset>/<model>/<timestamp__config>/predictions/preds_<provider>_<split>.parquet
  • outputs/<dataset>/<model>/<timestamp__config>/predictions/preds_<provider>_<split>_manifest.summary.json
  • outputs/<dataset>/<model>/<timestamp__config>/debug/<image_id>.json (with --debug)
  • outputs/<dataset>/<model>/<timestamp__config>/debug/<image_id>_report.pdf (with --debug)

Notes:

  • Manifest is appended query-by-query.
  • Parquet is checkpointed per image and finalized at end.
  • Dataset/model folder names are slugified in lowercase.

2) Evaluate Metrics

Run evaluation from predictions + GT:

PYTHONPATH=src .venv/bin/python -m text2box_infer.evaluation \
  --data-root Datasets/ycbv \
  --split test

Default behavior:

  • Auto-discovers a recent *_manifest.jsonl under outputs/.
  • Writes to outputs/metrics/final_metrics.json when --output-json is not provided.

Specific manifest example:

PYTHONPATH=src .venv/bin/python -m text2box_infer.evaluation \
  --manifest-jsonl outputs/<dataset>/<model>/<timestamp__config>/predictions/preds_ollama_test_manifest.jsonl \
  --data-root Datasets/ycbv \
  --split test

3) Post-hoc Visualization

You can regenerate reports after inference in two ways.

Replay from existing debug JSON:

PYTHONPATH=src .venv/bin/python -m text2box_infer.visualization \
  --debug-json-dir outputs/<dataset>/<model>/<timestamp__config>/debug \
  --run-dir outputs/<dataset>/<model>/<timestamp__config> \
  --data-root Datasets/ycbv \
  --split test \
  --model-name auto

Manifest-enriched rendering (recomputes metrics + writes fresh reports):

PYTHONPATH=src .venv/bin/python -m text2box_infer.visualization \
  --manifest-jsonl outputs/<dataset>/<model>/<timestamp__config>/predictions/preds_ollama_test_manifest.jsonl \
  --run-dir outputs/<dataset>/<model>/<timestamp__config> \
  --data-root Datasets/ycbv \
  --split test \
  --model-name auto

Equivalent CLI route using --mode visualize is also available via -m text2box_infer.

Troubleshooting

"No manifest found"

Run inference first, or pass --manifest-jsonl explicitly.

Ollama connection errors

Check that Ollama is running and OLLAMA_BASE_URL is correct.

Full split is slow

Use --limit and/or --limit-images for quick validation before full runs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors