feat: Add Image Input Support and Proxy Configuration#1
feat: Add Image Input Support and Proxy Configuration#1simplaj wants to merge 7 commits intollmsresearch:mainfrom
Conversation
|
Thanks @simplaj for the PR. Did you test it? I think I made model selection configurable for command-line executions. Just want to confirm this before I accept and merge the PR. |
Thanks for the review! I accidentally committed my local config changes. I have just pushed a commit to revert |
|
Thanks for the update. This looks useful overall. I have two small requests before the merge:
With those tweaks, I am happy to approve. |
|
Thanks for the feedback! I've pushed fixes: 1. VLM default model alignmentReverted - def __init__(self, api_key: Optional[str] = None, model: str = "gemini-3-pro-preview"):
+ def __init__(self, api_key: Optional[str] = None, model: str = "gemini-2.0-flash"):2. CLI image path validationAdded upfront validation for # Validate image paths if provided
if image:
for img_path in image:
if not Path(img_path).exists():
console.print(f"[red]Error: Image file not found: {img_path}[/red]")
raise typer.Exit(1)3. Documentation updateUpdated README to clarify how to pass multiple images: Let me know if there's anything else! |
|
Thank you for the update, and apologies for the follow-up. I have one concern. The SDK documentation mentions that Could you update the proxy support to avoid using The documentation also recommends proxying via Reference |
Sorry for the slow response! Catching up on some deadlines at the moment. I'll get back to this PR a bit later. Thanks for your great work again! |
|
Hey @simplaj, no problem. Take care! Looking forward to scaling this solution. Thanks for accepting the invitation. I will link this to an issue. We need to add more providers and perhaps add an eval to compare quality across models. |
Thanks for understanding! Scaling and adding evals sounds like a great plan. Maybe we can dive into the details after I'm past the deadline. |
Summary
This PR introduces support for user-provided input images (e.g., sketches) to guide the diagram generation process, adds configuration support for local proxies (essential for regions without direct API access), and fixes a critical Unicode crash on Windows.
Key Changes
CLI: Added --image / -img option to paperbanana generate.
Pipeline: Modified
PlannerAgent
to accept input images and include them in the multimodal prompt context. The Planner now considers both the text methodology and the user provided sketch/chart when designing the diagram.
Example Usage:
bash
paperbanana generate --input method.txt --caption "Overview" --image sketch.png
Configuration: Added support for GEMINI_BASE_URL environment variable.
Implementation: Updated GeminiVLM and GoogleImagenGen providers to respect the custom base URL to support local proxies (e.g., Antigravity Tools, http://127.0.0.1:8045).
Windows Compat: Fixed a UnicodeDecodeError in ReferenceStore by explicitly forcing encoding="utf-8" when opening JSON files. This prevents crashes on Windows systems where the default encoding might be GBK.