Skip to content

kaushall13/OCR-using-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🖼️ Image Analysis Agent using LLaMA Vision

An intelligent terminal-based agent that allows you to query an image using natural language. The system utilizes LLaMA-3.2 Vision Preview to understand and respond to questions about the content of an uploaded image.


💡 Key Features

  • 📷 Accepts any local image file
  • 🧠 Uses LLaMA-3.2 90B Vision model via Groq for multimodal reasoning
  • 🗣️ Interactive CLI interface with feedback loop
  • 🔁 Follow-up questions and clarification supported

🛠️ Tech Stack

  • Python 3.x
  • Agno Framework
  • Groq API (LLaMA-3 Vision)
  • Base64 encoding for image handling

⚙️ Setup & Installation

  1. Clone the repo
git clone https://github.com/yourusername/image-llama-agent.git
cd image-llama-agent
  1. Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt
  1. Create .env file with your API keys
# .env.example
GROQ_API_KEY=your_groq_key_here

🚀 Usage

python image_analysis_llama_agent.py
  • Provide the path to your image
  • Ask any visual question about the image
  • Interact with the LLM and refine queries if needed

🧪 Example Queries

  • “What is shown in this image?”
  • “Is this image from a city or a rural area?”
  • “What animals or objects can you detect?”

📝 License

MIT License.


🙌 Credits

  • Groq for LLaMA Vision
  • Agno Agent Framework

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages