JINI is a voice-enabled AI desktop assistant that integrates real-time system automation with conversational AI.
It supports both command-based voice control and text-based AI interaction through a structured, safety-first architecture.
Unlike typical chatbot projects, JINI separates system-level command execution from AI-driven conversation, ensuring reliable automation without unintended actions.
The assistant maintains persistent conversational memory across sessions and supports multimodal interaction via a web-based interface.
- 🎤 Voice-controlled desktop commands – Open apps, launch URLs, send messages, make calls
- 💬 Text-based conversational AI – Handles queries, knowledge retrieval, and casual conversation
- 🗣️ Text-to-speech responses – Unified voice output for both AI and system responses
- 🧠 AI-powered responses – Gemini API primary, HugChat fallback for resilience
- 🖥️ System automation – Launch apps, play media, WhatsApp communication
- 📁 Local SQLite database – Stores contacts and system commands safely
- 🌐 Web-based frontend – Supports text input (index.html) and voice input (mic.html)
- 🔐 Environment-based API key handling – No sensitive data committed
JINI maintains unified conversational memory across both text and voice interactions:
- Shared storage: All messages, from both index.html and mic.html, are persisted in the same memory store
- Chronological history: Messages are timestamped and displayed in order
- Reload continuity: Conversation persists across page reloads and app restarts
- Explicit clearing: Users can clear memory manually
- Safety-first separation: Voice commands trigger handlers, but text inputs never execute system-level actions
This ensures JINI behaves as one consistent assistant, not two separate interfaces.
- Voice Commands (Command Mode):
- Classified first using intent.py
- Routed to appropriate handler:
- SYSTEM: system_handler.py → Opens apps, launches URLs
- COMMUNICATION: communication_handler.py → WhatsApp messages, calls, video calls
- MEDIA: media_handler.py → YouTube/media playback
- Text Inputs (Chatbot Mode):
- Handled only by feature.py
- Uses Gemini API → HugChat fallback
- No system command execution to ensure security
- Design Decision: Voice commands and AI queries are intentionally separated to prevent accidental system actions and maintain reliability.
JINI uses two independent input pipelines, intentionally designed to handle different interaction types.
- Text Input Pipeline (Chatbot Mode) Handles text-based interaction exclusively through AI. This pipeline does not trigger system commands, ensuring safe responses. Purpose:
- General conversation
- Knowledge-based queries
- AI assistance Flow: Text Input → feature.py (Gemini AI + HugChat fallback) → Voice & Text Output Text input does not trigger system commands to avoid ambiguity and unintended execution.
- Voice Input Pipeline (Command Mode) Voice input is command-oriented and therefore classified before execution. Purpose:
- Open applications
- Play media (YouTube, songs)
- Call or message stored contacts
- Perform system-level actions Flow:
Microphone Input
↓
Speech-to-Text (input/speech.py)
↓
Intent Classification (intent.py)
├─ SYSTEM → system_handler.py → Open apps / URLs
├─ COMMUNICATION → communication_handler.py → WhatsApp / Calls / Messages
├─ MEDIA → media_handler.py → YouTube / Media Playback
└─ AI → feature.py → Conversational AI
↓
Text-to-Speech Output (helper.py)
This separation ensures accuracy, security, and reliability in system control.
JINI/
│
├── backend/
│ ├── command.py # Command router (routes input to correct handler)
│ ├── feature.py # AI chatbot only (Gemini + HugChat fallback)
│ ├── helper.py # Utility functions (speak, extract words, etc.)
│ ├── conf.py # Configuration (API keys, assistant name)
│ ├── db.py # Database setup and access
│ ├── intent.py # Input classification: SYSTEM, COMMUNICATION, MEDIA, AI
│ ├── handlers/
│ │ ├── system_handler.py # System commands (open apps, launch URLs)
│ │ ├── communication_handler.py # WhatsApp, calls, messages
│ │ └── media_handler.py # YouTube/media playback
│ └── input/
│ └── speech.py # Speech-to-text logic
│
├── frontend/
│ ├── html/
│ │ ├── index.html # Chatbot interface (text input)
│ │ └── mic.html # Voice interaction UI
│ ├── css/
│ │ ├── main.css # Chatbot UI styles
│ │ └── mic.css # Mic UI styles
│ ├── js/
│ │ ├── app.js # Chatbot logic + memory handling
│ │ └── mic.js # Microphone input + audio visualization
│ └── assets/
│ └── favicon.png # JINI application icon
│
├── main.py # Application entry point (launches Eel UI)
├── jini.db # SQLite database (generated locally)
├── requirements.txt # Python dependencies
├── .env # Environment variables (not committed)
├── .gitignore
└── README.md
The project uses SQLite for lightweight local storage. Tables sys_command – Stores application names and executable paths/URLs contacts – Stores contact names and phone numbers Sensitive data is excluded from version control.
- Backend
- Python
- SQLite
- SpeechRecognition
- Pyttsx3
- PyAutoGUI
- Frontend
- HTML
- CSS
- JavaScript
- Eel (Python–Web bridge)
- APIs & Tools
- Google Gemini AI
- Picovoice (Wake word detection)
- YouTube automation
- Text-to-Speech engine (SAPI5)
-
Clone Repository
git clone https://github.com/tanuluthra4/JINI.git
cd JINI -
Create Virtual Environment
python -m venv envjini
envjini\Scripts\activate -
Install Dependencies
pip install -r requirements.txt -
Configure Environment Variables
Create a .env file:
GEMINI_API_KEY=your_api_key ASSISTANT_NAME=JINI -
Run Application
python main.py
- Separate input pipelines: Voice commands trigger handlers, text input goes only to AI
- Command classification: All voice commands are classified first to avoid unintended actions
- AI fallback: Text inputs handled only by feature.py ensures system safety
- Persistent memory: Unified storage across text and voice modes for consistency
- Modular backend: Handlers for system, communication, and media make code maintainable
- Local database: Ensures offline capability for contacts and system commands
These choices were made to ensure robustness, clarity, and system safety.
- Platform dependent (Windows-based)
- Requires internet for AI responses
- Limited to predefined system commands
- Voice recognition accuracy depends on environment noise
- Cross-platform support
- User authentication
- Dynamic command learning
- Mobile integration
- Enhanced NLP-based command classification
- Cloud-based database synchronization
This project is developed as part of the B.Tech Capstone Project under the Department of Computer Engineering, J.C. Bose University of Science & Technology, YMCA, Faridabad.
Tanu Luthra
B.Tech Computer Engineering
J.C. Bose University of Science & Technology, YMCA
A short screen recording demonstrating memory persistence, voice commands, and reload recovery is included in the repository.