GitHub - abhay-lal/Visper: Visper is a voice-first RAG assistant for your repositories. Paste a GitHub URL and talk to your code in real time answers are grounded with citations and links back to files. Visper can also turn a small JSON summary into elegant slides with synchronized TTS narration and a final MP4, designed with accessibility in mind for blind developers.

Visper

Talk to your code. Real‑time voice chat with secure RAG over your repositories.

Quick start

Install

pip install -r requirements.txt

Auth

Developer API (images/text):

export GOOGLE_API_KEY="YOUR_API_KEY"
export USE_DEVELOPER_API=true

Cloud (TTS + GCS upload):

export GOOGLE_APPLICATION_CREDENTIALS="/absolute/path/to/service-account.json"

Minimal JSON (project.json)

{
  "title": "Your/Repo",
  "description": "One‑line project summary.",
  "user_journey": "Short user flow text.",
  "repository": "https://github.com/you/repo",
  "tech_stack": "List or paragraph of tech names"
}

Run end‑to‑end

python run.py all \
  --slides_json project.json \
  --auto_narration \
  --narration_model gemini-2.0-flash-001 \
  --tts_backend cloud \
  --out_dir media \
  --gcs_uri gs://YOUR_BUCKET/slides_with_audio.mp4

Output: media/slide_*.png, media/narration_*.wav, media/slides_with_audio.mp4

Run the API

python main.py
# or: uvicorn visper.api.app:app --reload --port 8000

Docs: http://localhost:8000/docs

Models and modes

Images
- Default Imagen: Vertex (imagen-4.0-generate-001) or Developer API auto‑resolves imagen‑3.x.
- Gemini image models supported if set explicitly:

export IMAGE_MODEL="gemini-2.5-flash-image"

Narration
- Text model (for auto narration): --narration_model (default gemini-2.0-flash-001).
- TTS backend: --tts_backend cloud (Cloud TTS); ensure API is enabled on your project.

JSON to slide prompts (dynamic)

Slides are generated per slide (separate calls), derived from fields: title, description, user_journey, tech_stack, repository. Missing fields fallback to generic, concise phrasing. Auto‑narration produces 1–2 natural explanatory sentences per slide.

Logo overlay and timing

Logo overlay on final video:

--logo /abs/path/logo.png --logo_scale 0.12 --logo_margin 20

With per‑slide narration, each slide duration equals its audio duration. With a single audio track, use --seconds for fixed durations.

Directory structure (short)

visper/: Python package
- api/app.py: FastAPI app (from visper.api.app import app)
- pipeline/: generation pipeline wrappers
  - images.py: init_client, generate_images, generate_images_for_slides
  - tts.py: generate_tts
  - compose.py: compose, compose_per_slide
- clients/: external service clients
  - github_client.py: GitHub API client
  - vectara_client.py: Vectara client
- services/: higher-level helpers
  - gemini_enhancer.py: RAG answer enhancer
- utils/github.py: parse_github_url
run.py: Orchestrator CLI (uses visper.pipeline.*)
main.py: API entrypoint (re-exports visper.api.app:app)
agent_router.py: Receives repo analysis, writes analysis.json, can auto-run pipeline
agent_visual.py / agent_audio.py: Optional split agents for slides/audio
media/: Outputs (slide PNGs, narration WAVs, final MP4)

Main features

JSON-driven slides from 5 fields: title, description, user_journey, tech_stack, repository
Flexible image models: Imagen (Vertex/Developer) or Gemini (via IMAGE_MODEL)
Auto narration: 1–2 natural sentences per slide (Gemini text) → TTS via Cloud TTS
Per‑slide sync: each slide waits until its own audio ends
Optional GCS upload and logo overlay
RAG (optional): Vectara can provide retrieved context to enrich slide prompts
Agents (optional): Fetch.ai uAgents enable remote repo analysis and pipeline triggering

Motivation & Accessibility

Visper exists to make engineering knowledge accessible to everyone, especially blind and low‑vision developers.

Audio‑first: Every visual is paired with clear, synchronized narration.
Minimal visuals: High‑contrast, low‑clutter slides for screen magnifiers.
Hands‑free: Voice in/voice out to explore large codebases quickly.
Grounded: Answers cite sources and link back to files.

Demo

Watch the 2‑minute demo on YouTube

How it works

Input: 5 JSON fields (title, description, user_journey, tech_stack, repository).
Prompting: Builds structured, minimal slide prompts per field.
Image generation: Imagen (default) or Gemini image model renders slides.
Auto‑narration: Gemini text model writes 1–2 natural sentences per slide.
TTS: Cloud TTS produces per‑slide WAVs (or a single track if desired).
Video: Slides + audio are stitched; each slide duration matches its audio.
Optional: Upload final MP4 to GCS; overlay logo.

Gemini models used and purpose

gemini-2.0-flash-001
- Purpose: Auto‑narration text generation (1–2 sentences per slide) when using --auto_narration or --slides_json.
- Where: run.py (--narration_model flag; defaults to this model).
gemini-2.5-flash-image
- Purpose: Image generation via generate_content with inline image parts when you explicitly set IMAGE_MODEL=gemini-2.5-flash-image.
- Where: generate_slides_with_tts.py auto‑detects Gemini image models and switches to the Gemini content path.
gemini-2.5-flash-preview-tts
- Purpose: TTS directly via Gemini when --tts_backend gemini is used (may require allowlisting); default pipeline uses Google Cloud Text‑to‑Speech instead.
- Where: generate_tts.py (backend selectable via --tts_backend).
Gemini Live API
- Purpose: Low‑latency, streaming voice interactions for real‑time conversations (barge‑in, partial results) in Voice RAG Chat.
- Where: Live session via google‑genai Live API (WebSocket/stream). Integrates with the FastAPI service for voice chat mode.

Built for

Google Gemini Hackathon — TED AI

Links:

Team

_{Abhay Lal}

_{Yash Vishe}

_{Kshitij Akash Dumbre}

_{Guruprasad Parasnis}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visper

Quick start

Models and modes

JSON to slide prompts (dynamic)

Logo overlay and timing

Directory structure (short)

Main features

Motivation & Accessibility

Demo

How it works

Gemini models used and purpose

Built for

Team

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
docs/assets		docs/assets
media		media
visper		visper
.gitignore		.gitignore
README.md		README.md
agent_audio.py		agent_audio.py
agent_router.py		agent_router.py
agent_visual.py		agent_visual.py
analysis.json		analysis.json
compose_slides_with_audio.py		compose_slides_with_audio.py
gemini_enhancer.py		gemini_enhancer.py
generate_slides_with_tts.py		generate_slides_with_tts.py
generate_tts.py		generate_tts.py
github_client.py		github_client.py
main.py		main.py
requirements.txt		requirements.txt
run.py		run.py
tes.json		tes.json
test_api.py		test_api.py
utils.py		utils.py
vectara_client.py		vectara_client.py
vectara_readme.md		vectara_readme.md

abhay-lal/Visper

Folders and files

Latest commit

History

Repository files navigation

Visper

Quick start

Models and modes

JSON to slide prompts (dynamic)

Logo overlay and timing

Directory structure (short)

Main features

Motivation & Accessibility

Demo

How it works

Gemini models used and purpose

Built for

Team

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages