Simpsons SQL Sorcerer - Natural Language to SQL Assistant

Image source: Gemini

Description

Querying databases typically requires technical knowledge of SQL syntax and schema structures. This project bridges that gap by creating an AI-powered assistant that converts plain English questions into executable SQL queries. Built using Python, LangChain, and SQLite, the system allows non-technical users to "chat" with their data, retrieving real-time results from a predefined dataset (e.g., The Simpsons database) without writing a single line of code.

Strategic Insights

The Value

The real power of this assistant is that it makes data accessible to everyone. Usually, if a writer or producer wants to know something from the database, they have to wait for a technical expert to write the code for them. This tool removes that barrier, allowing anyone to ask a question in plain English and get an answer instantly. It essentially turns a complex pile of data into a conversational partner.

Key Advantages

Self-Service Data: Non-technical users can explore datasets directly, removing the "middleman" (the Data Analyst).
Workflow Evolution: The developer's focus shifts from writing repetitive syntax to Schema Engineering and Prompt Optimization.
Contextual Accuracy: Unlike generic AI, this system is grounded in a local database_schema.txt, preventing hallucinations of table or column names.
Privacy & Security: The actual data never leaves the local environment; only the query logic is generated by the LLM.
Integrated Execution: It isn't just a text generator—it is a complete pipeline that writes, cleans, and runs the code in one step.
Safety: This approach is inherently more secure than public AI tools because it keeps your actual records local; only the database structure (metadata) and the user's question are sent to the cloud.

Note: the system remains vulnerable to "SQL Injection" if a user asks to delete or modify tables. To mitigate this in a professional setting, the assistant should be restricted to a Read-Only database connection to ensure it can query data without ever being able to change it.

Why this approach works:

Bridges the Gap: Connects human curiosity to complex relational storage.
Scalability: Handles large, multi-table JOINs that would be tedious to write manually.
Auditability: Maintains a permanent record of every question and its corresponding SQL result.

Installation

Clone the project:

    git clone https://github.com/butkutez/Natural-Language-to-SQL-Queries.git
    cd Natural-Language-To-SQL-Queries

Create virtual environment (Windows)

   python3.13 -m venv .venv
   .\.venv\Scripts\Activate.ps1

Install dependencies

    pip install -r requirements.txt

Initialize the Database

Download Data using Terminal - Fetch raw files from Kaggle (Make sure to adapt the code with your generated API key and Kaggle username).

    python setup.py

Build SQLite DB (Processes CSVs into simpsons.db)

    python build_db.py

Verify Schema (Optional: Check table structures)

    python exploring_schema.py

Run the SQL Assistant - Once the database is initialized, launch the interactive query tool:

    python main.py

Repo Structure

NATURAL-LANGUAGE-TO-SQL-QUERIES
├── setup.py                 
├── build_db.py              
├── exploring_schema.py      
├── main.py                  
│
├── database_schema.txt      
├── assignment_results.txt   
│
├── requirements.txt         
├── .gitignore               
└── README.md

Process & Methodology

┌──────────────┐      ┌────────────────┐      ┌──────────────┐      ┌──────────────┐
│  User Query  │ ──►  │   LLM Engine   │ ──►  │ SQLite Exec  │ ──►  │ Final Result │
│   (English)  │      │  (Text-to-SQL) │      │  (Python)    │      │ (Table/Data) │
└──────────────┘      └────────────────┘      └──────────────┘      └──────────────┘
 "Who is Lisa's        Prompt + Schema        Execute Generated     Visualized Result
   teacher?"              Inference               SQL Query

The application follows a modular "Pipeline" approach, where a natural language input is transformed into structured data through four distinct phases.

I. Secure Initialization & Context Loading

The system begins by loading environment variables via python-dotenv. A critical design choice was the use of a static schema reference (database_schema.txt).

Why this is important: Instead of querying the database for its structure every time (which is slow and resource-heavy), the script reads a pre-audited schema file. This provides the LLM with the exact table names and column types it needs to write accurate queries.

II. AI SQL Generation (The "Translator")

Using the google-genai client, the user's question is wrapped in a high-context prompt.

Prompt Engineering: The prompt combines the content of the schema file with the user’s request, explicitly instructing the model to "Return only SQL".
Post-Processing: Since LLMs often return Markdown formatting (like ```sql blocks), the code includes a cleaning step to strip these characters, ensuring the string is pure SQL ready for execution.

III. Database Execution Layer

The generated SQL is passed to a robust execution function using sqlite3.

Safe Connection Management: The function uses a try...except...finally block. This ensures that the database connection is closed safely even if the AI generates a faulty query, preventing database locks and memory leaks.
Error Handling: If the SQL is invalid, the system catches the exception and returns a readable error message rather than crashing the program.

IV. Data Persistence & Logging

The final phase focuses on transparency and accountability.

User Feedback: Results are iterated and displayed clearly in the terminal.
Audit Trail: Every interaction is appended to assignment_results.txt. This creates a permanent log of the question asked, the SQL generated by Gemini, and the final data retrieved, which is essential for debugging and evaluating the model's performance over time.

Dynamic Prompt Engineering

This snippet shows how the assistant gains "context" by reading the schema and instructing the Gemini model:

def generate_sql(user_question):
    """Translating natural language into SQL by injecting the schema into the prompt."""
    schema = get_schema_from_file()

    # Engineering the prompt to enforce a 'SQL-only' output format
    prompt = f"Schema: {schema}\nRequest: {user_question}\nReturn only SQL."
    response = client.models.generate_content(model='gemini-2.5-flash', contents=prompt)
    
    # Sanitizing the response by stripping backticks and markdown formatting
    return response.text.strip().replace("```sql", "").replace("```", "")

Robust SQL Execution

This demonstrates the ability to manage database connections safely and handle potential AI-generated errors.

def run_query(sql):
    """Executing the generated SQL against the SQLite database and fetching raw results."""
    conn = sqlite3.connect('simpsons.db')
    cursor = conn.cursor()
    try:
        cursor.execute(sql)
        rows = cursor.fetchall()
        return rows
    except Exception as e:
        return f"SQL Error: {e}"
    finally:
        # Ensuring the database connection is closed safely
        conn.close()

Automated Audit Logging

This implementation facilitates AI performance monitoring by archiving the complete "Question $\rightarrow$ SQL $\rightarrow$ Result" lifecycle. By capturing the raw SQL alongside the user’s intent, the system ensures transparency and provides a dataset for future prompt optimization.

# Appending the interaction to a log file for history tracking    
    with open("assignment_results.txt", "a", encoding="utf-8") as f:
        f.write(f"Q: {question}\nSQL: {sql}\nResult: {results}\n\n")

E.g. Output:

Q: Who is the wife of Homer?
SQL: 
SELECT name
FROM characters
WHERE name = 'Marge Simpson';

Result: [('Marge Simpson',)]

The Result:

By combining the Google Gemini 2.5 Flash model with a local SQLite database, I transformed a complex relational dataset into an accessible conversational interface. Users can now retrieve specific insights from the Simpsons universe (or a database of your choice) without knowing any SQL syntax.

Natural Language Processing: The assistant accurately interprets intent and maps it to the specific schema of the Simpsons dataset.
Automated Audit Trail: Every interaction is captured in assignment_results.txt, creating a comprehensive record that includes the user's question, the AI-generated SQL, and the final data output for validation.

Future Improvements:

Query Refinement Loop: Implement a "self-healing" mechanism where the assistant tries to fix its own SQL if the initial execution returns a sqlite3 error.
Interactive UI: Migration from a terminal-based input to a Streamlit or Gradio web interface for a more user-friendly experience.
Advanced Visualization: Integrate matplotlib or seaborn to automatically generate charts (e.g., "Most frequent characters by season") when the query returns numerical data.

Timeline

This solo project was completed over 2 days.

Personal Situation

This project was completed as part of the AI & Data Science Bootcamp at BeCode.org.

Connect with me on LinkedIn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simpsons SQL Sorcerer - Natural Language to SQL Assistant

Description

Strategic Insights

The Value

Key Advantages

Why this approach works:

Installation

Repo Structure

Process & Methodology

I. Secure Initialization & Context Loading

II. AI SQL Generation (The "Translator")

III. Database Execution Layer

IV. Data Persistence & Logging

Dynamic Prompt Engineering

Robust SQL Execution

Automated Audit Logging

The Result:

Future Improvements:

Timeline

Personal Situation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
Simpsons.png		Simpsons.png
assignment_results.txt		assignment_results.txt
build_db.py		build_db.py
database_schema.txt		database_schema.txt
exploring_schema.py		exploring_schema.py
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

butkutez/Natural-Language-to-SQL-Queries

Folders and files

Latest commit

History

Repository files navigation

Simpsons SQL Sorcerer - Natural Language to SQL Assistant

Description

Strategic Insights

The Value

Key Advantages

Why this approach works:

Installation

Repo Structure

Process & Methodology

I. Secure Initialization & Context Loading

II. AI SQL Generation (The "Translator")

III. Database Execution Layer

IV. Data Persistence & Logging

Dynamic Prompt Engineering

Robust SQL Execution

Automated Audit Logging

The Result:

Future Improvements:

Timeline

Personal Situation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages