Automated AI Pipeline for Scientific Discovery & Science Communication
GeneGist is an open-source tool designed for bioinformaticians and science communicators. It automates the tracking of scientific literature by fetching real-time data from NCBI PubMed, summarizing complex papers into engaging blog posts using Google Gemini AI, and archiving them in a cloud database.
- Smart Mining: Fetches latest research based on dynamic keywords (e.g., CRISPR, mRNA, Aging) using NCBI Entrez API.
- AI-Powered Summarization: Uses a custom "Expert Science Journalist" persona to convert abstract scientific texts into high-quality, readable blog posts.
- Dual Language Support: Generates content in both English (EN) and Turkish (TR) simultaneously.
- Cloud Architecture: Stores all processed metadata and blog content in Supabase (PostgreSQL), ready for web integration.
- Smart Deduplication: Automatically checks the database history to prevent processing the same article twice.
- Hybrid Operation Modes:
- Auto Mode: Scans pre-defined topics daily (ideal for Cron jobs).
- Manual Mode: CLI support for specific, ad-hoc research queries.
- Core Logic: Python (Modular Architecture)
- Data Source: Biopython (NCBI Entrez)
- LLM Engine: Google Gemini 1.5 Flash via
google-generativeai - Database: Supabase (PostgreSQL) via
supabase-py - Environment: Dotenv for secure key management
-
Clone the repository:
git clone [https://github.com/yourusername/genegist.git](https://github.com/yourusername/genegist.git) cd genegist -
Set up Virtual Environment & Install Dependencies:
python3 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` pip install -r requirements.txt
-
Configuration: Create a
.envfile in the root directory with your API keys:NCBI_API_KEY=your_ncbi_key NCBI_EMAIL=your_email@example.com GOOGLE_API_KEY=your_gemini_key SUPABASE_URL=your_supabase_url SUPABASE_KEY=your_supabase_anon_key
-
Database Setup: Run the following SQL in your Supabase SQL Editor to create the necessary table:
create table articles ( article_id text primary key, title text, topic text, url text, published_at date, content_tr text, content_en text, created_at timestamp with time zone default timezone('utc'::text, now()) );
GeneGist is designed to be flexible. You can run it automatically based on your config file or manually via CLI.
1. Auto Mode (Default)
Scans keywords defined in config.json.
python main.py2. Manual Mode (CLI) Search for a specific topic instantly.
# Search for "Neuralink" papers from the last 30 days
python main.py --manual --keyword "Neuralink" --days 30 --count 5For detailed documentation, please refer to MANUAL.md.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.