Skip to content

A LLM-enabled semantic search for research abstracts. Including web interface and RAG chatbot.

License

Notifications You must be signed in to change notification settings

thawn/abstracts-explorer

Repository files navigation

Abstracts Explorer

A package to download conference data and search it with LLM-based semantic search including document retrieval and question answering.

Features

  • πŸ“₯ Download conference data from various sources (NeurIPS, ICLR, ICML, ML4PS)
  • πŸ’Ύ Store data in SQL database (SQLite or PostgreSQL) with efficient indexing
  • πŸ” Search papers by keywords, track, and other attributes
  • πŸ€– Generate text embeddings for semantic search
  • πŸ”Ž Find similar papers using AI-powered semantic similarity
  • πŸ’¬ Interactive RAG chat to ask questions about papers
  • 🎨 NEW: Cluster and visualize paper embeddings with interactive plots
  • 🌐 Web interface for browsing and searching papers
  • πŸ”Œ NEW: MCP server for LLM-based cluster analysis
  • πŸ—„οΈ NEW: Multi-database backend support (SQLite and PostgreSQL)
  • βš™οΈ Environment-based configuration with .env file support

Installation

Quick Start with Docker/Podman 🐳

The easiest way to get started with a complete stack (PostgreSQL + ChromaDB):

First create a .env file with your blablador token:

LLM_BACKEND_AUTH_TOKEN=your_blablador_token_here

Then download docker-compose.yml:

curl -o docker-compose.yml https://raw.githubusercontent.com/thawn/abstracts-explorer/refs/heads/main/docker-compose.yml

Then start the services with:

# Using Podman (recommended)
podman-compose up -d

# Or using Docker
docker-compose up -d

# Access at http://localhost:5000

The Docker Compose setup includes:

  • Web UI on port 5000 (exposed)
  • PostgreSQL for paper metadata (internal only)
  • ChromaDB for semantic search (internal only)

πŸ“– Complete Docker/Podman Guide

Note: The container images use pre-built static vendor files. Node.js is only needed for local development if you want to rebuild CSS/JS libraries.

Traditional Installation

Requirements: Python 3.11+, uv package manager, Node.js 14+ (for web UI development)

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/thawn/neurips-abstracts.git
cd abstracts-explorer

# Install dependencies
uv sync --all-extras

# Install Node.js dependencies for web UI
npm install
npm run install:vendor

πŸ“– Full Installation Guide

Configuration

Create a .env file to customize settings:

cp .env.example .env
# Edit .env with your preferred settings

πŸ“– Configuration Guide - Complete list of settings and options

Database Backend

Abstracts Explorer supports both SQLite and PostgreSQL backends:

# Option 1: SQLite (default, no additional setup required)
PAPER_DB=data/abstracts.db

# Option 2: PostgreSQL (requires PostgreSQL server)
PAPER_DB=postgresql://user:password@localhost/abstracts

PostgreSQL Setup:

# Install PostgreSQL support
uv sync --extra postgres

# Create database
createdb abstracts

# Configure in .env
PAPER_DB=postgresql://user:password@localhost/abstracts

πŸ“– See Configuration Guide for more database options

Quick Start

Download Conference Data

# Download NeurIPS 2025 papers
abstracts-explorer download --year 2025

Generate Embeddings for Semantic Search

# Requires LM Studio running with embedding model loaded
abstracts-explorer create-embeddings

Cluster and Visualize Embeddings

# Cluster embeddings using K-Means (PCA reduction)
abstracts-explorer cluster-embeddings --n-clusters 8 --output clusters.json

# Cluster using t-SNE and DBSCAN
abstracts-explorer cluster-embeddings \
  --reduction-method tsne \
  --clustering-method dbscan \
  --eps 0.5 \
  --min-samples 5 \
  --output clusters.json

# Cluster using Agglomerative with distance threshold
abstracts-explorer cluster-embeddings \
  --clustering-method agglomerative \
  --distance-threshold 5.0 \
  --output clusters.json

# Cluster using Spectral clustering
abstracts-explorer cluster-embeddings \
  --clustering-method spectral \
  --n-clusters 10 \
  --output clusters.json

# The web UI includes an interactive cluster visualization tab!

Start MCP Server for Cluster Analysis

# Start MCP server for LLM-based cluster analysis
abstracts-explorer mcp-server

# The MCP server provides tools to analyze clustered papers:
# - Get most frequently mentioned topics
# - Analyze topic evolution over years
# - Find recent developments in topics
# - Generate cluster visualizations

NEW: MCP clustering tools are now automatically integrated into the RAG chat! The LLM will automatically use clustering tools when appropriate to answer questions about topics, trends, and developments. No need to run a separate MCP server for RAG chat usage.

Start Web Interface

abstracts-explorer web-ui
# Open http://127.0.0.1:5000 in your browser

πŸ“– Usage Guide - Detailed examples and workflows
πŸ“– CLI Reference - Complete command-line documentation
πŸ“– API Reference - Python API documentation

Web Interface

The web UI provides an intuitive interface for browsing and searching papers:

  • πŸ” Search: Keyword and AI-powered semantic search
  • πŸ’¬ Chat: Interactive RAG chat with query rewriting
  • ⭐ Ratings: Save and organize interesting papers
  • πŸ“Š Filters: Filter by track, decision, event type, and more
  • 🎨 Clusters: Interactive visualization of paper embeddings (NEW!)
abstracts-explorer web-ui
# Open http://127.0.0.1:5000

Web UI Screenshot The web interface provides an intuitive way to search and explore conference papers

Python API Examples

Download and Search Papers

from abstracts_explorer.plugins import get_plugin
from abstracts_explorer import DatabaseManager

# Download papers
neurips_plugin = get_plugin('neurips')
papers_data = neurips_plugin.download(year=2025)

# Load into database and search
with DatabaseManager() as db:
    db.create_tables()
    db.add_papers(papers_data)
    
    # Search papers
    papers = db.search_papers(keyword="deep learning", limit=5)
    for paper in papers:
        print(f"{paper['title']} by {paper['authors']}")

Semantic Search with Embeddings

from abstracts_explorer import EmbeddingsManager

with EmbeddingsManager() as em:
    em.create_collection()
    em.embed_from_database()
    
    # Find similar papers
    results = em.search_similar(
        "transformers for natural language processing",
        n_results=5
    )

Cluster and Visualize Embeddings

from abstracts_explorer.clustering import perform_clustering

# Perform complete clustering pipeline
results = perform_clustering(
    reduction_method="tsne",      # or "pca", "umap"
    n_components=2,
    clustering_method="kmeans",    # or "dbscan", "agglomerative", "spectral", "fuzzy_cmeans"
    n_clusters=8,
    output_path="clusters.json"
)

# Access clustering results
print(f"Found {results['statistics']['n_clusters']} clusters")
for point in results['points']:
    print(f"Paper: {point['title']} -> Cluster {point['cluster']}")

πŸ“– Complete Usage Guide - More examples and workflows

Documentation

πŸ“š Full Documentation - Complete documentation built with Sphinx

Quick Links

Development

# Install with development dependencies
uv sync --all-extras

# Run tests
uv run pytest

# Run linters
ruff check src/ tests/
mypy src/ --ignore-missing-imports

πŸ“– Contributing Guide - Complete development documentation

Contributing

Contributions are welcome! Please read our Contributing Guide for details on:

  • Development setup
  • Running tests and linters
  • Code style and conventions
  • Submitting pull requests

License

Apache License 2.0 - see LICENSE file for details.

Support

For issues, questions, or contributions:

About

A LLM-enabled semantic search for research abstracts. Including web interface and RAG chatbot.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors 2

  •  
  •