GitHub - benyam7/llm-knowledge-extractor

Project Overview: LLM Knowledge Extractor

This is an LLM-powered text analysis API built with FastAPI that extracts structured information from unstructured text. Here's what it does:

Core Functionality

The system takes unstructured text and automatically extracts:

Title - A concise title for the content
Summary - A brief summary of the text
Topics - Main topics (aims for 3)
Sentiment - Positive, neutral, or negative
Keywords - Most frequent nouns extracted locally
Confidence Score - Quality metric based on structural integrity

Design Choices

I designed this system using a modular, service-oriented architecture to ensure a clean separation of concerns between the API layer, business logic, and data access. I chose FastAPI for its high performance, automatic interactive documentation, and excellent data validation with Pydantic. For LLM interaction, I used the instructor library to reliably parse the LLM's output into Pydantic models, which makes the extraction process robust and eliminates manual parsing errors. Finally, SQLAlchemy provides a flexible Object-Relational Mapper (ORM) that abstracts away database queries and allowed me to start with SQLite while making it easy to switch to a more powerful database like PostgreSQL in the future.

Trade-offs

I made several deliberate trade-offs:

Synchronous API Calls: The /analyze and /analyze-batch endpoints are synchronous. For a production system, I would move the slow LLM API calls to a background task queue (like Celery) to prevent blocking and API timeouts.
Simplified Search: The search functionality uses a basic SQL LIKE query. This is inefficient for large datasets and lacks linguistic intelligence; it should be replaced with a dedicated full-text search engine (like Elasticsearch) or a vector database for semantic similarity search.
SQLite Database: I used SQLite because it requires zero setup. For any multi-user or high-write application, I would migrate to PostgreSQL to handle concurrency and scale effectively.
I also have what_i_would_improve.md talking further.

Setup and Run Instructions

You can run this application either locally using a Python virtual environment or via Docker.

1. Running with Docker (Recommended)

Prerequisites:

Docker installed and running.

Instructions:

Create the environment file: Copy the example file and add your OpenAI API key.

cp .env.example .env
# Now, edit the .env file and add your key
# OPENAI_API_KEY="sk-..."
# OPENAI_MODEL="your_open_ai_compatible_model"
# OPENAI_BASE_URL="your_open_ai_base_url"

Build the Docker image:
```
docker build -t llm-extractor .
```
Run the Docker container: This command will start the application and forward port 8000.
```
docker run --rm -p 8000:8000 --env-file .env llm-extractor
```

The API is now running and accessible at http://localhost:8000.

2. Running Locally (for Development)

Prerequisites:

Python 3.9+
A Python virtual environment tool (venv)

Instructions:

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Create the environment file: Copy the example and add your OpenAI API key.
```
cp .env.example .env
# Edit the .env file with your key
```
Run the application: The --reload flag will automatically restart the server when you make code changes.
```
uvicorn app.main:app --reload
```

The API is now running and accessible at http://127.0.0.1:8000.

How to Use the API

Once the application is running, the easiest way to interact with the API is through the auto-generated documentation.

Open your browser and navigate to http://127.0.0.1:8000/docs.
You will see the Swagger UI, which provides an interactive interface for all available endpoints.

Key Endpoints:

POST /analyze: Accepts a JSON object with a single text field. It processes the text, stores the result, and returns the full analysis object.
POST /analyze-batch: Accepts a JSON object with a texts field (a list of strings) and returns a list of analysis objects.
GET /search?q={query}: Searches for stored analyses where the query string matches one of the extracted topics or keywords.

Running Tests

To ensure the application is working as expected, you can run the integration test suite.

Make sure you have installed the development dependencies:
```
pip install -r requirements.txt
```

Run pytest from the root directory:

PYTHONPATH=. pytest tests/test_api.py -v

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
app		app
tests		tests
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
what_i_would_improve.md		what_i_would_improve.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Overview: LLM Knowledge Extractor

Core Functionality

Design Choices

Trade-offs

Setup and Run Instructions

1. Running with Docker (Recommended)

2. Running Locally (for Development)

How to Use the API

Key Endpoints:

Running Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project Overview: LLM Knowledge Extractor

Core Functionality

Design Choices

Trade-offs

Setup and Run Instructions

1. Running with Docker (Recommended)

2. Running Locally (for Development)

How to Use the API

Key Endpoints:

Running Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages