Skip to content

benyam7/llm-knowledge-extractor

Repository files navigation

Project Overview: LLM Knowledge Extractor

This is an LLM-powered text analysis API built with FastAPI that extracts structured information from unstructured text. Here's what it does:

Core Functionality

The system takes unstructured text and automatically extracts:

  • Title - A concise title for the content
  • Summary - A brief summary of the text
  • Topics - Main topics (aims for 3)
  • Sentiment - Positive, neutral, or negative
  • Keywords - Most frequent nouns extracted locally
  • Confidence Score - Quality metric based on structural integrity

Design Choices

I designed this system using a modular, service-oriented architecture to ensure a clean separation of concerns between the API layer, business logic, and data access. I chose FastAPI for its high performance, automatic interactive documentation, and excellent data validation with Pydantic. For LLM interaction, I used the instructor library to reliably parse the LLM's output into Pydantic models, which makes the extraction process robust and eliminates manual parsing errors. Finally, SQLAlchemy provides a flexible Object-Relational Mapper (ORM) that abstracts away database queries and allowed me to start with SQLite while making it easy to switch to a more powerful database like PostgreSQL in the future.

Trade-offs

I made several deliberate trade-offs:

  1. Synchronous API Calls: The /analyze and /analyze-batch endpoints are synchronous. For a production system, I would move the slow LLM API calls to a background task queue (like Celery) to prevent blocking and API timeouts.

  2. Simplified Search: The search functionality uses a basic SQL LIKE query. This is inefficient for large datasets and lacks linguistic intelligence; it should be replaced with a dedicated full-text search engine (like Elasticsearch) or a vector database for semantic similarity search.

  3. SQLite Database: I used SQLite because it requires zero setup. For any multi-user or high-write application, I would migrate to PostgreSQL to handle concurrency and scale effectively.

  4. I also have what_i_would_improve.md talking further.

Setup and Run Instructions

You can run this application either locally using a Python virtual environment or via Docker.

1. Running with Docker (Recommended)

Prerequisites:

  • Docker installed and running.

Instructions:

  1. Create the environment file: Copy the example file and add your OpenAI API key.

    cp .env.example .env
    # Now, edit the .env file and add your key
    # OPENAI_API_KEY="sk-..."
    # OPENAI_MODEL="your_open_ai_compatible_model"
    # OPENAI_BASE_URL="your_open_ai_base_url"
  2. Build the Docker image:

    docker build -t llm-extractor .
  3. Run the Docker container: This command will start the application and forward port 8000.

    docker run --rm -p 8000:8000 --env-file .env llm-extractor

The API is now running and accessible at http://localhost:8000.

2. Running Locally (for Development)

Prerequisites:

  • Python 3.9+
  • A Python virtual environment tool (venv)

Instructions:

  1. Create and activate a virtual environment:

    python -m venv .venv
    source .venv/bin/activate  
  2. Install dependencies:

    pip install -r requirements.txt
  3. Create the environment file: Copy the example and add your OpenAI API key.

    cp .env.example .env
    # Edit the .env file with your key
  4. Run the application: The --reload flag will automatically restart the server when you make code changes.

    uvicorn app.main:app --reload

The API is now running and accessible at http://127.0.0.1:8000.

How to Use the API

Once the application is running, the easiest way to interact with the API is through the auto-generated documentation.

  1. Open your browser and navigate to http://127.0.0.1:8000/docs.

  2. You will see the Swagger UI, which provides an interactive interface for all available endpoints.

Key Endpoints:

  • POST /analyze: Accepts a JSON object with a single text field. It processes the text, stores the result, and returns the full analysis object.
  • POST /analyze-batch: Accepts a JSON object with a texts field (a list of strings) and returns a list of analysis objects.
  • GET /search?q={query}: Searches for stored analyses where the query string matches one of the extracted topics or keywords.

Running Tests

To ensure the application is working as expected, you can run the integration test suite.

  1. Make sure you have installed the development dependencies:

    pip install -r requirements.txt
  2. Run pytest from the root directory:

    PYTHONPATH=. pytest tests/test_api.py -v

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors