Skip to content

Telsho/Extrai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

38 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Extrai

Extrai Logo

Python CI/CD codecov Python 3.12 MIT License

Documentation

πŸ“– Description

extrai extracts data from text documents using LLMs, formatting the output into a given SQLModel and registering it in a database.

The library utilizes a Consensus Mechanism to ensure accuracy. It makes the same request multiple times, using the same or different providers, and then selects the values that meet a configured threshold.

extrai also has other features, like generating SQLModels from a prompt and documents, and generating few-shot examples. For complex, nested data, the library offers Hierarchical Extraction, breaking down the extraction into manageable, hierarchical steps. It also includes built-in analytics to monitor performance and output quality.

✨ Key Features

πŸ“š Documentation

For a complete guide, please see the full documentation. Here are the key sections:

βš™οΈ Workflow Overview

The library is built around a few key components that work together to manage the extraction workflow. The following diagram illustrates the high-level workflow (see Architecture Overview):

graph TD
    %% Define styles for different stages for better colors
    classDef inputStyle fill:#f0f9ff,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e
    classDef processStyle fill:#eef2ff,stroke:#6366f1,stroke-width:2px,color:#3730a3
    classDef consensusStyle fill:#fffbeb,stroke:#f59e0b,stroke-width:2px,color:#78350f
    classDef outputStyle fill:#f0fdf4,stroke:#22c55e,stroke-width:2px,color:#14532d
    classDef modelGenStyle fill:#fdf4ff,stroke:#a855f7,stroke-width:2px,color:#581c87

    subgraph "Inputs (Static Mode)"
        A["πŸ“„<br/>Documents"]
        B["πŸ›οΈ<br/>SQLAlchemy Models"]
        L1["πŸ€–<br/>LLM"]
    end

    subgraph "Inputs (Dynamic Mode)"
        C["πŸ“‹<br/>Task Description<br/>(User Prompt)"]
        D["πŸ“š<br/>Example Documents"]
        L2["πŸ€–<br/>LLM"]
    end

    subgraph "Model Generation<br/>(Optional)"
        MG("πŸ”§<br/>Generate SQLModels<br/>via LLM")
    end

    subgraph "Data Extraction"
        EG("πŸ“<br/>Example Generation<br/>(Optional)")
        P("✍️<br/>Prompt Generation")
        
        subgraph "LLM Extraction Revisions"
            direction LR
            E1("πŸ€–<br/>Revision 1")
            H1("πŸ’§<br/>SQLAlchemy Hydration 1")
            E2("πŸ€–<br/>Revision 2")
            H2("πŸ’§<br/>SQLAlchemy Hydration 2")
            E3("πŸ€–<br/>...")
            H3("πŸ’§<br/>...")
        end
        
        F("🀝<br/>JSON Consensus")
        H("πŸ’§<br/>SQLAlchemy Hydration")
    end

    subgraph Outputs
        SM["πŸ›οΈ<br/>Generated SQLModels<br/>(Optional)"]
        O["βœ…<br/>Hydrated Objects"]
        DB("πŸ’Ύ<br/>Database Persistence<br/>(Optional)")
    end

    %% Connections for Static Mode
    L1 --> P
    A --> P
    B --> EG
    EG --> P
    P --> E1
    P --> E2
    P --> E3
    E1 --> H1
    E2 --> H2
    E3 --> H3
    H1 --> F
    H2 --> F
    H3 --> F
    F --> H
    H --> O
    H --> DB

    %% Connections for Dynamic Mode
    L2 --> MG
    C --> MG
    D --> MG
    MG --> EG
    EG --> P

    MG --> SM

    %% Apply styles
    class A,B,C,D,L1,L2 inputStyle;
    class P,E1,E2,E3,H,EG processStyle;
    class F consensusStyle;
    class O,DB,SM outputStyle;
    class MG modelGenStyle;
Loading

▢️ Getting Started

πŸ“¦ Installation

Install the library from PyPI:

pip install extrai-workflow

✨ Usage Example

For a more detailed guide, please see the Getting Started Tutorial.

Here is a minimal example:

import asyncio
from typing import Optional
from sqlmodel import Field, SQLModel, create_engine, Session
from extrai.core import WorkflowOrchestrator
from extrai.llm_providers.huggingface_client import HuggingFaceClient

# 1. Define your data model
class Product(SQLModel, table=True):
    id: Optional[int] = Field(default=None, primary_key=True)
    name: str
    price: float

# 2. Set up the orchestrator
llm_client = HuggingFaceClient(api_key="YOUR_HF_API_KEY")
engine = create_engine("sqlite:///:memory:")
orchestrator = WorkflowOrchestrator(
    llm_client=llm_client,
    db_engine=engine,
    root_model=Product,
)

# 3. Run the extraction and verify
text = "The new SuperWidget costs $99.99."
with Session(engine) as session:
    asyncio.run(orchestrator.synthesize_and_save([text], db_session=session))
    product = session.query(Product).first()
    print(product)
    # Expected: name='SuperWidget' price=99.99 id=1

πŸš€ More Examples

For more in-depth examples, see the /examples directory in the repository.

πŸ™Œ Contributing

We welcome contributions! Please see the Contributing Guide for details on how to set up your development environment, run tests, and submit a pull request.

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.