Skip to content

margato/aws-bedrock-document-entity-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Document Entity Extractor (AWS Bedrock PoC)

Table of Contents

Overview

This project is a Streamlit web application for extracting structured entities from uploaded documents (PDF, PNG, JPG, etc.) using AWS Bedrock Agents.

Architecture Architecture

Users can define custom extraction fields and view the results. This enables automation of manual processes, reducing back-office costs.

UI UI

AWS Bedrock

This proof of concept was tested on a AWS Bedrock agent using Nova Lite model.

Moreover, this repository is not intended to demonstrate how to implement an agent in AWS Bedrock. However, if you want to reproduce it, the prompt to create one in your account is provided below.

Agent Prompt:

You are a document extraction agent.

You will receive:
- A document in PDF, PNG, or other file extension.
- A list of extraction objects, each with:
    - `key`: the entity key to extract.
    - `description`: a clear description of what the entity represents.

Your task:
- For each object, extract the corresponding value from the text.
- Return a JSON with each `key` mapped to the extracted value.
- If the value is not found, return `null` for that key.
- Do not include any explanations or extra text, only the JSON output.

Example Input:
{
  "fields": [
    {
      "key": "name",
      "description": "Candidate's name"
    },
    {
      "key": "candidate_resume_summary",
      "description": "Make a summary about the candidate resume"
    },
    {
      "key": "current_job_role",
      "description": "Candidate's current job role"
    }
  ]
}

Expected Output:
{
    "name": "John Doe",
    "candidate_resume_summary": "Experienced Java Developer. Builds scalable enterprise solutions.",
    "current_job_role": "Java Developer"
}

If you cannot find an entity based on the description, return `null` for that key.

Features

  • Upload documents for entity extraction
  • Results displayed in a user-friendly format
  • Uses AWS Bedrock via boto3

Prerequisites

  • AWS credentials and Bedrock agent implementation

  • The following .env file in app/.env (example):

    AWS_REGION=your-region
    AWS_ACCESS_KEY_ID=your-access-key-id
    AWS_SECRET_ACCESS_KEY=your-secret-access-key
    AWS_BEDROCK_AGENT_ID=your-bedrock-agent-id
    AWS_BEDROCK_AGENT_ALIAS_ID=your-bedrock-agent-alias-id
    

Installation

  1. Clone this repository.

  2. Create and activate a Python virtual environment:

    python3 -m venv venv
    source venv/bin/activate
  3. Install dependencies:

    pip install -r app/requirements.txt
  4. Create your .env file in app/.env as shown above.

Running Locally

Start the Streamlit app:

streamlit run app/main.py

Open the provided local URL in your browser to use the application.

About

Document Entity Extraction using AWS Bedrock

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages