Skip to content

RokoCirjakDev/OpenIngest

Repository files navigation

OpenIngest

Python FastAPI License: MIT

OpenIngest builds in-depth AI/RAG knowledge datasets from average, messy documentation.

It converts DOCX/PDF files into structured records that include source context, hierarchy, generated Q/A, embeddings, and traceability fields—so the output is retrieval-ready, not just raw text.

Table of Contents

Why OpenIngest

Real documentation is usually not RAG-ready:

  • inconsistent structure
  • long procedural blocks
  • critical meaning hidden in screenshots
  • mixed tables, notes, steps, and troubleshooting

OpenIngest transforms that into structured knowledge with:

  • KONTEKST (grounded chunk text)
  • PITANJE / ODGOVOR (retrieval-friendly intent + concise answer)
  • breadcrumbs + page ranges
  • embeddings + model provenance
  • stable IDs (DOC_ID, SECTION_ID, CHUNK_ID)

Workflow

  1. Extract blocks from DOCX/PDF
  2. Extract images and optionally caption with vision
  3. Merge image descriptions inline
  4. Compute heading breadcrumbs
  5. Detect parent sections
  6. Split into child chunks (token/overlap aware)
  7. Generate questions/summaries/keywords
  8. Embed chunks
  9. Write records via selected writer

Extreme Configurability

OpenIngest is intentionally configurable at every stage.

  • Config sources: defaults, JSON/YAML config file, environment variables
  • Chunking: mode (procedure/structure/window/semantic), token targets, overlap, heading/keyword heuristics, OCR language
  • Enrichment: language strategy, image captions, summaries, synthetic questions, prompts
  • Custom summarizer fields: declare chunk-level dynamic fields (enum, freelist, freeparagraph) and have them generated during chunk enrichment
  • Embedding: model, dimensions, batch size, input template
  • Pipeline control: per-stage enablement (extract, chunk, enrich_text, embed, write, etc.)
  • Writer mapping: destination field mapping is configurable via writer.mapping

This means the same core pipeline can be adapted to very different documentation styles, data contracts, and storage backends.

Custom summarizer fields

You can declare custom fields in config and they will be injected into chunk summarization prompts dynamically.

Each field declaration supports:

  • name: output field name (letters, numbers, underscore; starts with letter/underscore)
  • type: enum | freelist | freeparagraph
  • description: instruction for what the model should extract/generate
  • required: optional, default false
  • options: required for enum, forbidden for other types

Example:

enrichment:
	custom_fields:
		- name: app_id
			type: enum
			description:
				- Classify the application identifier used by this chunk.
				- Use the identifier that best matches the source application or business context.
				- Return exactly one of the configured enum values.
			required: true
			options: ["1", "2", "3", "10"]

		- name: impact_level
			type: enum
			description:
				- Classify the operational impact severity of this chunk.
				- Prefer the highest severity that is directly supported by the chunk content.
			required: true
			options: [low, medium, high]

		- name: affected_modules
			type: freelist
			description:
				- List all modules/systems explicitly mentioned as impacted.
				- Use concise names only.

		- name: operator_note
			type: freeparagraph
			description:
				- Write a short operator-facing paragraph with the key caution.
				- Keep it direct and actionable.

Generated values are stored in text.custom_fields in each output record, so writer mappings can target paths such as:

  • text.custom_fields.impact_level
  • text.custom_fields.affected_modules
  • text.custom_fields.operator_note

Output Architecture (Oracle is an Example)

OpenIngest output is writer-driven.

  • Current built-ins: jsonl, oracle23ai
  • The Oracle writer in this repo reflects a specific schema used by one application at my current firm
  • It is provided as a reference implementation, not a platform limitation

You can target alternative database architectures by adding another writer implementation (same Writer interface) and selecting it through writer.kind.

In other words: Oracle is one adapter example; the architecture is extensible by design.

Create a Custom Writer in 5 Minutes

  1. Create a new file in OpenIngest/writers/, for example mydb_writer.py.
  2. Implement the Writer interface from OpenIngest.writers.base.
  3. Convert each ChunkRecord into your DB/API payload.
  4. Return a WriteResult with counts and destination.
  5. Register/select your writer via writer.kind in config.

Minimal example:

from OpenIngest.writers.base import Writer, WriteResult
from OpenIngest.models import ChunkRecord


class MyDbWriter(Writer):
		def write(self, records: list[ChunkRecord]) -> WriteResult:
				# map records -> your storage schema, then insert
				inserted = len(records)
				return WriteResult(written=inserted, destination="mydb")

Config example:

writer:
	kind: mydb

Tip: start by copying OpenIngest/writers/oracle23ai.py or OpenIngest/writers/jsonl.py and replacing only the mapping + write logic.

Quick Start

Backend

pip install -e .
uvicorn OpenIngest.serve:app --reload

UI

cd ui
npm install
npm run dev

CLI

openingest-ingest /path/to/file.pdf --metadata "{\"app_id\":\"10\"}"

Accepted sources: .pdf, .docx.

Common environment variables

OPENAI_API_KEY=...
ORACLE_USER=...
ORACLE_PASSWORD=...
ORACLE_DSN=host/service_name

OPENINGEST_VISION_MODEL=gpt-4.1-mini
OPENINGEST_SUMMARIZE_MODEL=gpt-4.1-mini
OPENINGEST_EMBEDDING_MODEL=text-embedding-3-small
OPENINGEST_VISION_MAX_WORKERS=4
OPENINGEST_OPENAI_MAX_RETRIES=4
OPENINGEST_OPENAI_BACKOFF=1.0
ORACLE_TABLE=RAG_CHUNKS
ORACLE_BATCH_SIZE=50

About

Inteligent minimalist solution for pdfs,docx text Enrichment->Chunking+Metadata Extraction->Vectors Database Workflow. Based on OpenAI models and Oracle 23ai database but modular to others.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors