OpenIngest

OpenIngest builds in-depth AI/RAG knowledge datasets from average, messy documentation.

It converts DOCX/PDF files into structured records that include source context, hierarchy, generated Q/A, embeddings, and traceability fields—so the output is retrieval-ready, not just raw text.

Why OpenIngest

Real documentation is usually not RAG-ready:

inconsistent structure
long procedural blocks
critical meaning hidden in screenshots
mixed tables, notes, steps, and troubleshooting

OpenIngest transforms that into structured knowledge with:

KONTEKST (grounded chunk text)
PITANJE / ODGOVOR (retrieval-friendly intent + concise answer)
breadcrumbs + page ranges
embeddings + model provenance
stable IDs (DOC_ID, SECTION_ID, CHUNK_ID)

Workflow

Extract blocks from DOCX/PDF
Extract images and optionally caption with vision
Merge image descriptions inline
Compute heading breadcrumbs
Detect parent sections
Split into child chunks (token/overlap aware)
Generate questions/summaries/keywords
Embed chunks
Write records via selected writer

Extreme Configurability

OpenIngest is intentionally configurable at every stage.

Config sources: defaults, JSON/YAML config file, environment variables
Chunking: mode (procedure/structure/window/semantic), token targets, overlap, heading/keyword heuristics, OCR language
Enrichment: language strategy, image captions, summaries, synthetic questions, prompts
Custom summarizer fields: declare chunk-level dynamic fields (enum, freelist, freeparagraph) and have them generated during chunk enrichment
Embedding: model, dimensions, batch size, input template
Pipeline control: per-stage enablement (extract, chunk, enrich_text, embed, write, etc.)
Writer mapping: destination field mapping is configurable via writer.mapping

This means the same core pipeline can be adapted to very different documentation styles, data contracts, and storage backends.

Custom summarizer fields

You can declare custom fields in config and they will be injected into chunk summarization prompts dynamically.

Each field declaration supports:

name: output field name (letters, numbers, underscore; starts with letter/underscore)
type: enum | freelist | freeparagraph
description: instruction for what the model should extract/generate
required: optional, default false
options: required for enum, forbidden for other types

Example:

enrichment:
	custom_fields:
		- name: app_id
			type: enum
			description:
				- Classify the application identifier used by this chunk.
				- Use the identifier that best matches the source application or business context.
				- Return exactly one of the configured enum values.
			required: true
			options: ["1", "2", "3", "10"]

		- name: impact_level
			type: enum
			description:
				- Classify the operational impact severity of this chunk.
				- Prefer the highest severity that is directly supported by the chunk content.
			required: true
			options: [low, medium, high]

		- name: affected_modules
			type: freelist
			description:
				- List all modules/systems explicitly mentioned as impacted.
				- Use concise names only.

		- name: operator_note
			type: freeparagraph
			description:
				- Write a short operator-facing paragraph with the key caution.
				- Keep it direct and actionable.

Generated values are stored in text.custom_fields in each output record, so writer mappings can target paths such as:

text.custom_fields.impact_level
text.custom_fields.affected_modules
text.custom_fields.operator_note

Output Architecture (Oracle is an Example)

OpenIngest output is writer-driven.

Current built-ins: jsonl, oracle23ai
The Oracle writer in this repo reflects a specific schema used by one application at my current firm
It is provided as a reference implementation, not a platform limitation

You can target alternative database architectures by adding another writer implementation (same Writer interface) and selecting it through writer.kind.

In other words: Oracle is one adapter example; the architecture is extensible by design.

Create a Custom Writer in 5 Minutes

Create a new file in OpenIngest/writers/, for example mydb_writer.py.
Implement the Writer interface from OpenIngest.writers.base.
Convert each ChunkRecord into your DB/API payload.
Return a WriteResult with counts and destination.
Register/select your writer via writer.kind in config.

Minimal example:

from OpenIngest.writers.base import Writer, WriteResult
from OpenIngest.models import ChunkRecord


class MyDbWriter(Writer):
		def write(self, records: list[ChunkRecord]) -> WriteResult:
				# map records -> your storage schema, then insert
				inserted = len(records)
				return WriteResult(written=inserted, destination="mydb")

Config example:

writer:
	kind: mydb

Tip: start by copying OpenIngest/writers/oracle23ai.py or OpenIngest/writers/jsonl.py and replacing only the mapping + write logic.

Quick Start

Backend

pip install -e .
uvicorn OpenIngest.serve:app --reload

UI

cd ui
npm install
npm run dev

CLI

openingest-ingest /path/to/file.pdf --metadata "{\"app_id\":\"10\"}"

Accepted sources: .pdf, .docx.

Common environment variables

OPENAI_API_KEY=...
ORACLE_USER=...
ORACLE_PASSWORD=...
ORACLE_DSN=host/service_name

OPENINGEST_VISION_MODEL=gpt-4.1-mini
OPENINGEST_SUMMARIZE_MODEL=gpt-4.1-mini
OPENINGEST_EMBEDDING_MODEL=text-embedding-3-small
OPENINGEST_VISION_MAX_WORKERS=4
OPENINGEST_OPENAI_MAX_RETRIES=4
OPENINGEST_OPENAI_BACKOFF=1.0
ORACLE_TABLE=RAG_CHUNKS
ORACLE_BATCH_SIZE=50

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
OpenIngest		OpenIngest
ui		ui
.gitignore		.gitignore
.openingest_doc_state.json		.openingest_doc_state.json
.openingest_image_cache.json		.openingest_image_cache.json
LICENSE		LICENSE
README.md		README.md
ingest-debug.jsonl		ingest-debug.jsonl
modifysql.sql		modifysql.sql
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenIngest

Table of Contents

Why OpenIngest

Workflow

Extreme Configurability

Custom summarizer fields

Output Architecture (Oracle is an Example)

Create a Custom Writer in 5 Minutes

Quick Start

Backend

UI

CLI

Common environment variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenIngest

Table of Contents

Why OpenIngest

Workflow

Extreme Configurability

Custom summarizer fields

Output Architecture (Oracle is an Example)

Create a Custom Writer in 5 Minutes

Quick Start

Backend

UI

CLI

Common environment variables

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages