OpenIngest builds in-depth AI/RAG knowledge datasets from average, messy documentation.
It converts DOCX/PDF files into structured records that include source context, hierarchy, generated Q/A, embeddings, and traceability fields—so the output is retrieval-ready, not just raw text.
- Why OpenIngest
- Workflow
- Extreme Configurability
- Output Architecture (Oracle is an Example)
- Create a Custom Writer in 5 Minutes
- Quick Start
Real documentation is usually not RAG-ready:
- inconsistent structure
- long procedural blocks
- critical meaning hidden in screenshots
- mixed tables, notes, steps, and troubleshooting
OpenIngest transforms that into structured knowledge with:
KONTEKST(grounded chunk text)PITANJE/ODGOVOR(retrieval-friendly intent + concise answer)- breadcrumbs + page ranges
- embeddings + model provenance
- stable IDs (
DOC_ID,SECTION_ID,CHUNK_ID)
- Extract blocks from DOCX/PDF
- Extract images and optionally caption with vision
- Merge image descriptions inline
- Compute heading breadcrumbs
- Detect parent sections
- Split into child chunks (token/overlap aware)
- Generate questions/summaries/keywords
- Embed chunks
- Write records via selected writer
OpenIngest is intentionally configurable at every stage.
- Config sources: defaults, JSON/YAML config file, environment variables
- Chunking: mode (
procedure/structure/window/semantic), token targets, overlap, heading/keyword heuristics, OCR language - Enrichment: language strategy, image captions, summaries, synthetic questions, prompts
- Custom summarizer fields: declare chunk-level dynamic fields (
enum,freelist,freeparagraph) and have them generated during chunk enrichment - Embedding: model, dimensions, batch size, input template
- Pipeline control: per-stage enablement (
extract,chunk,enrich_text,embed,write, etc.) - Writer mapping: destination field mapping is configurable via
writer.mapping
This means the same core pipeline can be adapted to very different documentation styles, data contracts, and storage backends.
You can declare custom fields in config and they will be injected into chunk summarization prompts dynamically.
Each field declaration supports:
name: output field name (letters, numbers, underscore; starts with letter/underscore)type:enum|freelist|freeparagraphdescription: instruction for what the model should extract/generaterequired: optional, defaultfalseoptions: required forenum, forbidden for other types
Example:
enrichment:
custom_fields:
- name: app_id
type: enum
description:
- Classify the application identifier used by this chunk.
- Use the identifier that best matches the source application or business context.
- Return exactly one of the configured enum values.
required: true
options: ["1", "2", "3", "10"]
- name: impact_level
type: enum
description:
- Classify the operational impact severity of this chunk.
- Prefer the highest severity that is directly supported by the chunk content.
required: true
options: [low, medium, high]
- name: affected_modules
type: freelist
description:
- List all modules/systems explicitly mentioned as impacted.
- Use concise names only.
- name: operator_note
type: freeparagraph
description:
- Write a short operator-facing paragraph with the key caution.
- Keep it direct and actionable.Generated values are stored in text.custom_fields in each output record, so writer mappings can target paths such as:
text.custom_fields.impact_leveltext.custom_fields.affected_modulestext.custom_fields.operator_note
OpenIngest output is writer-driven.
- Current built-ins:
jsonl,oracle23ai - The Oracle writer in this repo reflects a specific schema used by one application at my current firm
- It is provided as a reference implementation, not a platform limitation
You can target alternative database architectures by adding another writer implementation (same Writer interface) and selecting it through writer.kind.
In other words: Oracle is one adapter example; the architecture is extensible by design.
- Create a new file in
OpenIngest/writers/, for examplemydb_writer.py. - Implement the
Writerinterface fromOpenIngest.writers.base. - Convert each
ChunkRecordinto your DB/API payload. - Return a
WriteResultwith counts and destination. - Register/select your writer via
writer.kindin config.
Minimal example:
from OpenIngest.writers.base import Writer, WriteResult
from OpenIngest.models import ChunkRecord
class MyDbWriter(Writer):
def write(self, records: list[ChunkRecord]) -> WriteResult:
# map records -> your storage schema, then insert
inserted = len(records)
return WriteResult(written=inserted, destination="mydb")Config example:
writer:
kind: mydbTip: start by copying OpenIngest/writers/oracle23ai.py or OpenIngest/writers/jsonl.py and replacing only the mapping + write logic.
pip install -e .
uvicorn OpenIngest.serve:app --reloadcd ui
npm install
npm run devopeningest-ingest /path/to/file.pdf --metadata "{\"app_id\":\"10\"}"Accepted sources: .pdf, .docx.
OPENAI_API_KEY=...
ORACLE_USER=...
ORACLE_PASSWORD=...
ORACLE_DSN=host/service_name
OPENINGEST_VISION_MODEL=gpt-4.1-mini
OPENINGEST_SUMMARIZE_MODEL=gpt-4.1-mini
OPENINGEST_EMBEDDING_MODEL=text-embedding-3-small
OPENINGEST_VISION_MAX_WORKERS=4
OPENINGEST_OPENAI_MAX_RETRIES=4
OPENINGEST_OPENAI_BACKOFF=1.0
ORACLE_TABLE=RAG_CHUNKS
ORACLE_BATCH_SIZE=50