Process Semantic Layer - Proof of Concept

A lightweight semantic layer demonstrating how a concept graph guides enterprise document retrieval. This PoC combines business process concepts, document tagging, embeddings, and vector similarity search in a working Python pipeline.

Purpose

The core question: Can a simple concept graph—representing business processes—meaningfully improve how AI systems retrieve enterprise knowledge?

This project tests that hypothesis with a working system. It proves that structured business knowledge (concepts like "employee onboarding" or "payroll processing") can guide retrieval better than embeddings or keywords alone.

The system:

Maps queries to business concepts using name and synonym matching
Filters documents based on concept tags (semantic pre-filtering)
Ranks filtered documents using embedding similarity (semantic relevance)
Returns structured results with matched concepts and relevance scores

This architecture reflects real-world applied AI patterns where domain knowledge guides retrieval.

Context: Why This Matters Now

As a business process management consultant, I've watched organizations struggle to connect their AI investments with their actual operational knowledge. We have decades of work in business semantics—standards like SBVR (Semantics of Business Vocabulary and Rules)—that defined how to capture business meaning formally.

Then AI came along, and we threw it all away for "let the LLM figure it out."

This experiment asks: What if we didn't?

What if we took the structured business vocabulary work seriously—but made it lightweight enough for the AI age? Not heavyweight ontologies. Not complex rule engines. Just: here are our business concepts, here are the synonyms, here's how they relate.

This PoC tests whether that middle ground—structured enough to guide, simple enough to maintain—actually works.

Full disclosure: I'm not an AI researcher. I'm not a knowledge graph expert. I'm a process consultant learning in public. If you spot issues or see better approaches, I want to hear them. This is about asking good questions, not claiming perfect answers.

Architecture

Query Text
    ↓
┌─────────────────────┐
│  Concept Matching   │ ← data/concepts.yaml (business knowledge)
└─────────────────────┘
    ↓
┌─────────────────────┐
│ Document Filtering  │ ← Tagged documents (concept-labeled)
└─────────────────────┘
    ↓
┌─────────────────────┐
│ Embedding Ranking   │ ← Sentence-transformers embeddings
└─────────────────────┘
    ↓
Top K Results (doc_id, title, concepts, score, snippet)

Components

Concept Graph (data/concepts.yaml): 12 business process concepts with names, synonyms, and relationships
Documents (data/documents/): 5 markdown process documents with frontmatter metadata
Concept Matching (src/concept_graph.py): Case-insensitive substring matching against concept names/synonyms
Document Tagging (src/document_loader.py): Automatic concept assignment during document loading
Embedding Engine (src/embedding_engine.py): Sentence-transformers for vectorization and cosine similarity
Retrieval Pipeline (src/retrieval_pipeline.py): Orchestrates concept filtering → embedding ranking

Quick Start

1. Install Dependencies

pip install -r requirements.txt

This installs:

pydantic - Data models
pyyaml - Concept graph parsing
sentence-transformers - Embedding generation
numpy + scikit-learn - Vector operations

2. Run the Demo

Option A: Automated Demo (Recommended)

python run_demo.py

Runs 4 test queries and displays:

Concept matching visualization
Document filtering steps
Ranking with confidence scores
Performance summary dashboard

Option B: Interactive Mode

python run_demo.py --interactive

Enter your own queries and see results in real-time.

What happens on startup:

Loads 12 concepts from YAML
Loads 5 markdown documents
Tags each document with matched concepts
Generates embeddings using all-MiniLM-L6-v2 model
Ready to process queries

Test Results

Query	Top Result	Confidence	Matched Concepts
"how do new hires get benefits?"	Employee Onboarding	45.5%	employee_onboarding, benefits_enrollment
"processing payroll and tax"	Payroll Guidelines	73.8%	payroll_processing, tax_filing
"vendor contracts and approval"	Vendor Management	69.2%	vendor_management, contract_review
"annual audit requirements"	Compliance Audit	57.0%	compliance_audit, financial_reporting

4 out of 4 queries returned the correct document as top result.

The system understands the difference between:

Documents that mention a keyword
Documents that relate to a business concept

Why This Matters

For Strategy

Most RAG systems are black boxes. You ask a question, get documents back, but can't explain why.

This architecture combines:

Explicit business logic (concept graph in plain YAML)
AI pattern matching (embeddings for semantic similarity)
Transparent reasoning (every result shows matched concepts)

Result: AI that aligns with how your organization actually thinks and can explain its decisions.

For Compliance

Query: "show me payroll tax documents"

Without semantic layer:
Result: [mixed documents, no explanation]

With semantic layer:
Result: "Payroll Guidelines" (73.8% confidence)
Matched: payroll_processing, tax_filing

For regulated industries (finance, healthcare, legal), this explainability isn't optional—it's essential.

For Knowledge Management

The concept graph is just YAML:

- id: employee_onboarding
  name: Employee Onboarding
  synonyms:
    - new hire
    - joining process
    - orientation
  related_to:
    - benefits_enrollment
    - timekeeping

Business analysts can edit this. No ontology engineers required. Documents auto-tag themselves based on these concepts during ingestion.

Project Structure

process-semantic-layer/
├── README.md                      # This file
├── requirements.txt               # Python dependencies
├── run_demo.py                    # Demo script (automated + interactive modes)
├── data/
│   ├── concepts.yaml             # Business concept graph (12 concepts)
│   └── documents/                # Process markdown documents (5 docs)
│       ├── doc1_onboarding.md
│       ├── doc2_payroll.md
│       ├── doc3_procurement.md
│       ├── doc4_compliance.md
│       └── doc5_reporting.md
└── src/
    ├── __init__.py
    ├── models.py                 # Pydantic data models
    ├── concept_graph.py          # Concept loading and matching
    ├── document_loader.py        # Document parsing and tagging
    ├── embedding_engine.py       # Embedding generation and ranking
    └── retrieval_pipeline.py     # Orchestration logic

Design Decisions

Why a Concept Graph?

Real enterprise systems deal with domain-specific terminology. A concept graph captures business vocabulary (e.g., "onboarding" = "new hire" = "orientation") that pure embedding models might miss. This PoC shows how explicit semantic structures complement ML-based retrieval.

Why Tag Documents in Advance?

Pre-tagging documents with concepts enables semantic filtering - we can narrow the search space before computing similarity. This is faster and more explainable than embedding-only approaches.

Why Sentence-Transformers?

The all-MiniLM-L6-v2 model provides:

Fast embedding generation (384-dim vectors)
Good semantic understanding for short documents
No external API dependencies
Suitable for local development and PoCs

Why In-Memory?

This PoC prioritizes clarity over scale. In-memory storage keeps the code simple. A production system would use vector databases (Pinecone, Weaviate, Qdrant) and persistent storage.

What This Proves (And What It Doesn't)

✅ Proven

Lightweight concept graphs work - 12 concepts + synonyms is enough to see value
Hybrid retrieval beats pure embedding search for domain-specific queries
Explainability scales - Each result traces back to matched concepts
Implementation is practical - Plain YAML, standard Python libraries, no specialized infrastructure

❌ Not Proven (Yet)

Scale - Tested on 5 documents, not 5,000
Maintenance - How do you keep concepts aligned as business evolves?
Multi-domain - Does this work across different business units?
Production hardening - No auth, caching, or monitoring built in

This is a directional experiment, not a production system.

Questions This Raises

For executives:

What are the 10 most critical process concepts in your business?
Can your AI systems reason about them explicitly?
Do you need to explain your AI's decisions for compliance?

For AI strategists:

When does semantic layer complexity outweigh benefits?
How do you measure concept graph ROI?
Should concept definitions be centralized or federated?

For knowledge managers:

Who owns concept definitions—IT or business SMEs?
Can concepts emerge automatically from document patterns?
What's the governance model for concept evolution?

What Could Come Next

This PoC opens several directions for exploration:

Immediate Extensions

Scale testing: Test on 100+ documents across multiple domains
Multi-hop reasoning: Traverse concept relationships to expand queries
Automatic concept discovery: Use LLMs to suggest new concepts from document patterns
Integration: Connect to existing enterprise search or knowledge management systems

Bigger Research Questions

Can LLMs help maintain concept graphs automatically as business processes change?
How do you align concept graphs across merged organizations?
What does "process interoperability" look like with semantic layers?
Can semantic layers satisfy AI governance requirements (EU AI Act, etc.)?

Production Considerations

Vector database integration for scale
Concept versioning and change management
Multi-tenant concept graphs (different teams, different concepts)
Active learning feedback loops (users correct matches → concepts improve)

Extending This PoC

Add more concepts: Expand data/concepts.yaml with additional business terms and relationships

Add more documents: Place markdown files in data/documents/ with YAML frontmatter

Improve matching: Implement fuzzy matching or stemming in concept_graph.py

Better snippets: Extract sentences around matched keywords instead of first N characters

Add concept relationships: Use related_to field to expand queries with related concepts

Persistent storage: Integrate vector database (Qdrant, Weaviate) for larger document collections

Logging and metrics: Track query patterns, concept hit rates, and retrieval performance

Limitations

This is a proof of concept with intentional constraints:

No persistence - Documents and embeddings are loaded in-memory each run
Simple matching - Concept matching is basic substring match (no NLP)
No query expansion - Related concepts are loaded but not used yet
Single-language - Only handles English text
No caching - Every query recomputes similarities
Limited error handling - Basic exception handling only
Small scale - Tested on 5 documents, 12 concepts

These limitations keep the implementation simple for learning and experimentation.

A Note on Approach

This experiment comes from a business process management perspective, not a pure AI/ML perspective. The concept graph structure draws inspiration from business semantics standards (particularly SBVR) adapted for modern retrieval systems.

What this means:

The concepts represent business vocabulary, not arbitrary tags
The relationships (related_to) mirror how business processes actually connect
The emphasis on explainability reflects enterprise governance needs

What this doesn't mean:

This isn't claiming to be the "right" way to do semantic layers
There are likely better embedding models, matching algorithms, or architectures
Experts in knowledge graphs or NLP may spot obvious improvements

The point: Can process consultants and business analysts—people who understand organizational semantics—contribute meaningfully to AI architecture? Or does this all need to be ML engineering?

I'm testing the former. Feedback welcome.

The Bigger Picture

This experiment sits at the intersection of three trends:

Knowledge Graphs → Semantic Layers
Moving from heavyweight ontologies to lightweight, task-specific concept models
RAG → Hybrid Retrieval
Combining symbolic reasoning (concepts) with neural retrieval (embeddings)
Black Box AI → Explainable AI
Building systems that can articulate their reasoning process

The question isn't "Should we use embeddings or concept graphs?"

The question is: "How do we combine structured business knowledge with statistical learning to build AI that organizations can trust?"

This PoC is one answer. What's yours?

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
src		src
.gitignore		.gitignore
QUICKSTART.md		QUICKSTART.md
README.md		README.md
requirements.txt		requirements.txt
run_demo.py		run_demo.py
test_pipeline.py		test_pipeline.py

nickblackbourn/process-semantic-layer

Folders and files

Latest commit

History

Repository files navigation