semantic-search-engine

A lightweight semantic search system built over the 20 Newsgroups corpus — featuring fuzzy clustering, a custom-built semantic cache, and a FastAPI service with a minimal search UI.

Built for the Trademarkia AI/ML Engineer task.

System Architecture

User Query
    │
    ▼
┌─────────────────────────┐
│   Semantic Cache        │  Cluster-aware lookup — O(N/K) not O(N)
│   (pure Python + numpy) │  Cosine similarity · threshold τ=0.90
└──────────┬──────────────┘
           │ miss
           ▼
┌─────────────────────────┐
│   ChromaDB              │  HNSW index · cosine distance
│   Vector Store          │  16,958 documents · 384-dim embeddings
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────┐
│   Fuzzy C-Means         │  k=15 clusters · m=2.0 fuzziness
│   Cluster Structure     │  Membership matrix (16958 × 15)
└─────────────────────────┘

Parts

Part 1 — Embedding & Vector Store (`src/ingest.py`)

Loads 20 Newsgroups via sklearn (subset='all', ~18k documents)
Cleans corpus: strips headers, quoted replies, footers, short documents
Embeds with all-MiniLM-L6-v2 (384-dim, CPU-friendly, L2-normalised)
Persists to ChromaDB with cosine distance index
Saves .npy arrays for downstream clustering

Key decisions:

Headers removed because they leak category labels into embeddings
all-MiniLM-L6-v2 over all-mpnet-base-v2: 3× faster, sufficient quality, lower dimensionality is better for FCM
ChromaDB over FAISS: metadata filtering needed for cluster-aware cache lookup

Part 2 — Fuzzy Clustering (`src/cluster.py`)

PCA reduces 384d → 50d before clustering (curse of dimensionality)
Fuzzy C-Means implemented from scratch in numpy (no skfuzzy — incompatible with Python 3.12)
Uses cosine distance (correct for L2-normalised vectors on a hypersphere)
Produces membership matrix U of shape (n_docs, 15) — soft assignments
Generates FPC elbow plot, membership distribution, convergence curve, and category heatmap

Key decisions:

k=15 not 20: several newsgroups overlap semantically (comp.sys.ibm + comp.sys.mac, rec.sport.*)
m=2.0 fuzziness: canonical value — m=1 degenerates to KMeans, m→∞ gives uniform membership
Cosine distance over Euclidean: unit vectors on a hypersphere have compressed Euclidean ranges

Part 3 — Semantic Cache (`src/cache.py`)

Built from scratch — pure Python dicts + numpy, no Redis, no caching libraries
Cluster-bucketed: cache entries are partitioned by dominant cluster
Lookup scans only the matching cluster bucket → O(N/K) not O(N)
Exact-match fast path via SHA-256 hash
Cosine similarity threshold τ (default 0.90) determines cache hit

Threshold behaviour:

τ	Behaviour
0.99	Near-exact duplicates only
0.95	Very close paraphrases
0.90	Clear paraphrases — default
0.85	Related queries — false positives begin
0.80	Loosely related — wrong results risk

Part 4 — FastAPI Service (`api/main.py`)

Endpoints

Method	Path	Description
`POST`	`/query`	Semantic search with cache
`GET`	`/cache/stats`	Cache state
`DELETE`	`/cache`	Flush cache
`GET`	`/health`	Health check
`GET`	`/`	Search UI

Response (cache miss):

{
  "query": "What are the health risks of smoking?",
  "cache_hit": false,
  "matched_query": null,
  "similarity_score": null,
  "result": {
    "documents": [
      { "rank": 1, "text": "...", "category": "sci.med", "similarity": 0.62 }
    ],
    "total_found": 5
  },
  "dominant_cluster": 13
}

Response (cache hit):

{
  "query": "How dangerous are cigarettes?",
  "cache_hit": true,
  "matched_query": "What are the health risks of smoking?",
  "similarity_score": 0.91,
  "result": { ... },
  "dominant_cluster": 13
}

Quick Start

Local (recommended)

# 1. Clone
git clone https://github.com/Dhanishta-codes/semantic-search-engine.git
cd semantic-search-engine

# 2. Create and activate virtual environment
python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # macOS/Linux

# 3. Install dependencies
pip install -r requirements.txt

# 4. Run Part 1 — embed corpus and build vector store (~10-15 min first time)
python src/ingest.py

# 5. Run Part 2 — fuzzy clustering (~5 min)
python src/cluster.py

# 6. Start the API
uvicorn api.main:app --host 0.0.0.0 --port 8000

Open http://localhost:8000 for the search UI. Open http://localhost:8000/docs for the Swagger API docs.

Docker

Run the data pipeline locally first (steps 4–5 above), then:

docker-compose up --build

The container mounts chroma_db/, embeddings/, and clustering/ as volumes. Service starts on http://localhost:8000.

Environment Variables

Variable	Default	Description
`SIM_THRESHOLD`	`0.90`	Cache similarity threshold τ
`N_CLUSTERS`	`15`	Must match clustering run
`N_RESULTS`	`5`	Documents returned per query

Project Structure

semantic-search-engine/
├── src/
│   ├── ingest.py          # Part 1: corpus cleaning, embedding, ChromaDB
│   ├── cluster.py         # Part 2: PCA + Fuzzy C-Means + analysis
│   └── cache.py           # Part 3: SemanticCache class
├── api/
│   └── main.py            # Part 4: FastAPI service
├── index.html             # Search UI (served at /)
├── embeddings/            # Generated: .npy arrays (gitignored)
├── clustering/            # Generated: membership matrix + plots (gitignored)
├── chroma_db/             # Generated: vector store (gitignored)
├── requirements.txt
├── Dockerfile
└── docker-compose.yml

Stack

Component	Choice	Reason
Embeddings	`all-MiniLM-L6-v2`	Fast CPU inference, 384-dim, no API key
Vector store	ChromaDB	Local persistent, metadata filtering, cosine distance
Clustering	Fuzzy C-Means (numpy)	Soft assignments, Python 3.12 compatible
Cache	Pure Python + numpy	As required — no Redis or caching middleware
API	FastAPI + uvicorn	As required
Docker	`python:3.11-slim`	All deps compatible, pre-baked model weights

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

semantic-search-engine

System Architecture

Parts

Part 1 — Embedding & Vector Store (`src/ingest.py`)

Part 2 — Fuzzy Clustering (`src/cluster.py`)

Part 3 — Semantic Cache (`src/cache.py`)

Part 4 — FastAPI Service (`api/main.py`)

Endpoints

Quick Start

Local (recommended)

Docker

Environment Variables

Project Structure

Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
api		api
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
index.html		index.html
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

semantic-search-engine

System Architecture

Parts

Part 1 — Embedding & Vector Store (src/ingest.py)

Part 2 — Fuzzy Clustering (src/cluster.py)

Part 3 — Semantic Cache (src/cache.py)

Part 4 — FastAPI Service (api/main.py)

Endpoints

Quick Start

Local (recommended)

Docker

Environment Variables

Project Structure

Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Part 1 — Embedding & Vector Store (`src/ingest.py`)

Part 2 — Fuzzy Clustering (`src/cluster.py`)

Part 3 — Semantic Cache (`src/cache.py`)

Part 4 — FastAPI Service (`api/main.py`)

Packages