Skip to content

Dhanishta-codes/semantic-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

semantic-search-engine

A lightweight semantic search system built over the 20 Newsgroups corpus — featuring fuzzy clustering, a custom-built semantic cache, and a FastAPI service with a minimal search UI.

Built for the Trademarkia AI/ML Engineer task.


System Architecture

User Query
    │
    ▼
┌─────────────────────────┐
│   Semantic Cache        │  Cluster-aware lookup — O(N/K) not O(N)
│   (pure Python + numpy) │  Cosine similarity · threshold τ=0.90
└──────────┬──────────────┘
           │ miss
           ▼
┌─────────────────────────┐
│   ChromaDB              │  HNSW index · cosine distance
│   Vector Store          │  16,958 documents · 384-dim embeddings
└──────────┬──────────────┘
           │
           ▼
┌─────────────────────────┐
│   Fuzzy C-Means         │  k=15 clusters · m=2.0 fuzziness
│   Cluster Structure     │  Membership matrix (16958 × 15)
└─────────────────────────┘

Parts

Part 1 — Embedding & Vector Store (src/ingest.py)

  • Loads 20 Newsgroups via sklearn (subset='all', ~18k documents)
  • Cleans corpus: strips headers, quoted replies, footers, short documents
  • Embeds with all-MiniLM-L6-v2 (384-dim, CPU-friendly, L2-normalised)
  • Persists to ChromaDB with cosine distance index
  • Saves .npy arrays for downstream clustering

Key decisions:

  • Headers removed because they leak category labels into embeddings
  • all-MiniLM-L6-v2 over all-mpnet-base-v2: 3× faster, sufficient quality, lower dimensionality is better for FCM
  • ChromaDB over FAISS: metadata filtering needed for cluster-aware cache lookup

Part 2 — Fuzzy Clustering (src/cluster.py)

  • PCA reduces 384d → 50d before clustering (curse of dimensionality)
  • Fuzzy C-Means implemented from scratch in numpy (no skfuzzy — incompatible with Python 3.12)
  • Uses cosine distance (correct for L2-normalised vectors on a hypersphere)
  • Produces membership matrix U of shape (n_docs, 15) — soft assignments
  • Generates FPC elbow plot, membership distribution, convergence curve, and category heatmap

Key decisions:

  • k=15 not 20: several newsgroups overlap semantically (comp.sys.ibm + comp.sys.mac, rec.sport.*)
  • m=2.0 fuzziness: canonical value — m=1 degenerates to KMeans, m→∞ gives uniform membership
  • Cosine distance over Euclidean: unit vectors on a hypersphere have compressed Euclidean ranges

Part 3 — Semantic Cache (src/cache.py)

  • Built from scratch — pure Python dicts + numpy, no Redis, no caching libraries
  • Cluster-bucketed: cache entries are partitioned by dominant cluster
  • Lookup scans only the matching cluster bucket → O(N/K) not O(N)
  • Exact-match fast path via SHA-256 hash
  • Cosine similarity threshold τ (default 0.90) determines cache hit

Threshold behaviour:

τ Behaviour
0.99 Near-exact duplicates only
0.95 Very close paraphrases
0.90 Clear paraphrases — default
0.85 Related queries — false positives begin
0.80 Loosely related — wrong results risk

Part 4 — FastAPI Service (api/main.py)

Endpoints

Method Path Description
POST /query Semantic search with cache
GET /cache/stats Cache state
DELETE /cache Flush cache
GET /health Health check
GET / Search UI

Response (cache miss):

{
  "query": "What are the health risks of smoking?",
  "cache_hit": false,
  "matched_query": null,
  "similarity_score": null,
  "result": {
    "documents": [
      { "rank": 1, "text": "...", "category": "sci.med", "similarity": 0.62 }
    ],
    "total_found": 5
  },
  "dominant_cluster": 13
}

Response (cache hit):

{
  "query": "How dangerous are cigarettes?",
  "cache_hit": true,
  "matched_query": "What are the health risks of smoking?",
  "similarity_score": 0.91,
  "result": { ... },
  "dominant_cluster": 13
}

Quick Start

Local (recommended)

# 1. Clone
git clone https://github.com/Dhanishta-codes/semantic-search-engine.git
cd semantic-search-engine

# 2. Create and activate virtual environment
python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # macOS/Linux

# 3. Install dependencies
pip install -r requirements.txt

# 4. Run Part 1 — embed corpus and build vector store (~10-15 min first time)
python src/ingest.py

# 5. Run Part 2 — fuzzy clustering (~5 min)
python src/cluster.py

# 6. Start the API
uvicorn api.main:app --host 0.0.0.0 --port 8000

Open http://localhost:8000 for the search UI. Open http://localhost:8000/docs for the Swagger API docs.


Docker

Run the data pipeline locally first (steps 4–5 above), then:

docker-compose up --build

The container mounts chroma_db/, embeddings/, and clustering/ as volumes. Service starts on http://localhost:8000.


Environment Variables

Variable Default Description
SIM_THRESHOLD 0.90 Cache similarity threshold τ
N_CLUSTERS 15 Must match clustering run
N_RESULTS 5 Documents returned per query

Project Structure

semantic-search-engine/
├── src/
│   ├── ingest.py          # Part 1: corpus cleaning, embedding, ChromaDB
│   ├── cluster.py         # Part 2: PCA + Fuzzy C-Means + analysis
│   └── cache.py           # Part 3: SemanticCache class
├── api/
│   └── main.py            # Part 4: FastAPI service
├── index.html             # Search UI (served at /)
├── embeddings/            # Generated: .npy arrays (gitignored)
├── clustering/            # Generated: membership matrix + plots (gitignored)
├── chroma_db/             # Generated: vector store (gitignored)
├── requirements.txt
├── Dockerfile
└── docker-compose.yml

Stack

Component Choice Reason
Embeddings all-MiniLM-L6-v2 Fast CPU inference, 384-dim, no API key
Vector store ChromaDB Local persistent, metadata filtering, cosine distance
Clustering Fuzzy C-Means (numpy) Soft assignments, Python 3.12 compatible
Cache Pure Python + numpy As required — no Redis or caching middleware
API FastAPI + uvicorn As required
Docker python:3.11-slim All deps compatible, pre-baked model weights

About

Semantic search system with fuzzy clustering and a custom cache layer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors