A lightweight semantic search system built over the 20 Newsgroups corpus — featuring fuzzy clustering, a custom-built semantic cache, and a FastAPI service with a minimal search UI.
Built for the Trademarkia AI/ML Engineer task.
User Query
│
▼
┌─────────────────────────┐
│ Semantic Cache │ Cluster-aware lookup — O(N/K) not O(N)
│ (pure Python + numpy) │ Cosine similarity · threshold τ=0.90
└──────────┬──────────────┘
│ miss
▼
┌─────────────────────────┐
│ ChromaDB │ HNSW index · cosine distance
│ Vector Store │ 16,958 documents · 384-dim embeddings
└──────────┬──────────────┘
│
▼
┌─────────────────────────┐
│ Fuzzy C-Means │ k=15 clusters · m=2.0 fuzziness
│ Cluster Structure │ Membership matrix (16958 × 15)
└─────────────────────────┘
- Loads 20 Newsgroups via sklearn (
subset='all', ~18k documents) - Cleans corpus: strips headers, quoted replies, footers, short documents
- Embeds with
all-MiniLM-L6-v2(384-dim, CPU-friendly, L2-normalised) - Persists to ChromaDB with cosine distance index
- Saves
.npyarrays for downstream clustering
Key decisions:
- Headers removed because they leak category labels into embeddings
all-MiniLM-L6-v2overall-mpnet-base-v2: 3× faster, sufficient quality, lower dimensionality is better for FCM- ChromaDB over FAISS: metadata filtering needed for cluster-aware cache lookup
- PCA reduces 384d → 50d before clustering (curse of dimensionality)
- Fuzzy C-Means implemented from scratch in numpy (no skfuzzy — incompatible with Python 3.12)
- Uses cosine distance (correct for L2-normalised vectors on a hypersphere)
- Produces membership matrix
Uof shape(n_docs, 15)— soft assignments - Generates FPC elbow plot, membership distribution, convergence curve, and category heatmap
Key decisions:
k=15not 20: several newsgroups overlap semantically (comp.sys.ibm + comp.sys.mac, rec.sport.*)m=2.0fuzziness: canonical value — m=1 degenerates to KMeans, m→∞ gives uniform membership- Cosine distance over Euclidean: unit vectors on a hypersphere have compressed Euclidean ranges
- Built from scratch — pure Python dicts + numpy, no Redis, no caching libraries
- Cluster-bucketed: cache entries are partitioned by dominant cluster
- Lookup scans only the matching cluster bucket → O(N/K) not O(N)
- Exact-match fast path via SHA-256 hash
- Cosine similarity threshold τ (default 0.90) determines cache hit
Threshold behaviour:
| τ | Behaviour |
|---|---|
| 0.99 | Near-exact duplicates only |
| 0.95 | Very close paraphrases |
| 0.90 | Clear paraphrases — default |
| 0.85 | Related queries — false positives begin |
| 0.80 | Loosely related — wrong results risk |
| Method | Path | Description |
|---|---|---|
POST |
/query |
Semantic search with cache |
GET |
/cache/stats |
Cache state |
DELETE |
/cache |
Flush cache |
GET |
/health |
Health check |
GET |
/ |
Search UI |
Response (cache miss):
{
"query": "What are the health risks of smoking?",
"cache_hit": false,
"matched_query": null,
"similarity_score": null,
"result": {
"documents": [
{ "rank": 1, "text": "...", "category": "sci.med", "similarity": 0.62 }
],
"total_found": 5
},
"dominant_cluster": 13
}Response (cache hit):
{
"query": "How dangerous are cigarettes?",
"cache_hit": true,
"matched_query": "What are the health risks of smoking?",
"similarity_score": 0.91,
"result": { ... },
"dominant_cluster": 13
}# 1. Clone
git clone https://github.com/Dhanishta-codes/semantic-search-engine.git
cd semantic-search-engine
# 2. Create and activate virtual environment
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # macOS/Linux
# 3. Install dependencies
pip install -r requirements.txt
# 4. Run Part 1 — embed corpus and build vector store (~10-15 min first time)
python src/ingest.py
# 5. Run Part 2 — fuzzy clustering (~5 min)
python src/cluster.py
# 6. Start the API
uvicorn api.main:app --host 0.0.0.0 --port 8000Open http://localhost:8000 for the search UI. Open http://localhost:8000/docs for the Swagger API docs.
Run the data pipeline locally first (steps 4–5 above), then:
docker-compose up --buildThe container mounts chroma_db/, embeddings/, and clustering/ as volumes.
Service starts on http://localhost:8000.
| Variable | Default | Description |
|---|---|---|
SIM_THRESHOLD |
0.90 |
Cache similarity threshold τ |
N_CLUSTERS |
15 |
Must match clustering run |
N_RESULTS |
5 |
Documents returned per query |
semantic-search-engine/
├── src/
│ ├── ingest.py # Part 1: corpus cleaning, embedding, ChromaDB
│ ├── cluster.py # Part 2: PCA + Fuzzy C-Means + analysis
│ └── cache.py # Part 3: SemanticCache class
├── api/
│ └── main.py # Part 4: FastAPI service
├── index.html # Search UI (served at /)
├── embeddings/ # Generated: .npy arrays (gitignored)
├── clustering/ # Generated: membership matrix + plots (gitignored)
├── chroma_db/ # Generated: vector store (gitignored)
├── requirements.txt
├── Dockerfile
└── docker-compose.yml
| Component | Choice | Reason |
|---|---|---|
| Embeddings | all-MiniLM-L6-v2 |
Fast CPU inference, 384-dim, no API key |
| Vector store | ChromaDB | Local persistent, metadata filtering, cosine distance |
| Clustering | Fuzzy C-Means (numpy) | Soft assignments, Python 3.12 compatible |
| Cache | Pure Python + numpy | As required — no Redis or caching middleware |
| API | FastAPI + uvicorn | As required |
| Docker | python:3.11-slim |
All deps compatible, pre-baked model weights |