This project demonstrates a simple semantic search system built using transformer embeddings and inspired by vector database principles used in Endee.
It converts text documents into vector embeddings and retrieves the most relevant result (Top-1) based on semantic similarity.
Traditional keyword search matches exact words. Semantic search understands meaning.
This project:
Converts documents into embeddings using a transformer model
Stores embeddings as vectors
Computes cosine similarity
Returns the most relevant document (Top-1)
- Python
- sentence-transformers
- NumPy
- Transformer Model: all-MiniLM-L6-v2
- Similarity Metric: Cosine similarity
- Transformer-based embeddings using
all-MiniLM-L6-v2 - Cosine similarity ranking
- Top-5 semantic retrieval
- Lightweight and efficient implementation
- Modular structure for future scaling (RAG-ready)
- Text documents are stored in
data/sample_docs/ - Documents are converted into vector embeddings using SentenceTransformers
- A user query is converted into an embedding
- Cosine similarity is computed between the query and all documents
- The Top-5 most similar documents are ranked and displayed
all-MiniLM-L6-v2- 384-dimensional sentence embeddings
- Optimized for semantic similarity tasks
- Fast and lightweight transformer model
endee-semantic-search/
│
├── search.py
├── README.md
git clone https://github.com/ad8tea/endee-semantic-search.git
cd endee-semantic-search
python -m venv venv
venv\Scripts\activate # Windows
pip install sentence-transformers numpy
python search.py
health
Top Matching Document:
Score: 0.5298
Text: Staying healthy requires regular exercise and proper nutrition.
- Documents are encoded into dense vectors using a transformer model.
- The query is converted into a vector.
- Cosine similarity is computed between the query vector and document vectors.
- The document with the highest similarity score (Top-1) is returned.
This project demonstrates:
Understanding of semantic search Use of transformer embeddings Vector similarity computation Retrieval system design Application of vector database concepts
- Integrate a vector database (e.g., Endee, FAISS)
- Add persistent embedding storage
- Implement REST API with FastAPI
- Extend to Retrieval-Augmented Generation (RAG)
- Add evaluation metrics for retrieval performance
Aditi Thakur GitHub: https://www.github.com/ad8tea