This is a single-node toy/demonstration of a search engine distributed system.
Components:
- Web Server
- serves a simple html page with search input text box
- on submit the query is logged to an analytics log and the top 10 search results ranked by cosine similarity over BERT embeddings are returned using an HNSW index for efficiency
- Analytics cron job
- reads the analytics log and constructs a Trie with caching to serve autocomplete suggestions
- Web Crawler cron job
- Builds an Inverted Index from scraped web pages starting with Hacker News as a seed url
Prerequisites for running:
- make
- Docker
- A web browser
To run the application
make builddocker-compose up- Open a browser to
localhost:3000 - Start submitting queries
- If you want to refresh the search index, run
make inverted_index
Note: the autosuggest trie is refreshed every 30 seconds.
Prerequisites for developing:
- Python 3.13 / Pip
Create a virtual environment
python -m venv .venv
source .venv/bin/activate
Install requirements
make install
Run tests
make test
- Move the web crawler cron job to docker-compose: unfortunately Selenium web_driver is currently not supported in the docker environment, so you'll have refresh the index yourself
