Skip to content

mweiden/search-engine

Repository files navigation

Search Engine

Python package

This is a single-node toy/demonstration of a search engine distributed system.

Components:

  • Web Server
    • serves a simple html page with search input text box
    • on submit the query is logged to an analytics log and the top 10 search results ranked by cosine similarity over BERT embeddings are returned using an HNSW index for efficiency
  • Analytics cron job
    • reads the analytics log and constructs a Trie with caching to serve autocomplete suggestions
  • Web Crawler cron job
    • Builds an Inverted Index from scraped web pages starting with Hacker News as a seed url

Running

Prerequisites for running:

  • make
  • Docker
  • A web browser

To run the application

  1. make build
  2. docker-compose up
  3. Open a browser to localhost:3000
  4. Start submitting queries
  5. If you want to refresh the search index, run make inverted_index

Note: the autosuggest trie is refreshed every 30 seconds.

Development

Prerequisites for developing:

  • Python 3.13 / Pip

Create a virtual environment

python -m venv .venv
source .venv/bin/activate

Install requirements

make install

Run tests

make test

TODO

  • Move the web crawler cron job to docker-compose: unfortunately Selenium web_driver is currently not supported in the docker environment, so you'll have refresh the index yourself

About

Toy search engine

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published