Distributed Search Engine

A scalable and fault-tolerant distributed search engine built using Python, MapReduce-style architecture, and TCP/IP socket programming, optimized for real-time web crawling, indexing, and retrieval with high efficiency.

Features

Parallel Crawling & Indexing: Efficient Unix-based crawling pipelines powered by Python's multiprocessing.
Distributed Ranking Algorithms: Implements scalable document ranking using TF-IDF and BM25 models.
Fault Tolerance: Node communication via TCP/IP sockets with recovery support.
Information Retrieval Optimized: Achieved 45% improvement in search response and document relevance.
Unix/Linux Compatible: Built and tested on Linux systems for optimal performance.

Architecture Overview

Crawler: Fetches and stores HTML pages concurrently.
Indexer: Processes and tokenizes documents to build an inverted index.
Ranker: Uses BM25 algorithm to rank results based on search queries.
Socket Layer: Ensures communication between distributed components using TCP/IP.

Tech Stack

Python, BeautifulSoup, Requests
Multiprocessing, Socket Programming
UNIX/Linux Environments
Information Retrieval: TF-IDF, BM25
TCP/IP Networking

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
crawler		crawler
indexer		indexer
ranker		ranker
utils		utils
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Search Engine

Features

Architecture Overview

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Languages

AnanyaJindal1145/Distributed-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Distributed Search Engine

Features

Architecture Overview

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages