Real-Time Document Similarity Engine

Finding near-duplicate and related documents in milliseconds using MinHash LSH and distributed computing

Status: 🚧 In Development - Phase 1

Project Overview

A production-grade document similarity system that processes streaming documents, identifies near-duplicates, and surfaces related content in real-time.

Use Case: Plagiarism Detection for arXiv Papers

Dataset: 100K arXiv paper abstracts
Problem: Identify submitted papers similar to existing work
Scale: Processing 100K+ documents
Requirement: Sub-second query response for new submissions

Tech Stack

Core Algorithm: MinHash + Locality-Sensitive Hashing (LSH)
Distributed Computing: Apache Spark
Storage: Redis (in-memory index)
API: FastAPI
Deployment: Docker

Project Timeline

Phase 1: System Architecture & Local Prototype
Phase 2: Scale with Spark + Distributed Storage
Phase 3: REST API + Query Service
Phase 4: Production Deployment & Monitoring

Current Phase: Phase 1 - Local Prototype

Progress

Performance Benchmarks

Phase 1: Local Prototype (Baseline)

Dataset: 10,000 arXiv abstracts
Technology: Pure Python, single-threaded
Indexing throughput: 4 docs/sec
Query latency: 218ms
Total indexing time: 41 minutes

Phase 2: Distributed Processing (Current)

Dataset: 99,904 arXiv abstracts (10x scale)
Technology: Apache Spark distributed computing
Indexing throughput: 172 docs/sec (43x improvement 🚀)
Query latency: TBD (Phase 3)
Total pipeline time: 10 minutes
Storage format: Parquet (compressed, columnar)

Key Achievement: Superlinear speedup through parallelization - 10x more data processed 4.2x faster.

Scalability Projection

Dataset Size	Estimated Time	Status
10K papers	42 seconds	✅ Complete
100K papers	10 minutes	✅ Complete
1M papers	~97 minutes	📊 Projected
2.3M papers (full arXiv)	~3.7 hours	📊 Projected

Benchmarked on: [10-core cpu, 16GB RAM]

Demo Results

Performance on 10K arXiv Papers

Key Findings:

Query latency: 218ms (sub-second response ✅)
Median similarity: 0.18 (healthy distribution, no pathological clustering)
Threshold: 0.3 effectively filters top ~15-20% most similar papers
Indexing throughput: 4 docs/sec (identified bottleneck for Phase 2 optimization)

The system successfully identifies similar papers across diverse domains (physics, mathematics, computer science) with consistent performance.

Repository Structure

├── data/                  # Dataset files (not tracked in git)
├── docs/                  # Documentation
├── notebooks/             # Jupyter notebooks for exploration
├── src/                   # Source code
│   ├── similarity/        # MinHash + LSH implementation
│   ├── preprocessing/     # Text processing
│   └── pipeline/          # End-to-end pipeline
├── tests/                 # Unit tests
├── README.md
└── requirements.txt

Setup Instructions

Phase 1: Done!!

Portfolio Context: This project demonstrates end-to-end ML system design, scalable algorithms, and production engineering thinking - translating academic ML concepts into deployable systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time Document Similarity Engine

Project Overview

Tech Stack

Project Timeline

Current Phase: Phase 1 - Local Prototype

Progress

Performance Benchmarks

Phase 1: Local Prototype (Baseline)

Phase 2: Distributed Processing (Current)

Scalability Projection

Demo Results

Performance on 10K arXiv Papers

Repository Structure

Setup Instructions

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
docs		docs
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
modules.zip		modules.zip
requirements.txt		requirements.txt

Joshh99/document-similarity-engine

Folders and files

Latest commit

History

Repository files navigation

Real-Time Document Similarity Engine

Project Overview

Tech Stack

Project Timeline

Current Phase: Phase 1 - Local Prototype

Progress

Performance Benchmarks

Phase 1: Local Prototype (Baseline)

Phase 2: Distributed Processing (Current)

Scalability Projection

Demo Results

Performance on 10K arXiv Papers

Repository Structure

Setup Instructions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages