Skip to content

KaushikML/Plaig-Dectector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Plagiarism & AI Content Detector

Hybrid Web Search + Local Embeddings System (Hackathon Edition) Demo - https://www.youtube.com/watch?v=FHVAjsTcN1c

PROJECT OVERVIEW

This project is a full-stack plagiarism and AI-content detection system that analyzes uploaded documents against real web sources and generates detailed PDF reports.

The system is designed with reliability and graceful degradation in mind:

  • Core plagiarism detection runs fully offline using local embeddings
  • External LLMs (Google Gemini) are optional enhancements
  • If LLM quota is unavailable, the system still completes analysis

This makes the project robust for demos, hackathons, and real-world usage.


CORE IDEA

Plagiarism detection should not fail because an LLM quota is exhausted.

To achieve this, the system uses:

  • Web search and scraping for source discovery
  • Local semantic embeddings (E5) for similarity scoring
  • Optional LLM usage for query refinement and AI probability
  • Strict tunable thresholds to reduce irrelevant matches

PROJECT STRUCTURE

Plaig-Dectector/ │ ├── backend/ │ ├── app.py Flask entry point │ ├── core_detector.py Plagiarism & AI detection logic │ ├── auth.py Authentication routes │ ├── database.py MongoDB connection │ ├── config.py Configuration & env loading │ ├── .env Environment variables │ ├── requirements.txt │ └── .venv/ Python virtual environment │ ├── frontend/ React (Vite) │ └── README.md


FEATURES

User Management

  • User registration and login
  • Secure session-based authentication
  • User-specific report history

Plagiarism Detection

  • Supports PDF, DOCX, TXT, and raw text input
  • Web-scale comparison using Serper (Google Search API)
  • Scrapes real web pages
  • Semantic similarity using local E5 embeddings
  • Fuzzy matching using RapidFuzz
  • Configurable precision/recall tuning

AI Content Detection (Optional)

  • Uses Google Gemini when quota is available
  • Automatically falls back if unavailable
  • Never blocks plagiarism analysis

Reporting

  • Originality score
  • Overlap excerpts with source URLs
  • Fuzzy and semantic similarity scores
  • Search queries used
  • AI probability (if available)
  • Downloadable PDF reports

Dashboard

  • User statistics
  • Analysis history
  • Report downloads

ARCHITECTURE OVERVIEW

User Upload ↓ Text Extraction ↓ Query Generation ├─ Gemini (optional) └─ Keyword fallback ↓ Serper Web Search ↓ Web Scraping ↓ Local E5 Embeddings (Offline) ↓ Similarity Scoring ↓ PDF Report Generation


LOCAL EMBEDDINGS (WHY E5)

The system uses SentenceTransformers with the model: intfloat/e5-large-v2

Advantages:

  • Runs completely locally
  • No API keys or quotas
  • Strong semantic similarity performance
  • Ideal for hackathons and demos

This removes the biggest reliability risk in plagiarism systems.


TUNING KNOBS

Defined in core_detector.py:

MAX_QUERIES = 6 MAX_PAGES = 30 WINDOW_SENT = 5 TH_FUZZ = 72 TH_COS = 0.86 MAX_OVERL_URL = 3

These control precision vs recall.

Recommended usage:

  • Resume: stricter thresholds, MAX_OVERL_URL = 1–2
  • Academic report: balanced defaults
  • Blog/article: slightly relaxed thresholds
  • Legal documents: very strict thresholds

TECH STACK

Backend

  • Python
  • Flask
  • MongoDB
  • SentenceTransformers (E5 embeddings)
  • RapidFuzz
  • BeautifulSoup
  • ReportLab
  • NLTK
  • Serper API
  • Google Gemini (optional)

Frontend

  • React
  • Vite
  • React Router
  • Custom CSS

ENVIRONMENT VARIABLES

Create backend/.env:

FLASK_SECRET_KEY=your_secret_key MONGO_URI=mongodb://localhost:27017/plagiarism SERPER_API_KEY=your_serper_key

Optional (LLM features): GOOGLE_API_KEY=your_gemini_key GEMINI_MODEL=gemini-2.0-flash

DEBUG=True

Gemini is optional. Plagiarism detection works without it.


BACKEND SETUP (WITH .venv)

cd backend python -m venv .venv .\.venv\Scripts\activate (Windows) pip install -r requirements.txt python app.py

Backend runs at: http://127.0.0.1:5000


FRONTEND SETUP

cd frontend npm install npm run dev

Frontend runs at: http://127.0.0.1:5173


DESIGN PHILOSOPHY

  • LLMs are enhancements, not dependencies
  • Local ML for correctness
  • Graceful degradation
  • Transparent reporting
  • Configurable precision

This project is built with production thinking, not just demo logic.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors