168.se

🏗️ Architecture

The search engine follows a modular architecture with the following components:

Crawler - Discovers and fetches web pages
Parser - Extracts content from HTML pages
Indexer - Creates searchable indexes using Whoosh
Query Engine - Processes search queries
Ranking System - Ranks results based on relevance
Scheduler - Manages automated crawling tasks
Web Interface - Flask-based user interface

🚀 Features

Web Crawling: Respects robots.txt, handles relative URLs, and manages crawl depth
Content Parsing: Extracts titles, meta descriptions, headings, and main content
Full-Text Search: Powered by Whoosh search library with stemming
Custom Ranking: Multi-factor ranking algorithm considering title matches, content relevance, freshness, and more
Scheduled Crawling: Automated crawl jobs with configurable schedules
Web Interface: Clean, responsive web UI for searching and administration
Admin Panel: Manage crawl jobs, optimize indexes, and configure ranking weights
API Endpoints: RESTful API for integration with other applications

📋 Requirements

Python 3.7+
Flask
Whoosh
BeautifulSoup4
Requests
Schedule
NLTK

🛠️ Installation

Clone or download the project
```
cd d:\168.se
```

Install dependencies

uv pip install requests beautifulsoup4 flask whoosh nltk lxml urllib3 schedule

Run the demo (optional)
```
python demo.py
```
Start the web interface
```
cd search_engine
python app.py
```
Open your browser
- Visit: http://localhost:5000
- Admin panel: http://localhost:5000/admin

📁 Project Structure

search_engine/
├── components/
│   ├── __init__.py
│   ├── crawler.py          # Web crawler
│   ├── parser.py           # HTML parser
│   ├── indexer.py          # Search indexer
│   ├── query_engine.py     # Query processor
│   ├── scheduler.py        # Crawl scheduler
│   └── ranking.py          # Ranking algorithm
├── templates/
│   ├── base.html           # Base template
│   ├── index.html          # Home page
│   ├── search_results.html # Search results
│   ├── admin.html          # Admin panel
│   └── error.html          # Error page
├── data/                   # Data directory (created automatically)
│   ├── index/              # Search index files
│   └── crawl_config.json   # Crawl job configuration
├── app.py                  # Flask web application
└── search_engine.py       # Main search engine class

🔍 Usage

Basic Search

from search_engine import SearchEngine

# Initialize search engine
engine = SearchEngine()

# Crawl and index websites
engine.crawl_and_index([
    "https://example.com",
    "https://httpbin.org"
], max_pages=50, max_depth=2)

# Search
results = engine.search("example query", limit=10)
for result in results:
    print(f"{result['title']} - {result['url']}")

Scheduled Crawling

# Add a scheduled job
job_id = engine.add_crawl_job(
    name="Daily News Crawl",
    seed_urls=["https://news.example.com"],
    schedule_type="daily",
    schedule_time="02:00",
    max_pages=100,
    max_depth=3
)

# Start the scheduler
engine.start_scheduler()

Web Interface

Home Page: Search interface with suggestions
Search Results: Ranked results with pagination
Admin Panel:
- Manual crawling
- Scheduled job management
- Index optimization
- Ranking weight configuration

🎛️ Configuration

Ranking Weights

The ranking algorithm uses the following weights (configurable via admin panel):

title_match: 3.0 - Matches in page title
content_match: 1.0 - Matches in page content
meta_description_match: 2.0 - Matches in meta description
heading_match: 2.5 - Matches in headings
url_match: 1.5 - Matches in URL
freshness: 1.0 - How recently the page was crawled
content_length: 0.5 - Optimal content length scoring
depth_penalty: -0.2 - Penalty for deeper pages

Crawler Settings

max_pages: Maximum pages to crawl per job
max_depth: Maximum depth to crawl from seed URLs
delay: Delay between requests (default: 1 second)
Respects robots.txt automatically

🌐 API Endpoints

GET /api/search?q=query&limit=10 - Search API
GET /api/suggestions?q=partial&limit=5 - Query suggestions
POST /admin/crawl - Manual crawl trigger
POST /admin/add_job - Add scheduled job
POST /admin/run_job/<id> - Run job immediately

🔧 Advanced Features

Custom Ranking Algorithm

The search engine implements a sophisticated ranking algorithm that considers:

TF-IDF for content relevance
Position bias (terms appearing early get higher scores)
Title and heading emphasis
Freshness decay (newer content ranked higher)
Content length optimization
URL relevance

Scheduled Crawling

Daily, Weekly, or Hourly schedules
Background processing without blocking the web interface
Job status tracking and management
Automatic index updates

Search Features

Multi-field search across title, content, meta description, and headings
Query suggestions based on indexed content
Popular queries tracking
Pagination for large result sets
Score breakdown for debugging and optimization

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
search_engine		search_engine
.gitignore		.gitignore
README.md		README.md
add_w3schools_record.py		add_w3schools_record.py
demo.py		demo.py
requirements.txt		requirements.txt
test_config_update.py		test_config_update.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

168.se

🏗️ Architecture

🚀 Features

📋 Requirements

🛠️ Installation

📁 Project Structure

🔍 Usage

Basic Search

Scheduled Crawling

Web Interface

🎛️ Configuration

Ranking Weights

Crawler Settings

🌐 API Endpoints

🔧 Advanced Features

Custom Ranking Algorithm

Scheduled Crawling

Search Features

About

Uh oh!

Releases

Packages

Languages

dheepakshakthi/168

Folders and files

Latest commit

History

Repository files navigation

168.se

🏗️ Architecture

🚀 Features

📋 Requirements

🛠️ Installation

📁 Project Structure

🔍 Usage

Basic Search

Scheduled Crawling

Web Interface

🎛️ Configuration

Ranking Weights

Crawler Settings

🌐 API Endpoints

🔧 Advanced Features

Custom Ranking Algorithm

Scheduled Crawling

Search Features

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages