The search engine follows a modular architecture with the following components:
- Crawler - Discovers and fetches web pages
- Parser - Extracts content from HTML pages
- Indexer - Creates searchable indexes using Whoosh
- Query Engine - Processes search queries
- Ranking System - Ranks results based on relevance
- Scheduler - Manages automated crawling tasks
- Web Interface - Flask-based user interface
- Web Crawling: Respects robots.txt, handles relative URLs, and manages crawl depth
- Content Parsing: Extracts titles, meta descriptions, headings, and main content
- Full-Text Search: Powered by Whoosh search library with stemming
- Custom Ranking: Multi-factor ranking algorithm considering title matches, content relevance, freshness, and more
- Scheduled Crawling: Automated crawl jobs with configurable schedules
- Web Interface: Clean, responsive web UI for searching and administration
- Admin Panel: Manage crawl jobs, optimize indexes, and configure ranking weights
- API Endpoints: RESTful API for integration with other applications
- Python 3.7+
- Flask
- Whoosh
- BeautifulSoup4
- Requests
- Schedule
- NLTK
-
Clone or download the project
cd d:\168.se
-
Install dependencies
uv pip install requests beautifulsoup4 flask whoosh nltk lxml urllib3 schedule
-
Run the demo (optional)
python demo.py
-
Start the web interface
cd search_engine python app.py -
Open your browser
- Visit: http://localhost:5000
- Admin panel: http://localhost:5000/admin
search_engine/
├── components/
│ ├── __init__.py
│ ├── crawler.py # Web crawler
│ ├── parser.py # HTML parser
│ ├── indexer.py # Search indexer
│ ├── query_engine.py # Query processor
│ ├── scheduler.py # Crawl scheduler
│ └── ranking.py # Ranking algorithm
├── templates/
│ ├── base.html # Base template
│ ├── index.html # Home page
│ ├── search_results.html # Search results
│ ├── admin.html # Admin panel
│ └── error.html # Error page
├── data/ # Data directory (created automatically)
│ ├── index/ # Search index files
│ └── crawl_config.json # Crawl job configuration
├── app.py # Flask web application
└── search_engine.py # Main search engine class
from search_engine import SearchEngine
# Initialize search engine
engine = SearchEngine()
# Crawl and index websites
engine.crawl_and_index([
"https://example.com",
"https://httpbin.org"
], max_pages=50, max_depth=2)
# Search
results = engine.search("example query", limit=10)
for result in results:
print(f"{result['title']} - {result['url']}")# Add a scheduled job
job_id = engine.add_crawl_job(
name="Daily News Crawl",
seed_urls=["https://news.example.com"],
schedule_type="daily",
schedule_time="02:00",
max_pages=100,
max_depth=3
)
# Start the scheduler
engine.start_scheduler()- Home Page: Search interface with suggestions
- Search Results: Ranked results with pagination
- Admin Panel:
- Manual crawling
- Scheduled job management
- Index optimization
- Ranking weight configuration
The ranking algorithm uses the following weights (configurable via admin panel):
title_match: 3.0 - Matches in page titlecontent_match: 1.0 - Matches in page contentmeta_description_match: 2.0 - Matches in meta descriptionheading_match: 2.5 - Matches in headingsurl_match: 1.5 - Matches in URLfreshness: 1.0 - How recently the page was crawledcontent_length: 0.5 - Optimal content length scoringdepth_penalty: -0.2 - Penalty for deeper pages
max_pages: Maximum pages to crawl per jobmax_depth: Maximum depth to crawl from seed URLsdelay: Delay between requests (default: 1 second)- Respects robots.txt automatically
GET /api/search?q=query&limit=10- Search APIGET /api/suggestions?q=partial&limit=5- Query suggestionsPOST /admin/crawl- Manual crawl triggerPOST /admin/add_job- Add scheduled jobPOST /admin/run_job/<id>- Run job immediately
The search engine implements a sophisticated ranking algorithm that considers:
- TF-IDF for content relevance
- Position bias (terms appearing early get higher scores)
- Title and heading emphasis
- Freshness decay (newer content ranked higher)
- Content length optimization
- URL relevance
- Daily, Weekly, or Hourly schedules
- Background processing without blocking the web interface
- Job status tracking and management
- Automatic index updates
- Multi-field search across title, content, meta description, and headings
- Query suggestions based on indexed content
- Popular queries tracking
- Pagination for large result sets
- Score breakdown for debugging and optimization