JavDB Auto Spider

A comprehensive Python + Rust automation system for extracting torrent links from javdb.com and automatically adding them to qBittorrent. The system features a high-performance Rust core (via PyO3) for HTML parsing and proxy management, multi-threaded parallel processing, intelligent history tracking, git integration, automated pipeline execution, and duplicate download prevention.

It can be played as an ingestion pipeline before the automated scrapping platform for JAV (e.g. MDC-NG).

English | 简体中文

Features

Core Spider Functionality

Modular spider package (scripts/spider/) with 14 specialized modules
Fetches data in real-time from javdb.com/?vft=2 to javdb.com/?page=5&vft=2
Filters entries with both "含中字磁鏈" and "今日新種" tags (supports multiple language variations)
Extracts magnet links based on specific categories with priority ordering
Saves results to timestamped CSV files in reports/DailyReport/ directory
Comprehensive logging with different levels (INFO, WARNING, DEBUG, ERROR)
Multi-page processing with progress tracking
Additional metadata extraction (actor, rating, comment count)

Rust Acceleration (Optional)

High-performance Rust core extension (javdb_rust_core) built with PyO3 + maturin
"Rust first, Python fallback" pattern — all features work without Rust installed
HTML parsing 5-10x faster than BeautifulSoup (index, detail, category pages)
Thread-safe proxy pool management with Arc<Mutex>
Accelerated history management, CSV operations, magnet extraction, URL helpers
Automatic detection: system uses Rust when available, falls back to pure Python

Parallel Processing

Multi-threaded detail page processing with one worker thread per proxy
Activated automatically when using proxy pool mode with 2+ proxies
Task queue / result queue architecture for safe concurrent scraping
Independent MovieSleepManager per worker for rate limiting
Thread-safe login refresh with _login_lock
Force sequential mode with --sequential flag

Torrent Classification System

字幕 (subtitle): Magnet links with "Subtitle" tag
hacked: Magnet links with priority order:
1. UC无码破解 (-UC.无码破解.torrent) - Highest priority
2. UC (-UC.torrent)
3. U无码破解 (-U.无码破解.torrent)
4. U (-U.torrent) - Lowest priority

Dual Mode Support

The spider operates in two modes:

Daily Mode (Default)

Uses base URL: https://javdb.com/?vft=2
Saves results to reports/DailyReport/ directory
Checks history by default to avoid re-downloading
Uses "JavDB" category in qBittorrent

Ad Hoc Mode (Custom URL)

Activated with --url parameter for custom URLs (actors, tags, etc.)
Saves results to reports/AdHoc/ directory
Now checks history by default to skip already downloaded entries
Use --ignore-history to re-download everything
Uses "Ad Hoc" category in qBittorrent
Example: python3 scripts/spider --url "https://javdb.com/actors/EvkJ"

qBittorrent Integration

Automatically reads current date's CSV file
Connects to qBittorrent via Web UI API
Adds torrents with proper categorization and settings
Comprehensive logging and progress tracking
Detailed summary reports

qBittorrent File Filter

Automatically filters small files from recently added torrents
Configurable minimum file size threshold (default: 50MB)
Sets priority to 0 (do not download) for files below threshold
Filters out NFO files, samples, screenshots, etc.
Supports dry-run mode for preview
Category-based filtering option
Scheduled via GitHub Actions (2 hours after daily ingestion)

Duplicate Download Prevention

Automatic Downloaded Detection: Automatically identifies which torrents have been downloaded by checking the history CSV file
Download Indicators: Adds [DOWNLOADED] prefix to downloaded torrents in daily report CSV files
Skip Duplicate Downloads: qBittorrent uploader automatically skips torrents with [DOWNLOADED] indicators
Multiple Torrent Type Support: Supports four types: hacked_subtitle, hacked_no_subtitle, subtitle, no_subtitle
Enhanced History Tracking: Tracks create_date (first discovery) and update_date (latest modification) for each movie

Git Integration & Pipeline

Automated git commit and push functionality
Incremental commits throughout pipeline execution
Email notifications with results and logs
Complete workflow automation

JavDB Auto Login

Automatic session cookie refresh
Captcha handling (manual input or 2Captcha API)
Updates config.py automatically
Supports custom URL scraping with authentication
See JavDB Login Guide for setup

CloudFlare Bypass (Optional)

Integration with CloudflareBypassForScraping
Request Mirroring mode for transparent CF bypass
Automatic cookie caching and management
Works with both local and remote proxy setups
Automatically activated as a fallback when direct requests fail

Installation

Install Python dependencies:

pip install -r requirements.txt

(Optional) Install SOCKS5 proxy support if you want to use SOCKS5 proxies:

pip install requests[socks]

(Optional) Install Rust acceleration extension for 5-10x faster HTML parsing:

# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Build and install the extension
cd rust_core
pip install maturin
maturin develop --release
cd ..

Note: The Rust extension is optional. All features work without it — the system automatically falls back to pure Python implementations.

Configure the system by copying and editing the configuration file:

cp config.py.example config.py

(Optional) For CloudFlare bypass feature, install and run CloudflareBypassForScraping service:

# See CloudFlare Bypass section below for setup instructions

Docker Installation (Alternative)

You can also run the application using Docker containers, which simplifies dependency management and deployment.

Quick Start with Docker

Pull the image from GitHub Container Registry:

docker pull ghcr.io/YOUR_USERNAME/javdb-autospider:latest

Prepare configuration files:

cp config.py.example config.py
cp env.example .env
# Edit config.py with your settings

Run the container:

docker run -d \
  --name javdb-spider \
  --restart unless-stopped \
  -v $(pwd)/config.py:/app/config.py:ro \
  -v $(pwd)/logs:/app/logs \
  -v $(pwd)/Ad\ Hoc:/app/Ad\ Hoc \
  -v $(pwd)/Daily\ Report:/app/Daily\ Report \
  --env-file .env \
  ghcr.io/YOUR_USERNAME/javdb-autospider:latest

Using Docker Compose (Recommended)

Use the automated build script:

./docker/docker-build.sh

Or manually:

# Prepare configuration
cp config.py.example config.py
cp env.example .env

# Build and start
docker-compose -f docker/docker-compose.yml build
docker-compose -f docker/docker-compose.yml up -d

View logs:

docker-compose -f docker/docker-compose.yml logs -f

The Docker image uses multi-stage builds: a Rust builder stage compiles the javdb_rust_core extension, and the runtime stage only includes the compiled wheel.

Usage

Docker Usage

If you installed via Docker, you can manage the container with the following commands:

Basic Commands

# View container logs
docker logs -f javdb-spider

# View cron logs
docker exec javdb-spider tail -f /var/log/cron.log

# Run spider manually
docker exec javdb-spider python3 scripts/spider --use-proxy

# Run pipeline manually
docker exec javdb-spider python pipeline.py

# Execute commands inside container
docker exec -it javdb-spider bash

# Stop container
docker stop javdb-spider

# Start container
docker start javdb-spider

# Restart container
docker restart javdb-spider

With Docker Compose

# Start containers
docker-compose -f docker/docker-compose.yml up -d

# Stop containers
docker-compose -f docker/docker-compose.yml down

# View logs
docker-compose -f docker/docker-compose.yml logs -f

# Restart containers
docker-compose -f docker/docker-compose.yml restart

# Rebuild and restart
docker-compose -f docker/docker-compose.yml build --no-cache
docker-compose -f docker/docker-compose.yml up -d

Configure Cron Jobs

Edit the .env file to configure scheduled tasks:

# Spider runs daily at 3:00 AM
CRON_SPIDER=0 3 * * *
SPIDER_COMMAND=cd /app && /usr/local/bin/python scripts/spider --use-proxy >> /var/log/cron.log 2>&1

# Pipeline runs daily at 4:00 AM
CRON_PIPELINE=0 4 * * *
PIPELINE_COMMAND=cd /app && /usr/local/bin/python pipeline.py >> /var/log/cron.log 2>&1

After modifying .env, restart the container:

docker-compose -f docker/docker-compose.yml restart

Individual Scripts (Local Installation)

Run the spider to extract data:

python3 scripts/spider

# Or equivalently:
python -m scripts.spider

Run the qBittorrent uploader:

# Daily mode (default)
python qbtorrent_uploader.py

# Ad hoc mode (for custom URL scraping results)
python qbtorrent_uploader.py --mode adhoc

# Use proxy for qBittorrent API requests
python qbtorrent_uploader.py --use-proxy

Run the qBittorrent File Filter (filter out small files):

# Default: filter files smaller than 50MB from last 2 days
python scripts/qb_file_filter.py --min-size 50

# Custom threshold and days
python scripts/qb_file_filter.py --min-size 100 --days 3

# Dry run (preview without changes)
python scripts/qb_file_filter.py --min-size 50 --dry-run

# Filter specific category only
python scripts/qb_file_filter.py --min-size 50 --category JavDB

# With proxy
python scripts/qb_file_filter.py --min-size 50 --use-proxy

Run the PikPak bridge (transfer old torrents from qBittorrent to PikPak):

# Default: process torrents older than 3 days in batch mode
python pikpak_bridge.py

# Custom days threshold
python pikpak_bridge.py --days 7

# Dry run mode (test without actual transfers)
python pikpak_bridge.py --dry-run

# Individual mode (process torrents one by one instead of batch)
python pikpak_bridge.py --individual

# Use proxy for qBittorrent API requests
python pikpak_bridge.py --use-proxy

# Combine options
python pikpak_bridge.py --days 5 --dry-run --use-proxy

Command-Line Arguments

The JavDB Spider supports various command-line arguments for customization:

Basic Options

# Dry run mode (no CSV file written)
python3 scripts/spider --dry-run

# Specify custom output filename
python3 scripts/spider --output-file my_results.csv

# Custom page range
python3 scripts/spider --start-page 3 --end-page 10

# Parse all pages until empty page is found
python3 scripts/spider --all

Phase Control

# Run only Phase 1 (subtitle + today/yesterday tags)
python3 scripts/spider --phase 1

# Run only Phase 2 (today/yesterday tags with quality filter)
python3 scripts/spider --phase 2

# Run both phases (default)
python3 scripts/spider --phase all

History Control

# Ignore history file and scrape all pages (for both daily and ad hoc modes)
python3 scripts/spider --ignore-history

# Custom URL scraping (saves to reports/AdHoc/, checks history by default)
python3 scripts/spider --url "https://javdb.com/?vft=2"

# Custom URL scraping, ignoring history to re-download everything
python3 scripts/spider --url "https://javdb.com/actors/EvkJ" --ignore-history

# Ignore today/yesterday release date tags and process all matching entries
python3 scripts/spider --ignore-release-date

# Use proxy for all HTTP requests
python3 scripts/spider --use-proxy

Complete Examples

# Quick test run with limited pages
python3 scripts/spider --start-page 1 --end-page 3 --dry-run

# Full scrape ignoring history
python3 scripts/spider --all --ignore-history

# Custom URL with specific output file
python3 scripts/spider --url "https://javdb.com/?vft=2" --output-file custom_results.csv

# Phase 1 only with custom page range
python3 scripts/spider --phase 1 --start-page 5 --end-page 15

# Download all subtitle entries regardless of release date
python3 scripts/spider --ignore-release-date --phase 1

# Download all high-quality entries regardless of release date
python3 scripts/spider --ignore-release-date --phase 2 --start-page 1 --end-page 10

# Ad hoc mode: Download specific actor's movies (skips already downloaded)
python3 scripts/spider --url "https://javdb.com/actors/EvkJ" --ignore-release-date

# Ad hoc mode: Re-download everything from an actor (ignores history)
python3 scripts/spider --url "https://javdb.com/actors/EvkJ" --ignore-history --ignore-release-date

# Use proxy to access JavDB (useful for geo-restricted regions)
python3 scripts/spider --use-proxy --start-page 1 --end-page 5

# Combine multiple options: proxy + custom URL + ignore release date
python3 scripts/spider --url "https://javdb.com/actors/EvkJ" --use-proxy --ignore-release-date

Argument Reference

Argument	Description	Default	Example
`--dry-run`	Print items without writing CSV	False	`--dry-run`
`--output-file`	Custom CSV filename	Auto-generated	`--output-file results.csv`
`--start-page`	Starting page number	1	`--start-page 5`
`--end-page`	Ending page number	20	`--end-page 10`
`--all`	Parse until empty page	False	`--all`
`--ignore-history`	Skip history checking (both daily & ad hoc)	False	`--ignore-history`
`--url`	Custom URL to scrape (enables ad hoc mode)	None	`--url "https://javdb.com/?vft=2"`
`--phase`	Phase to run (1/2/all)	all	`--phase 1`
`--ignore-release-date`	Ignore today/yesterday tags	False	`--ignore-release-date`
`--use-proxy`	Enable proxy from config.py	False	`--use-proxy`
`--sequential`	Force sequential processing (disable parallel)	False	`--sequential`
`--max-movies-phase1`	Limit phase 1 movies (for testing)	None	`--max-movies-phase1 10`
`--max-movies-phase2`	Limit phase 2 movies (for testing)	None	`--max-movies-phase2 5`
`--use-history`	Enable history filter in ad-hoc mode	False	`--use-history`

Additional Tools

JavDB Auto Login (for custom URL scraping):

# Run when session cookie expires or before using --url parameter
python3 javdb_login.py

# The script will:
# 1. Login to JavDB with your credentials
# 2. Handle captcha (manual input or 2Captcha API)
# 3. Extract and update session cookie in config.py
# 4. Verify the cookie works

# See JavDB Auto Login section above for setup details

Check Proxy Ban Status:

# View ban records
cat "reports/proxy_bans.csv"

# Ban information is also included in pipeline email reports

Run Migration Scripts:

cd migration

# Clean up duplicate history entries
python3 cleanup_history_priorities.py

# Update history file format (if upgrading from older version)
python3 update_history_format.py

# Reclassify torrents (after classification rule changes)
python3 reclassify_c_hacked_torrents.py

Automated Pipeline

Run the complete workflow:

# Basic pipeline run
python pipeline_run_and_notify.py

# Pipeline with custom arguments (passed to Javdb_Spider)
python pipeline_run_and_notify.py --start-page 1 --end-page 5

# Pipeline ignoring release date tags
python pipeline_run_and_notify.py --ignore-release-date --phase 1

# Pipeline with custom URL
python pipeline_run_and_notify.py --url "https://javdb.com/actors/EvkJ"

# Pipeline with proxy enabled
python pipeline_run_and_notify.py --use-proxy

# Pipeline with PikPak individual mode (process torrents one by one)
python pipeline_run_and_notify.py --pikpak-individual

The pipeline will:

Run the JavDB Spider to extract data (with provided arguments)
Commit spider results to GitHub immediately
Run the qBittorrent Uploader to add torrents
Commit uploader results to GitHub immediately
Run PikPak Bridge to handle old torrents (batch mode by default, individual mode with --pikpak-individual)
Perform final commit and push to GitHub
Analyze logs for critical errors
Send email notifications with appropriate status

Note: The pipeline accepts the same arguments as scripts/spider and passes them through automatically. Additional pipeline-specific arguments include --pikpak-individual for PikPak Bridge mode control.

Intelligent Error Detection

The pipeline now includes sophisticated error analysis that distinguishes between:

Critical Errors (email marked as FAILED):

Cannot access JavDB main site (all pages fail)
Cannot connect to qBittorrent
Cannot login to qBittorrent
All torrent additions failed
Network completely unreachable

Non-Critical Errors (email marked as SUCCESS):

Some specific JavDB pages failed (but main site accessible)
Some individual torrents failed to add (but qBittorrent accessible)
PikPak API issues (PikPak service problem, not infrastructure)
No new torrents found (expected behavior)

This ensures you only get FAILED emails when there's a real infrastructure problem that needs attention, not just when there's no new content or minor issues.

Configuration

Unified Configuration (`config.py`)

All configuration settings are now centralized in a single config.py file:

# =============================================================================
# GIT CONFIGURATION
# =============================================================================
GIT_USERNAME = 'your_github_username'
GIT_PASSWORD = 'your_github_token_or_password'
GIT_REPO_URL = 'https://github.com/your_username/your_repo_name.git'
GIT_BRANCH = 'main'

# =============================================================================
# QBITTORRENT CONFIGURATION
# =============================================================================
QB_HOST = 'your_qbittorrent_ip'
QB_PORT = 'your_qbittorrent_port'
QB_USERNAME = 'your_qbittorrent_username'
QB_PASSWORD = 'your_qbittorrent_password'
TORRENT_CATEGORY = 'JavDB'  # Category for daily mode torrents
TORRENT_CATEGORY_ADHOC = 'Ad Hoc'  # Category for adhoc mode torrents
TORRENT_SAVE_PATH = ''
AUTO_START = True
SKIP_CHECKING = False
REQUEST_TIMEOUT = 30
DELAY_BETWEEN_ADDITIONS = 1

# =============================================================================
# SMTP CONFIGURATION (for email notifications)
# =============================================================================
SMTP_SERVER = 'smtp.gmail.com'
SMTP_PORT = 587
SMTP_USER = 'your_email@gmail.com'
SMTP_PASSWORD = 'your_email_password_or_app_password'
EMAIL_FROM = 'your_email@gmail.com'
EMAIL_TO = 'your_email@gmail.com'

# =============================================================================
# PROXY CONFIGURATION
# =============================================================================

# Proxy mode: 'single' (use first proxy only) or 'pool' (automatic failover)
PROXY_MODE = 'single'

# Proxy pool - list of proxies (first one used in single mode, all used in pool mode)
PROXY_POOL = [
    {'name': 'Main-Proxy', 'http': 'http://127.0.0.1:7890', 'https': 'http://127.0.0.1:7890'},
    {'name': 'Backup-Proxy', 'http': 'http://127.0.0.1:7891', 'https': 'http://127.0.0.1:7891'},
]

# Proxy pool behavior (only for pool mode)
PROXY_POOL_COOLDOWN_SECONDS = 691200  # 8 days cooldown for banned proxies
PROXY_POOL_MAX_FAILURES = 3  # Max failures before cooldown

# Legacy proxy config (deprecated - use PROXY_POOL instead)
PROXY_HTTP = None
PROXY_HTTPS = None

# Modular proxy control - which modules use proxy
PROXY_MODULES = ['all']  # 'all' or list: 'spider', 'qbittorrent', 'pikpak'

# =============================================================================
# SPIDER CONFIGURATION
# =============================================================================
START_PAGE = 1
END_PAGE = 20
BASE_URL = 'https://javdb.com'

# Phase 2 filtering criteria
PHASE2_MIN_RATE = 4.0  # Minimum rating score for phase 2 entries
PHASE2_MIN_COMMENTS = 80  # Minimum comment count for phase 2 entries

# Release date filter
IGNORE_RELEASE_DATE_FILTER = False  # Set True to ignore today/yesterday tags

# Sleep time configuration (in seconds)
PAGE_SLEEP = 2  # Sleep between index pages
MOVIE_SLEEP_MIN = 5   # Minimum random sleep between movies
MOVIE_SLEEP_MAX = 15  # Maximum random sleep between movies

# =============================================================================
# JAVDB LOGIN CONFIGURATION (for automatic session cookie refresh)
# =============================================================================

# JavDB login credentials (optional - for custom URL scraping)
JAVDB_USERNAME = ''  # Your JavDB email or username
JAVDB_PASSWORD = ''  # Your JavDB password

# Session cookie (auto-updated by javdb_login.py)
JAVDB_SESSION_COOKIE = ''

# Optional: 2Captcha API key for automatic captcha solving
# Get from: https://2captcha.com/ (~$1 per 1000 captchas)
TWOCAPTCHA_API_KEY = ''  # Leave empty for manual captcha input

# =============================================================================
# CLOUDFLARE BYPASS CONFIGURATION (Optional)
# =============================================================================

# CloudFlare bypass service port (must match the service port)
# See: https://github.com/sarperavci/CloudflareBypassForScraping
CF_BYPASS_SERVICE_PORT = 8000

# =============================================================================
# LOGGING CONFIGURATION
# =============================================================================
LOG_LEVEL = 'INFO'
SPIDER_LOG_FILE = 'logs/spider.log'
UPLOADER_LOG_FILE = 'logs/qb_uploader.log'
PIPELINE_LOG_FILE = 'logs/pipeline.log'
EMAIL_NOTIFICATION_LOG_FILE = 'logs/email_notification.log'

# =============================================================================
# FILE PATHS
# =============================================================================
REPORTS_DIR = 'reports'
DAILY_REPORT_DIR = 'reports/DailyReport'
AD_HOC_DIR = 'reports/AdHoc'
PARSED_MOVIES_CSV = 'parsed_movies_history.csv'

# =============================================================================
# PIKPAK CONFIGURATION (for PikPak Bridge)
# =============================================================================

# PikPak login credentials
PIKPAK_EMAIL = 'your_pikpak_email@example.com'
PIKPAK_PASSWORD = 'your_pikpak_password'

# PikPak settings
PIKPAK_LOG_FILE = 'logs/pikpak_bridge.log'
PIKPAK_REQUEST_DELAY = 3  # Delay between requests (seconds) to avoid rate limiting

# =============================================================================
# qBittorrent File Filter Configuration
# =============================================================================

# Minimum file size threshold in MB
# Files smaller than this will be set to "do not download" priority
# This helps filter out small files like NFO, samples, screenshots, etc.
QB_FILE_FILTER_MIN_SIZE_MB = 50

# Log file for the file filter script
QB_FILE_FILTER_LOG_FILE = 'logs/qb_file_filter.log'

Setup Instructions:

Copy config.py.example to config.py
Update all the placeholder values with your actual credentials
The config.py file is automatically excluded from git commits for security

GitHub Authentication Setup:

Go to GitHub Settings → Developer settings → Personal access tokens
Generate a new token with repo permissions
Use this token as GIT_PASSWORD

qBittorrent Setup:

Enable Web UI in qBittorrent settings
Note the IP address, port, username, and password
Update the qBittorrent configuration section in config.py

Email Setup (Optional):

For Gmail, use an App Password instead of your regular password
Enable 2-factor authentication and generate an App Password
Update the SMTP configuration section in config.py

Output Structure

CSV File Columns

The spider generates CSV files with the following columns:

href: The video page URL
video-title: The video title
page: The page number where the entry was found
actor: The main actor/actress name
rate: The rating score
comment_number: Number of user comments/ratings
hacked_subtitle: Magnet link for hacked version with subtitles
hacked_no_subtitle: Magnet link for hacked version without subtitles
subtitle: Magnet link for subtitle version
no_subtitle: Magnet link for regular version (prefers 4K if available)
size_hacked_subtitle, size_hacked_no_subtitle, size_subtitle, size_no_subtitle: Corresponding sizes

File Locations

All report files are organized under the reports/ directory:

reports/
├── DailyReport/YYYY/MM/         # Daily report CSV files
│   └── Javdb_TodayTitle_YYYYMMDD.csv
├── AdHoc/YYYY/MM/               # Ad hoc report CSV files
│   └── Javdb_AdHoc_*.csv
├── Dedup/                       # Rclone dedup reports
├── parsed_movies_history.csv    # History tracking
├── pikpak_bridge_history.csv    # PikPak transfer history
└── proxy_bans.csv               # Proxy ban records

Daily Report CSV files: reports/DailyReport/YYYY/MM/Javdb_TodayTitle_YYYYMMDD.csv
Ad Hoc CSV files: reports/AdHoc/YYYY/MM/Javdb_AdHoc_*.csv
History file: reports/parsed_movies_history.csv
PikPak history: reports/pikpak_bridge_history.csv
Proxy ban records: reports/proxy_bans.csv
Log files: logs/ directory
- spider.log
- qb_uploader.log
- pipeline.log

History System

The spider includes an intelligent history system that tracks which torrent types have been found for each movie:

Multiple Torrent Type Tracking

Tracks ALL available torrent types per movie (e.g., both hacked_subtitle and subtitle)
Prevents redundant processing when movies already have complete torrent collections
Only searches for torrent types that are missing based on preference rules

Processing Rules

Phase 1: Processes movies with missing torrent types based on preferences
Phase 2: Only processes movies that can be upgraded from no_subtitle to hacked_no_subtitle or meet quality criteria
New Movies: Always processed regardless of history

Phase 2 Quality Filtering

Phase 2 includes configurable quality filtering based on user ratings and comment counts:

Minimum Rating: Configurable via PHASE2_MIN_RATE (default: 4.0)
Minimum Comments: Configurable via PHASE2_MIN_COMMENTS (default: 80)
Purpose: Ensures only high-quality content is processed in phase 2

Preference Rules

Hacked Category: Always prefer hacked_subtitle over hacked_no_subtitle
Subtitle Category: Always prefer subtitle over no_subtitle
Complete Collection Goal: Each movie should have both categories represented

Release Date Filtering

By default, the spider filters entries based on release date tags ("今日新種" or "昨日新種"). You can override this behavior in two ways:

Command-Line Argument (Recommended)

# Ignore release date tags for a single run
python3 scripts/spider --ignore-release-date

# Or via pipeline
python pipeline_run_and_notify.py --ignore-release-date

Configuration File

Set IGNORE_RELEASE_DATE_FILTER = True in config.py to permanently ignore release date tags.

Behavior with --ignore-release-date or IGNORE_RELEASE_DATE_FILTER = True:

Phase 1: Downloads ALL entries with subtitle tags, regardless of release date
Phase 2: Downloads ALL entries meeting quality criteria (rate > 4.0, comments > 80), regardless of release date

This is useful when:

You want to backfill your collection with older content
You're scraping a custom URL (like an actor's page) where release date is not relevant
You want to download everything matching the quality criteria

Proxy Support

The system supports both single proxy and proxy pool modes for improved reliability:

Proxy Pool Mode (✨ NEW - Recommended)

Configure multiple proxies for automatic failover:

Automatic Switching: When one proxy fails, automatically switches to another
Passive Health Checking: Only marks proxies as failed on actual failures (no active probing)
Cooldown Mechanism: Failed proxies are temporarily disabled to allow recovery (8 days default)
Ban Detection: Automatically detects when proxies are banned by JavDB
Persistent Ban Records: Ban history stored in reports/proxy_bans.csv and persists across runs
Statistics Tracking: Detailed success rates and usage statistics for each proxy
Perfect for JavDB: Respects strict rate limiting while providing redundancy

See PROXY_POOL_GUIDE.md for detailed configuration and usage guide.

Quick Setup:

# In config.py
PROXY_MODE = 'pool'
PROXY_POOL = [
    {'name': 'Proxy-1', 'http': 'http://127.0.0.1:7890', 'https': 'http://127.0.0.1:7890'},
    {'name': 'Proxy-2', 'http': 'http://127.0.0.1:7891', 'https': 'http://127.0.0.1:7891'},
]
PROXY_POOL_COOLDOWN_SECONDS = 691200  # 8 days cooldown (JavDB bans for 7 days)
PROXY_POOL_MAX_FAILURES = 3  # Max failures before cooldown

Proxy Ban Management:

The system includes intelligent ban detection and management:

Automatic Detection: Detects when JavDB blocks a proxy IP
Persistent Records: Ban history stored in reports/proxy_bans.csv
8-Day Cooldown: Default cooldown matches JavDB's 7-day ban period
Exit Code 2: Spider exits with code 2 when proxies are banned (helps with automation)
Ban Summary: Detailed ban status included in pipeline email reports

Checking Ban Status:

# Ban records are logged in:
cat "reports/proxy_bans.csv"

# Pipeline emails include ban summary with:
# - Proxy name and IP
# - Ban timestamp
# - Cooldown expiry time
# - Current status (BANNED/AVAILABLE)

Then run with --use-proxy flag:

python3 scripts/spider --use-proxy

Single Proxy Mode (Legacy)

The spider also supports traditional single proxy configuration for HTTP/HTTPS/SOCKS5 proxies. This is useful if:

JavDB is geo-restricted in your region
You need to route traffic through a specific network
You want to use a VPN or proxy service

Setup

1. Configure proxy in config.py:

# HTTP/HTTPS proxy
PROXY_HTTP = 'http://127.0.0.1:7890'
PROXY_HTTPS = 'http://127.0.0.1:7890'

# Or SOCKS5 proxy
PROXY_HTTP = 'socks5://127.0.0.1:1080'
PROXY_HTTPS = 'socks5://127.0.0.1:1080'

# With authentication
PROXY_HTTP = 'http://username:password@proxy.example.com:8080'
PROXY_HTTPS = 'http://username:password@proxy.example.com:8080'

# Control which modules use proxy (modular control)
PROXY_MODULES = ['all']  # Enable for all modules
# PROXY_MODULES = ['spider']  # Only spider module (includes login)
# PROXY_MODULES = ['spider', 'qbittorrent']  # Spider and qBittorrent
# PROXY_MODULES = []  # Disable for all modules

2. Enable proxy with command-line flag:

# Enable proxy for spider
python3 scripts/spider --use-proxy

# Enable proxy for qBittorrent uploader
python qbtorrent_uploader.py --use-proxy

# Enable proxy for PikPak bridge
python pikpak_bridge.py --use-proxy

# Combine with other options
python3 scripts/spider --use-proxy --url "https://javdb.com/actors/EvkJ"

# Via pipeline (enables proxy for all components)
python pipeline_run_and_notify.py --use-proxy

Note:

Proxy is disabled by default. You must use --use-proxy to enable it.
If --use-proxy is set but no proxy is configured in config.py, a warning will be logged.
You can control which parts of the spider use proxy via PROXY_MODULES configuration.

Modular Proxy Control

The PROXY_MODULES setting allows fine-grained control over which parts use proxy:

Module	Description	Use Case
`spider`	JavDB Spider	Use proxy to access all JavDB pages (index, detail, login/session refresh)
`qbittorrent`	qBittorrent Web UI API	Use proxy for qBittorrent API requests
`pikpak`	PikPak bridge qBittorrent API	Use proxy for PikPak bridge operations
`all`	All modules	Use proxy for everything (default)

Examples:

# Use proxy for everything
PROXY_MODULES = ['all']

# Only use proxy for spider module (includes login)
PROXY_MODULES = ['spider']

# Use proxy for spider and qBittorrent
PROXY_MODULES = ['spider', 'qbittorrent']

# Only use proxy for qBittorrent and PikPak, not spider
PROXY_MODULES = ['qbittorrent', 'pikpak']

# Use proxy for spider only, not qBittorrent/PikPak
PROXY_MODULES = ['spider']

# Disable proxy for all modules (even if --use-proxy is set)
PROXY_MODULES = []

Common Scenarios:

Geo-restricted JavDB only: PROXY_MODULES = ['spider']
Local qBittorrent behind firewall: PROXY_MODULES = ['qbittorrent', 'pikpak']
Everything through proxy: PROXY_MODULES = ['all']

Supported Proxy Types

HTTP: http://proxy.example.com:8080
HTTPS: https://proxy.example.com:8080
SOCKS5: socks5://proxy.example.com:1080 (requires requests[socks] package)

Installing SOCKS5 Support

If you want to use SOCKS5 proxy, install the additional dependency:

pip install requests[socks]

Troubleshooting Proxy Issues

Error: 500 Internal Server Error

Check if proxy server is running and accessible
Verify proxy credentials (username/password)

If password contains special characters, URL-encode them:

from urllib.parse import quote
password = "My@Pass!"
encoded = quote(password, safe='')
print(encoded)  # Output: My%40Pass%21

Test proxy manually:

curl -x http://username:password@proxy:port https://javdb.com

Error: Connection refused or timeout

Check if proxy server is running: telnet proxy_ip proxy_port
Verify firewall rules allow connection to proxy
Check if proxy requires authentication

Proxy works but downloads fail

Some proxies don't support magnet links or torrents
Try different proxy or use direct connection for qBittorrent/PikPak:
```
PROXY_MODULES = ['spider']
```

Password with special characters Common special characters that need URL encoding:

@ → %40
: → %3A (only in password, not after @)
/ → %2F
? → %3F
# → %23
& → %26
= → %3D
+ → %2B
Space → %20
! → %21

Example: http://user:My@Pass!123@proxy:8080 becomes http://user:My%40Pass%21123@proxy:8080

CloudFlare Bypass Support

The system supports integration with CloudflareBypassForScraping for handling CloudFlare protection on JavDB.

What is CloudFlare Bypass?

CloudFlare Bypass is an optional feature that helps you access JavDB when CloudFlare protection is enabled. It uses the CloudflareBypassForScraping service which automatically:

Handles CloudFlare challenges
Manages cf_clearance cookies
Provides transparent request forwarding (Request Mirroring mode)

Setup

1. Install CloudflareBypassForScraping:

# Clone the repository
git clone https://github.com/sarperavci/CloudflareBypassForScraping.git
cd CloudflareBypassForScraping

# Install dependencies
pip install -r requirements.txt

# Configure (edit config.json if needed)
# Default port is 8000

2. Start the CF Bypass Service:

# Local setup (default)
python app.py

# Custom port (update CF_BYPASS_SERVICE_PORT in config.py to match)
python app.py --port 8000

3. Configure Spider:

Edit config.py to set the CF bypass service port:

# CloudFlare Bypass Configuration
CF_BYPASS_SERVICE_PORT = 8000  # Must match the service port

4. CF Bypass Behavior:

CF bypass is automatically activated as a fallback when direct requests fail during the proxy pool fallback mechanism. No command-line flag is needed.

How It Works

When CF bypass is activated during fallback:

Request Mirroring: Requests are forwarded through the CF bypass service
URL Rewriting: Original URL https://javdb.com/page → http://localhost:8000/page
Host Header: The original hostname is sent via x-hostname header
Cookie Management: CF bypass service handles cf_clearance cookies automatically
Transparent: Your spider code doesn't need any changes

Network Topology

Local Setup:

Spider → http://localhost:8000 → CloudFlare Bypass Service → https://javdb.com

With Proxy:

Spider → http://proxy_ip:8000 → CF Bypass on Proxy Server → https://javdb.com

When using proxy pool, the CF bypass service URL automatically adjusts to match the current proxy IP.

Configuration

# In config.py
CF_BYPASS_SERVICE_PORT = 8000  # CF bypass service port (default: 8000)

Service Location Logic:

No Proxy: Uses http://localhost:8000
With Proxy Pool: Uses http://{proxy_ip}:8000 (extracts IP from current proxy URL)

This allows you to run CF bypass service on the same server as your proxy for better performance.

When to Use

Use CloudFlare Bypass when:

✅ JavDB shows CloudFlare challenge page
✅ You get "Access Denied" or "Checking your browser" errors
✅ Direct access works in browser but fails in script
✅ Proxy alone doesn't bypass CloudFlare protection

Troubleshooting

Error: "Connection refused to localhost:8000"

Make sure CF bypass service is running
Check if port 8000 is available: netstat -an | grep 8000
Update CF_BYPASS_SERVICE_PORT if using different port

Error: "No movie list found" with CF bypass

Check CF bypass service logs for errors
Verify x-hostname header is being sent correctly
Try restarting the CF bypass service

CF Bypass + Proxy Not Working

Ensure CF bypass service is running on the proxy server
Verify proxy IP extraction is correct (check logs)
Test CF bypass service directly: curl http://proxy_ip:8000/

Performance Notes

First Request: Slower (CF challenge solving)
Subsequent Requests: Fast (cookie cached)
Cookie TTL: Varies (usually hours to days)
Overhead: Minimal after first request

JavDB Auto Login

The system includes automatic login functionality to maintain session cookies for custom URL scraping.

Why Use Auto Login?

When scraping custom URLs (actors, tags, etc.) with --url parameter, JavDB requires a valid session cookie. This cookie expires after some time, causing failures with age verification or login issues.

Auto login solves this by:

✅ Automatically logging into JavDB
✅ Handling age verification automatically
✅ Extracting and updating session cookie
✅ Supporting captcha (manual input or 2Captcha API)

Quick Start

1. Configure credentials in config.py:

# JavDB login credentials (for automatic session cookie refresh)
JAVDB_USERNAME = 'your_email@example.com'  # or username
JAVDB_PASSWORD = 'your_password'

# Optional: 2Captcha API key for automatic captcha solving
TWOCAPTCHA_API_KEY = ''  # Leave empty for manual captcha input

2. Run the login script:

python3 javdb_login.py

3. Enter captcha when prompted:

The script will:

Download and save captcha image to javdb_captcha.png
Automatically open the image (if possible)
Prompt you to enter the captcha code

4. Use the spider with custom URLs:

# Spider with custom URL
python3 scripts/spider --url "https://javdb.com/actors/RdEb4"

# Pipeline with custom URL
python3 pipeline_run_and_notify.py --url "https://javdb.com/actors/RdEb4"

Captcha Handling

Manual Input (Default):

Script downloads captcha image
Opens image automatically (platform-dependent)
You enter the code when prompted
Simple and free

2Captcha API (Optional):

Sign up at 2Captcha
Add API key to config.py: TWOCAPTCHA_API_KEY = 'your_key'
Script automatically solves captchas (~$1 per 1000 captchas)
Fully automated but costs money

Configuration Options

# In config.py

# Login credentials (required)
JAVDB_USERNAME = 'your_email@example.com'
JAVDB_PASSWORD = 'your_password'

# Session cookie (auto-updated by javdb_login.py)
JAVDB_SESSION_COOKIE = ''

# Optional: 2Captcha API key
TWOCAPTCHA_API_KEY = ''  # For automatic captcha solving

# Optional: Manual cookie extraction
# Get from browser DevTools → Application → Cookies → _jdb_session
# JAVDB_SESSION_COOKIE = 'your_session_cookie_here'

When to Re-run

Re-run python3 javdb_login.py when:

✅ Session cookie expires (usually after days/weeks)
✅ Spider shows "No movie list found" on valid URLs
✅ Age verification or login errors appear
✅ Before using --url parameter for first time

Automation (Optional)

Cron Job (Linux/Mac):

# Refresh cookie every 7 days
0 0 */7 * * cd ~/JAVDB_AutoSpider && python3 javdb_login.py >> logs/javdb_login.log 2>&1

Task Scheduler (Windows):

Set up scheduled task to run javdb_login.py weekly

Advanced: OCR-based Captcha Solving

The script includes an optional OCR-based captcha solver in utils/login/javdb_captcha_solver.py:

# Free methods (included)
solve_captcha(image_data, method='ocr')      # Local OCR (Tesseract)
solve_captcha(image_data, method='manual')   # Manual input

# Paid method (requires API key)
solve_captcha(image_data, method='2captcha') # 2Captcha API
solve_captcha(image_data, method='auto')     # Try OCR first, fallback to 2Captcha

Installing Tesseract OCR (Optional):

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# macOS
brew install tesseract

# Windows
# Download installer from: https://github.com/UB-Mannheim/tesseract/wiki

Troubleshooting

Login Failed - Incorrect Captcha:

Captcha is case-sensitive
Try again for a new captcha
Consider using 2Captcha API

Login Failed - Invalid Credentials:

Verify username/password in config.py
Test credentials in browser first
Check for typos

Session Cookie Not Working:

Verify cookie updated in config.py
Use same proxy/network for login and spider
Try logging in again

For detailed troubleshooting and manual cookie extraction, see JavDB Login Guide.

Downloaded Indicator Feature

The system includes an advanced duplicate download prevention feature that automatically marks downloaded torrents and skips them in future runs.

Feature Overview

This feature implements automatic marking of downloaded torrents in daily reports and skips these downloaded torrents in the qBittorrent uploader to avoid duplicate downloads. The system also includes enhanced history tracking with create and update timestamps.

Feature Characteristics

Automatic Detection of Downloaded Torrents: Automatically identifies which torrents have been downloaded by checking the history CSV file
Add Indicators: Adds [DOWNLOADED] prefix to downloaded torrents in daily report CSV files
Skip Duplicate Downloads: qBittorrent uploader automatically skips torrents with [DOWNLOADED] indicators
Support Multiple Torrent Types: Supports four types: hacked_subtitle, hacked_no_subtitle, subtitle, no_subtitle
Enhanced History Tracking: Tracks create_date (first discovery) and update_date (latest modification) for each movie

Enhanced History Format

The history CSV file now uses an enhanced format with individual columns for each torrent type:

Old Format:

href,phase,video_code,parsed_date,torrent_type

New Format:

href,phase,video_code,create_date,update_date,last_visited_datetime,hacked_subtitle,hacked_no_subtitle,subtitle,no_subtitle

create_date: When the movie was first discovered and logged
update_date: When the movie was last updated with new torrent types
last_visited_datetime: When the movie detail page was last visited
hacked_subtitle: Download date for hacked version with subtitles (empty if not downloaded)
hacked_no_subtitle: Download date for hacked version without subtitles (empty if not downloaded)
subtitle: Download date for subtitle version (empty if not downloaded)
no_subtitle: Download date for regular version (empty if not downloaded)
Backward compatibility is maintained for existing files

Workflow

Daily Report Generation: Spider generates daily report CSV file
History Check: Uploader checks history CSV file when starting
Add Indicators: Add [DOWNLOADED] prefix to downloaded torrents
Skip Processing: Skip torrents with indicators when reading CSV
Upload New Torrents: Only upload torrents that haven't been downloaded
Update History: When new torrent types are found, update_date is modified

Example Output

CSV Before Modification:

href,video_code,hacked_subtitle,subtitle
/v/mOJnXY,IPZZ-574,magnet:?xt=...,magnet:?xt=...

CSV After Modification:

href,video_code,hacked_subtitle,subtitle
/v/mOJnXY,IPZZ-574,[DOWNLOADED] magnet:?xt=...,[DOWNLOADED] magnet:?xt=...

History File Format:

href,phase,video_code,create_date,update_date,hacked_subtitle,hacked_no_subtitle,subtitle,no_subtitle
/v/mOJnXY,1,IPZZ-574,2025-07-09 20:00:57,2025-07-09 20:05:30,2025-07-09 20:05:30,,2025-07-09 20:05:30,

Uploader Log:

2025-07-09 22:09:23,182 - INFO - Adding downloaded indicators to CSV file...
2025-07-09 22:09:23,183 - INFO - Added downloaded indicators to Daily Report/Javdb_TodayTitle_20250709.csv
2025-07-09 22:09:23,183 - INFO - Found 0 torrent links in Daily Report/Javdb_TodayTitle_20250709.csv
2025-07-09 22:09:23,183 - INFO - Skipped 20 already downloaded torrents

Important Notes

History File Dependency: Feature depends on reports/parsed_movies_history.csv file
Indicator Format: Downloaded indicator format is [DOWNLOADED] (note the space)
Backward Compatibility: If history file doesn't exist, feature will gracefully degrade without affecting normal use
Performance Optimization: History check uses efficient CSV reading, won't significantly impact performance
Timestamp Tracking: create_date remains constant while update_date changes with each modification
Torrent Type Merging: When updating existing records, new torrent types are merged with existing ones

Migration

The system automatically handles migration from the old format (parsed_date) to the new format (create_date, update_date). Existing files are automatically converted with backward compatibility.

This feature ensures system stability and efficiency, avoiding duplicate downloads while maintaining comprehensive history tracking with enhanced timestamp management.

Migration Scripts

The migration/ directory contains utility scripts for maintaining and upgrading the system:

Available Scripts

cleanup_history_priorities.py

Removes duplicate entries from history file
Ensures data integrity
Safe to run multiple times

update_history_format.py

Migrates old history format to new format
Converts parsed_date to create_date/update_date
Automatic backward compatibility

rename_columns_add_last_visited.py

Renames date columns and adds last_visited_datetime field
Required when upgrading to support the new history format

migrate_reports_to_dated_dirs.py

Migrates flat report files into YYYY/MM/ dated subdirectories
Required when upgrading to the new reports directory structure

reclassify_c_hacked_torrents.py

Reclassifies torrents with specific naming patterns
Updates torrent type classification
Useful after classification rule changes

When to Use

Run migration scripts when:

✅ Upgrading from older versions
✅ History file shows duplicate entries
✅ Format changes are introduced
✅ Data cleanup is needed

How to Run

cd migration
python3 cleanup_history_priorities.py
python3 update_history_format.py
python3 rename_columns_add_last_visited.py
python3 reclassify_c_hacked_torrents.py

Note: Always backup your reports/parsed_movies_history.csv before running migration scripts.

Logging

The system provides comprehensive logging:

INFO: General progress information with tracking
WARNING: Non-critical issues
DEBUG: Detailed debugging information
ERROR: Critical errors

Progress tracking includes:

[Page 1/5] - Page-level progress
[15/75] - Entry-level progress across all pages
[1/25] - Upload progress for qBittorrent

Troubleshooting

Common Issues

Spider Issues:

No entries found: Check if the website structure has changed
Connection errors: Verify internet connection and website accessibility
CSV not generated: Check if the reports/DailyReport directory exists

qBittorrent Issues:

Cannot connect: Check if qBittorrent is running and Web UI is enabled
Login failed: Verify username and password in configuration
CSV file not found: Run the spider first to generate the CSV file

Git Issues:

Authentication failed: Verify username and token/password
Repository not found: Check repository URL and access permissions
Branch issues: Ensure the branch exists in your repository

Downloaded Indicator Issues:

Indicators not added: Check if history file exists and has correct format
Uploader skipping too many torrents: Check if history file contains outdated records
Import errors: Ensure utils/history_manager.py file exists
History format issues: Ensure history file has correct column structure with backward compatibility

JavDB Login Issues:

Login failed: Check credentials in config.py
Captcha errors: Try again for new captcha, or use 2Captcha API
Cookie not working: Verify cookie updated in config.py, use same proxy for login and spider
See JavDB Login Guide for detailed troubleshooting

CloudFlare Bypass Issues:

Connection refused: Ensure CF bypass service is running
Port errors: Verify CF_BYPASS_SERVICE_PORT matches service port
No movie list found: Check CF bypass service logs
Proxy + CF not working: Ensure CF bypass service runs on proxy server

Proxy Ban Issues:

All proxies banned: Check reports/proxy_bans.csv for ban status
Spider exits with code 2: Indicates proxy ban detected, wait for cooldown or add new proxies
Cooldown not working: Default is 8 days, adjust PROXY_POOL_COOLDOWN_SECONDS if needed
Ban false positives: Check if JavDB is actually accessible from proxy IP

Debug Mode

To see detailed operations, you can temporarily increase logging level in the scripts:

# In config.py
LOG_LEVEL = 'DEBUG'  # Shows detailed debug information

Security Notes

Configuration file: config.py is automatically excluded from git commits (check .gitignore)
Never commit credentials: GitHub tokens, passwords, API keys should stay in config.py only
GitHub authentication: Use personal access tokens instead of passwords
JavDB credentials: Only stored locally in config.py, never transmitted except to JavDB
PikPak credentials: Stored in config.py, used only for PikPak API
2Captcha API key: Optional, only used if configured for automatic captcha solving
Proxy passwords: Use URL encoding for special characters in passwords
Session cookies: Auto-updated by login script, expire after some time
Sensitive logs: Pipeline automatically masks sensitive info in logs and emails

Environment variables (optional): Consider for production deployments

import os
JAVDB_USERNAME = os.getenv('JAVDB_USER', '')
JAVDB_PASSWORD = os.getenv('JAVDB_PASS', '')

Notes

Rate Limiting and Delays

The system includes delays between requests to be respectful to servers:
- Index pages: 2 seconds (configurable via PAGE_SLEEP)
- Movies: 5-15 seconds random (configurable via MOVIE_SLEEP_MIN / MOVIE_SLEEP_MAX)
- Volume-based adjustment: MovieSleepManager automatically increases sleep intervals when processing large batches
- qBittorrent additions: 1 second (configurable via DELAY_BETWEEN_ADDITIONS)
- PikPak requests: 3 seconds (configurable via PIKPAK_REQUEST_DELAY)

System Behavior

The system uses proper headers to mimic a real browser
CSV files are automatically saved to the reports/DailyReport/YYYY/MM/ or reports/AdHoc/YYYY/MM/ directory
The pipeline provides incremental commits for monitoring progress in real-time
History file tracks all downloaded movies with timestamps
Rust acceleration is automatically detected and used when available
Exit code 2 indicates proxy ban detection (useful for automation)
Logs automatically mask sensitive information (passwords, tokens, etc.)

File Structure

scripts/spider/: Spider package (modular architecture)
- __main__.py: Package entry point (python3 scripts/spider)
- main.py: Main orchestration flow
- cli.py: Command-line argument parsing
- parallel.py: Multi-threaded detail processing (ProxyWorker)
- sequential.py: Sequential detail processing
- index_fetcher.py: Index page fetching
- fallback.py: Multi-level fallback (proxy/CF/login)
- session.py: Login and session management
- sleep_manager.py: Volume-based sleep management
- state.py: Global state management
- csv_builder.py: CSV row construction
- report.py: Summary report generation
rust_core/: Rust acceleration extension (PyO3 + maturin)
- src/scraper/: HTML parsing (index, detail, category pages)
- src/proxy/: Proxy pool, ban manager, masking
- src/requester/: HTTP request handler
- src/history/: History CSV management
- src/csv_writer.rs, src/magnet_extractor.rs, src/url_helper.rs
api/: FastAPI REST API layer
reports/: Contains all report files and history
- DailyReport/YYYY/MM/: Daily scraping results
- AdHoc/YYYY/MM/: Custom URL scraping results
- parsed_movies_history.csv: History tracking
- pikpak_bridge_history.csv: PikPak transfer history
- proxy_bans.csv: Proxy ban records
logs/: Contains all log files
- spider.log: Spider execution logs
- qb_uploader.log: Upload execution logs
- pipeline.log: Pipeline execution logs
- pikpak_bridge.log: PikPak bridge execution logs
- qb_file_filter.log: File filter execution logs
migration/: Contains database migration scripts
utils/: Utility modules (history, parser, proxy pool, etc.)
utils/login/: JavDB login related files and documentation
docker/: Docker configuration files

Quick Reference

Common Commands

# Basic daily scraping
python3 scripts/spider
python3 qbtorrent_uploader.py

# Full automated pipeline
python3 pipeline_run_and_notify.py

# Scrape with proxy
python3 scripts/spider --use-proxy
python3 pipeline_run_and_notify.py --use-proxy

# Scrape with proxy (CF bypass activates automatically as fallback)
python3 scripts/spider --use-proxy
python3 pipeline_run_and_notify.py --use-proxy

# Custom URL scraping (requires login)
python3 javdb_login.py  # First time setup
python3 scripts/spider --url "https://javdb.com/actors/RdEb4"
python3 pipeline_run_and_notify.py --url "https://javdb.com/actors/RdEb4"

# Scrape ignoring release date
python3 scripts/spider --ignore-release-date --phase 1
python3 pipeline_run_and_notify.py --ignore-release-date

# Ad hoc mode
python3 scripts/spider --url "https://javdb.com/tags/xyz"
python3 qbtorrent_uploader.py --mode adhoc

# PikPak bridge
python3 pikpak_bridge.py  # Default: 3 days, batch mode
python3 pikpak_bridge.py --days 7 --individual  # Custom days, individual mode

# qBittorrent File Filter
python3 scripts/qb_file_filter.py --min-size 50  # Filter files < 50MB
python3 scripts/qb_file_filter.py --min-size 100 --days 3 --dry-run  # Preview mode

Configuration Files

Main config: config.py (copy from config.py.example)
History file: reports/parsed_movies_history.csv
Ban records: reports/proxy_bans.csv
Login docs: utils/login/JAVDB_LOGIN_README.md

Important Links

CloudFlare Bypass Service
2Captcha API (optional, for automatic captcha solving)
JavDB Login Guide
Rust Installation Guide (macOS)
API Usage Guide

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

This project is for educational and personal use only. Please respect the terms of service of the websites you scrape.

Name		Name	Last commit message	Last commit date
Latest commit History 778 Commits
.github		.github
api		api
docker		docker
docs		docs
migration		migration
rust_core		rust_core
scripts		scripts
tests		tests
utils		utils
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
README_CN.md		README_CN.md
config.py.example		config.py.example
env.example		env.example
pipeline.py		pipeline.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

JavDB Auto Spider

Features

Core Spider Functionality

Rust Acceleration (Optional)

Parallel Processing

Torrent Classification System

Dual Mode Support

Daily Mode (Default)

Ad Hoc Mode (Custom URL)

qBittorrent Integration

qBittorrent File Filter

Duplicate Download Prevention

Git Integration & Pipeline

JavDB Auto Login

CloudFlare Bypass (Optional)

Installation

Docker Installation (Alternative)

Quick Start with Docker

Using Docker Compose (Recommended)

Usage

Docker Usage

Basic Commands

With Docker Compose

Configure Cron Jobs

Individual Scripts (Local Installation)

Command-Line Arguments

Basic Options

Phase Control

History Control

Complete Examples

Argument Reference

Additional Tools

Automated Pipeline

Intelligent Error Detection

Configuration

Unified Configuration (config.py)

Output Structure

CSV File Columns

File Locations

History System

Multiple Torrent Type Tracking

Processing Rules

Phase 2 Quality Filtering

Preference Rules

Release Date Filtering

Command-Line Argument (Recommended)

Configuration File

Proxy Support

Proxy Pool Mode (✨ NEW - Recommended)

Single Proxy Mode (Legacy)

Setup

Modular Proxy Control

Supported Proxy Types

Installing SOCKS5 Support

Troubleshooting Proxy Issues

CloudFlare Bypass Support

What is CloudFlare Bypass?

Setup

How It Works

Network Topology

Configuration

When to Use

Troubleshooting

Performance Notes

JavDB Auto Login

Why Use Auto Login?

Quick Start

Captcha Handling

Configuration Options

When to Re-run

Automation (Optional)

Advanced: OCR-based Captcha Solving

Troubleshooting

Downloaded Indicator Feature

Feature Overview

Feature Characteristics

Unified Configuration (`config.py`)

Packages