refactor(backend): comprehensive backend optimization — GPU management, security, testing, and code quality by sylvanding · Pull Request #9 · sylvanding/omelette

sylvanding · 2026-03-18T16:16:17Z

Summary

This PR delivers a comprehensive backend optimization across 20 commits and 136 changed files (+14,731 / -811 lines), covering the following major areas:

🔧 Core Improvements

Async blocking fixes: Wrapped synchronous I/O calls (socket.getaddrinfo, subprocess.wait/read, fitz.open) with asyncio.to_thread() to prevent event loop blocking
Double commit elimination: Fixed redundant db.commit() calls in services that were already committed by callers
Exception swallowing fixes: Replaced bare except: pass with proper logging and re-raising
Config unification: Centralized environment variable management with Pydantic v2 Settings
Prompt centralization: Extracted all hardcoded LLM prompts into app/prompts/ module
RAG retrieval optimization: Improved hybrid retrieval with reranker integration

⚡ GPU Resource Management (New Feature)

TTL-based auto-unload: GPU models (embedding, reranker, PaddleOCR) are automatically unloaded after configurable idle time (MODEL_TTL_SECONDS)
GPU_MODE presets: conservative / balanced / aggressive presets for batch sizes and parallelism
GPU monitoring API: GET /gpu/status and POST /gpu/unload endpoints
MinerU subprocess auto-management: Auto start/stop MinerU with TTL, conda env isolation
Exit cleanup: atexit + SIGHUP handlers ensure GPU resources are released on program exit
External watchdog: scripts/gpu_watchdog.py daemon monitors process health and cleans up after crashes

🔒 Security Enhancements

SSRF protection: url_validator.py blocks requests to private/reserved IPs in RSS feeds and crawler
Project ID validation: Added Depends(get_project) to RAG, subscription, and search endpoints
Input validation: Literal type constraints for strategy/priority params, Pydantic request bodies for search
Rate limiting: Applied slowapi to writing stream endpoint

🗄️ Data Integrity

Paper unique constraint: UniqueConstraint("project_id", "doi") prevents duplicate papers at DB level
Pipeline persistence: Migrated from MemorySaver to AsyncSqliteSaver for checkpoint durability
Composite indexes: Added indexes on (project_id, status) and keyword.parent_id
Pipeline cancellation: Extracted to pipelines/cancellation.py module, removing reverse dependency

🐛 Bug Fixes

Pipeline data loss: Fixed missing new_paper field in ResolvedConflict schema that caused keep_new action to lose data
Resource leak: Fixed unclosed fitz.open() file handle in OCR service
LLM config fallback: Fixed temperature/max_tokens not respecting user-defined settings
Settings test_connection: Returns proper HTTP error status codes instead of 200 with error body
MinerU conda: Removed unsupported --no-banner flag from conda run command

🧪 Testing (178 → 526 tests)

141 new API tests: Comprehensive coverage for all REST endpoints
E2E live server tests: 25 tests with real LLM integration
Unit tests: GPU model manager, MinerU process manager, reranker, URL validator, PDF metadata, chat pipeline
Stress tests: Concurrent request handling and pipeline orchestration

📝 Code Quality

Unified pagination: PaginationParams / KeywordPaginationParams Pydantic models
SSE error format: Consistent event: error\ndata: {"code": ..., "message": ...} across all streaming endpoints
OpenAPI tags: Proper endpoint grouping and documentation
Lambda removal: Replaced lambda model loaders with named functions for debuggability

📖 Documentation

Updated README (EN/ZH): New startup instructions (Alembic migration, GPU watchdog, MinerU setup)
Updated .env.example: All new config options documented
API endpoint catalog: docs/api-endpoints.md
Brainstorms & plans: 10+ design documents in docs/brainstorms/ and docs/plans/

Test Plan

All 526 backend tests pass (pytest tests/ -v)
All 2 skipped tests are intentional (GPU-dependent)
ruff check and ruff format clean
Pre-commit hooks pass on all commits
Database migration scripts verified (alembic upgrade head)
MinerU auto-management tested with conda run
GPU cleanup verified on process exit (atexit, SIGHUP)
Watchdog script tested with daemon mode

…swallowing - Wrap feedparser.parse, fitz _extract_local, and ChromaDB sync calls with asyncio.to_thread to avoid blocking the event loop - Add count cache to RAGService to reduce redundant ChromaDB count() calls within a single request - Remove manual db.commit() from conversations CRUD and persist_node; rely on get_session() auto-commit to prevent double commits - Replace bare except-pass in rag_service with debug logging - Upgrade MCP mount failure log from warning to error with traceback Made-with: Cursor

… retrieval - Sync config.py defaults with actual Qwen3 models (Embedding-0.6B, Reranker-0.6B-seq-cls) - Centralize all LLM/VLM prompts into app/prompts/ module (chat, completion, dedup, keyword, rag, rewrite, writing) - Add reranker service with singleton loading, semaphore concurrency control, and graceful fallback - Implement batch adjacent chunk fetching to eliminate N+1 ChromaDB queries - Enable MMR diversity via vector_store_query_mode with configurable threshold - Tune HNSW index parameters (ef_construction=200, M=32, ef_search=100) - Expose rag_top_k and use_reranker in Chat API with input validation - Extract generic get_or_404 helper using PEP 695 type parameters - Add rate limit, auth middleware, and API endpoint hardening Made-with: Cursor

…cases - Add 4 new test modules covering projects, papers, keywords, search, dedup, chat, RAG, writing, conversations, subscriptions, tasks, and settings APIs - Support real_llm marker for Volcengine-dependent tests (2 tests) - Verify SSE streaming events (start, text-delta, finish, [DONE]) - Test new reranker and RAG parameter exposure in Chat/RAG endpoints - All 370 tests pass (2 skipped for real_llm when provider not configured) Made-with: Cursor

…arch - Document all 76 backend API endpoints with parameters and flags - Add brainstorm docs for backend review and config/RAG/testing sessions - Add implementation plans with acceptance criteria and research insights - Include RAG retrieval optimization best practices research Made-with: Cursor

…skipped) Full end-to-end test suite against a live backend with Volcengine LLM: - PDF upload and background processing (pdfplumber fallback) - RAG index build, stats, and query with real LLM answers - SSE streaming chat (basic + RAG-enhanced) - Writing assistant (summarize, citations, review outline, gap analysis) - Conversation persistence and settings APIs - Auto-skips when server is unreachable Made-with: Cursor

Made-with: Cursor

…ensive E2E tests - Add ocr_parallel_limit config for controlling concurrent OCR tasks - Refactor paper_processor.py from serial to parallel OCR with asyncio.gather, semaphore-based concurrency control, and round-robin GPU assignment - Support CPU-only, single-GPU, and multi-GPU environments gracefully - Add MinerU client unit tests (mocked HTTP) and E2E integration tests - Add stress tests: 8-PDF concurrent upload, concurrent RAG queries, concurrent chat streams - Add quality comparison tests: MinerU vs pdfplumber extraction metrics - Add GPU utilization monitoring via nvidia-smi sampling during stress tests - Enhance existing E2E tests with MinerU parsing verification - Add MinerU deployment guide (docs/solutions/deployment/mineru-setup-guide.md) - Add OCR_PARALLEL_LIMIT to .env.example Test results: 394 unit/integration passed, 37 E2E passed (across 4 test suites) Made-with: Cursor

- Add huggingface-hub as explicit dependency in pyproject.toml (was missing, causing RAG index build to fail with ImportError) - Add GET /papers/{paper_id}/chunks API endpoint with ChunkRead schema (test_paper_chunks_have_sections was skipped because endpoint didn't exist) - Implement smart GPU selection: _pick_best_gpu() chooses the device with the most free memory instead of always using cuda:0 - Add CUDA OOM auto-retry in RAG index build endpoint: clears GPU cache, reloads embedding model onto best available GPU, and retries - Reduce embedding batch_size from 32 to 8 to lower peak GPU memory - Reuse detect_gpu() in reranker_service for consistent GPU selection - Add _cleanup_gpu_memory() (gc.collect + empty_cache) before model loads - Add retry logic for flaky LLM responses in test_rag_query_with_real_llm - Update test assertions for new cuda:N device string format Results: 28/29 E2E tests pass (previously 27/29 with 2 skipped + failures) Made-with: Cursor

Three presets (conservative/balanced/aggressive) control batch sizes, parallelism, and GPU pinning across embedding, reranker, and OCR services. Users can override any parameter individually via .env. Default mode is balanced for backward compatibility; .env set to conservative for current debugging phase with CUDA_VISIBLE_DEVICES=6,7. Made-with: Cursor

…ts across 5 phases Phase 1 (P0 Critical): - Fix OCR blocking event loop with asyncio.to_thread() - Implement pipeline cancellation with shared state + asyncio.Task.cancel() - Add SSRF prevention (url_validator.py) + DOI format validation - Save asyncio.create_task references to prevent GC Phase 2 (API Consistency): - Unify error responses: HTTPException + ValidationError → ApiResponse format - Strengthen Schema validation: Literal types, max_length, ge/le constraints - Fix non-serializable ValueError in validation error handler Phase 3 (API Completion): - Persist pipeline state to Task table - Add pipeline list endpoint + typed ResumeRequest - Add batch delete papers endpoint - Add composite indexes (paper/task project+status) + Alembic migration Phase 4 (MCP & Middleware): - Add 4 MCP tools: summarize_papers, generate_review_outline, analyze_gaps, manage_keywords - Add MCP input validation (top_k, max_results bounds) - Add per-endpoint rate limiting (chat 30/min, OCR 5/min, RAG 5/min, pipeline 10/min) - Add subscription auto_import parameter - Remove llm_client.py shim, unify LLM imports - Expand Schema __init__.py exports Phase 5 (WebSocket & Polish): - Add WebSocket ConnectionManager with room-based broadcasts - Add pipeline WebSocket endpoint for real-time status - Add /health endpoint - Improve CORS config (expose_headers, max_age) - Restrict API key to header-only (no query params) - Add project export/import endpoints - Disable rate limiting in test environment - Add 33 new tests (url_validator, middleware, batch delete, export/import, WS manager, schema validation) - Fix existing tests for new error format and Literal constraints 409 tests passing, ruff clean. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com> Made-with: Cursor

…g gaps fix - Extract hardcoded constants to config.py (S2 API, rewrite timeout, title similarity threshold, app version) - Unify citation_graph_service error handling to use HTTPException instead of returning 200 with error dict - Narrow rewrite.py exception handling from broad Exception to specific types - Use Path.is_relative_to() for safer path validation in pipelines - Add LLMConfigResolver unit tests (12 tests covering from_env/from_merged) - Add RerankerService unit tests (7 tests covering caching and fallback) - Add MCP tool tests for all 7 previously untested tools (20 new tests) - Add Pipeline real PDF integration tests with HITL flow - Add Chat tool_mode tests for citation_lookup, review_outline, gap_analysis Total: 498 tests passing (up from ~409) Made-with: Cursor

…ocess control - Add GPUModelManager with TTL-based auto-unloading (default 5min idle) - Add MinerUProcessManager for auto start/stop of MinerU subprocess - Refactor embedding_service and reranker_service to use GPUModelManager - Add OCRService.close() and context manager for explicit GPU cleanup - Add GPU monitoring API: GET /api/v1/gpu/status, POST /api/v1/gpu/unload - Integrate managers into FastAPI lifespan (startup/shutdown) - Add config fields: model_ttl_seconds, mineru_auto_manage, mineru_ttl_seconds - Add 30 new tests (GPUModelManager, MinerUProcessManager, GPU API) Models are loaded on-demand and released after idle timeout to minimize GPU memory usage when the system is not actively processing requests. Made-with: Cursor

The current conda version does not support the --no-banner argument, causing MinerU auto-start to silently fail and fall back to pdfplumber. Made-with: Cursor

…rity, resource leaks - Fix ResolvedConflict missing new_paper field causing keep_new data loss - Add merge action support in apply_resolution_node - Extract pipeline cancellation to shared module, fix memory leak - Wrap blocking socket.getaddrinfo/process.wait in asyncio.to_thread - Fix fitz.open resource leak with context manager - Add SSRF validation for subscription feed URLs - Add project existence checks for rag/subscription/search endpoints Made-with: Cursor

…, input validation Phase 2: Data integrity + Pipeline persistence - Add Paper (project_id, doi) unique constraint with Alembic migration - Replace MemorySaver with AsyncSqliteSaver for pipeline checkpointing - Add pipeline_checkpoint_db config field Phase 3: Code quality refactoring - Extract GPU memory cleanup to shared gpu_utils.py - Unify OCR calls to use process_pdf_async (MinerU priority) - Fix LLM config resolver temperature/max_tokens fallback - Fix hardcoded /tmp path in OCR service - Replace lambda with explicit helper functions in embedding_service - Add engine.dispose() on application shutdown Phase 4: Input validation + API consistency - Add unified PaginationParams for all list endpoints - Add Literal type constraints for dedup strategy and crawler priority - Add SearchExecuteRequest Pydantic model for search API - Add typed Pydantic models for project import data Made-with: Cursor

- Add 6 unit tests for pdf_metadata service (normal/corrupted/no-doi/crossref) - Extend paper API tests with chunks and 404 coverage - Add shared fixtures to conftest.py for new tests Made-with: Cursor

…imits, indexes - Add summary to all API endpoints for OpenAPI documentation - Unify SSE error format with format_sse_error helper - Add rate limiting to writing stream endpoint - Extract citation error messages to constants - Add reranker top_n/batch_size documentation - Add Keyword parent_id index with Alembic migration - Update frontend subscription API for pagination compatibility Made-with: Cursor

…and gpu_utils - Disable paper DOI unique constraint in dedup test fixtures - Update search tests to use JSON body instead of query params - Fix pipeline tests for task cleanup timing - Update gpu_model_manager tests to mock gpu_utils.gc - Mock validate_url_safe in subscription tests for SSRF bypass Made-with: Cursor

Made-with: Cursor

Two-layer safety net for GPU cleanup on all exit scenarios: Layer 1 — In-process safety net: - atexit handler for sync cleanup (GPU unload + MinerU kill) - SIGHUP handler for terminal close - Enhanced MinerU stop() kills external processes by port lookup - PID file for watchdog coordination Layer 2 — External watchdog script: - Independent process monitors Omelette via PID file - Cleans up GPU resources after any exit (including kill -9, OOM) - Supports daemon mode for background operation Covers: Ctrl+C, kill, kill -9, OOM/crash, terminal close Made-with: Cursor

…atures Add GPU TTL, MinerU auto-management, watchdog, and Alembic migration instructions to both EN/ZH README files. Sync .env.example with new config options introduced in this branch. Made-with: Cursor

sylvanding and others added 21 commits March 17, 2026 17:46

docs(backend): mark E2E acceptance criteria as completed

23e9574

Made-with: Cursor

fix(backend): remove unsupported --no-banner flag from conda run command

e4be0d0

The current conda version does not support the --no-banner argument, causing MinerU auto-start to silently fail and fall back to pdfplumber. Made-with: Cursor

test(backend): add pdf_metadata tests and extend paper API test coverage

72e7ccf

- Add 6 unit tests for pdf_metadata service (normal/corrupted/no-doi/crossref) - Extend paper API tests with chunks and 404 coverage - Add shared fixtures to conftest.py for new tests Made-with: Cursor

docs(backend): mark backend comprehensive review plan as completed

7d0e37c

Made-with: Cursor

docs: update README and .env.example for GPU management and MinerU fe…

c1d449a

…atures Add GPU TTL, MinerU auto-management, watchdog, and Alembic migration instructions to both EN/ZH README files. Sync .env.example with new config options introduced in this branch. Made-with: Cursor

sylvanding merged commit 0c46ba6 into main Mar 18, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(backend): comprehensive backend optimization — GPU management, security, testing, and code quality#9

refactor(backend): comprehensive backend optimization — GPU management, security, testing, and code quality#9
sylvanding merged 21 commits intomainfrom
refactor/backend-comprehensive-optimization

sylvanding commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant