refactor(backend): comprehensive backend optimization — GPU management, security, testing, and code quality#9
Merged
sylvanding merged 21 commits intomainfrom Mar 18, 2026
Conversation
…swallowing - Wrap feedparser.parse, fitz _extract_local, and ChromaDB sync calls with asyncio.to_thread to avoid blocking the event loop - Add count cache to RAGService to reduce redundant ChromaDB count() calls within a single request - Remove manual db.commit() from conversations CRUD and persist_node; rely on get_session() auto-commit to prevent double commits - Replace bare except-pass in rag_service with debug logging - Upgrade MCP mount failure log from warning to error with traceback Made-with: Cursor
… retrieval - Sync config.py defaults with actual Qwen3 models (Embedding-0.6B, Reranker-0.6B-seq-cls) - Centralize all LLM/VLM prompts into app/prompts/ module (chat, completion, dedup, keyword, rag, rewrite, writing) - Add reranker service with singleton loading, semaphore concurrency control, and graceful fallback - Implement batch adjacent chunk fetching to eliminate N+1 ChromaDB queries - Enable MMR diversity via vector_store_query_mode with configurable threshold - Tune HNSW index parameters (ef_construction=200, M=32, ef_search=100) - Expose rag_top_k and use_reranker in Chat API with input validation - Extract generic get_or_404 helper using PEP 695 type parameters - Add rate limit, auth middleware, and API endpoint hardening Made-with: Cursor
…cases - Add 4 new test modules covering projects, papers, keywords, search, dedup, chat, RAG, writing, conversations, subscriptions, tasks, and settings APIs - Support real_llm marker for Volcengine-dependent tests (2 tests) - Verify SSE streaming events (start, text-delta, finish, [DONE]) - Test new reranker and RAG parameter exposure in Chat/RAG endpoints - All 370 tests pass (2 skipped for real_llm when provider not configured) Made-with: Cursor
…arch - Document all 76 backend API endpoints with parameters and flags - Add brainstorm docs for backend review and config/RAG/testing sessions - Add implementation plans with acceptance criteria and research insights - Include RAG retrieval optimization best practices research Made-with: Cursor
…skipped) Full end-to-end test suite against a live backend with Volcengine LLM: - PDF upload and background processing (pdfplumber fallback) - RAG index build, stats, and query with real LLM answers - SSE streaming chat (basic + RAG-enhanced) - Writing assistant (summarize, citations, review outline, gap analysis) - Conversation persistence and settings APIs - Auto-skips when server is unreachable Made-with: Cursor
Made-with: Cursor
…ensive E2E tests - Add ocr_parallel_limit config for controlling concurrent OCR tasks - Refactor paper_processor.py from serial to parallel OCR with asyncio.gather, semaphore-based concurrency control, and round-robin GPU assignment - Support CPU-only, single-GPU, and multi-GPU environments gracefully - Add MinerU client unit tests (mocked HTTP) and E2E integration tests - Add stress tests: 8-PDF concurrent upload, concurrent RAG queries, concurrent chat streams - Add quality comparison tests: MinerU vs pdfplumber extraction metrics - Add GPU utilization monitoring via nvidia-smi sampling during stress tests - Enhance existing E2E tests with MinerU parsing verification - Add MinerU deployment guide (docs/solutions/deployment/mineru-setup-guide.md) - Add OCR_PARALLEL_LIMIT to .env.example Test results: 394 unit/integration passed, 37 E2E passed (across 4 test suites) Made-with: Cursor
- Add huggingface-hub as explicit dependency in pyproject.toml
(was missing, causing RAG index build to fail with ImportError)
- Add GET /papers/{paper_id}/chunks API endpoint with ChunkRead schema
(test_paper_chunks_have_sections was skipped because endpoint didn't exist)
- Implement smart GPU selection: _pick_best_gpu() chooses the device
with the most free memory instead of always using cuda:0
- Add CUDA OOM auto-retry in RAG index build endpoint: clears GPU cache,
reloads embedding model onto best available GPU, and retries
- Reduce embedding batch_size from 32 to 8 to lower peak GPU memory
- Reuse detect_gpu() in reranker_service for consistent GPU selection
- Add _cleanup_gpu_memory() (gc.collect + empty_cache) before model loads
- Add retry logic for flaky LLM responses in test_rag_query_with_real_llm
- Update test assertions for new cuda:N device string format
Results: 28/29 E2E tests pass (previously 27/29 with 2 skipped + failures)
Made-with: Cursor
Three presets (conservative/balanced/aggressive) control batch sizes, parallelism, and GPU pinning across embedding, reranker, and OCR services. Users can override any parameter individually via .env. Default mode is balanced for backward compatibility; .env set to conservative for current debugging phase with CUDA_VISIBLE_DEVICES=6,7. Made-with: Cursor
…ts across 5 phases Phase 1 (P0 Critical): - Fix OCR blocking event loop with asyncio.to_thread() - Implement pipeline cancellation with shared state + asyncio.Task.cancel() - Add SSRF prevention (url_validator.py) + DOI format validation - Save asyncio.create_task references to prevent GC Phase 2 (API Consistency): - Unify error responses: HTTPException + ValidationError → ApiResponse format - Strengthen Schema validation: Literal types, max_length, ge/le constraints - Fix non-serializable ValueError in validation error handler Phase 3 (API Completion): - Persist pipeline state to Task table - Add pipeline list endpoint + typed ResumeRequest - Add batch delete papers endpoint - Add composite indexes (paper/task project+status) + Alembic migration Phase 4 (MCP & Middleware): - Add 4 MCP tools: summarize_papers, generate_review_outline, analyze_gaps, manage_keywords - Add MCP input validation (top_k, max_results bounds) - Add per-endpoint rate limiting (chat 30/min, OCR 5/min, RAG 5/min, pipeline 10/min) - Add subscription auto_import parameter - Remove llm_client.py shim, unify LLM imports - Expand Schema __init__.py exports Phase 5 (WebSocket & Polish): - Add WebSocket ConnectionManager with room-based broadcasts - Add pipeline WebSocket endpoint for real-time status - Add /health endpoint - Improve CORS config (expose_headers, max_age) - Restrict API key to header-only (no query params) - Add project export/import endpoints - Disable rate limiting in test environment - Add 33 new tests (url_validator, middleware, batch delete, export/import, WS manager, schema validation) - Fix existing tests for new error format and Literal constraints 409 tests passing, ruff clean. 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com> Made-with: Cursor
…g gaps fix - Extract hardcoded constants to config.py (S2 API, rewrite timeout, title similarity threshold, app version) - Unify citation_graph_service error handling to use HTTPException instead of returning 200 with error dict - Narrow rewrite.py exception handling from broad Exception to specific types - Use Path.is_relative_to() for safer path validation in pipelines - Add LLMConfigResolver unit tests (12 tests covering from_env/from_merged) - Add RerankerService unit tests (7 tests covering caching and fallback) - Add MCP tool tests for all 7 previously untested tools (20 new tests) - Add Pipeline real PDF integration tests with HITL flow - Add Chat tool_mode tests for citation_lookup, review_outline, gap_analysis Total: 498 tests passing (up from ~409) Made-with: Cursor
…ocess control - Add GPUModelManager with TTL-based auto-unloading (default 5min idle) - Add MinerUProcessManager for auto start/stop of MinerU subprocess - Refactor embedding_service and reranker_service to use GPUModelManager - Add OCRService.close() and context manager for explicit GPU cleanup - Add GPU monitoring API: GET /api/v1/gpu/status, POST /api/v1/gpu/unload - Integrate managers into FastAPI lifespan (startup/shutdown) - Add config fields: model_ttl_seconds, mineru_auto_manage, mineru_ttl_seconds - Add 30 new tests (GPUModelManager, MinerUProcessManager, GPU API) Models are loaded on-demand and released after idle timeout to minimize GPU memory usage when the system is not actively processing requests. Made-with: Cursor
The current conda version does not support the --no-banner argument, causing MinerU auto-start to silently fail and fall back to pdfplumber. Made-with: Cursor
…rity, resource leaks - Fix ResolvedConflict missing new_paper field causing keep_new data loss - Add merge action support in apply_resolution_node - Extract pipeline cancellation to shared module, fix memory leak - Wrap blocking socket.getaddrinfo/process.wait in asyncio.to_thread - Fix fitz.open resource leak with context manager - Add SSRF validation for subscription feed URLs - Add project existence checks for rag/subscription/search endpoints Made-with: Cursor
…, input validation Phase 2: Data integrity + Pipeline persistence - Add Paper (project_id, doi) unique constraint with Alembic migration - Replace MemorySaver with AsyncSqliteSaver for pipeline checkpointing - Add pipeline_checkpoint_db config field Phase 3: Code quality refactoring - Extract GPU memory cleanup to shared gpu_utils.py - Unify OCR calls to use process_pdf_async (MinerU priority) - Fix LLM config resolver temperature/max_tokens fallback - Fix hardcoded /tmp path in OCR service - Replace lambda with explicit helper functions in embedding_service - Add engine.dispose() on application shutdown Phase 4: Input validation + API consistency - Add unified PaginationParams for all list endpoints - Add Literal type constraints for dedup strategy and crawler priority - Add SearchExecuteRequest Pydantic model for search API - Add typed Pydantic models for project import data Made-with: Cursor
- Add 6 unit tests for pdf_metadata service (normal/corrupted/no-doi/crossref) - Extend paper API tests with chunks and 404 coverage - Add shared fixtures to conftest.py for new tests Made-with: Cursor
…imits, indexes - Add summary to all API endpoints for OpenAPI documentation - Unify SSE error format with format_sse_error helper - Add rate limiting to writing stream endpoint - Extract citation error messages to constants - Add reranker top_n/batch_size documentation - Add Keyword parent_id index with Alembic migration - Update frontend subscription API for pagination compatibility Made-with: Cursor
…and gpu_utils - Disable paper DOI unique constraint in dedup test fixtures - Update search tests to use JSON body instead of query params - Fix pipeline tests for task cleanup timing - Update gpu_model_manager tests to mock gpu_utils.gc - Mock validate_url_safe in subscription tests for SSRF bypass Made-with: Cursor
Made-with: Cursor
Two-layer safety net for GPU cleanup on all exit scenarios: Layer 1 — In-process safety net: - atexit handler for sync cleanup (GPU unload + MinerU kill) - SIGHUP handler for terminal close - Enhanced MinerU stop() kills external processes by port lookup - PID file for watchdog coordination Layer 2 — External watchdog script: - Independent process monitors Omelette via PID file - Cleans up GPU resources after any exit (including kill -9, OOM) - Supports daemon mode for background operation Covers: Ctrl+C, kill, kill -9, OOM/crash, terminal close Made-with: Cursor
…atures Add GPU TTL, MinerU auto-management, watchdog, and Alembic migration instructions to both EN/ZH README files. Sync .env.example with new config options introduced in this branch. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR delivers a comprehensive backend optimization across 20 commits and 136 changed files (+14,731 / -811 lines), covering the following major areas:
🔧 Core Improvements
socket.getaddrinfo,subprocess.wait/read,fitz.open) withasyncio.to_thread()to prevent event loop blockingdb.commit()calls in services that were already committed by callersexcept: passwith proper logging and re-raisingapp/prompts/module⚡ GPU Resource Management (New Feature)
MODEL_TTL_SECONDS)conservative/balanced/aggressivepresets for batch sizes and parallelismGET /gpu/statusandPOST /gpu/unloadendpointsatexit+SIGHUPhandlers ensure GPU resources are released on program exitscripts/gpu_watchdog.pydaemon monitors process health and cleans up after crashes🔒 Security Enhancements
url_validator.pyblocks requests to private/reserved IPs in RSS feeds and crawlerDepends(get_project)to RAG, subscription, and search endpointsslowapito writing stream endpoint🗄️ Data Integrity
UniqueConstraint("project_id", "doi")prevents duplicate papers at DB levelMemorySavertoAsyncSqliteSaverfor checkpoint durability(project_id, status)andkeyword.parent_idpipelines/cancellation.pymodule, removing reverse dependency🐛 Bug Fixes
new_paperfield inResolvedConflictschema that causedkeep_newaction to lose datafitz.open()file handle in OCR servicetemperature/max_tokensnot respecting user-defined settings--no-bannerflag fromconda runcommand🧪 Testing (178 → 526 tests)
📝 Code Quality
PaginationParams/KeywordPaginationParamsPydantic modelsevent: error\ndata: {"code": ..., "message": ...}across all streaming endpoints📖 Documentation
docs/api-endpoints.mddocs/brainstorms/anddocs/plans/Test Plan
pytest tests/ -v)ruff checkandruff formatcleanalembic upgrade head)conda run