Upgrade: Handled Test cases Like Large Doc chunking One doc test case…#3
Upgrade: Handled Test cases Like Large Doc chunking One doc test case…#3techieworld2 wants to merge 1 commit intomainfrom
Conversation
… and prompt injection security
There was a problem hiding this comment.
Pull request overview
This PR improves SpecGap’s robustness when analyzing large documents and handling LLM output variability, while adding additional defenses against prompt injection and enabling a single-document “self-audit” flow when only one file is provided.
Changes:
- Added safe JSON extraction/parsing utilities and updated engines to retry on parse failures.
- Introduced document sanitization + standardized document-context wrapping in prompts to reduce prompt injection risk.
- Added large-document chunking + map-reduce condensation, and smart comparison logic (single-doc audit vs cross-doc comparison).
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| specgap/app/services/workflow.py | Wraps council document context with a standardized document delimiter. |
| specgap/app/services/tech_engine.py | Uses safe parsing + document-context wrapping for tech analysis prompts. |
| specgap/app/services/biz_engine.py | Uses safe parsing + document-context wrapping for legal/business analysis prompts. |
| specgap/app/services/cross_check.py | Uses safe parsing + wrapping; adds single-document audit and smart comparison routing. |
| specgap/app/services/parser.py | Sanitizes extracted text to reduce prompt injection and control characters. |
| specgap/app/services/sanitizer.py | New sanitizer + context wrapper utilities. |
| specgap/app/services/safe_parse.py | New robust JSON extraction/repair and consistent parsing helper. |
| specgap/app/services/chunker.py | New chunking + map-reduce condensation for very large documents. |
| specgap/app/services/init.py | Exposes new utilities and comparison functions at services package level. |
| specgap/app/main.py | Integrates condensation for council; switches deep/full-spectrum synthesis to smart comparison; adds single-doc support. |
| specgap/.env.example | Removes the example env file. |
Comments suppressed due to low confidence (1)
specgap/.env.example:1
- This PR deletes
.env.example, but the repository README still instructs users to runcp .env.example .envduring setup. Either keep a maintained.env.example(recommended for onboarding) or update the README and any tooling/docs that reference it so new installs don’t break.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @app.post("/api/v1/documents/classify", tags=["Documents"]) | ||
| async def classify_uploaded_document( | ||
| file: UploadFile = File(..., description="Document to classify") | ||
| ): | ||
| """ | ||
| Classify a document to determine recommended analysis agents. | ||
|
|
||
| Useful for understanding what type of document you're uploading | ||
| before running a full analysis. | ||
| """ | ||
|
|
||
| await file.seek(0) |
There was a problem hiding this comment.
These endpoint functions no longer have docstrings. In FastAPI, the function docstring is used as the operation description in the generated OpenAPI docs; removing it reduces the usefulness of /docs and /redoc. Consider restoring the docstring (or providing description= in the decorator) for this endpoint.
| """ | ||
| List saved audit records with optional filtering. | ||
| """ | ||
|
|
There was a problem hiding this comment.
The documentation string for this endpoint was removed, which means the generated OpenAPI docs will no longer include an operation description here. If the project relies on Swagger/ReDoc for usability, consider restoring the docstring (or setting summary/description on the route decorator).
| """ | |
| List stored audits with optional filtering and pagination. | |
| This endpoint returns a paginated list of audit records, which can be | |
| filtered by audit type and risk level for easier browsing of history. | |
| """ |
| matches = pattern.findall(cleaned) | ||
| if matches: | ||
| injection_count += len(matches) | ||
| cleaned = pattern.sub("[REDACTED-INSTRUCTION]", cleaned) |
There was a problem hiding this comment.
sanitize_document_text uses pattern.findall(cleaned) to count matches and then pattern.sub(...), which allocates potentially large match lists for big documents (you’re sanitizing up to 500k chars). Using pattern.subn(...) would do the replacement and return the substitution count without building a list, reducing memory/time overhead.
| matches = pattern.findall(cleaned) | |
| if matches: | |
| injection_count += len(matches) | |
| cleaned = pattern.sub("[REDACTED-INSTRUCTION]", cleaned) | |
| cleaned, num_subs = pattern.subn("[REDACTED-INSTRUCTION]", cleaned) | |
| if num_subs: | |
| injection_count += num_subs |
| "why_unrealistic": "Why this is infeasible" | ||
| }} | ||
| ], | ||
| "completeness_score": 0-100, |
There was a problem hiding this comment.
The JSON example in SINGLE_DOC_PROMPT contains non-JSON placeholders like "completeness_score": 0-100. Models often mirror the example, which can increase invalid-JSON outputs and parse retries. Use a valid JSON example value (e.g., an integer like 85) and describe the allowed range in prose instead.
| "completeness_score": 0-100, | |
| "completeness_score": 85, |
| result = safe_parse_llm_response( | ||
| response.text, | ||
| expected_keys=["critical_gaps", "ambiguity_score"] | ||
| ) | ||
|
|
There was a problem hiding this comment.
_clean_json_response is now unused after migrating to safe_parse_llm_response. Consider removing the unused helper (or updating it to call into safe_parse) to keep a single, clear JSON-parsing path.
| prompt = ( | ||
| f"You are a document analyst preparing content for {purpose}.\n" | ||
| f"This is section {idx + 1} of {len(chunks)} from a large document.\n\n" | ||
| "TASK: Extract and preserve ALL of the following from this section:\n" | ||
| "- Specific requirements, obligations, and commitments\n" | ||
| "- Financial terms, dates, deadlines, and SLAs\n" | ||
| "- Legal clauses, liability terms, and penalties\n" | ||
| "- Technical specifications and architecture decisions\n" | ||
| "- Any ambiguous or concerning language\n\n" | ||
| "Preserve EXACT QUOTES for important clauses. Be thorough — do not summarize.\n" | ||
| "Output a structured extraction, NOT a summary.\n\n" | ||
| f"--- SECTION {idx + 1}/{len(chunks)} ---\n{chunk}" | ||
| ) |
There was a problem hiding this comment.
The chunk content is injected directly into the summarization prompt (...\n{chunk}) without using wrap_as_document_context. Since this is untrusted document text, it can still contain prompt-injection strings that bypass the extraction task (even after earlier sanitization). Consider wrapping each chunk in the document-context delimiter (and/or sanitizing here) so the model is more likely to treat it as data.
| result = safe_parse_llm_response( | ||
| response.text, | ||
| expected_keys=["contradictions", "strategic_synthesis"] | ||
| ) | ||
|
|
There was a problem hiding this comment.
After switching to safe_parse_llm_response, _clean_json_response is no longer referenced in this module. To avoid dead code and confusion about which parser is canonical, consider removing _clean_json_response (or delegating to safe_parse from it) now that parsing is handled elsewhere.
| result = safe_parse_llm_response( | ||
| response.text, | ||
| expected_keys=["leverage_score", "trap_clauses"] | ||
| ) | ||
|
|
There was a problem hiding this comment.
After switching to safe_parse_llm_response, _clean_json_response is no longer referenced in this file. Consider deleting it to reduce dead code and avoid confusion about which parser to use.
| # Sanitize extracted text to prevent prompt injection (Test Case 5) | ||
| if not text.startswith("Error:"): | ||
| text = sanitize_document_text(text, max_length=500000) |
There was a problem hiding this comment.
The sanitizer max length is hard-coded to 500,000 chars. This is a policy/limit that likely belongs in configuration (e.g., a Settings field) so it can be tuned per environment and stays consistent with MAX_FILE_SIZE_MB / MAX_CONTEXT_CHARS. Consider moving this value to settings and referencing it here.
| # === MULTIPLE FILES: Real cross-document comparison === | ||
| logger.info(f"{len(file_texts)} files detected, running cross-document comparison") | ||
|
|
||
| filenames = list(file_texts.keys()) |
There was a problem hiding this comment.
Variable filenames is not used.
| filenames = list(file_texts.keys()) |
… and prompt injection security