Integrate top 6 tools from Accountability Agent pipeline into Bill Translator#15
Conversation
…hecker, CI/CD, security headers, enhanced gitignore; add CLAUDE.md; delete tar.xz Agent-Logs-Url: https://github.com/Leerrooy95/Bill_Translator/sessions/0ca699cf-bacd-46ef-800d-2249ad9ef4fe Co-authored-by: YouGotCuboned <261487993+YouGotCuboned@users.noreply.github.com>
…ion, add CSP comment Agent-Logs-Url: https://github.com/Leerrooy95/Bill_Translator/sessions/0ca699cf-bacd-46ef-800d-2249ad9ef4fe Co-authored-by: YouGotCuboned <261487993+YouGotCuboned@users.noreply.github.com>
Agent-Logs-Url: https://github.com/Leerrooy95/Bill_Translator/sessions/0ca699cf-bacd-46ef-800d-2249ad9ef4fe Co-authored-by: YouGotCuboned <261487993+YouGotCuboned@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR integrates additional “Accountability Agent” pipeline capabilities into the Bill Translator Flask app, adding secure multi-format document ingestion and a Brave Search + Claude fact-checking feature, alongside centralized configuration, documentation, and CI validation.
Changes:
- Added centralized Flask configuration (
config.py) and applied it inweb_app.py. - Introduced PDF/Markdown upload processing (
document_processor.py) and a new/fact-checkendpoint (fact_checker.py). - Expanded documentation, CI workflow, dependencies, and test coverage for the new modules and security headers.
Reviewed changes
Copilot reviewed 10 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| web_app.py | Uses centralized Config, adds security headers, supports .pdf/.md upload processing, and adds /fact-check route |
| document_processor.py | New ingestion/extraction helper for txt/pdf/md with hashed filenames and basic path validation |
| fact_checker.py | New Brave Search + Claude claim verification utility |
| config.py | New centralized configuration for uploads, sessions, model defaults, etc. |
| tests.py | Adds unit tests for document processor, fact checker, config, and security headers |
| templates/index.html | Updates file input accept list and UI copy to include .pdf/.md |
| requirements.txt | Adds pdfplumber and requests, reorganizes dependency list |
| README.md | Documents new features (PDF/MD support, fact-checking, security/CI additions) |
| CLAUDE.md | Adds comprehensive developer/setup guide and repo conventions |
| .github/workflows/validate.yml | Adds CI job to compile-check, test, scan for secrets, and validate templates |
| .gitignore | Expands ignore rules for secrets, uploads, outputs, venvs, and tooling artifacts |
| .env.example | Documents additional optional env vars (upload size, model, Brave key, etc.) |
Comments suppressed due to low confidence (1)
web_app.py:118
- This validation error message still says “upload a .txt file”, but the app now accepts .pdf and .md as well. Updating the message will avoid confusing users when they submit an empty form.
elif request.form.get("bill_text", "").strip():
raw_text = request.form["bill_text"].strip()
else:
flash("Please upload a .txt file or paste bill text.", "error")
return redirect(url_for("index"))
| ALLOWED_EXTENSIONS = {".txt"} | ||
| MAX_CONTENT_LENGTH = 2 * 1024 * 1024 # 2 MB | ||
| app.config["MAX_CONTENT_LENGTH"] = MAX_CONTENT_LENGTH | ||
| ALLOWED_EXTENSIONS = {".txt", ".pdf", ".md"} |
There was a problem hiding this comment.
ALLOWED_EXTENSIONS is redefined here with dot-prefixed extensions, while Config.ALLOWED_EXTENSIONS and document_processor.ALLOWED_EXTENSIONS use non-dot extensions. This duplication risks inconsistencies (e.g., updating one list but not the others) and makes it unclear which source of truth is authoritative. Prefer deriving the web layer’s allowlist from Config (or reusing document_processor.allowed_file) to keep extension policy centralized.
| ALLOWED_EXTENSIONS = {".txt", ".pdf", ".md"} | |
| ALLOWED_EXTENSIONS = { | |
| f".{ext.lower().lstrip('.')}" for ext in Config.ALLOWED_EXTENSIONS | |
| } |
| if not claim: | ||
| return jsonify({"error": "No claim provided."}), 400 | ||
|
|
||
| brave_key = request.form.get("brave_key", "").strip() or session.get("brave_key") |
There was a problem hiding this comment.
The repo documents BRAVE_SEARCH_API_KEY in CLAUDE.md/.env.example, but the route only accepts a key from the form or session. Consider falling back to the server-side env var (e.g., os.environ.get("BRAVE_SEARCH_API_KEY")) when no key is provided, so the documented deployment path works without requiring users to paste the key each session.
| brave_key = request.form.get("brave_key", "").strip() or session.get("brave_key") | |
| brave_key = ( | |
| request.form.get("brave_key", "").strip() | |
| or session.get("brave_key") | |
| or os.environ.get("BRAVE_SEARCH_API_KEY", "").strip() | |
| ) |
| brave_key = request.form.get("brave_key", "").strip() or session.get("brave_key") | ||
| if not brave_key: | ||
| return jsonify({"error": "Brave Search API key is required for fact-checking."}), 400 | ||
|
|
||
| user_api_key = request.form.get("api_key", "").strip() or session.get("api_key") | ||
| if not user_api_key: | ||
| return jsonify({"error": "Anthropic API key is required."}), 400 | ||
|
|
||
| # Store keys in session for convenience | ||
| session["brave_key"] = brave_key | ||
| session["api_key"] = user_api_key | ||
|
|
There was a problem hiding this comment.
Storing api_key/brave_key in Flask’s default session means the keys end up in the client-side session cookie (signed but not encrypted). That conflicts with the “never saved”/BYOK security model and increases exposure risk if cookies are leaked (or if server logs/debug tools capture them). Prefer not persisting these secrets in the session at all (require them per request), or switch to a server-side session backend (e.g., Redis/filesystem via Flask-Session) and ensure cookies are Secure/HttpOnly with appropriate rotation.
| brave_key = request.form.get("brave_key", "").strip() or session.get("brave_key") | |
| if not brave_key: | |
| return jsonify({"error": "Brave Search API key is required for fact-checking."}), 400 | |
| user_api_key = request.form.get("api_key", "").strip() or session.get("api_key") | |
| if not user_api_key: | |
| return jsonify({"error": "Anthropic API key is required."}), 400 | |
| # Store keys in session for convenience | |
| session["brave_key"] = brave_key | |
| session["api_key"] = user_api_key | |
| brave_key = request.form.get("brave_key", "").strip() | |
| if not brave_key: | |
| return jsonify({"error": "Brave Search API key is required for fact-checking."}), 400 | |
| user_api_key = request.form.get("api_key", "").strip() | |
| if not user_api_key: | |
| return jsonify({"error": "Anthropic API key is required."}), 400 |
| # Session | ||
| SESSION_COOKIE_HTTPONLY = True | ||
| SESSION_COOKIE_SAMESITE = "Lax" | ||
| SESSION_COOKIE_SECURE = os.environ.get("FLASK_ENV") == "production" |
There was a problem hiding this comment.
SESSION_COOKIE_SECURE is derived from FLASK_ENV, which is deprecated/removed in newer Flask versions (including Flask 3.x). In practice this can silently leave cookies non-Secure in production unless the deploy explicitly sets FLASK_ENV=production. Prefer controlling this via an explicit env var (e.g., SESSION_COOKIE_SECURE) with a safe default, or set it unconditionally in production deployments where HTTPS is enforced.
| SESSION_COOKIE_SECURE = os.environ.get("FLASK_ENV") == "production" | |
| SESSION_COOKIE_SECURE = os.environ.get("SESSION_COOKIE_SECURE", "true").strip().lower() in { | |
| "1", | |
| "true", | |
| "t", | |
| "yes", | |
| "y", | |
| "on", | |
| } |
| _BRAVE_SEARCH_URL = "https://api.search.brave.com/res/v1/web/search" | ||
|
|
||
|
|
||
| def _brave_search(query, api_key, count=5): | ||
| """Search Brave for evidence related to a claim.""" | ||
| headers = { | ||
| "Accept": "application/json", | ||
| "Accept-Encoding": "gzip", | ||
| "X-Subscription-Token": api_key, | ||
| } | ||
| params = {"q": query, "count": count} |
There was a problem hiding this comment.
This module hardcodes the Brave Search URL and default result count, while similar values are already present in Config (BRAVE_SEARCH_URL, BRAVE_MAX_RESULTS). Using the centralized config (or accepting these as parameters) would reduce duplication and make it easier to adjust endpoints/limits in one place.
| - name: Install dependencies | ||
| run: | | ||
| pip install -r requirements.txt | ||
| pip install requests |
There was a problem hiding this comment.
pip install -r requirements.txt already installs requests (and pins its minimum version). The extra pip install requests is redundant and can be removed to keep CI steps minimal and deterministic.
| pip install requests |
| name1 = secure_filename_hash("same.pdf") | ||
| name2 = secure_filename_hash("same.pdf") |
There was a problem hiding this comment.
This test is probabilistic: it expects two calls to secure_filename_hash() to always differ, but the implementation uses randomness (secrets.token_hex) so collisions are theoretically possible and can create rare flaky failures. Consider patching secrets.token_hex (or injecting a deterministic RNG) to make the test fully deterministic while still asserting that the function uses the token value and preserves the extension.
| name1 = secure_filename_hash("same.pdf") | |
| name2 = secure_filename_hash("same.pdf") | |
| with patch( | |
| "document_processor.secrets.token_hex", | |
| side_effect=[ | |
| "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa", | |
| "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb", | |
| ], | |
| ): | |
| name1 = secure_filename_hash("same.pdf") | |
| name2 = secure_filename_hash("same.pdf") | |
| self.assertEqual(name1, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.pdf") | |
| self.assertEqual(name2, "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb.pdf") |
CLAUDE.md— comprehensive setup & usage guide for Codespaces, local, and any environmentconfig.py— centralized Flask configuration (from tar.xz)document_processor.py— PDF/txt/md ingestion with secure filenames and dedup (from tar.xz)fact_checker.py— Brave Search + Claude claim verification engine (from tar.xz).github/workflows/validate.yml— CI/CD pipeline: lint, compile, test, secrets scan (from tar.xz)web_app.py— integrate security headers, config, document processor, and fact checker.gitignore— comprehensive security-focused ignore rules (from tar.xz)requirements.txt— add new dependencies (pdfplumber, requests>=2.31.0).env.example— add new environment variable docsREADME.md— document new features and toolstests.py— add 24 tests for new modules (document processor, fact checker, config, security headers)leroysmemoryfix.tar.xz