feat(security): implement strict upstream validation for URIs and pagination to mitigate SPARQL injection#37
Conversation
…e SPARQL injection & generator bypass
|
Hi @wiresio , A major driver for this architectural update is laying the groundwork for Enhancing TD Directories with MCP-driven Capabilities. Since future LLM agents interacting via MCP can unpredictably hallucinate parameters or inject structural characters, it was critical to secure the API boundary. This upstream validation ensures that malformed tool calls are intercepted before reaching the SPARQL engine or Python generators, returning a structured Additionally, this PR introduces a workaround for the RDFlib Blank Node (BNode) pagination data loss and fixes the Windows UTF-8 subprocess encoding issue. I would be incredibly grateful for your feedback whenever you have a chance to look at this. Please let me know if anything needs to be changed. Thanks again for maintaining such an amazing project! |
|
Thanks @kaiprodevops! Please allow me some time to have a close look. |
|
Hi @kaiprodevops, here is my Claude powered feedback:
IMHO it is great to prepare the API before making use of it in an MCP server!
So, maybe despite 1., only minor changes needed for this PR. |
|
Hi @wiresio , |
Signed-off-by: kaiprodev <warmtigerca@gmail.com>
Signed-off-by: kaiprodev <warmtigerca@gmail.com>
Signed-off-by: kaiprodev <warmtigerca@gmail.com>
|
Hi @wiresio, |
|
Thanks @kaiprodevops, looks really good now! |
Overview
As part of an ongoing architectural review to enhance TD Directories with MCP-driven capabilities, I conducted a security audit of the TDD API. While this audit was motivated by the need to build robust guardrails for future AI agents (which can unpredictably hallucinate malformed parameters), the vulnerabilities discovered are critical for the existing REST API security.
This PR implements a Zero-Trust upstream validation layer, effectively mitigating SPARQL Injection across multiple attack vectors, and resolves a sophisticated streaming architecture bypass.
Vulnerabilities Discovered
During Red Team penetration testing, I identified two critical injection vectors:
> } ;) to break the Abstract Syntax Tree (AST) and execute unauthorized administrative commands likeDROP GRAPH.sort_orderparameter was vulnerable to injection. More critically, I discovered an architectural flaw: because theGET /thingsendpoint uses a Python streaming generator (yield), validation occurring inside the database layer was executed after Flask had sent the initial HTTP headers. This bypassed the global error handler entirely, causing the WSGI server to crash mid-stream and leak raw HTML500 Internal Server Errortraces.get_paginated_tds()function usedThreadPoolExecutorto fetch TDs concurrently but failed to wait for all tasks to complete before returning results.UnicodeDecodeErrorwhen processing TDs with international characters. This resulted in silent failures for TDs containing UTF-8 characters.The Fix: "Shift-Left" Validation + Concurrency Hardening
To address this without disrupting the core business logic, I implemented the following:
tdd/validators.py): Created a dedicated validation module using strict RFC 3986-compliant Regex for URIs and explicit allowlists (ASC/DESC) for pagination.tdd/__init__.py). By validating parameters before the generator is instantiated, we completely eliminated the lazy evaluation bypass.tdd/errors.py): To ensure robust and consistent error reporting, I implemented a dedicatedSecurityValidationErrorclass, wired the new validators to trigger this specific exception, ensuring that malicious inputs are elegantly caught and converted into structured JSON-LD400 Bad Requestresponses, maintaining the API's contract consistency.tdd/td.py): Added explicit task completion waiting usingconcurrent.futures.as_completed()to ensure all concurrent TD retrieval tasks finish before returning results. This maintains the parallel execution performance while guaranteeing data integrity.tdd/common.py): Added explicitencoding='utf-8'parameter to the subprocess call, ensuring consistent UTF-8 handling across all platforms, particularly Windows.Red Team Test Results (Proof of Concept)
I tested the endpoints using a custom-crafted AST breakout payload(CONSTRUCT { ?s ?p ?o } WHERE { GRAPH urn:test { ?s ?p ?o } } ; DROP SILENT GRAPH ; #> { ?s ?p ?o } }):
urn:test%3E%20%7B%20%3Fs%20%3Fp%20%3Fo%20%7D%20%7D%20%3B%20DROP%20SILENT%20GRAPH%20%3CALL%3E%20%3B%20%23500error (or crashing the stream for pagination).400 Bad Request.[Before]

[After]

