Implement PDDA/HTML upload to AWS S3 (#334)#345
Implement PDDA/HTML upload to AWS S3 (#334)#345alexbottenberg wants to merge 22 commits intomasterfrom
Conversation
Created technical specification and implementation plan for issue #334. This plan covers: - New API endpoint for PDDA HTML/HTM file uploads to XHIBIT S3 - Database migration to add artefact type column - AWS S3 integration with @aws-sdk/client-s3 - File validation (extensions, size, path traversal protection) - New artefact type LCSU for PDDA HTML uploads - Comprehensive test strategy including S3 verification Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add service-to-service API endpoint for PDDA to upload HTML/HTM files
directly to AWS S3 (XHIBIT bucket).
Changes:
- Add new libs/pdda-html-upload module with S3 upload functionality
- Create POST /api/v1/pdda-html endpoint with OAuth authentication
- Add database migration for artefact.type column (LCSU artefact type)
- Implement comprehensive validation (file type, size, path traversal)
- Add 37 unit tests with >95% coverage
- Update publication queries to set default artefact type
Technical details:
- AWS SDK integration for S3 uploads with verification
- Correlation ID support for request tracing
- File size limit (10MB default, configurable)
- S3 key format: pdda-html/YYYY/MM/DD/{uuid}.html
- Secure logging (no sensitive data)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRegisters a new PDDA HTML upload API module and package, adds route registration, S3 client and upload service, validation, types and tests, introduces a Prisma migration adding non-nullable artefact.type (backfilled to 'LIST') and updates seeds/tests and TypeScript path aliases. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant API as API (Express)
participant Auth as OAuth
participant Validator as Validation
participant S3 as AWS S3
Note over API: libs/pdda-html-upload route handler
Client->>API: POST /api/v1/pdda-html (file, artefact_type, correlation_id)
API->>Auth: authenticateApi()
Auth-->>API: 200 OK
API->>Validator: validatePddaHtmlUpload(artefact_type, file)
alt validation fails
Validator-->>API: { valid: false, error }
API-->>Client: 400 { success:false, message, correlation_id }
else validation passes
Validator-->>API: { valid: true }
API->>S3: uploadHtmlToS3(fileBuffer, filename, correlation_id)
alt upload succeeds
S3-->>API: { success:true, s3Key, bucketName }
API-->>Client: 201 { success:true, s3_key, correlation_id }
else upload fails
S3-->>API: error
API-->>Client: 500 { success:false, message, correlation_id }
end
end
Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🎭 Playwright E2E Test Results238 tests 238 ✅ 22m 14s ⏱️ Results for commit 50100db. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (13)
libs/pdda-html-upload/src/index.ts (1)
1-1: Unnecessary comment in an otherwise empty file.Per coding guidelines, comments should explain why, not what. An empty file is self-explanatory; this comment adds no value. Consider removing it or leaving the file truly empty.
As per coding guidelines, "Do not add comments unless meaningful - explain why, not what."
libs/pdda-html-upload/src/s3/s3-client.ts (1)
3-19: Consider using the AWS SDK default credential provider chain.Hardcoding credentials from environment variables bypasses the SDK's built-in credential resolution (IAM roles, instance profiles, ECS task roles, etc.), which is the recommended approach for deployed environments.
new S3Client({ region })without explicitcredentialswill automatically resolve credentials from the environment, includingAWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEYif set, plus IAM roles when available.This would simplify the code and improve deployment flexibility.
♻️ Suggested simplification
export function createS3Client(): S3Client { const region = process.env.AWS_S3_XHIBIT_REGION; - const accessKeyId = process.env.AWS_ACCESS_KEY_ID; - const secretAccessKey = process.env.AWS_SECRET_ACCESS_KEY; - if (!region || !accessKeyId || !secretAccessKey) { + if (!region) { throw new Error("Missing required AWS S3 configuration"); } return new S3Client({ - region, - credentials: { - accessKeyId, - secretAccessKey - } + region }); }docs/tickets/334/tasks.md (1)
119-129: Clarify the mapping between Key Vault secrets and environment variables.Lines 120–125 document the env vars (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY), while lines 127–128 reference Key Vault secrets with different names (xhibit-s3-access-key,xhibit-s3-access-key-secret). Ensure the deployment configuration clearly maps each secret to its corresponding environment variable to avoid misconfiguration.docs/tickets/334/review.md (1)
1-422: AI-generated review document committed to the repository.This is a documentation artifact rather than code. No functional concerns, but consider whether committing AI-generated review documents as permanent repo artefacts is the intended practice — they can become stale quickly and may confuse future readers about which issues were actually addressed.
libs/pdda-html-upload/src/validation/file-validation.test.ts (1)
133-163: Consider adding a test for embedded path traversal.The validation uses
includes("../")which catches traversal anywhere in the filename (e.g.,foo/../evil.html), but there's no test for this case. A quick addition would strengthen coverage.libs/pdda-html-upload/src/types.ts (2)
1-17: Boolean properties don't followis/has/canprefix convention.Per coding guidelines, boolean fields should use prefixes like
isValid,isSuccess. These are API response types so the external contract may justify deviation, but worth noting for internal consistency.As per coding guidelines: "Booleans should use is/has/can prefixes (e.g.,
isActive,hasAccess,canEdit)."
12-17:PddaHtmlUploadResponseuses snake_case property names.
s3_keyandcorrelation_iduse snake_case, which suits an external API contract but deviates from the TypeScript camelCase guideline. If the API contract is fixed, this is fine — just ensure consistency across all response types in the codebase.As per coding guidelines: "TypeScript variables should use camelCase (e.g.,
userId,caseDetails,documentId)."libs/pdda-html-upload/src/routes/v1/pdda-html.ts (2)
8-13: Max file size parsed in two places — risk of divergence.The
PDDA_HTML_MAX_FILE_SIZEenv var is parsed identically here (multer limit) and infile-validation.tsline 35. If one is updated but not the other, multer could reject a file before validation runs, or vice versa. Consider extracting a shared constant or config helper.
15-16:x-correlation-idheader type could bestring[].In Express,
req.headers[key]can returnstring | string[] | undefined. The cast tostring | undefinedsilently ignores the array case. Low risk but worth a safe coercion.Proposed fix
- const correlationId = req.headers["x-correlation-id"] as string | undefined; + const correlationIdHeader = req.headers["x-correlation-id"]; + const correlationId = Array.isArray(correlationIdHeader) ? correlationIdHeader[0] : correlationIdHeader;libs/pdda-html-upload/src/validation/file-validation.ts (1)
43-48: Path traversal check is basic — may miss encoded variants.The check only looks for literal
../and..\strings. Encoded variants like..%2For..%5Ccould bypass this. The actual risk is low since the filename is only used for extension extraction and S3 metadata (the S3 key uses a UUID), but it's worth hardening if the original filename is ever used in a file-system context downstream.Proposed hardening
- if (file.originalname.includes("../") || file.originalname.includes("..\\")) { + const decodedName = decodeURIComponent(file.originalname); + if (decodedName.includes("../") || decodedName.includes("..\\") || decodedName.includes("..%")) { return { valid: false, error: "Invalid filename" }; }libs/pdda-html-upload/src/s3/s3-upload-service.ts (2)
8-8: New S3 client created per request.
createS3Client()is called on every upload, instantiating a newS3Clienteach time. The AWS SDK client is safe to reuse and handles connection pooling internally. Consider creating the client once at module level or caching it.
30-34: User-controlledoriginalFilenamestored in S3 metadata without sanitisation.
file.originalnameis placed directly into S3 object metadata. While S3 metadata doesn't execute, excessively long or specially crafted values could cause issues with downstream consumers reading the metadata. Consider truncating or sanitising.libs/pdda-html-upload/src/s3/s3-upload-service.test.ts (1)
85-91: Missing test for HeadObject failure after successful PutObject.The test at line 85 mocks
sendto reject on the first call (PutObject). Consider adding a test where PutObject succeeds but HeadObject fails (secondsendcall rejects), to verify the error path and document the orphaned-file behaviour.Suggested additional test
+ it("should throw when HeadObject verification fails after successful upload", async () => { + mockS3Send + .mockResolvedValueOnce({}) // PutObject succeeds + .mockRejectedValueOnce(new Error("HeadObject failed")); // HeadObject fails + + const fileBuffer = Buffer.from("<html></html>"); + + await expect(uploadHtmlToS3(fileBuffer, "test.html")).rejects.toThrow( + "The file could not be uploaded to storage. Try again." + ); + });
| **I WANT** to implement the PDDA/HTML functionality in CaTH | ||
| **SO THAT** the Crown hearing lists can be published in CaTH | ||
|
|
||
| **TECHNIAL SPECIFICATION** |
There was a problem hiding this comment.
Typo: "TECHNIAL" → "TECHNICAL".
| • AWS S3 SDK should be set up in API to communicate with S3 | ||
| • A new value LCSU is added to the ArtefactType which is to be used by PDDA when sending html to PIP APIM | ||
| • Functional test is added to the API to upload the html file | ||
| • Functional test is added to upload the html/htm file to file to S3 bucket and check the uploaded file exists in the S3 bucket |
There was a problem hiding this comment.
Duplicated phrase: "file to file to".
Should likely read "file to S3 bucket".
🧰 Tools
🪛 LanguageTool
[grammar] ~34-~34: This phrase is duplicated. You should probably use “file to” only once.
Context: ...al test is added to upload the html/htm file to file to S3 bucket and check the uploaded file e...
(PHRASE_REPETITION)
| ________________________________________ | ||
| **Accessibility** | ||
| • Ensure error messages returned by the upload flow are concise, consistent, and machine-readable for any consuming UI (where applicable). | ||
| • Ensure logs/audit events include correlation identifiers to support support/operations without exposing sensitive credentials or file contents. |
There was a problem hiding this comment.
Duplicated word: "support support/operations".
Should be "support operations" or "support/operations".
🧰 Tools
🪛 LanguageTool
[duplication] ~132-~132: Possible typo: you repeated a word.
Context: ...ents include correlation identifiers to support support/operations without exposing sensitive c...
(ENGLISH_WORD_REPEAT_RULE)
| } catch (error) { | ||
| console.error("PDDA HTML upload failed", { | ||
| error: error instanceof Error ? error.message : "Unknown error", | ||
| correlationId | ||
| }); | ||
|
|
||
| const response: PddaHtmlUploadResponse = { | ||
| success: false, | ||
| message: error instanceof Error ? error.message : "Internal server error", | ||
| correlation_id: correlationId | ||
| }; | ||
| return res.status(500).json(response); | ||
| } |
There was a problem hiding this comment.
Internal error details may leak to the client.
If createS3Client() throws (e.g. "Missing required AWS S3 configuration"), that message is returned verbatim in the 500 response at line 62. This exposes internal infrastructure details. Consider returning a generic message for unexpected errors and reserving the error.message passthrough only for known, user-safe error types.
Proposed fix
- message: error instanceof Error ? error.message : "Internal server error",
+ message: "Internal server error",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } catch (error) { | |
| console.error("PDDA HTML upload failed", { | |
| error: error instanceof Error ? error.message : "Unknown error", | |
| correlationId | |
| }); | |
| const response: PddaHtmlUploadResponse = { | |
| success: false, | |
| message: error instanceof Error ? error.message : "Internal server error", | |
| correlation_id: correlationId | |
| }; | |
| return res.status(500).json(response); | |
| } | |
| } catch (error) { | |
| console.error("PDDA HTML upload failed", { | |
| error: error instanceof Error ? error.message : "Unknown error", | |
| correlationId | |
| }); | |
| const response: PddaHtmlUploadResponse = { | |
| success: false, | |
| message: "Internal server error", | |
| correlation_id: correlationId | |
| }; | |
| return res.status(500).json(response); | |
| } |
| const headCommand = new HeadObjectCommand({ | ||
| Bucket: bucketName, | ||
| Key: s3Key | ||
| }); | ||
| await s3Client.send(headCommand); |
There was a problem hiding this comment.
HeadObject failure after successful PutObject leaves an orphaned file in S3.
If PutObjectCommand succeeds but HeadObjectCommand fails (transient network issue, eventual consistency), the catch block at line 50 throws a generic error to the caller. The file exists in S3 but the client receives a 500 failure, with no cleanup or retry logic. This could lead to orphaned objects and potential duplicate uploads on retry.
Consider either:
- Separating the HeadObject call into its own try/catch so PutObject success is preserved, or
- Removing the HeadObject verification (PutObject success is sufficient).
Update seed scripts to include the new required 'type' field for artefacts. Changes: - apps/postgres/prisma/seed.ts: Add type='LIST' to test artefacts - e2e-tests/utils/seed-location-data.ts: Add type='LIST' to E2E test artefacts Required for database schema changes in #334. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Resolved yarn.lock conflicts by regenerating with yarn install. Changes from master: - Updated E2E test workflows - Helm values updates - SSO config improvements - GitHub secrets documentation Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add required 'type' field to blob-explorer E2E test artefact creation. Required for schema changes in #334. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated the following packages to address security vulnerabilities: - multer: 1.4.5-lts.1 → 2.0.2 (CVSS 8.7 critical) Fixes arbitrary file exposure vulnerability - @aws-sdk/client-s3: 3.712.0 → 3.989.0 Includes update to @smithy/config-resolver (CVSS 3.7) Addresses incorrect resolution of endpoints during SDK credential retrieval - @types/multer: 1.4.12 → 2.0.0 Updated type definitions to match multer 2.0.2 - qs: 6.14.1 → 6.14.2 (CVSS 3.7) Fixes prototype pollution via array brackets All unit tests passing (39/41 tasks successful). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add type: "LIST" to artefact creation in delete-court test to match the new database schema requirement. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add type: "LIST" to artefact creation in flat-file-viewing test to match the new database schema requirement. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add @db.VarChar(50) to the type field in the Artefact model to match the migration SQL which creates VARCHAR(50). Without this annotation, Prisma maps String to text in PostgreSQL, which would cause schema drift detection in future migrations. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changed error handling to return generic "Internal server error" message instead of exposing raw error.message to the client. This prevents leaking internal infrastructure details such as: - AWS configuration errors - S3 bucket names or credentials - Internal service error messages Error details are still logged server-side for debugging at lines 55-58. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Removed HeadObjectCommand verification after PutObjectCommand success. If PutObject succeeds but HeadObject fails (transient network issue, eventual consistency), the previous implementation would throw an error to the caller even though the file was successfully uploaded to S3. This created two problems: 1. Orphaned files in S3 that the client doesn't know about 2. Client retries would create duplicate uploads PutObject success is sufficient proof that the file was uploaded. The HeadObject verification added no real value and created this problematic failure scenario. Changes: - Removed HeadObjectCommand import and usage - Updated tests to expect 1 S3 call instead of 2 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Resolved conflicts: - package.json: Updated dependency versions (qs 6.15.0, tar 7.5.9) - yarn.lock: Regenerated after dependency resolution Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changes:
- Use original filename as S3 key instead of UUID-based path structure
- Enhance S3 upload error logging to include AWS response details (status code, headers, metadata)
- Add @aws-sdk/node-http-handler dependency for future SSL certificate handling
- Fix S3 region from eu-west-2 to eu-west-1 (correct bucket region)
- Remove custom endpoint configuration that was causing CloudFront 403 errors
The S3 key now follows the format: pdda-html/{original-filename}.html
instead of: pdda-html/YYYY/MM/DD/{uuid}.html
This allows PDDA to identify uploaded files by their original names and
provides better error diagnostics for troubleshooting S3 upload issues.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Resolved conflicts: - yarn.lock: Regenerated after merging latest dependencies from master - Multiple package.json files: Updated dependency versions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated tests to reflect the new behavior where S3 keys use the original filename (pdda-html/filename.html) instead of the previous date-based UUID format (pdda-html/YYYY/MM/DD/uuid.html). Changes: - Update "should upload file to S3 successfully" to expect exact filename match - Rename and update "should generate S3 key with date-based path" test to verify original filename is preserved - Make file extension regex checks case-insensitive (match both .html and .HTML) All 7 tests now pass. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
| } | ||
| }); | ||
|
|
||
| const postHandler = async (req: Request, res: Response) => { |
There was a problem hiding this comment.
Instead of a separate endpoint, could this be hooked into the existing endpoint with a separate path if LCSU type to match CaTH
Consolidate HTML file upload functionality into the existing /api/v1/publication endpoint to support both JSON blob ingestion and multipart file uploads. Changes: - Add content-type detection to route requests appropriately - Apply multer middleware conditionally for multipart requests - Extract PDDA HTML validation and S3 upload as reusable functions - Remove separate /api/v1/pdda-html endpoint and route files - Add comprehensive tests for both upload types (11 tests passing) Benefits: - Single endpoint for all publication uploads - Consistent authentication across upload types - Improved code reusability and maintainability Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Resolved conflicts: - tsconfig.json: Added both pdda-html-upload and pdf-generation paths - yarn.lock: Regenerated after merge Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Organize imports in correct order (external, then relative) - Remove unused consoleInfoSpy variable from tests Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
| async function handleHtmlFileUpload(req: Request, res: Response) { | ||
| const correlationId = req.headers["x-correlation-id"] as string | undefined; | ||
|
|
||
| const { artefact_type } = req.body; |
There was a problem hiding this comment.
Can this be 'type' instead of 'artefact_type', so that it is is consistent with the DB column
…ml-s3-upload # Conflicts: # apps/postgres/prisma/schema.prisma # tsconfig.json
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>



Summary
Implements service-to-service API endpoint for PDDA to upload HTML/HTM files directly to AWS S3 (XHIBIT bucket).
Changes
New Module:
libs/pdda-html-uploadPOST /api/v1/pdda-htmlauthenticateApi()middlewarepdda-html/YYYY/MM/DD/{uuid}.htmlX-Correlation-IDheaderDatabase Changes
typecolumn toartefacttable (VARCHAR 50, NOT NULL)typecolumn for performanceTesting
Request/Response Format
Successful Upload (201)
{ "success": true, "message": "Upload accepted and stored", "s3_key": "pdda-html/2026/02/11/{uuid}.html", "correlation_id": "{uuid}" }Validation Error (400)
{ "success": false, "message": "The uploaded file must be an HTM or HTML file", "correlation_id": "{uuid}" }Deployment Requirements
Environment Variables Required
Deployment Steps
yarn db:migrateAcceptance Criteria
Code Quality
anytypes)Review
Full code review completed:
docs/tickets/334/review.mdOverall Assessment: ✅ APPROVED
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Database
Integration
Tests & Documentation