add pdf converter#42
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
This PR adds PDF to EPUB conversion functionality to the media library application. The implementation includes a conversion library and an API endpoint that allows users to convert existing PDF files to EPUB format.
Key Changes
- New
pdf-to-epub.tslibrary that converts PDF buffers to EPUB format using pdf-lib for PDF parsing and @lesjoursfr/html-to-epub for EPUB generation - New
/api/media/convert-pdf-to-epubAPI endpoint supporting both single media item conversion and batch conversion of all PDFs - Integration with existing media file management system to store generated EPUBs alongside original PDFs
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 18 comments.
| File | Description |
|---|---|
| src/lib/pdf-to-epub.ts | Core conversion library that extracts PDF metadata and structure, generates HTML content, and creates EPUB files. Currently creates placeholder content rather than extracting actual PDF text. |
| src/app/api/media/convert-pdf-to-epub/route.ts | API endpoint that handles conversion requests, fetches PDF files from blob storage, invokes the conversion library, and stores resulting EPUBs back to the media database. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const body = await req.json() | ||
| const { mediaId, all } = body |
There was a problem hiding this comment.
The API lacks input validation for the request body structure. When 'all' is provided, it should be validated as a boolean type. Currently, any truthy value for 'all' (including strings like 'false' or 'no') would trigger batch conversion. Add explicit type validation to ensure 'all' is a boolean value when present.
| // eslint-disable-next-line @typescript-eslint/no-explicit-any | ||
| async function extractPdfContent(pdfDoc: any, pageCount: number): Promise<string> { |
There was a problem hiding this comment.
Using 'any' type for pdfDoc parameter loses type safety and makes the code harder to maintain. While the comment acknowledges that pdf-lib doesn't have built-in text extraction, the type could be more specific. Consider importing and using the actual PDFDocument type from pdf-lib for better type safety and IDE autocomplete support.
| for (let i = 1; i <= pageCount; i++) { | ||
| htmlContent += `<div class="page">\n` | ||
| htmlContent += `<h2>Page ${i}</h2>\n` | ||
| htmlContent += `<p><em>[Content from page ${i} of the original PDF]</em></p>\n` | ||
| htmlContent += `</div>\n\n` | ||
| } |
There was a problem hiding this comment.
The conversion process creates placeholder content instead of extracting actual text from the PDF. This means users will get an EPUB with only page markers and no actual content. While the comments acknowledge this limitation, the API provides no indication to callers that the conversion is incomplete. Consider either implementing actual text extraction (e.g., using pdfjs-dist), or returning a warning/status field in the response indicating the conversion is limited to structure only.
| if (mediaId) { | ||
| // Convert a specific media item | ||
| console.log('[Convert PDF to EPUB] Converting single media:', mediaId) | ||
| const result = await convertMediaPdfToEpub(mediaId) | ||
| results.push(result) | ||
| } else if (all) { | ||
| // Convert all PDFs that don't have EPUBs yet | ||
| console.log('[Convert PDF to EPUB] Converting all PDFs without EPUBs') | ||
|
|
||
| // Find all media items that have PDF files but no EPUB files | ||
| const mediaWithPdfs = await prisma.media.findMany({ | ||
| where: { | ||
| deletedAt: null, | ||
| files: { | ||
| some: { | ||
| fileType: 'pdf' | ||
| } | ||
| } | ||
| }, | ||
| include: { | ||
| files: true | ||
| } | ||
| }) | ||
|
|
||
| console.log(`[Convert PDF to EPUB] Found ${mediaWithPdfs.length} media items with PDFs`) | ||
|
|
||
| // Filter to only those without EPUBs | ||
| const mediaNeedingConversion = mediaWithPdfs.filter(media => | ||
| !media.files.some(file => file.fileType === 'epub') | ||
| ) | ||
|
|
||
| console.log(`[Convert PDF to EPUB] ${mediaNeedingConversion.length} need EPUB conversion`) | ||
|
|
||
| // Convert each one | ||
| for (const media of mediaNeedingConversion) { | ||
| const result = await convertMediaPdfToEpub(media.id) | ||
| results.push(result) | ||
| } |
There was a problem hiding this comment.
This endpoint lacks authorization checks to verify that the user has permission to convert the specified media items. Unlike other media endpoints in this codebase (e.g., /api/media/[id]/route.ts), which check that existingMedia.uploadedBy === session.user.id, this endpoint allows any authenticated user to convert any media item by ID or all PDFs in the system. This could allow users to trigger expensive conversion operations on media they don't own, and potentially enumerate or access other users' media. Add ownership verification before processing conversions.
| } else if (all) { | ||
| // Convert all PDFs that don't have EPUBs yet | ||
| console.log('[Convert PDF to EPUB] Converting all PDFs without EPUBs') | ||
|
|
||
| // Find all media items that have PDF files but no EPUB files | ||
| const mediaWithPdfs = await prisma.media.findMany({ | ||
| where: { | ||
| deletedAt: null, | ||
| files: { | ||
| some: { | ||
| fileType: 'pdf' | ||
| } | ||
| } | ||
| }, | ||
| include: { | ||
| files: true | ||
| } | ||
| }) | ||
|
|
||
| console.log(`[Convert PDF to EPUB] Found ${mediaWithPdfs.length} media items with PDFs`) | ||
|
|
||
| // Filter to only those without EPUBs | ||
| const mediaNeedingConversion = mediaWithPdfs.filter(media => | ||
| !media.files.some(file => file.fileType === 'epub') | ||
| ) | ||
|
|
||
| console.log(`[Convert PDF to EPUB] ${mediaNeedingConversion.length} need EPUB conversion`) | ||
|
|
||
| // Convert each one | ||
| for (const media of mediaNeedingConversion) { | ||
| const result = await convertMediaPdfToEpub(media.id) | ||
| results.push(result) | ||
| } |
There was a problem hiding this comment.
The 'all' parameter allows any authenticated user to trigger batch conversion of all PDFs in the system without rate limiting or pagination. This could cause performance issues, excessive resource consumption, and high costs in serverless environments. The operation processes conversions sequentially in a loop (line 70-73) which could timeout for large datasets. Consider adding: (1) pagination or batch size limits, (2) rate limiting, (3) restricting to user's own media only, or (4) implementing this as a background job rather than a synchronous API call.
| @@ -0,0 +1,160 @@ | |||
| import { EPub } from '@lesjoursfr/html-to-epub' | |||
There was a problem hiding this comment.
The package '@lesjoursfr/html-to-epub' is imported but does not appear to be listed in package.json dependencies. This will cause a runtime error when the module is loaded. Please add this package to the dependencies in package.json.
| // Clean up temp file on error | ||
| await fs.unlink(tempFilePath).catch(() => { | ||
| // Ignore cleanup errors | ||
| }) |
There was a problem hiding this comment.
The function creates a temporary file and attempts to clean it up in both the success path (line 81) and error path (line 88). However, if the file creation never happens (e.g., epub.render() fails before writing), attempting to unlink a non-existent file could cause issues. While the catch blocks ignore errors, this could mask legitimate cleanup issues. Consider checking if the file exists before attempting to delete it, or tracking whether the file was successfully created.
| // Check if EPUB already exists | ||
| const epubFile = media.files.find(f => f.fileType === 'epub') | ||
| if (epubFile) { | ||
| return { | ||
| mediaId, | ||
| success: false, | ||
| error: 'EPUB file already exists for this media' | ||
| } | ||
| } |
There was a problem hiding this comment.
When checking if an EPUB already exists (line 133-139), the function returns an error. However, this might not be the desired behavior if a user wants to re-convert a PDF (e.g., after the conversion logic is improved to include actual text extraction). Consider making this check optional through a 'force' or 'overwrite' parameter, or returning success with a message indicating the EPUB already exists rather than treating it as an error condition.
| console.log(`[PDF to EPUB] Converting PDF with ${pageCount} pages`) | ||
|
|
||
| // Extract text from PDF pages | ||
| // Note: pdf-lib doesn't have built-in text extraction | ||
| // We'll create a simple HTML representation based on page structure | ||
| const htmlContent = await extractPdfContent(pdfDoc, pageCount) | ||
|
|
||
| // Prepare EPUB metadata | ||
| const title = options.title || 'Untitled' | ||
| const author = options.author || 'Unknown Author' | ||
| const description = options.description || '' | ||
| const language = options.language || 'en' | ||
|
|
||
| console.log(`[PDF to EPUB] Creating EPUB with title: ${title}`) | ||
|
|
||
| // Create EPUB options | ||
| const epubOptions = { | ||
| title, | ||
| author, | ||
| description, | ||
| lang: language, | ||
| content: [ | ||
| { | ||
| title: 'Content', | ||
| data: htmlContent, | ||
| }, | ||
| ], | ||
| verbose: false, | ||
| } | ||
|
|
||
| // Generate EPUB (pass output as second argument) | ||
| const epub = new EPub(epubOptions, tempFilePath) | ||
| await epub.render() | ||
|
|
||
| console.log('[PDF to EPUB] EPUB file created at:', tempFilePath) | ||
|
|
||
| // Read the EPUB file into a buffer | ||
| const epubBuffer = await fs.readFile(tempFilePath) | ||
|
|
||
| console.log('[PDF to EPUB] EPUB buffer created, size:', epubBuffer.length) |
There was a problem hiding this comment.
Console.log statements are used throughout for logging (lines 39, 52, 73, 78). In a production application, consider using a structured logging library that supports different log levels (debug, info, warn, error) and can be configured per environment. This would make it easier to control verbosity and integrate with monitoring tools.
| options: ConversionOptions = {} | ||
| ): Promise<Buffer> { | ||
| // Create a temporary file path for the EPUB | ||
| const tempDir = '/tmp' |
There was a problem hiding this comment.
The hardcoded '/tmp' directory path may not exist or be writable in all deployment environments, particularly in serverless platforms like Vercel where the filesystem is read-only except for the '/tmp' directory which has size and persistence limitations. Consider using the system's temporary directory (os.tmpdir()) for better cross-platform compatibility, or ensure this code is aware of the deployment environment's constraints.
|
@copilot How gets this mechanism triggered? |
|
@andrepetsch I've opened a new pull request, #43, to work on those changes. Once the pull request is ready, I'll request review from you. |
No description provided.