Summary
The document processing jobs (Documents::AnalyzePdfJob and Documents::OcrJob) have no test coverage. These jobs handle PDF text extraction and OCR — the foundation for all downstream AI analysis.
Jobs Needing Tests
Documents::AnalyzePdfJob (app/jobs/documents/analyze_pdf_job.rb)
- Extracts text from PDF documents using
pdftotext
- Performs page-level analysis (creates
Extraction rows)
- Classifies
text_quality (good, poor, no_text)
- Triggers
OcrJob for scanned/image-based documents
Test scenarios:
- Text-based PDF → extracts text, sets
text_quality: good, creates Extraction rows per page
- Image-based/scanned PDF → detects low text quality, enqueues
OcrJob
- Already-processed document (idempotent re-run) → clears and rebuilds extractions
- Handles corrupt/unreadable PDFs gracefully (doesn't crash)
- Updates
MeetingDocument fields: extracted_text, text_chars, avg_chars_per_page, page_count
Documents::OcrJob (app/jobs/documents/ocr_job.rb)
- Runs Tesseract OCR on image-based PDFs
- Updates
MeetingDocument.ocr_status and extracted text
Test scenarios:
- Scanned PDF → OCR produces text, updates
extracted_text and ocr_status
- PDF with mixed text/image pages → handles correctly
- Tesseract unavailable → graceful failure with appropriate status
- Idempotent re-run
Approach
- Create small test PDF fixtures (
test/fixtures/files/): one text-based, one image-based
- Stub system calls to
pdftotext and tesseract where appropriate
- Test the full flow: download → analyze → OCR → extraction rows
- Verify
text_quality classification logic
Dependencies
Documents::DownloadJob already has tests (test/jobs/documents/download_job_test.rb) — use as a pattern
Summary
The document processing jobs (
Documents::AnalyzePdfJobandDocuments::OcrJob) have no test coverage. These jobs handle PDF text extraction and OCR — the foundation for all downstream AI analysis.Jobs Needing Tests
Documents::AnalyzePdfJob(app/jobs/documents/analyze_pdf_job.rb)pdftotextExtractionrows)text_quality(good, poor, no_text)OcrJobfor scanned/image-based documentsTest scenarios:
text_quality: good, createsExtractionrows per pageOcrJobMeetingDocumentfields:extracted_text,text_chars,avg_chars_per_page,page_countDocuments::OcrJob(app/jobs/documents/ocr_job.rb)MeetingDocument.ocr_statusand extracted textTest scenarios:
extracted_textandocr_statusApproach
test/fixtures/files/): one text-based, one image-basedpdftotextandtesseractwhere appropriatetext_qualityclassification logicDependencies
Documents::DownloadJobalready has tests (test/jobs/documents/download_job_test.rb) — use as a pattern