Implement Extractous integration for text and metadata extraction from various file formats #142

Copilot · 2025-08-28T08:28:48Z

This PR implements a simple MVP integration of Extractous - a high-performance document extraction library built in Rust that leverages Apache Tika for text and metadata extraction from various file formats.

What's Changed

Backend Implementation

Added Extractous dependency (extractous>=0.2.2) to support document processing
Created new API endpoint /api/v1/extract for file content extraction
Added response schema FileExtractionResponse with structured extraction results
Implemented extraction logic using Extractous Python bindings with proper error handling

The extraction endpoint accepts file uploads (up to 50MB) and returns:

{
  "extracted_text": "Full document text content...",
  "metadata": {
    "Content-Type": ["application/pdf"],
    "X-TIKA:Parsed-By": ["org.apache.tika.parser.DefaultParser"],
    "Content-Length": ["1234"]
  },
  "original_filename": "document.pdf",
  "content_type": "application/pdf",
  "file_size": 1234
}

Frontend Implementation

Created FileExtraction component with complete UI for the extraction workflow
Integrated with existing ImportFileUpload component for seamless user experience
Added extraction view model with API integration and state management
Implemented responsive design matching existing Extralit UI patterns

Key Features

✨ Multi-format Support: PDF, DOC, DOCX, TXT, HTML, RTF, and many more via Apache Tika
⚡ High Performance: Rust-based core bypasses Python GIL for efficient processing
🔍 Rich Metadata: Extracts comprehensive document metadata including content type, encoding, and parser information
🎯 Simple Integration: Clean API design following existing Extralit patterns
🛡️ Error Handling: Graceful error handling with user-friendly feedback

Demo Screenshots

Clean interface for file upload and extraction

Complete extraction results showing text content and technical metadata

Usage Example

The extraction functionality integrates seamlessly into the existing file upload workflow:

Upload files using the existing ImportFileUpload component
Select a file from the uploaded files dropdown
Click "Extract Content" to process the document
View results including extracted text and metadata

Technical Details

Extractous Library: Uses the latest Extractous Python bindings (v0.3.0)
Apache Tika Integration: Full Tika functionality via GraalVM native compilation
Minimal Changes: Surgical integration following existing codebase patterns
Type Safety: Full TypeScript support with proper type definitions
Authentication: Uses optional authentication for easy testing (can be made required)

Testing

✅ Extraction endpoint responds correctly with structured data
✅ Frontend UI integration works seamlessly
✅ Multiple file format support verified
✅ Error handling for invalid files
✅ End-to-end workflow functionality

This implementation provides a solid foundation for enhanced document processing capabilities in Extralit, enabling users to preview and validate document content before processing.

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Co-authored-by: dawn-tran <104935595+dawn-tran@users.noreply.github.com>

Initial plan

c90b184

Copilot AI assigned Copilot and dawn-tran Aug 28, 2025

Copilot started work on behalf of dawn-tran August 28, 2025 08:28 View session

Copilot AI and others added 4 commits August 28, 2025 08:32

Initial assessment and planning for Extractous integration

ecdfd00

Co-authored-by: dawn-tran <104935595+dawn-tran@users.noreply.github.com>

Add Extractous dependency and extraction endpoint to backend

67bf898

Co-authored-by: dawn-tran <104935595+dawn-tran@users.noreply.github.com>

Add FileExtraction component to frontend with Extractous integration

5bbc9ef

Co-authored-by: dawn-tran <104935595+dawn-tran@users.noreply.github.com>

Complete Extractous integration MVP with working demo

47c8f2b

Co-authored-by: dawn-tran <104935595+dawn-tran@users.noreply.github.com>

Copilot AI requested a review from dawn-tran August 28, 2025 08:47

Copilot finished work on behalf of dawn-tran August 28, 2025 08:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement Extractous integration for text and metadata extraction from various file formats #142

Implement Extractous integration for text and metadata extraction from various file formats #142

Uh oh!

Copilot AI commented Aug 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Implement Extractous integration for text and metadata extraction from various file formats #142

Are you sure you want to change the base?

Implement Extractous integration for text and metadata extraction from various file formats #142

Uh oh!

Conversation

Copilot AI commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's Changed

Backend Implementation

Frontend Implementation

Key Features

Demo Screenshots

Usage Example

Technical Details

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Aug 28, 2025 •

edited

Loading