Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Aug 28, 2025

This PR implements a simple MVP integration of Extractous - a high-performance document extraction library built in Rust that leverages Apache Tika for text and metadata extraction from various file formats.

What's Changed

Backend Implementation

  • Added Extractous dependency (extractous>=0.2.2) to support document processing
  • Created new API endpoint /api/v1/extract for file content extraction
  • Added response schema FileExtractionResponse with structured extraction results
  • Implemented extraction logic using Extractous Python bindings with proper error handling

The extraction endpoint accepts file uploads (up to 50MB) and returns:

{
  "extracted_text": "Full document text content...",
  "metadata": {
    "Content-Type": ["application/pdf"],
    "X-TIKA:Parsed-By": ["org.apache.tika.parser.DefaultParser"],
    "Content-Length": ["1234"]
  },
  "original_filename": "document.pdf",
  "content_type": "application/pdf",
  "file_size": 1234
}

Frontend Implementation

  • Created FileExtraction component with complete UI for the extraction workflow
  • Integrated with existing ImportFileUpload component for seamless user experience
  • Added extraction view model with API integration and state management
  • Implemented responsive design matching existing Extralit UI patterns

Key Features

Multi-format Support: PDF, DOC, DOCX, TXT, HTML, RTF, and many more via Apache Tika
High Performance: Rust-based core bypasses Python GIL for efficient processing
🔍 Rich Metadata: Extracts comprehensive document metadata including content type, encoding, and parser information
🎯 Simple Integration: Clean API design following existing Extralit patterns
🛡️ Error Handling: Graceful error handling with user-friendly feedback

Demo Screenshots

Extractous Demo Interface

Clean interface for file upload and extraction

Extraction Results

Complete extraction results showing text content and technical metadata

Usage Example

The extraction functionality integrates seamlessly into the existing file upload workflow:

  1. Upload files using the existing ImportFileUpload component
  2. Select a file from the uploaded files dropdown
  3. Click "Extract Content" to process the document
  4. View results including extracted text and metadata

Technical Details

  • Extractous Library: Uses the latest Extractous Python bindings (v0.3.0)
  • Apache Tika Integration: Full Tika functionality via GraalVM native compilation
  • Minimal Changes: Surgical integration following existing codebase patterns
  • Type Safety: Full TypeScript support with proper type definitions
  • Authentication: Uses optional authentication for easy testing (can be made required)

Testing

  • ✅ Extraction endpoint responds correctly with structured data
  • ✅ Frontend UI integration works seamlessly
  • ✅ Multiple file format support verified
  • ✅ Error handling for invalid files
  • ✅ End-to-end workflow functionality

This implementation provides a solid foundation for enhanced document processing capabilities in Extralit, enabling users to preview and validate document content before processing.


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

Copilot AI and others added 4 commits August 28, 2025 08:32
Co-authored-by: dawn-tran <104935595+dawn-tran@users.noreply.github.com>
Co-authored-by: dawn-tran <104935595+dawn-tran@users.noreply.github.com>
Co-authored-by: dawn-tran <104935595+dawn-tran@users.noreply.github.com>
Co-authored-by: dawn-tran <104935595+dawn-tran@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement Extractous (use context7 MCP) to support text and metadata extraction from various file formats, leveraging Apache Tika. You should read in extralit-server/CLAUDE.md for structure on how add the dependency and build endpoints in the backend... Implement Extractous integration for text and metadata extraction from various file formats Aug 28, 2025
Copilot AI requested a review from dawn-tran August 28, 2025 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants