Hi @noah-schroeder
I've been exploring the AIDE project and am very impressed with its capabilities for human-in-the-loop data extraction. I believe we can further enhance its power and versatility by integrating Python for certain AI/ML tasks, particularly for PDF processing.
Motivation:
The Python ecosystem offers a rich set of libraries for advanced PDF analysis (e.g., pymupdf, pdfplumber) and a vast landscape of AI/ML tools. By tapping into these, we could:
- Improve the accuracy and granularity of data extracted from PDFs (e.g., better handling of complex layouts, tables, and potentially images/scanned documents via OCR).
- Lay the groundwork for incorporating a wider array of LLMs or other AI/ML techniques that have strong Python support.
- Potentially handle a broader range of PDF document types and quality.
Proposed Initial Integration:
My initial thought is to focus on enhancing the PDF text extraction that currently uses pdftools::pdf_text for non-Gemini LLM pathways (Ollama, Mistral, OpenRouter).
The idea is to:
- Introduce a Python script (e.g.,
python_pdf_processor.py) that uses a library like pymupdf to extract text from PDFs.
- Use the R
reticulate package to call this Python function from within the existing Shiny app (inst/app/app.R).
- Design the Python function to return data in a structure compatible with what the R application currently expects from
pdf_text (i.e., a list/vector of strings, where each string is the text from a page).
Benefits of this approach:
- Non-Disruptive: This approach aims to replace a specific component (
pdf_text call) while maintaining the existing R workflow and data structures as much as possible. The goal is for the rest of the application logic for Ollama, Mistral, and OpenRouter to continue functioning without major changes.
- Improved Extraction Quality: Python libraries can offer more robust PDF parsing, potentially leading to better quality text input for the LLMs.
- Extensibility: Once
reticulate is in place, it becomes easier to integrate other Python-based enhancements in the future (e.g., more sophisticated data cleaning, specific LLM SDKs, OCR capabilities).
Collaboration and Discussion:
This is an initial proposal, and I'm very open to discussing the best way to approach this. I'm keen to contribute this enhancement in a way that aligns with the project's goals and maintainability.
Would you be open to exploring this Python integration? I'm happy to share more detailed thoughts on the implementation or prepare a draft pull request for a small, focused part of this change if that would be helpful.
Thank you for this great tool!
Best regards,
SHA888
Hi @noah-schroeder
I've been exploring the AIDE project and am very impressed with its capabilities for human-in-the-loop data extraction. I believe we can further enhance its power and versatility by integrating Python for certain AI/ML tasks, particularly for PDF processing.
Motivation:
The Python ecosystem offers a rich set of libraries for advanced PDF analysis (e.g.,
pymupdf,pdfplumber) and a vast landscape of AI/ML tools. By tapping into these, we could:Proposed Initial Integration:
My initial thought is to focus on enhancing the PDF text extraction that currently uses
pdftools::pdf_textfor non-Gemini LLM pathways (Ollama, Mistral, OpenRouter).The idea is to:
python_pdf_processor.py) that uses a library likepymupdfto extract text from PDFs.reticulatepackage to call this Python function from within the existing Shiny app (inst/app/app.R).pdf_text(i.e., a list/vector of strings, where each string is the text from a page).Benefits of this approach:
pdf_textcall) while maintaining the existing R workflow and data structures as much as possible. The goal is for the rest of the application logic for Ollama, Mistral, and OpenRouter to continue functioning without major changes.reticulateis in place, it becomes easier to integrate other Python-based enhancements in the future (e.g., more sophisticated data cleaning, specific LLM SDKs, OCR capabilities).Collaboration and Discussion:
This is an initial proposal, and I'm very open to discussing the best way to approach this. I'm keen to contribute this enhancement in a way that aligns with the project's goals and maintainability.
Would you be open to exploring this Python integration? I'm happy to share more detailed thoughts on the implementation or prepare a draft pull request for a small, focused part of this change if that would be helpful.
Thank you for this great tool!
Best regards,
SHA888