Skip to content

Feature: Use document extractors for grep and centralize file text extraction#114

Open
pfurovYnP wants to merge 3 commits intoopen-webui:mainfrom
pfurovYnP:main
Open

Feature: Use document extractors for grep and centralize file text extraction#114
pfurovYnP wants to merge 3 commits intoopen-webui:mainfrom
pfurovYnP:main

Conversation

@pfurovYnP
Copy link
Copy Markdown

@pfurovYnP pfurovYnP commented Apr 21, 2026

(Codex)

#113

Motivation

  • Ensure the same document text extraction logic used by read_file is also applied when searching files with grep_search so searchable content from PDFs/Office files is included.
  • Reduce duplication by centralizing the extraction logic into helper functions.
  • Improve error handling for binary/unsupported files in the grep flow to avoid reading raw bytes as text.

Description

  • Introduced _extract_text_with_supported_document_extractors(file_path, mime) to encapsulate lookup and invocation of EXTRACTORS from open_terminal.utils.documents and return extracted text or None.
  • Added _read_file_as_text_representation_for_grep(file_path) which mirrors read_file behavior: attempt UTF-8 read first, fall back to the document extractors, and raise UnicodeDecodeError for unsupported binary files.
  • Refactored read_file to call the new extractor helper and simplified the previous inline extractor loop.
  • Updated grep_search to use _read_file_as_text_representation_for_grep so document-extracted text is searched line-by-line, preserving existing match and truncation behaviour.

Testing

  • Ran the automated test suite with pytest, and all tests completed successfully.
  • Exercised read_file and grep_search behavior manually against text files and PDF/Office files to verify extracted text is returned and searchable as expected.
  • Verified that binary files with allowed MIME prefixes are still returned as raw and unsupported binaries are rejected during search.

@pfurovYnP pfurovYnP changed the title Use document extractors for grep and centralize file text extraction Feature: Use document extractors for grep and centralize file text extraction Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant