Feature: Use document extractors for grep and centralize file text extraction#114
Open
pfurovYnP wants to merge 3 commits intoopen-webui:mainfrom
Open
Feature: Use document extractors for grep and centralize file text extraction#114pfurovYnP wants to merge 3 commits intoopen-webui:mainfrom
pfurovYnP wants to merge 3 commits intoopen-webui:mainfrom
Conversation
…presentations Use document extractors for grep and centralize file text extraction
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(Codex)
#113
Motivation
read_fileis also applied when searching files withgrep_searchso searchable content from PDFs/Office files is included.Description
_extract_text_with_supported_document_extractors(file_path, mime)to encapsulate lookup and invocation ofEXTRACTORSfromopen_terminal.utils.documentsand return extracted text orNone._read_file_as_text_representation_for_grep(file_path)which mirrorsread_filebehavior: attempt UTF-8 read first, fall back to the document extractors, and raiseUnicodeDecodeErrorfor unsupported binary files.read_fileto call the new extractor helper and simplified the previous inline extractor loop.grep_searchto use_read_file_as_text_representation_for_grepso document-extracted text is searched line-by-line, preserving existing match and truncation behaviour.Testing
pytest, and all tests completed successfully.read_fileandgrep_searchbehavior manually against text files and PDF/Office files to verify extracted text is returned and searchable as expected.