feat: Add New core integration tests#1194
Draft
edwinjosechittilappilly wants to merge 6 commits intomainfrom
Draft
feat: Add New core integration tests#1194edwinjosechittilappilly wants to merge 6 commits intomainfrom
edwinjosechittilappilly wants to merge 6 commits intomainfrom
Conversation
Add integration test suite and supporting sample files/utilities. - tests/data/create_samples.py: script to generate minimal PDF/DOCX/XLSX/PPTX samples (Python stdlib) and write to tests/data/samples/. - tests/data/samples/*: pre-generated binary sample files used by tests. - tests/integration/core/helpers.py: shared test helpers (boot_app, HTTPX ASGI client, wait_for_task_completion, wait_for_indexed, is_docling_available). - tests/integration/core/test_document_lifecycle.py: tests for document endpoints (check-filename, delete-by-filename, upload_path) and full upload/delete lifecycles. - tests/integration/core/test_file_format_ingestion.py: parametrized ingestion tests across text and binary formats; skips docling-dependent cases when docling-serve is unavailable. - tests/integration/core/test_settings_and_tasks.py: tests for settings endpoints and task lifecycle (list, status, cancel, upload_path-created tasks). Tests run against an in-process FastAPI app with live OpenSearch; create_samples.py can be re-run to regenerate sample files.
Add several committed text sample files (md, adoc, html, xhtml, tex, csv) under tests/data/samples and update binary samples. Refactor tests/data/create_samples.py to separate binary_formats and text_formats: generate/write binary content as bytes and write textual samples using write_text for consistency. Update tests/integration/core/test_file_format_ingestion.py to reference committed sample files (SAMPLES_DIR/...) for text/docling-served formats instead of embedding inline content, and keep binary formats as pre-generated samples. This centralizes sample data and makes ingestion tests consistently use filesystem fixtures.
Replace calls to clients.close() with clients.cleanup() in test teardown and helpers to use the updated clients API and ensure proper cleanup of global clients (avoids aiohttp warnings). Updates docstring in helpers to instruct callers to call clients.cleanup(). Affected files: tests/conftest.py, tests/integration/core/helpers.py, and multiple test modules under tests/integration/core/*.
Create config dir in CI startup and update integration tests to use the production Langflow ingestion path. - Makefile: ensure config/ exists (mkdir -p + chmod 777) before bringing up infra in test-ci and test-ci-local. - tests/integration/core/helpers.py: add is_langflow_available() helper to detect Langflow health. - tests/integration/core/test_file_format_ingestion.py: switch tests to the Langflow ingestion flow (boot app with Langflow enabled), skip entire test when Langflow is not running, assert uploads return 202 with a task_id, poll tasks until completion (longer timeout), adjust search/skip logic for docling-dependent formats, and update docstrings and messages to reflect the Langflow path. These changes make the integration tests exercise the real production upload pipeline and avoid false failures when Langflow or docling-serve are unavailable.
Pre-build a fallback search body that omits `num_candidates` for OpenSearch versions that don't support that field and use it when a RequestError occurs instead of relying on fragile error-string matching. Update logging messages to be clearer and less dependent on specific error text, and ensure disk-space errors still raise OpenSearchDiskSpaceError on both initial and retry attempts. Remove verbose inclusion of the search body in error logs and tidy related log messages.
Replace search-based verification with a direct OpenSearch index check (/documents/check-filename) in the file ingestion integration test. This avoids relying on search/KNN/embedding behavior, removes the special-case fallback for binary formats, tightens assertions and error messages, and updates docstrings and logging to reflect that the test asserts indexing rather than searchability.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request adds comprehensive integration test sample data and utilities to support format ingestion testing. It introduces a script to generate minimal sample files for a variety of document formats (both binary and text), commits these sample files to the repository, and provides new shared helpers for integration tests. These changes enable robust, automated testing of document ingestion pipelines across multiple formats.
Test Sample Data Generation and Files:
create_samples.pyscript totests/data/that generates minimal, valid sample files for binary formats (PDF, DOCX, XLSX, PPTX) and writes standard sample files for text formats (Markdown, AsciiDoc, LaTeX, HTML, XHTML, CSV). This script uses only the Python standard library and ensures consistent, up-to-date test data.tests/data/samples/:sample.adoc)sample.md)sample.tex)sample.html)sample.xhtml)sample.csv)Integration Test Utilities:
helpers.pyintests/integration/core/with shared async helpers for integration tests, including:boot_app: Boots a fresh in-process FastAPI app with configurable settings and index cleanup.wait_for_task_completion: Polls for async task completion.wait_for_indexed: Waits until a search query returns results.is_docling_available: Checks if the docling-serve service is reachable (required for binary format tests).These utilities improve test reliability and reduce boilerplate in integration tests.