Skip to content

feat: Add New core integration tests#1194

Draft
edwinjosechittilappilly wants to merge 6 commits intomainfrom
new-core-integration-tests
Draft

feat: Add New core integration tests#1194
edwinjosechittilappilly wants to merge 6 commits intomainfrom
new-core-integration-tests

Conversation

@edwinjosechittilappilly
Copy link
Collaborator

This pull request adds comprehensive integration test sample data and utilities to support format ingestion testing. It introduces a script to generate minimal sample files for a variety of document formats (both binary and text), commits these sample files to the repository, and provides new shared helpers for integration tests. These changes enable robust, automated testing of document ingestion pipelines across multiple formats.

Test Sample Data Generation and Files:

  • Added create_samples.py script to tests/data/ that generates minimal, valid sample files for binary formats (PDF, DOCX, XLSX, PPTX) and writes standard sample files for text formats (Markdown, AsciiDoc, LaTeX, HTML, XHTML, CSV). This script uses only the Python standard library and ensures consistent, up-to-date test data.
  • Committed generated sample files for the following formats in tests/data/samples/:
    • AsciiDoc (sample.adoc)
    • Markdown (sample.md)
    • LaTeX (sample.tex)
    • HTML (sample.html)
    • XHTML (sample.xhtml)
    • CSV (sample.csv)

Integration Test Utilities:

  • Added helpers.py in tests/integration/core/ with shared async helpers for integration tests, including:
    • boot_app: Boots a fresh in-process FastAPI app with configurable settings and index cleanup.
    • wait_for_task_completion: Polls for async task completion.
    • wait_for_indexed: Waits until a search query returns results.
    • is_docling_available: Checks if the docling-serve service is reachable (required for binary format tests).
      These utilities improve test reliability and reduce boilerplate in integration tests.

Add integration test suite and supporting sample files/utilities.

- tests/data/create_samples.py: script to generate minimal PDF/DOCX/XLSX/PPTX samples (Python stdlib) and write to tests/data/samples/.
- tests/data/samples/*: pre-generated binary sample files used by tests.
- tests/integration/core/helpers.py: shared test helpers (boot_app, HTTPX ASGI client, wait_for_task_completion, wait_for_indexed, is_docling_available).
- tests/integration/core/test_document_lifecycle.py: tests for document endpoints (check-filename, delete-by-filename, upload_path) and full upload/delete lifecycles.
- tests/integration/core/test_file_format_ingestion.py: parametrized ingestion tests across text and binary formats; skips docling-dependent cases when docling-serve is unavailable.
- tests/integration/core/test_settings_and_tasks.py: tests for settings endpoints and task lifecycle (list, status, cancel, upload_path-created tasks).

Tests run against an in-process FastAPI app with live OpenSearch; create_samples.py can be re-run to regenerate sample files.
Add several committed text sample files (md, adoc, html, xhtml, tex, csv) under tests/data/samples and update binary samples. Refactor tests/data/create_samples.py to separate binary_formats and text_formats: generate/write binary content as bytes and write textual samples using write_text for consistency. Update tests/integration/core/test_file_format_ingestion.py to reference committed sample files (SAMPLES_DIR/...) for text/docling-served formats instead of embedding inline content, and keep binary formats as pre-generated samples. This centralizes sample data and makes ingestion tests consistently use filesystem fixtures.
@github-actions github-actions bot added tests enhancement 🔵 New feature or request labels Mar 19, 2026
Replace calls to clients.close() with clients.cleanup() in test teardown and helpers to use the updated clients API and ensure proper cleanup of global clients (avoids aiohttp warnings). Updates docstring in helpers to instruct callers to call clients.cleanup(). Affected files: tests/conftest.py, tests/integration/core/helpers.py, and multiple test modules under tests/integration/core/*.
@github-actions github-actions bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Mar 19, 2026
Create config dir in CI startup and update integration tests to use the production Langflow ingestion path.

- Makefile: ensure config/ exists (mkdir -p + chmod 777) before bringing up infra in test-ci and test-ci-local.
- tests/integration/core/helpers.py: add is_langflow_available() helper to detect Langflow health.
- tests/integration/core/test_file_format_ingestion.py: switch tests to the Langflow ingestion flow (boot app with Langflow enabled), skip entire test when Langflow is not running, assert uploads return 202 with a task_id, poll tasks until completion (longer timeout), adjust search/skip logic for docling-dependent formats, and update docstrings and messages to reflect the Langflow path.

These changes make the integration tests exercise the real production upload pipeline and avoid false failures when Langflow or docling-serve are unavailable.
@github-actions github-actions bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Mar 19, 2026
Pre-build a fallback search body that omits `num_candidates` for OpenSearch versions that don't support that field and use it when a RequestError occurs instead of relying on fragile error-string matching. Update logging messages to be clearer and less dependent on specific error text, and ensure disk-space errors still raise OpenSearchDiskSpaceError on both initial and retry attempts. Remove verbose inclusion of the search body in error logs and tidy related log messages.
@github-actions github-actions bot added backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Mar 19, 2026
Replace search-based verification with a direct OpenSearch index check (/documents/check-filename) in the file ingestion integration test. This avoids relying on search/KNN/embedding behavior, removes the special-case fallback for binary formats, tightens assertions and error messages, and updates docstrings and logging to reflect that the test asserts indexing rather than searchability.
@github-actions github-actions bot added enhancement 🔵 New feature or request and removed enhancement 🔵 New feature or request labels Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) enhancement 🔵 New feature or request tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant