test: increase backend coverage and stabilize test suite by AnthonyJi123 · Pull Request #84 · Faker-r/Evalhub

AnthonyJi123 · 2026-04-17T23:50:48Z

Summary

This PR raises backend test coverage to ~95% and fixes remaining failing tests without changing application business logic.

What changed

Expanded unit coverage across API routes, evaluation service (including background paths where mockable), eval worker pipelines, and related helpers.
Aligned integration-style scripts with current ModelConfig schema (api_source, numeric model_id, api_name).
Gated manual integration tests behind RUN_INTEGRATION_TESTS=1 so CI and default pytest runs stay fast and deterministic.
Adjusted auth expectation for GET /api/datasets (public list endpoint).
Handled optional lighteval/litellm deps via skip when extras are unavailable.
Codacy/Bandit: bandit config for tests + targeted suppressions/fixes where appropriate.

Verification

Run from backend/:

source .venv/bin/activate
python -m pytest tests/ -q --cov=api --cov-report=term

Expected: all tests pass; skipped tests are integration-only unless opted in.

Notes

Remaining uncovered lines (~5%) are largely server entrypoints, live infrastructure, fire-and-forget background tasks, and rare provider/data-shape branches—documented for verification reporting.

Made with Cursor

Move @api_router.get("/health") before app.include_router(api_router) to ensure the route is properly registered. Previously returned 404. Also update test_main.py to test /api/health instead of /health.

Add entries for: - backend/lighteval/ (cloned dependency) - backend/tests/evidence/*.json (generated test evidence) - samples_from_evals/ and traces_rows.csv (test data)

This was a standalone script (not a pytest test) that called OpenAI at import time, causing test collection failures.

Add conftest.py with: - Sync and async test client fixtures - Mock CurrentUser and authentication fixtures - Sample test data for datasets, guidelines, evaluations - TestEvidence class for capturing test run data - Helper functions for response assertions Also add evidence directory for storing test artifacts.

Add 28 unit tests covering schema validation: - Auth schemas (login, token responses) - Dataset schemas (create, response, list) - Guideline schemas (create, response, list) - Evaluation schemas (create, response, status) - Leaderboard schemas (entry, response) - Benchmark schemas (response, list, pagination) Tests cover valid data, invalid data rejection, and edge cases. Covers: FR-1.0, FR-2.0, FR-6.0

Add 28 integration tests covering: - Benchmarks API (list, pagination, sorting, search, detail, tasks) - Authentication enforcement (protected routes, invalid tokens) - Datasets, guidelines, evaluations (with auth) - Leaderboard API (parameter validation, data retrieval) - Side-by-side comparison endpoints - Providers and models API Tests verify HTTP status codes, JSON structure, and auth behavior. Covers: FR-1.0, FR-4.0, FR-5.0, FR-6.0

Add 6 E2E tests covering complete user workflows: - Benchmark browsing flow (list → detail → tasks) - Authenticated user workflow (datasets, guidelines, traces) - Leaderboard data retrieval - Error handling verification (404, 422, 401/403) - Performance baseline tests (health, benchmark list) Tests include evidence collection with timestamps and response data. Covers: FR-1.0, FR-2.0, FR-4.0, FR-6.0, NFR-1.0, NFR-3.0

Add run_tests.sh with commands for: - unit: Run unit tests only - integration: Run integration tests only - e2e: Run end-to-end tests only - quick: Run unit + integration (fast) - all: Run complete test suite with coverage

Comprehensive V&V report addressing M2 feedback: - Clear Verification vs Validation structure - Precise success criteria definitions - Detailed test procedures with expected/actual results - FR and NFR coverage with specific test references - Acceptance testing plan with user scenarios - Evidence collection and storage documentation Addresses all critical M2 feedback points.

Maps all FRs and NFRs to specific test implementations: - Test IDs, names, and file locations - Pass/fail status for each requirement - Coverage summary by requirement category - Direct links between requirements and evidence

The pyproject.toml is in the backend/ directory, so uv sync and pytest commands need to run from there.

Run isort to fix import ordering across the codebase to pass CI checks.

Add known_third_party and known_first_party to isort config so lighteval imports are correctly sorted as third-party, not local.

Add comprehensive unit tests across all backend modules to bring coverage from 41% (241 tests) to 95% (612 tests). All new tests use mocking to isolate from external dependencies (DB, S3, external APIs) — no app code was changed. New test files (29 files, ~370 new tests): - Route handler tests via TestClient for all endpoints (auth, users, datasets, guidelines, evaluations, benchmarks, leaderboard, models/providers, evaluation comparison) - Service layer tests (auth, datasets, guidelines, benchmarks, leaderboard, user, models/providers, evaluation, evaluation comparison) - Repository tests (benchmark, dataset, guideline, leaderboard, evaluation) - Evaluation pipeline tests (eval_pipeline, dataset_task, flexible_dataset_task, guideline_judge, metric_doc_generator) - Eval worker tests covering all three worker functions - EvaluationService background methods and get_trace_samples (parquet) - Core utility tests (security, exceptions, S3, migrations) - Schema validation tests (guideline schemas, evaluation models) The 6 pre-existing test failures remain unchanged (outdated ModelConfig schema in integration tests, missing litellm dependency). Made-with: Cursor

- Replace insecure tempfile.mktemp() with NamedTemporaryFile in test_flexible_dataset_task.py - Add nosec B108 comments to mock patches of tempfile.mkdtemp (these mock the function, not call it) - Add [tool.bandit] config to pyproject.toml to skip B101 (assert) rule in test directories — assert is the standard pytest assertion mechanism Made-with: Cursor

- Update ModelConfig usage to match current schema (api_source, model_id as int, api_name) - Skip integration tests by default (opt-in via RUN_INTEGRATION_TESTS=1) - Handle missing litellm[caching] gracefully in lighteval integration test - Fix datasets endpoint auth expectation (public route, not protected) Made-with: Cursor

vercel · 2026-04-17T23:50:54Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
evalhub	Ready	Preview, Comment	Apr 18, 2026 0:58am

codacy-production · 2026-04-17T23:52:04Z

Not up to standards ⛔

🔴 Issues 100 high

Alerts:
⚠ 100 issues (≤ 0 issues of at least minor severity)

Results:
100 new issues

Category Results

ErrorProne 33 high

Security 67 high

View in Codacy

🟢 Metrics 554 complexity · 15 duplication

Metric Results

Complexity 554

Duplication 15

View in Codacy

_{TIP This summary will be updated as you push new changes. Give us feedback}

- Take main for app code, migrations, scripts, and shared tests. - Remove legacy eval_worker module usage; tests target api.evaluations.tasks. - Replace evaluation service unit tests with Celery-dispatch focused coverage. - Update route tests for EvaluationModelConfig union and leaderboard schema. - Gate manual integration scripts with RUN_INTEGRATION_TESTS. - Restore Bandit test exclusions in pyproject.toml. Made-with: Cursor

- Exclude vendored lighteval/experiments from Black (matches isort skip) - CI: REDIS_URL, CELERY_BROKER_URL, AWS_REGION, S3_BUCKET_NAME for Settings - asyncpg SSL only for non-localhost databases - pytest collect_ignore test_all_benchmarks.py (import-time S3 monkeypatch) - Update unit tests for string IDs, schemas, leaderboard, repos - Apply black formatting on api/tests Made-with: Cursor

AnthonyJi123 added 16 commits February 5, 2026 01:02

fix: correct health endpoint registration order

6c95daa

Move @api_router.get("/health") before app.include_router(api_router) to ensure the route is properly registered. Previously returned 404. Also update test_main.py to test /api/health instead of /health.

chore: update .gitignore for test artifacts

44282bb

Add entries for: - backend/lighteval/ (cloned dependency) - backend/tests/evidence/*.json (generated test evidence) - samples_from_evals/ and traces_rows.csv (test data)

chore: remove unused OpenAI test script

5fc071f

This was a standalone script (not a pytest test) that called OpenAI at import time, causing test collection failures.

feat: add test runner convenience script

b3c9825

Add run_tests.sh with commands for: - unit: Run unit tests only - integration: Run integration tests only - e2e: Run end-to-end tests only - quick: Run unit + integration (fast) - all: Run complete test suite with coverage

docs: add requirements-to-tests traceability matrix

d1a72dd

Maps all FRs and NFRs to specific test implementations: - Test IDs, names, and file locations - Pass/fail status for each requirement - Coverage summary by requirement category - Direct links between requirements and evidence

fix(ci): add working-directory for backend commands

c8649fb

The pyproject.toml is in the backend/ directory, so uv sync and pytest commands need to run from there.

style: fix import sorting with isort

c62a9cc

Run isort to fix import ordering across the codebase to pass CI checks.

style: fix isort config for lighteval third-party imports

e8c7c37

Add known_third_party and known_first_party to isort config so lighteval imports are correctly sorted as third-party, not local.

vercel Bot deployed to Preview April 17, 2026 23:58 View deployment

vercel Bot deployed to Preview April 18, 2026 00:58 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: increase backend coverage and stabilize test suite#84

test: increase backend coverage and stabilize test suite#84
AnthonyJi123 wants to merge 18 commits intomainfrom
feat/increase-test-coverage

AnthonyJi123 commented Apr 17, 2026

Uh oh!

vercel Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

codacy-production Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AnthonyJi123 commented Apr 17, 2026

Summary

What changed

Verification

Notes

Uh oh!

vercel Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codacy-production Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 17, 2026 •

edited

Loading

codacy-production Bot commented Apr 17, 2026 •

edited

Loading