Skip to content

Upload a bulk search #582

@hoyla

Description

@hoyla

Feasibility Assessment: Bulk CSV Search Endpoint

The core building blocks already exist — this is mostly plumbing work.

  1. ES _msearch is already used in ElasticsearchPages.scala and Pages2.scala — these execute multiple search queries in a single ES round-trip via multi().

  2. Query/filter construction is already modular. SearchParameters is a clean case class, and the filter-building logic in SearchContext operates on a single SearchParameters — it can be called in a loop without modification.

  3. Permission checking can be done once per request (the user's visibility doesn't change between queries), then reused across all search terms.

  4. Results are already JSON-serializable via the existing SearchResults model.

What would need to be built

Layer Work
Route New POST /api/search/bulk (POST because payload can be large)
Controller Parse CSV/JSON body into List[String] of search terms, apply the shared UI filters to each, call verifyParameters per query
Index service New queryBatch(params: List[(SearchParameters, SearchContext)]) method using the existing multi() msearch pattern
Response Array of SearchResults, each tagged with the originating search term for CSV correlation
Frontend CSV upload UI + a new SearchApi method; the existing filters (collections, workspaces, MIME types, dates) can be reused as-is since they're just query params

Key design decisions

  • Aggregations — probably omit per-query aggs in bulk mode (expensive and not useful per-term).
  • Pagination — limit bulk results to page 1 with a small pageSize (e.g. 10–20 hits per term). Deep pagination across thousands of queries would be very expensive.
  • Batch size cap — ES _msearch has a default max_concurrent_searches setting. For thousands of terms, we'd want to chunk into batches (e.g. 100 at a time) and aggregate results.
  • Response format — for very large batches, consider streaming the response or returning a downloadable CSV/JSON rather than a single massive JSON payload.
  • Rate limiting — this endpoint may need some guard (max terms per request, request timeout).

The hardest part isn't the search itself — it's deciding on the UX for presenting results from potentially thousands of queries (columnar export? hit counts only? full highlights?).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions