-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Feasibility Assessment: Bulk CSV Search Endpoint
The core building blocks already exist — this is mostly plumbing work.
-
ES
_msearchis already used inElasticsearchPages.scalaandPages2.scala— these execute multiple search queries in a single ES round-trip viamulti(). -
Query/filter construction is already modular.
SearchParametersis a clean case class, and the filter-building logic inSearchContextoperates on a singleSearchParameters— it can be called in a loop without modification. -
Permission checking can be done once per request (the user's visibility doesn't change between queries), then reused across all search terms.
-
Results are already JSON-serializable via the existing
SearchResultsmodel.
What would need to be built
| Layer | Work |
|---|---|
| Route | New POST /api/search/bulk (POST because payload can be large) |
| Controller | Parse CSV/JSON body into List[String] of search terms, apply the shared UI filters to each, call verifyParameters per query |
| Index service | New queryBatch(params: List[(SearchParameters, SearchContext)]) method using the existing multi() msearch pattern |
| Response | Array of SearchResults, each tagged with the originating search term for CSV correlation |
| Frontend | CSV upload UI + a new SearchApi method; the existing filters (collections, workspaces, MIME types, dates) can be reused as-is since they're just query params |
Key design decisions
- Aggregations — probably omit per-query aggs in bulk mode (expensive and not useful per-term).
- Pagination — limit bulk results to page 1 with a small
pageSize(e.g. 10–20 hits per term). Deep pagination across thousands of queries would be very expensive. - Batch size cap — ES
_msearchhas a defaultmax_concurrent_searchessetting. For thousands of terms, we'd want to chunk into batches (e.g. 100 at a time) and aggregate results. - Response format — for very large batches, consider streaming the response or returning a downloadable CSV/JSON rather than a single massive JSON payload.
- Rate limiting — this endpoint may need some guard (max terms per request, request timeout).
The hardest part isn't the search itself — it's deciding on the UX for presenting results from potentially thousands of queries (columnar export? hit counts only? full highlights?).