Search optimization and indexing based on datetime #405

GrzegorzPustulka · 2025-06-18T13:05:33Z

Related Issue(s):

search optimization #401

Index Management System with Time-based Partitioning

Description

This PR introduces a new index management system that enables automatic index partitioning based on dates and index size control with automatic splitting.

How it works

System Architecture

The system consists of several main components:

1. Search Engine Adapters

SearchEngineAdapter - base class
ElasticsearchAdapter and OpenSearchAdapter - implementations for specific engines

2. Index Selection Strategies

AsyncDatetimeBasedIndexSelector / SyncDatetimeBasedIndexSelector - date-based index filtering
UnfilteredIndexSelector - returns all indexes (fallback)
Cache with TTL (default 1 hour) for performance

3. Data Insertion Strategies

Simple strategy: one index per collection (behavior as before)
Datetime strategy: indexes partitioned by dates with automatic partitioning

Datetime Strategy - Operation Details

Index Format:

items_collection-name_2025-01-01-2025-03-31

Item Insertion Process:

System checks item date (properties.datetime)
Looks for existing index that covers this date
If not found - creates new index from this date
Checks target index size
If exceeds limit (DATETIME_INDEX_MAX_SIZE_GB) - splits index

Early Date Handling:
If item has date earlier than oldest index:

Creates new index from this earlier date
Updates oldest index alias to end one day before new date

Index Splitting:
When index exceeds size limit:

Updates current index alias to end on last item's date
Creates new index from next day
New items go to new index

Cache and Performance

IndexCacheManager:

Stores mapping of collection aliases to index lists
TTL default 1 hour
Automatic refresh on expiration
Manual refresh after index modifications

AsyncIndexAliasLoader / SyncIndexAliasLoader:

Load alias information from search engine
Use cache manager to store results
Async and sync versions for different usage contexts

Configuration

New Environment Variables:

# Enable datetime strategy (default false)
ENABLE_DATETIME_INDEX_FILTERING=true

# Maximum index size in GB before splitting (default 25)
DATETIME_INDEX_MAX_SIZE_GB=50

Usage Examples

Scenario 1: Adding items to new collection

First item with date 2025-01-15 → creates index items_collection_2025-01-15
Subsequent items with similar dates → go to same index

Scenario 2: Size limit exceeded

Index items_collection_2025-01-01 reaches 25GB
New item with date 2025-03-15 → system splits index:
- Old: items_collection_2025-01-01-2025-03-15
- New: items_collection_2025-03-16

Scenario 3: Item with early date

Existing index: items_collection_2025-02-01
New item with date 2024-12-15 → creates:
- New: items_collection_2024-12-15-2025-01-31

Search

System automatically filters indexes during search:

Query with date range:

{
  "datetime": {
    "gte": "2025-02-01",
    "lte": "2025-02-28"
  }
}

Searches only indexes containing items from this period, instead of all collection indexes.

Factories

IndexSelectorFactory:

Creates appropriate selector based on configuration
create_async_selector() / create_sync_selector()

IndexInsertionFactory:

Creates insertion strategy based on configuration
Automatically detects engine type and creates appropriate adapter

SearchEngineAdapterFactory:

Detects whether you're using Elasticsearch or OpenSearch
Creates appropriate adapter with engine-specific methods

Backward Compatibility

When ENABLE_DATETIME_INDEX_FILTERING=false → works as before
Existing indexes remain unchanged

All operations have sync and async versions for different usage contexts in the application.

PR Checklist:

Code is formatted and linted (run pre-commit run --all-files)
Tests pass (run make test)
Documentation has been updated to reflect changes, if applicable
Changes are added to the changelog

GrzegorzPustulka · 2025-07-08T10:41:01Z

@jonhealy1
@StijnCaerts
@jamesfisher-geo

The MR is already finished and ready for code review.

jonhealy1 · 2025-07-20T11:33:06Z

@GrzegorzPustulka There's a couple of conflicts now. They don't look too bad. I have been travelling but am going to try to review this in the next few days,

jonhealy1 · 2025-07-20T11:35:53Z

@jamesfisher-geo @StijnCaerts @rhysrevans3 Hi. Added you guys as reviewers if you have time to have a look :)

rhysrevans3

Looks okay to me but I have a couple of questions.

rhysrevans3 · 2025-07-21T07:16:45Z

stac_fastapi/core/stac_fastapi/core/core.py

+            logger.error(f"Invalid interval format: {datetime}, error: {e}")
+            datetime_search = None


Should this error be returned to the user rather than continuing the search without a datetime filter?

rhysrevans3 · 2025-07-21T07:24:59Z

stac_fastapi/core/stac_fastapi/core/core.py

+        except (ValueError, TypeError) as e:
+            # Handle invalid interval formats if return_date fails
+            logger.error(
+                f"Invalid interval format: {search_request.datetime}, error: {e}"
            )
+            datetime_search = None


rhysrevans3 · 2025-07-21T13:10:51Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/adapters.py

+    def create_index_name(collection_id: str, start_date: str) -> str:
+        """Create index name from collection ID and start date.
+
+        Args:
+            collection_id (str): Collection identifier.
+            start_date (str): Start date for the index.
+
+        Returns:
+            str: Formatted index name.
+        """
+        cleaned = collection_id.translate(_ES_INDEX_NAME_UNSUPPORTED_CHARS_TABLE)
+        return f"{ITEMS_INDEX_PREFIX}{cleaned.lower()}_{start_date}"


Is this the equivalent of index_by_collection_id for the simple method? If it is should it not also include the hex of the collection_id and -000001?

What's the benefit of having the start datetime in the index name could you just have it in the alias with the end datetime? You could just use a count to prevent index name clashes.

You would then only need to create a new index when you exceed the max size and not for earlier items. If the item's start datetime is earlier or the end datetime is later than the current alias then update the alias.

You're right, I changed it as you say. The only difference is that the indexes have UUID4 values, part1, part2 could be misleading because part3 might be younger than part2, etc.

rhysrevans3 · 2025-07-21T13:21:09Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/selection/cache_manager.py

+    def __init__(self, cache_ttl_seconds: int = 3600):
+        """Initialize the cache manager.
+
+        Args:
+            cache_ttl_seconds (int): Time-to-live for cache entries in seconds.
+        """
+        self._cache: Optional[Dict[str, List[str]]] = None
+        self._timestamp: float = 0
+        self._ttl = cache_ttl_seconds


Would it be better to just update the cache as aliases are set/updated rather than polling ES every hour?

I did it this way because some people might use external tools to upload products to the database, hence my idea to do it like this.

jamesfisher-geo

Overall looks great. I've got some comments around error handling and some cache handling as well.

This PR will add a lot of future maintenance burden in it's current form. How about we implement only in async and not include the sync code. That would cut down on repetitive code in this PR.

@jonhealy1 @GrzegorzPustulka what are your thoughts on this?

jamesfisher-geo · 2025-07-21T13:48:31Z

stac_fastapi/core/stac_fastapi/core/core.py

@@ -342,6 +348,7 @@ async def item_collection(
            sort=None,
            token=token,
            collection_ids=[collection_id],
+            datetime_search=datetime_search,


Is this needed? We apply the datetime_search to the search variable on line 331. If this is optional, could we omit it?

This is needed in this function so that you can find which index this product is in.

jamesfisher-geo · 2025-07-21T13:49:17Z

stac_fastapi/core/stac_fastapi/core/core.py

@@ -560,6 +574,7 @@ async def post_search(
            token=search_request.token,
            sort=sort,
            collection_ids=search_request.collections,
+            datetime_search=datetime_search,


Same here -- Is this needed? We apply the datetime_search to the search variable on line 513. If this is optional, could we omit it?

jamesfisher-geo · 2025-07-21T14:33:51Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/adapters.py

+class ElasticsearchAdapter(SearchEngineAdapter):
+    """Elasticsearch-specific adapter implementation."""
+
+    async def create_simple_index(self, client: Any, collection_id: str) -> str:


The index mappings and setting are missing from ElasticsearchAdapter().create_simple_index(). Could you include the mappings here like is done in OpenSearchAdapter()._create_index_body()

The patterns for creating an index should be the same between ElasticsearchAdapter() and OpenSearchAdapter() IMO. How about creating a _create_index_body() method in ElasticsearchAdapter()?

Now Elasticsearch and OpenSearch are identical so there is only one class for both of them

jamesfisher-geo · 2025-07-21T15:00:15Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/adapters.py

+        Returns:
+            SearchEngineType: Detected engine type.
+        """
+        return (


How about using isInstance() here rather than matching the string?

return ( OpenSearchAdapter() if isInstance(client, (OpenSearch, AsyncOpenSearch)) else ElasticsearchAdapter() )

This code snippet no longer exists

jamesfisher-geo · 2025-07-21T15:08:37Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/adapters.py

+    """Factory for creating search engine adapters."""
+
+    @staticmethod
+    def create_adapter(engine_type: SearchEngineType) -> SearchEngineAdapter:


Is this function necessary? See comment below

it is no longer needed

jamesfisher-geo · 2025-07-21T16:27:16Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/managers.py

+            )
+        return product_datetime
+
+    async def handle_new_collection(


logging statements in handle_new_collection() and handle_new_collection_sync() would be useful

I definitely think we need to do a better job at logging on this project.

jamesfisher-geo · 2025-07-21T16:38:42Z

..._fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/selection/async_selectors.py

+
+    _instance = None
+
+    def __new__(cls, client):


I'm a bit confused with this implementation. Maybe I am missing something. Could this be replaced with the normal method of instance creation using __init__()

def __init__(self, client: Any): self.cache_manager = IndexCacheManager() self.alias_loader = AsyncIndexAliasLoader(client, self.cache_manager)

I used the singleton design pattern here. This is so that every time we create a new object we get the same first instance, it has to be the same instance because the cache state is stored there.

jamesfisher-geo · 2025-07-21T17:45:18Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/selection/cache_manager.py

+class IndexCacheManager:
+    """Manages caching of index aliases with expiration."""
+
+    def __init__(self, cache_ttl_seconds: int = 3600):


I believe some concurrency management is needed here because multiple threads may be attempting to access the cache resource at the same time. From what I have found threading.Lock() should work.

https://docs.python.org/3/library/threading.html#lock-objects

The following (untested) should place a lock on the cache when accessing it and release it when finished

import threading class IndexCacheManager: def __init__(self, cache_ttl_seconds: int = 3600): self._cache: Optional[Dict[str, List[str]]] = None self._timestamp: float = 0 self._ttl = cache_ttl_seconds self._lock = threading.Lock() def get_cache(self) -> Optional[Dict[str, List[str]]]: """Get the current cache if not expired. Returns: Optional[Dict[str, List[str]]]: Cache data if valid, None if expired. """ with self._lock: if self.is_expired: return None return self._cache

jamesfisher-geo · 2025-07-21T18:09:32Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/selection/cache_manager.py

+        """
+        if self.is_expired:
+            return None
+        return self._cache


Returning the _cache object here could be problematic because it is a pointer to the actual cache. How about returning a copy?
return {k: v.copy() for k, v in self._cache.items()}

jamesfisher-geo · 2025-07-21T18:24:21Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/search_engine/selection/factory.py

+        return (
+            SyncDatetimeBasedIndexSelector(sync_client)
+            if use_datetime_filtering
+            else UnfilteredIndexSelector()


But the UnfilteredIndexSelector() is async

Yes, but it doesn't matter, this is a class that implements the previous method, it can be asynchronous.

GrzegorzPustulka · 2025-07-22T08:17:37Z

I'll improve all the comments in the coming days, remove the sync versions, and fix the bugs my friend found testing this MR

jonhealy1 · 2025-07-22T22:20:19Z

stac_fastapi/tests/resources/test_item.py

@@ -998,6 +1005,9 @@ async def _search_and_get_ids(
 async def test_search_datetime_with_null_datetime(
    app_client, txn_client, load_test_data
 ):
+    if not os.getenv("ENABLE_DATETIME_INDEX_FILTERING"):
+        pytest.skip()
+


Is this right? This test should definitely run in default mode.

Yes, because datetime is passed there as null, so it will return a 400 error, because in this indexing method it's not possible to index without datetime.

jonhealy1 · 2025-07-22T22:35:39Z

@GrzegorzPustulka Can we set ENABLE_DATETIME_INDEX_FILTERING for the associated tests and then turn it off for the default tests?

jonhealy1 · 2025-07-22T22:38:58Z

README.md

+| `DATABASE_REFRESH`              | Controls whether database operations refresh the index immediately after changes. If set to `true`, changes will be immediately searchable. If set to `false`, changes may not be immediately visible but can improve performance for bulk operations. If set to `wait_for`, changes will wait for the next refresh cycle to become visible. | `false`                                              | Optional |
+| `ENABLE_TRANSACTIONS_EXTENSIONS` | Enables or disables the Transactions and Bulk Transactions API extensions. If set to `false`, the POST `/collections` route and related transaction endpoints (including bulk transaction operations) will be unavailable in the API. This is useful for deployments where mutating the catalog via the API should be prevented.             | `true`                                               | Optional |
+| `ENABLE_DATETIME_INDEX_FILTERING` | Enable datetime-based index selection using collection IDs. Requires indexes in format: STAC_ITEMS_INDEX_PREFIX_collection-id_start_year-start_month-start_day-end_year-end_month-end_day, e.g. items_sentinel-2-l2a_2025-06-06-2025-09-22.                                                                                                  | `false`                                              | Optional |
+| `DATETIME_INDEX_MAX_SIZE_GB` | Maximum size limit in GB for datetime-based indexes. When an index exceeds this size, a new time-partitioned index will be created. Note: This value should account for ~25% overhead due to OS/ES caching of data structures and metadata. Only applies when`ENABLE_DATETIME_INDEX_FILTERING` is enabled.                                                                               | `25`                                                 | Optional |


These are important additions and maybe should have their own section in the readme for a better explanation.

README and changelog, I'll update them when the code gets accepted, there might still be too many changes

StijnCaerts · 2025-07-24T14:45:11Z

stac_fastapi/elasticsearch/stac_fastapi/elasticsearch/database_logic.py

+            await self.client.delete_by_query(
                index=index_alias_by_collection_id(collection_id),
-                id=mk_item_id(item_id, collection_id),
+                body={"query": {"term": {"_id": mk_item_id(item_id, collection_id)}}},


Delete by query will not raise a ESNotFoundError.

I think it should because this test passes.
@pytest.mark.asyncio async def test_delete_missing_item(app_client, load_test_data): """Test deletion of an item which does not exist (transactions extension)""" test_item = load_test_data("test_item.json") resp = await app_client.delete( f"/collections/{test_item['collection']}/items/hijosh" ) assert resp.status_code == 404

StijnCaerts · 2025-07-24T14:48:44Z

stac_fastapi/opensearch/stac_fastapi/opensearch/database_logic.py

+            await self.client.delete_by_query(
                index=index_alias_by_collection_id(collection_id),
-                id=mk_item_id(item_id, collection_id),
+                body={"query": {"term": {"_id": mk_item_id(item_id, collection_id)}}},
                refresh=refresh,
            )


Delete by query will not raise a NotFoundError.

StijnCaerts · 2025-07-24T15:08:43Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/database/datetime.py

+            result["gte"] = (
+                parts[0] if parts[0] != ".." else datetime_type.min.isoformat() + "Z"
+            )
+            result["lte"] = (
+                parts[1]
+                if len(parts) > 1 and parts[1] != ".."
+                else datetime_type.max.isoformat() + "Z"
+            )


These explicit min/max values are a bit ugly. Are these really needed?

It's needed unfortunately

StijnCaerts · 2025-07-24T15:26:52Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/database/index.py

@@ -120,11 +176,11 @@ async def delete_item_index_shared(settings: Any, collection_id: str) -> None:
    client = settings.create_client

    name = index_alias_by_collection_id(collection_id)
-    resolved = await client.indices.resolve_index(name=name)
+    resolved = await client.indices.resolve_index(name=name, ignore=[404])


What does the ignore parameter do? I cannot find it in the ES/OS docs:

https://elasticsearch-py.readthedocs.io/en/latest/api/indices.html#elasticsearch.client.IndicesClient.resolve_index

https://opensearch-project.github.io/opensearch-py/api-ref/clients/indices_client.html#opensearchpy.client.indices.IndicesClient.resolve_index

https://elasticsearch-py.readthedocs.io/en/v7.12.0/api.html

StijnCaerts · 2025-07-24T15:27:02Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/database/index.py

    if "aliases" in resolved and resolved["aliases"]:
        [alias] = resolved["aliases"]
        await client.indices.delete_alias(index=alias["indices"], name=alias["name"])
        await client.indices.delete(index=alias["indices"])
    else:
-        await client.indices.delete(index=name)
+        await client.indices.delete(index=name, ignore=[404])


Same as above.

https://elasticsearch-py.readthedocs.io/en/v7.12.0/api.html

jamesfisher-geo · 2025-08-08T16:05:44Z

Looks great @GrzegorzPustulka not sure if you are finished with this yet. I left a couple comments, but they aren't blocking. I'm ready to approve if your PR is ready. Maybe we give the others a chance to take a look before merging

GrzegorzPustulka · 2025-08-11T10:55:06Z

@jamesfisher-geo, @jonhealy1

Everything I had to do is finished, I just added the corrected README and changelog

jamesfisher-geo

Looks good to me. Great contribution @GrzegorzPustulka

jamesfisher-geo · 2025-08-08T15:22:48Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/database/datetime.py

@@ -58,3 +69,53 @@ def return_date(
            result["lte"] = end.strftime("%Y-%m-%dT%H:%M:%S.%f")[:-3] + "Z"

    return result
+
+
+def extract_date(date_str: str) -> date:


Could you use the existing function datetime_to_str() instead?

stac-fastapi-elasticsearch-opensearch/stac_fastapi/core/stac_fastapi/core/datetime_utils.py

Line 38 in 59d43f9

def datetime_to_str(dt: datetime, timespec: str = "auto") -> str:

jamesfisher-geo · 2025-08-08T15:24:11Z

stac_fastapi/sfeos_helpers/stac_fastapi/sfeos_helpers/database/datetime.py

+    date_string = match.group(0)
+
+    try:
+        extracted_date = datetime_type.strptime(date_string, "%Y-%m-%d").date()


Could you use this function here?
from stac_fastapi.types.rfc3339 import rfc3339_str_to_datetime

Maybe that does not matter, though

It doesn't make sense here because this function adds a time zone

jonhealy1 · 2025-08-15T22:08:19Z

@GrzegorzPustulka Hi. I have been travelling for a while but will be home next week when I can have a better look at everything.

GrzegorzPustulka marked this pull request as ready for review July 7, 2025 20:01

GrzegorzPustulka force-pushed the search_optimization branch 2 times, most recently from 295d3d6 to 243dd1c Compare July 7, 2025 23:12

jonhealy1 requested review from jonhealy1, jamesfisher-geo, rhysrevans3 and StijnCaerts and removed request for jamesfisher-geo July 20, 2025 11:33

rhysrevans3 reviewed Jul 21, 2025

View reviewed changes

jamesfisher-geo requested changes Jul 21, 2025

View reviewed changes

jonhealy1 reviewed Jul 22, 2025

View reviewed changes

StijnCaerts reviewed Jul 24, 2025

View reviewed changes

GrzegorzPustulka force-pushed the search_optimization branch from c25b7ac to 7ba552d Compare August 1, 2025 11:05

ready for code review

590ccb3

GrzegorzPustulka force-pushed the search_optimization branch from 7ba552d to 590ccb3 Compare August 7, 2025 15:29

Grzegorz Pustulka added 4 commits August 7, 2025 17:36

fix

c9668ac

isort

db4a4dc

fix in cicd.yml

60d517f

fix for openserach

f2da9c5

Grzegorz Pustulka added 2 commits August 11, 2025 12:37

changelog, readme

6ba27fd

fix

3159e05

GrzegorzPustulka force-pushed the search_optimization branch from 4c75386 to 3159e05 Compare August 11, 2025 10:52

jamesfisher-geo approved these changes Aug 11, 2025

View reviewed changes

jonhealy1 requested review from rhysrevans3, jonhealy1 and StijnCaerts August 12, 2025 09:25

		logger.error(f"Invalid interval format: {datetime}, error: {e}")
		datetime_search = None

Search optimization and indexing based on datetime #405

Are you sure you want to change the base?

Search optimization and indexing based on datetime #405

Uh oh!

Conversation

GrzegorzPustulka commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Index Management System with Time-based Partitioning

Description

How it works

System Architecture

Datetime Strategy - Operation Details

Cache and Performance

Configuration

Usage Examples

Scenario 1: Adding items to new collection

Scenario 2: Size limit exceeded

Scenario 3: Item with early date

Search

Factories

Backward Compatibility

Uh oh!

GrzegorzPustulka commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonhealy1 commented Jul 20, 2025

Uh oh!

jonhealy1 commented Jul 20, 2025

Uh oh!

rhysrevans3 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesfisher-geo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GrzegorzPustulka commented Jun 18, 2025 •

edited

Loading

GrzegorzPustulka commented Jul 8, 2025 •

edited

Loading

jamesfisher-geo left a comment •

edited

Loading

jamesfisher-geo Jul 21, 2025 •

edited

Loading