Dev 1659 solr query lenght #48

liseli · 2025-04-24T11:23:50Z

This PR introduces a new logic to retrieve documents from Solr. (Ticket: DEV_1659)

Documents are retrieved in batches of 200 IDs each time. As we have a long list of ht_ids/ids, we recommend splitting the list of documents into chunks and creating a query batch to avoid the Solr URI too long error.
The chunk size is 200, because Solr will fail with the status code 414. The chunk size was determined by testing the Solr query with different values (e.g., 100-500 and with 200 ht_ids it worked.
We use the term Solr parser to retrieve documents by IDs. This is the most efficient alternative for exact match queries.
retriever_services has been restructured for a better understanding of the logic.
Tests were updated

How to test this PR?

git clone git@github.com:hathitrust/ht_indexer.git
git checkout -b DEV-1659_SolrQuery_lenght

Create the image

docker build -t document_generator .

Run ht_indexer_tracker

docker compose up ht_indexer_tracker -d
docker compose exec ht_indexer_tracker python -m pytest ht_indexer_monitoring

Run document_retriever_service container and test it

docker compose up document_retriever -d
docker compose exec document_retriever python -m pytest document_retriever_service catalog_metadata ht_utils

Run document_generator_service container and test it

docker compose up document_generator -d
docker compose exec document_generator python -m pytest document_generator ht_document ht_queue_service ht_utils

Ronster2018

This all look fine to me. After running the tests, I did get a warning. However, nothing failed so this could be fine.

ht_queue_service/queue_multiple_consumer_test.py:10
  /app/ht_queue_service/queue_multiple_consumer_test.py:10: PytestCollectionWarning: cannot collect test class 'TestHTMultipleConsumerServiceConcrete' because it has a __init__ constructor (from: ht_queue_service/queue_multiple_consumer_test.py)
    class TestHTMultipleConsumerServiceConcrete(QueueMultipleConsumer):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
35 passed, 1 warning in 64.68s (0:01:04)

aelkiss

The changes make sense to me. I do have a couple questions inline.

aelkiss · 2025-04-24T16:46:31Z

conftest.py

 @pytest.fixture
-def solr_api_url():
-    return "http://solr-sdr-catalog:9033/solr/catalog/"
+def solr_catalog_url():


In testing things reliant solr elsewhere I've usually relied on a SOLR_URL environment variable which can be set in docker-compose.yml, rather than putting it in the tests. That said when I have tests that are mocking a response from solr I do set the URL for that in the tests themselves (because essentially we're making a fake solr service in the tests)

aelkiss · 2025-04-29T13:23:12Z

ht_utils/query_maker.py

+    """
+
+    # Use terms query parser for faster lookup for large sets of IDs, e.g., document_retriever_service
+    # The terms query parser in Solr is a highly efficient way to search for multiple exact values


Noting that this might be worth trying in holdings for batch retrieval as well: https://hathitrust.atlassian.net/browse/ETT-238

@liseli Do you have a reference to the Solr documentation for the terms query parser?

On the Solr main page, you will find more details about the terms query parser. It is recommended to filter facet results, but for certain use cases involving strings, it provides a quicker option because it implements an exact match without altering the string. I would be happy to explore whether this approach could enhance holdings batch retrieval. I'll add Solr main page to the Readme of this repository.

Great. That page mentions:

"[the terms query parser] may be more efficient in some cases than using the Standard Query Parser to generate a boolean query since the default implementation method avoids scoring."

I think if we use fq that also avoids scoring so I'm curious if we see any difference. However, constructing queries using the terms query parser might result in it making slightly easier to generate the queries, so I think it makes sense to use the terms parser even if it and fq both offer the same performance advantages.

I am curious to see if combining the terms query parser with filters results in better performance than just using the terms query parser. I will run some experiments using these queries to check the performance.

aelkiss · 2025-04-29T13:26:37Z

document_retriever_service/full_text_search_retriever_service.py

+
+        chuck_solr_params = copy.deepcopy(self.solr_retriever_query_params)
+
+        chuck_solr_params['q'] = solr_query


Should we use fq instead of q here since we don't need to rank the results?

Yes, using fq instead of q makes a lot of sense, because we are just applying a filter and we do not care about document ranks. I'll update the code

aelkiss · 2025-04-29T13:26:58Z

document_retriever_service/full_text_search_retriever_service.py

+        :return: response from Solr
+        """
+
+        chuck_solr_params = copy.deepcopy(self.solr_retriever_query_params)


chuck -> chunk?

aelkiss · 2025-04-29T13:29:20Z

document_retriever_service/full_text_search_retriever_service.py

-            n_cores = num_threads
-        else:
-            n_cores = multiprocessing.cpu_count()
+    if PARALLELIZE:


When running in Kubernetes we will probably want to make the batch size directly configurable rather than relying on the number of CPU cores (which will probably tell us the number of CPU cores for the whole worker node rather than the resources actually available to our pod)

I will create a task to add the ht_indexer configuration file and set up the config map for all configurable variables in Kubernetes. That is something I could do with K'Ron.

liseli · 2025-04-29T19:47:46Z

This all look fine to me. After running the tests, I did get a warning. However, nothing failed so this could be fine.

ht_queue_service/queue_multiple_consumer_test.py:10
  /app/ht_queue_service/queue_multiple_consumer_test.py:10: PytestCollectionWarning: cannot collect test class 'TestHTMultipleConsumerServiceConcrete' because it has a __init__ constructor (from: ht_queue_service/queue_multiple_consumer_test.py)
    class TestHTMultipleConsumerServiceConcrete(QueueMultipleConsumer):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
35 passed, 1 warning in 64.68s (0:01:04)

I'll work on this warning as part of a different task that I have created

…ch without using rows parameters; the class retriever_services was re-structered to reduce complecity;Unittest were updated

liseli · 2025-04-29T20:09:44Z

I have included the recommendations from my peers in this pull request.

liseli force-pushed the DEV-1659_SolrQuery_lenght branch from 66e3aee to c4c61a1 Compare April 24, 2025 11:32

liseli requested review from Ronster2018 and aelkiss April 24, 2025 11:40

Ronster2018 approved these changes Apr 28, 2025

View reviewed changes

aelkiss reviewed Apr 29, 2025

View reviewed changes

document_retriever_services changed to retrieve solr documents by bat…

3c54788

…ch without using rows parameters; the class retriever_services was re-structered to reduce complecity;Unittest were updated

liseli force-pushed the DEV-1659_SolrQuery_lenght branch from 2705c40 to 3c54788 Compare April 29, 2025 20:08

liseli merged commit 98c6d1f into main Apr 30, 2025
1 check passed

liseli deleted the DEV-1659_SolrQuery_lenght branch April 30, 2025 13:16


		chuck_solr_params = copy.deepcopy(self.solr_retriever_query_params)

		chuck_solr_params['q'] = solr_query

Dev 1659 solr query lenght #48

Dev 1659 solr query lenght #48

Uh oh!

Conversation

liseli commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ronster2018 left a comment

Choose a reason for hiding this comment

Uh oh!

aelkiss left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aelkiss Apr 29, 2025 • edited by liseli Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liseli commented Apr 29, 2025

Uh oh!

liseli commented Apr 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

liseli commented Apr 24, 2025 •

edited

Loading

aelkiss Apr 29, 2025 •

edited by liseli

Loading