Skip to content

Conversation

@liseli
Copy link
Contributor

@liseli liseli commented Apr 24, 2025

This PR introduces a new logic to retrieve documents from Solr. (Ticket: DEV_1659)

  • Documents are retrieved in batches of 200 IDs each time. As we have a long list of ht_ids/ids, we recommend splitting the list of documents into chunks and creating a query batch to avoid the Solr URI too long error.
  • The chunk size is 200, because Solr will fail with the status code 414. The chunk size was determined by testing the Solr query with different values (e.g., 100-500 and with 200 ht_ids it worked.
  • We use the term Solr parser to retrieve documents by IDs. This is the most efficient alternative for exact match queries.
  • retriever_services has been restructured for a better understanding of the logic.
  • Tests were updated

How to test this PR?

git clone git@github.com:hathitrust/ht_indexer.git
git checkout -b DEV-1659_SolrQuery_lenght 
  • Create the image

docker build -t document_generator .

  • Run ht_indexer_tracker
docker compose up ht_indexer_tracker -d
docker compose exec ht_indexer_tracker python -m pytest ht_indexer_monitoring 
  • Run document_retriever_service container and test it
docker compose up document_retriever -d
docker compose exec document_retriever python -m pytest document_retriever_service catalog_metadata ht_utils 
  • Run document_generator_service container and test it
docker compose up document_generator -d
docker compose exec document_generator python -m pytest document_generator ht_document ht_queue_service ht_utils

@liseli liseli force-pushed the DEV-1659_SolrQuery_lenght branch from 66e3aee to c4c61a1 Compare April 24, 2025 11:32
@liseli liseli requested review from Ronster2018 and aelkiss April 24, 2025 11:40
Copy link
Contributor

@Ronster2018 Ronster2018 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all look fine to me. After running the tests, I did get a warning. However, nothing failed so this could be fine.

ht_queue_service/queue_multiple_consumer_test.py:10
  /app/ht_queue_service/queue_multiple_consumer_test.py:10: PytestCollectionWarning: cannot collect test class 'TestHTMultipleConsumerServiceConcrete' because it has a __init__ constructor (from: ht_queue_service/queue_multiple_consumer_test.py)
    class TestHTMultipleConsumerServiceConcrete(QueueMultipleConsumer):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
35 passed, 1 warning in 64.68s (0:01:04)

Copy link
Member

@aelkiss aelkiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense to me. I do have a couple questions inline.

@pytest.fixture
def solr_api_url():
return "http://solr-sdr-catalog:9033/solr/catalog/"
def solr_catalog_url():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In testing things reliant solr elsewhere I've usually relied on a SOLR_URL environment variable which can be set in docker-compose.yml, rather than putting it in the tests. That said when I have tests that are mocking a response from solr I do set the URL for that in the tests themselves (because essentially we're making a fake solr service in the tests)

"""

# Use terms query parser for faster lookup for large sets of IDs, e.g., document_retriever_service
# The terms query parser in Solr is a highly efficient way to search for multiple exact values
Copy link
Member

@aelkiss aelkiss Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this might be worth trying in holdings for batch retrieval as well: https://hathitrust.atlassian.net/browse/ETT-238

@liseli Do you have a reference to the Solr documentation for the terms query parser?

On the Solr main page, you will find more details about the terms query parser. It is recommended to filter facet results, but for certain use cases involving strings, it provides a quicker option because it implements an exact match without altering the string. I would be happy to explore whether this approach could enhance holdings batch retrieval. I'll add Solr main page to the Readme of this repository.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. That page mentions:

"[the terms query parser] may be more efficient in some cases than using the Standard Query Parser to generate a boolean query since the default implementation method avoids scoring."

I think if we use fq that also avoids scoring so I'm curious if we see any difference. However, constructing queries using the terms query parser might result in it making slightly easier to generate the queries, so I think it makes sense to use the terms parser even if it and fq both offer the same performance advantages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious to see if combining the terms query parser with filters results in better performance than just using the terms query parser. I will run some experiments using these queries to check the performance.


chuck_solr_params = copy.deepcopy(self.solr_retriever_query_params)

chuck_solr_params['q'] = solr_query
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use fq instead of q here since we don't need to rank the results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using fq instead of q makes a lot of sense, because we are just applying a filter and we do not care about document ranks. I'll update the code

:return: response from Solr
"""

chuck_solr_params = copy.deepcopy(self.solr_retriever_query_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chuck -> chunk?

n_cores = num_threads
else:
n_cores = multiprocessing.cpu_count()
if PARALLELIZE:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running in Kubernetes we will probably want to make the batch size directly configurable rather than relying on the number of CPU cores (which will probably tell us the number of CPU cores for the whole worker node rather than the resources actually available to our pod)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a task to add the ht_indexer configuration file and set up the config map for all configurable variables in Kubernetes. That is something I could do with K'Ron.

@liseli
Copy link
Contributor Author

liseli commented Apr 29, 2025

This all look fine to me. After running the tests, I did get a warning. However, nothing failed so this could be fine.

ht_queue_service/queue_multiple_consumer_test.py:10
  /app/ht_queue_service/queue_multiple_consumer_test.py:10: PytestCollectionWarning: cannot collect test class 'TestHTMultipleConsumerServiceConcrete' because it has a __init__ constructor (from: ht_queue_service/queue_multiple_consumer_test.py)
    class TestHTMultipleConsumerServiceConcrete(QueueMultipleConsumer):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
35 passed, 1 warning in 64.68s (0:01:04)

I'll work on this warning as part of a different task that I have created

…ch without using rows parameters; the class retriever_services was re-structered to reduce complecity;Unittest were updated
@liseli liseli force-pushed the DEV-1659_SolrQuery_lenght branch from 2705c40 to 3c54788 Compare April 29, 2025 20:08
@liseli
Copy link
Contributor Author

liseli commented Apr 29, 2025

I have included the recommendations from my peers in this pull request.

@liseli liseli merged commit 98c6d1f into main Apr 30, 2025
1 check passed
@liseli liseli deleted the DEV-1659_SolrQuery_lenght branch April 30, 2025 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants