-
Notifications
You must be signed in to change notification settings - Fork 0
Dev 1659 solr query lenght #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
66e3aee to
c4c61a1
Compare
Ronster2018
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all look fine to me. After running the tests, I did get a warning. However, nothing failed so this could be fine.
ht_queue_service/queue_multiple_consumer_test.py:10
/app/ht_queue_service/queue_multiple_consumer_test.py:10: PytestCollectionWarning: cannot collect test class 'TestHTMultipleConsumerServiceConcrete' because it has a __init__ constructor (from: ht_queue_service/queue_multiple_consumer_test.py)
class TestHTMultipleConsumerServiceConcrete(QueueMultipleConsumer):
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
35 passed, 1 warning in 64.68s (0:01:04)
aelkiss
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes make sense to me. I do have a couple questions inline.
| @pytest.fixture | ||
| def solr_api_url(): | ||
| return "http://solr-sdr-catalog:9033/solr/catalog/" | ||
| def solr_catalog_url(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In testing things reliant solr elsewhere I've usually relied on a SOLR_URL environment variable which can be set in docker-compose.yml, rather than putting it in the tests. That said when I have tests that are mocking a response from solr I do set the URL for that in the tests themselves (because essentially we're making a fake solr service in the tests)
| """ | ||
|
|
||
| # Use terms query parser for faster lookup for large sets of IDs, e.g., document_retriever_service | ||
| # The terms query parser in Solr is a highly efficient way to search for multiple exact values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that this might be worth trying in holdings for batch retrieval as well: https://hathitrust.atlassian.net/browse/ETT-238
@liseli Do you have a reference to the Solr documentation for the terms query parser?
On the Solr main page, you will find more details about the terms query parser. It is recommended to filter facet results, but for certain use cases involving strings, it provides a quicker option because it implements an exact match without altering the string. I would be happy to explore whether this approach could enhance holdings batch retrieval. I'll add Solr main page to the Readme of this repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. That page mentions:
"[the terms query parser] may be more efficient in some cases than using the Standard Query Parser to generate a boolean query since the default implementation method avoids scoring."
I think if we use fq that also avoids scoring so I'm curious if we see any difference. However, constructing queries using the terms query parser might result in it making slightly easier to generate the queries, so I think it makes sense to use the terms parser even if it and fq both offer the same performance advantages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious to see if combining the terms query parser with filters results in better performance than just using the terms query parser. I will run some experiments using these queries to check the performance.
|
|
||
| chuck_solr_params = copy.deepcopy(self.solr_retriever_query_params) | ||
|
|
||
| chuck_solr_params['q'] = solr_query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use fq instead of q here since we don't need to rank the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, using fq instead of q makes a lot of sense, because we are just applying a filter and we do not care about document ranks. I'll update the code
| :return: response from Solr | ||
| """ | ||
|
|
||
| chuck_solr_params = copy.deepcopy(self.solr_retriever_query_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chuck -> chunk?
| n_cores = num_threads | ||
| else: | ||
| n_cores = multiprocessing.cpu_count() | ||
| if PARALLELIZE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When running in Kubernetes we will probably want to make the batch size directly configurable rather than relying on the number of CPU cores (which will probably tell us the number of CPU cores for the whole worker node rather than the resources actually available to our pod)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create a task to add the ht_indexer configuration file and set up the config map for all configurable variables in Kubernetes. That is something I could do with K'Ron.
I'll work on this warning as part of a different task that I have created |
…ch without using rows parameters; the class retriever_services was re-structered to reduce complecity;Unittest were updated
2705c40 to
3c54788
Compare
|
I have included the recommendations from my peers in this pull request. |
This PR introduces a new logic to retrieve documents from Solr. (Ticket: DEV_1659)
How to test this PR?
docker build -t document_generator .