-
Notifications
You must be signed in to change notification settings - Fork 12
Added support for segmentation by index shards #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 1.x
Are you sure you want to change the base?
Conversation
Segmentation by shards is applied on top of already existing segmentation instructions if present, or if not present, operates as a specialized segmentation.
|
First of all thanks for the contribution. |
|
The ability to target specific shards can be beneficial in the following scenarios:
I did not yet run any performance tests on this feature, but I plan to do so in the following week (mostly test shard segmentation feature, not without, unless i'll have a reason to suspect it hurts performance, which i will be monitoring and comparing to known baselines that I have) |
|
So I will wait for your performance tests as your reasoning makes sense. |
|
No problem. I plan to add additional contributions, so it might be easier if there was some kind of performance testing suite available for the tool. How did you initially test performance for this? |
|
I do not have performance tests for the tool. All tests are very dependant on the index size, type of storage, cluster size, index mapping. For me it would be enough if you tell how faster was reindex using new feature on your index, than without this feature. |
|
Hi @romanmar1, Do you have any results of using this change on your production environments? |
|
I ran the tool using those settings on our production environment a couple of times on an index containing roughly 600 million documents of small size (6-8 not analyzed string fields) and was able to reindex them in roughly 30 - 40 minutes. Compared to our baselines ingestion rates, that was a considerable improvement, BUT, its more like comparing apples and oranges, because there are too many different factors at play here to compare properly (Our baseline indexing process was hadoop based, this tool is not, etc'). To properly test this, I should run the tool without the shard segmentation settings and compare it to a run with shard segmentation settings. |
|
Thanks @romanmar1, If you could run the tool without the shard segmentation settings and compare it to a run with shard segmentation settings so I can decide if this PR is a reasonable performance improvement. |
Segmentation by shards is applied on top of already existing segmentation instructions if present, or if not present, operates as a specialized segmentation.
e.g: -s http://localhost:9300/index1/type1 -t http://localhost:9300/index2/type2 -sc clusterName -tc clusterName -shards 0,1,2,3,4 -segmentationByShards
This will create 5 querying processors for each mentioned shard
-s http://localhost:9300/index1/type1 -t http://localhost:9300/index2/type2 -sc clusterName -tc clusterName -shards 0,1,2
This will create a single query processor that will query only shards 0, 1, 2.
-s http://localhost:9300/index1/type1 -t http://localhost:9300/index2/type2 -sc clusterName -tc clusterName -shards 0,1,2
-s http://localhost:9300/index1/type1 -t http://localhost:9300/index2/type2 -sc clusterName -tc clusterName -shards 0,1,2,3,4 -segmentationByShards -segmentationField rate.newCoolness -segmentationThresholds 0.0,0.5,0.59,0.6,0.7,0.9,1.0
This will create 30 querying processors (num threshold segments * num shards). Every shards will be queries by 6 querying processor for each bounded segment