Skip to content

Conversation

@romanmar1
Copy link

Segmentation by shards is applied on top of already existing segmentation instructions if present, or if not present, operates as a specialized segmentation.

e.g: -s http://localhost:9300/index1/type1 -t http://localhost:9300/index2/type2 -sc clusterName -tc clusterName -shards 0,1,2,3,4 -segmentationByShards
This will create 5 querying processors for each mentioned shard

-s http://localhost:9300/index1/type1 -t http://localhost:9300/index2/type2 -sc clusterName -tc clusterName -shards 0,1,2
This will create a single query processor that will query only shards 0, 1, 2.

-s http://localhost:9300/index1/type1 -t http://localhost:9300/index2/type2 -sc clusterName -tc clusterName -shards 0,1,2

-s http://localhost:9300/index1/type1 -t http://localhost:9300/index2/type2 -sc clusterName -tc clusterName -shards 0,1,2,3,4 -segmentationByShards -segmentationField rate.newCoolness -segmentationThresholds 0.0,0.5,0.59,0.6,0.7,0.9,1.0
This will create 30 querying processors (num threshold segments * num shards). Every shards will be queries by 6 querying processor for each bounded segment

Segmentation by shards is applied on top of already existing segmentation instructions if present, or if not present, operates as a specialized segmentation.
@awislowski
Copy link
Contributor

First of all thanks for the contribution.
Could you give me a reason for this improvement.
How does it help? Did you do performance tests with this feature and without? How big was your index?

@romanmar1
Copy link
Author

The ability to target specific shards can be beneficial in the following scenarios:

  1. index routing - When the index documents have specific routing parameters, yet they are either unknown, not present in the documents themselves, or too many to count. In such scenarios, you can target specific shards, knowing that all documents in those shards are related in some way.
  2. partial reinexing - sometimes it's necessary to reindex a portion of an existing index for faster drill down scenarios and/or testing purposes. In some cases, it might be difficult to find a suitable double or string range that meets your needs, but a shard, which is usually a rather balanced component of an index can be exactly what you need.
  3. full reindexing - It can be easier sometimes to utilize the full capacity of the cluster when reindexing whole indices by segmenting by shards, because shards tend to be well balanced components of the index, eliminating the need to try and find a suitable collection of bounded segments.
  4. Faster reindexing - It's possible to avoid the network transfer of the scan scroll portion, if all reindexing tools are located locally on every node and targeting only their local shards for reindexing, which should result in faster operations.

I did not yet run any performance tests on this feature, but I plan to do so in the following week (mostly test shard segmentation feature, not without, unless i'll have a reason to suspect it hurts performance, which i will be monitoring and comparing to known baselines that I have)
My Indices, mostly, consists of 500-1500 million small to medium documents.

@awislowski
Copy link
Contributor

So I will wait for your performance tests as your reasoning makes sense.
In addition to this I will ask you to reformat pushed files, where only formatting lines have changed, as well as remove spare imports.

@romanmar1
Copy link
Author

No problem. I plan to add additional contributions, so it might be easier if there was some kind of performance testing suite available for the tool. How did you initially test performance for this?

@awislowski
Copy link
Contributor

I do not have performance tests for the tool. All tests are very dependant on the index size, type of storage, cluster size, index mapping. For me it would be enough if you tell how faster was reindex using new feature on your index, than without this feature.
For example my initial reindex with segmentation on my production index took 17 min and without segmentation was 45 min.

@awislowski
Copy link
Contributor

Hi @romanmar1,

Do you have any results of using this change on your production environments?

@romanmar1
Copy link
Author

I ran the tool using those settings on our production environment a couple of times on an index containing roughly 600 million documents of small size (6-8 not analyzed string fields) and was able to reindex them in roughly 30 - 40 minutes.

Compared to our baselines ingestion rates, that was a considerable improvement, BUT, its more like comparing apples and oranges, because there are too many different factors at play here to compare properly (Our baseline indexing process was hadoop based, this tool is not, etc').

To properly test this, I should run the tool without the shard segmentation settings and compare it to a run with shard segmentation settings.

@awislowski
Copy link
Contributor

Thanks @romanmar1,

If you could run the tool without the shard segmentation settings and compare it to a run with shard segmentation settings so I can decide if this PR is a reasonable performance improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants