Added support for segmentation by index shards #10

romanmar1 · 2015-12-15T07:29:54Z

Segmentation by shards is applied on top of already existing segmentation instructions if present, or if not present, operates as a specialized segmentation.

e.g: -s http://localhost:9300/index1/type1 -t http://localhost:9300/index2/type2 -sc clusterName -tc clusterName -shards 0,1,2,3,4 -segmentationByShards
This will create 5 querying processors for each mentioned shard

-s http://localhost:9300/index1/type1 -t http://localhost:9300/index2/type2 -sc clusterName -tc clusterName -shards 0,1,2
This will create a single query processor that will query only shards 0, 1, 2.

-s http://localhost:9300/index1/type1 -t http://localhost:9300/index2/type2 -sc clusterName -tc clusterName -shards 0,1,2

-s http://localhost:9300/index1/type1 -t http://localhost:9300/index2/type2 -sc clusterName -tc clusterName -shards 0,1,2,3,4 -segmentationByShards -segmentationField rate.newCoolness -segmentationThresholds 0.0,0.5,0.59,0.6,0.7,0.9,1.0
This will create 30 querying processors (num threshold segments * num shards). Every shards will be queries by 6 querying processor for each bounded segment

Segmentation by shards is applied on top of already existing segmentation instructions if present, or if not present, operates as a specialized segmentation.

awislowski · 2015-12-16T07:30:17Z

First of all thanks for the contribution.
Could you give me a reason for this improvement.
How does it help? Did you do performance tests with this feature and without? How big was your index?

romanmar1 · 2015-12-16T18:40:01Z

The ability to target specific shards can be beneficial in the following scenarios:

index routing - When the index documents have specific routing parameters, yet they are either unknown, not present in the documents themselves, or too many to count. In such scenarios, you can target specific shards, knowing that all documents in those shards are related in some way.
partial reinexing - sometimes it's necessary to reindex a portion of an existing index for faster drill down scenarios and/or testing purposes. In some cases, it might be difficult to find a suitable double or string range that meets your needs, but a shard, which is usually a rather balanced component of an index can be exactly what you need.
full reindexing - It can be easier sometimes to utilize the full capacity of the cluster when reindexing whole indices by segmenting by shards, because shards tend to be well balanced components of the index, eliminating the need to try and find a suitable collection of bounded segments.
Faster reindexing - It's possible to avoid the network transfer of the scan scroll portion, if all reindexing tools are located locally on every node and targeting only their local shards for reindexing, which should result in faster operations.

I did not yet run any performance tests on this feature, but I plan to do so in the following week (mostly test shard segmentation feature, not without, unless i'll have a reason to suspect it hurts performance, which i will be monitoring and comparing to known baselines that I have)
My Indices, mostly, consists of 500-1500 million small to medium documents.

awislowski · 2015-12-18T17:03:12Z

So I will wait for your performance tests as your reasoning makes sense.
In addition to this I will ask you to reformat pushed files, where only formatting lines have changed, as well as remove spare imports.

romanmar1 · 2015-12-19T07:27:21Z

No problem. I plan to add additional contributions, so it might be easier if there was some kind of performance testing suite available for the tool. How did you initially test performance for this?

awislowski · 2015-12-22T14:09:03Z

I do not have performance tests for the tool. All tests are very dependant on the index size, type of storage, cluster size, index mapping. For me it would be enough if you tell how faster was reindex using new feature on your index, than without this feature.
For example my initial reindex with segmentation on my production index took 17 min and without segmentation was 45 min.

awislowski · 2016-02-15T17:28:17Z

Hi @romanmar1,

Do you have any results of using this change on your production environments?

romanmar1 · 2016-02-15T19:01:14Z

I ran the tool using those settings on our production environment a couple of times on an index containing roughly 600 million documents of small size (6-8 not analyzed string fields) and was able to reindex them in roughly 30 - 40 minutes.

Compared to our baselines ingestion rates, that was a considerable improvement, BUT, its more like comparing apples and oranges, because there are too many different factors at play here to compare properly (Our baseline indexing process was hadoop based, this tool is not, etc').

To properly test this, I should run the tool without the shard segmentation settings and compare it to a run with shard segmentation settings.

awislowski · 2016-02-20T15:57:48Z

Thanks @romanmar1,

If you could run the tool without the shard segmentation settings and compare it to a run with shard segmentation settings so I can decide if this PR is a reasonable performance improvement.

Added support for segmentation by index shards.

d9bedb4

Segmentation by shards is applied on top of already existing segmentation instructions if present, or if not present, operates as a specialized segmentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added support for segmentation by index shards #10

Added support for segmentation by index shards #10

Uh oh!

romanmar1 commented Dec 15, 2015

Uh oh!

awislowski commented Dec 16, 2015

Uh oh!

romanmar1 commented Dec 16, 2015

Uh oh!

awislowski commented Dec 18, 2015

Uh oh!

romanmar1 commented Dec 19, 2015

Uh oh!

awislowski commented Dec 22, 2015

Uh oh!

awislowski commented Feb 15, 2016

Uh oh!

romanmar1 commented Feb 15, 2016

Uh oh!

awislowski commented Feb 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Added support for segmentation by index shards #10

Are you sure you want to change the base?

Added support for segmentation by index shards #10

Uh oh!

Conversation

romanmar1 commented Dec 15, 2015

Uh oh!

awislowski commented Dec 16, 2015

Uh oh!

romanmar1 commented Dec 16, 2015

Uh oh!

awislowski commented Dec 18, 2015

Uh oh!

romanmar1 commented Dec 19, 2015

Uh oh!

awislowski commented Dec 22, 2015

Uh oh!

awislowski commented Feb 15, 2016

Uh oh!

romanmar1 commented Feb 15, 2016

Uh oh!

awislowski commented Feb 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants