-
Notifications
You must be signed in to change notification settings - Fork 54
fix delta&rho calculation for random server selection mode #23
Conversation
e2a6420 to
25f9a5b
Compare
25f9a5b to
d50e474
Compare
Signed-off-by: Taewoong Kim <taewoong.kim@sk.com>
Signed-off-by: Taewoong Kim <taewoong.kim@sk.com>
d50e474 to
d1a73a8
Compare
|
Thank you for identifying an important issue with either the algorithm or this implementation. I want to make sure I fully understand the choices you made and their effects. You allow delta (and rho) to be zero. When that happens, the tags for the request on the servers don't advance. Why is this not a problem? In your first graph you show server 2's rho values continuously diverging from those of the others. Initially server 3 also diverges, but then it seems to recover. What do you attribute these different outcomes to? Certainly the algorithm as described in section 3.2.1 of the original paper (Gulati et al. 2010) is confusing. It says, that rho and delta should be computed based on responses from the other servers. And it also says in the single server case, rho and delta would always be one. Yet that does not seem to adequately account for in-flight operations. If multiple requests go out before any responses come in, it would seem that the calculated rho and delta values would be zero. |
|
I'm sorry for late replying. I will interleave comments below. You allow delta (and rho) to be zero. When that happens, the tags for the request on the servers don't advance. Why is this not a problem? In your first graph you show server 2's rho values continuously diverging from those of the others. Initially server 3 also diverges, but then it seems to recover. What do you attribute these different outcomes to? Certainly the algorithm as described in section 3.2.1 of the original paper (Gulati et al. 2010) is confusing. It says, that rho and delta should be computed based on responses from the other servers. And it also says in the single server case, rho and delta would always be one. Yet that does not seem to adequately account for in-flight operations. If multiple requests go out before any responses come in, it would seem that the calculated rho and delta values would be zero. |
|
Allowing rho and delta to be zero is problematic. The tags for reservations and limits have strong semantics and this change to rho & delta would break those semantics. Let me explain in more detail. The tags given to a request have clear meanings. A reservation tag represents the time that a request should be handled to remain with the in reservations set. Likewise a limit tag represents the earliest time a request should be handled so as not to exceed the limit. These tags should advance with every request. In the case of one server where all requests go, they should advance by the inverse of the request or limit values. In the case of multiple servers they should advance by a multiple if the inverse based on the level of service the other servers have provided. When rho and delta become zero, they prevent this type of advancement and the tags lose their semantics. This is such a drastic change that it needs an argument that is very compelling. And that it seems to work better under certain circumstances does not meet that bar. |
| } else { | ||
| Counter delta = | ||
| delta_counter - it->second.delta_prev_req; | ||
| #ifdef USE_SEPARATED_RHO_CAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the motivation for this change? I didn't understand from your write-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This removes subtracting my_rho/my_delta that induce a vicious cycle.
In random mode, many requests can go one specific server sometimes by random manner.
The vicious cycle is as follows.
- many requests go to one specific server
- many acks are returned for the server
- my_rho/my_delta value for the server increase
- rho/delta value for the server is decreased by subtraction of large my_rho/my_delta
- the server receive small rho/delta
- the server services more requests from the client because of the client's small rho/delta value
- go to step 2 again
As a result, there will be bias for rho/delta value between clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the my_rho/my_delta change. I do not understand the USE_SEPARATED_RHO_CAL change. What benefit did it provide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see this question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example in this article, a client with a relatively small proportional weight tends to get more shares than it was supposed to. Consider the below scenario that can happen in random mode.
(Let's assume Client 0's proportional IOPS setting is lower than reservation IOPS setting and Client 1's proportional IOPS setting is very high than Client 0's. So, Client 0 must get its all share as reservation phase.)
- Irregular I/O dispatch may occur on client 1 for a short period of time.
For example, Client 1 sends I/O requests rarely to Server a while. - Client 1's queue on Server becomes empty.
- Server serves Client 0's requests as proportional phase because Client 1's queue is empty but Client 0's queue is not.
- Server sends ack of proportion phase to Client 0
- For Client 0, the delta value is increased only but rho value is not.
- The reservation is not affected because rho value is not changed. Another server will service rest of IOPS by reservation phase because rho value is not increased.
- As a result, Client 0 got extra IOPS with proportional scheduling. But, Client 1 can't get its proportional share because resources are used for Client 1's extra IOPS instead of Client 0's proportional share.
However, if we use rho value equal to delta value, rho will be increased although the request is processed as a propotional phase. IO requests that are serviced as proportional phase will be counted also as reserved IOPS. Increased rho value is a kind of feedback that says the client got some service already, so don't need to be serviced much next time. Increased rho value prevents serving extra IOPS.
This PR consists of two commits. I think it would be better to separate the two commits into two PRs. Because the two commits have not a strong relationship and these are orthogonal, I think separating into two problems is good to handle.
|
If the tag value is not advance forever or during a long time it's a big problem as you mentioned. However, this will happen sometimes when there was no reply between request dispatches to the same server. I understand your concern but the important point is that we need to change the rho/delta counting method for a better result. |
|
So we agree that your change did now allow rho or delta to become zero, because dmclock_server.h did now allow the zero values. What is your objection to leaving the "1 +" in the code? Or to ask it the other way, what benefit did you get by removing the "1 +"? |
|
"+1" makes rho/delta be wrong especially if there are a small number of servers. rho/delta mean the ratio how many portions of reserved IOPS are responsible for each server. In the ideal case where all requests are distributed to all servers evenly, the rho/delta must be same with the total number of dmclock servers. If we leave the "+1" term, the average will be "1+ the number of dmclock servers". It causes that each server serves "reserved IOPS/(1+ N servers)" and one specific client will get N/(1+N) x reserved IOPS. |
|
I'm going to look at your argument more carefully. In the meantime you might want to look at https://github.com/ceph/dmclock/commits/wip-delta-rho-plugin, in which I parameterize the class that tracks rho and delta values, to allow for alternate implementations/algorithms. |
|
Yes, I'll check the commit~ |
|
I believe I understand the point you're making about the "1 +". Can you revisit my working branch (https://github.com/ceph/dmclock/commits/wip-delta-rho-plugin) and take a look at the BorrowingTracker and let me know what you think? Would you find that acceptable? Can you either a) test it, or b) outline to me your tests so that I can test it? Then we can compare the results. Thank you! |
|
I reviewed the branch and I think it is acceptable for me. In the branch, the total rho & delta value are kept and "1+" term is removed simultaneously. The below is my test result for BorrowingTracker. |
|
Thank you for testing it. And I'll be curious to learn what your analysis finds. |
Were you able to figure out why you weren't getting the results you were expecting? I'm very curious! |
|
I have been busy for a while. The problem occurs only with random selection. |
|
@TaewoongKim Thank you for looking into that. Would you be okay with the borrowing tracker to becoming the default tracker? |
|
I think it's no problem and the borrowing tracker is a more advanced one than the original tracker. |
I recently tested dmc_sim with ramdom_server_selection option enabling because that will be more similar with real Ceph environment. Unfortunately, the result is not stable. There were biases on select of server sometimes and that made wrong result. If the bias was occurred once it was not recovered and it kept until the end of simulation. I think the current code needs more resiliency. The main reason of lack of resiliency is the calculation of rho/delta value. In the current code rho/delta value are subtracted with my_rho/my_delta value. If large rho was calculated once and send to server, the server delays I/O requests for that client. As a result, other servers will get small rho value because the server will send replies less frequently. But, the server got small rho doesn't care that because replies from that server ignored by subtraction by my_rho. (vise versa for case of small rho value). Conclusion is that if dmclock got biased rho/delta value once it will be kept forever. I attached a test configuration and result below. And I changed the rho/delta calculation code progressively and test again for better result.
Test configuration
Test result of master branch
As you can see, client 0 got more share than its reservation parameter(250 iops). Because there was a bias for server selection with random server select option. One of servers has large rho value for client 0 in some time.(server 2 in upper case) The server 2 will process client 1's requests more frequently than it was supposed to because client 0 doesn't use its resource due to large rho. After awhile client 1's queue in server 2 will be empty and client 0's requests will be served as proportional phase because there are no other client's requests. Even if other servers will process all reserved share, the server 2 process additional share for client 0 by proportional phase. As a result, Client 0 will get more shares than supposed to and right share will be broken.
The below is the graph of accumulative rho value of client 0 for each server. You can see it diverse as time go.

To reinforce the resiliency, I fixed the rho/delta calculation at first commit (0cdb1b7). I removed my_rho/my_delta and the result is below.
Rho value diversity was disappeared and the result became better than the original code. But it was not enough. Client 0 got almost double IOPS than its parameter. I dug more and found another problem.
Another problem was that rho value calculation ignored replies of proportional phase. With random server selction some request can be handled with proportional phase in some server because there could be bias in short time so that some other client's queue can be empty. In that case, some requests handled with proportional phase and the client get more share than it was supposed to. Therefore replies of proportional phase must not be ignored to compensate with reducing reservation service for the client that got proportional share. If we count replies of proportional phase for rho calculation it will get same number with delta. So, I set rho value with delta and tested(e2a6420). The result is below.
I think it's quite acceptible number. I tested other configurations with random server selection mode too and the result was better with this code. I think rho variable need to be removed and replace it with delta variable although I just set rho with delta for convinience in this PR. I need your opinion before comlete the removing rho variable.
And bspark's PR(#6) was effective for this configuration too. I think more reducing rho value make decrease proportional phase so that help to keep from handling request additionally that excess reservation parameter. I will left comment about that at the PR page in sometime.