Skip to content
This repository was archived by the owner on Mar 25, 2025. It is now read-only.

Conversation

@TaewoongKim
Copy link
Collaborator

I recently tested dmc_sim with ramdom_server_selection option enabling because that will be more similar with real Ceph environment. Unfortunately, the result is not stable. There were biases on select of server sometimes and that made wrong result. If the bias was occurred once it was not recovered and it kept until the end of simulation. I think the current code needs more resiliency. The main reason of lack of resiliency is the calculation of rho/delta value. In the current code rho/delta value are subtracted with my_rho/my_delta value. If large rho was calculated once and send to server, the server delays I/O requests for that client. As a result, other servers will get small rho value because the server will send replies less frequently. But, the server got small rho doesn't care that because replies from that server ignored by subtraction by my_rho. (vise versa for case of small rho value). Conclusion is that if dmclock got biased rho/delta value once it will be kept forever. I attached a test configuration and result below. And I changed the rho/delta calculation code progressively and test again for better result.

Test configuration

[global]
server_groups = 1
client_groups = 2
server_random_selection = true
server_soft_limit = false

[client.0]
client_count = 1
client_wait = 0
client_total_ops = 10000
client_server_select_range = 4
client_iops_goal = 1000
client_outstanding_ops = 32
client_reservation = 250.0
client_limit = 5000.0
client_weight = 10.0

[client.1]
client_count = 1
client_wait = 0
client_total_ops = 35000
client_server_select_range = 4
client_iops_goal = 3000
client_outstanding_ops = 32
client_reservation = 0.0
client_limit = 10000.0
client_weight = 100.0

[server.0]
server_count = 4
server_iops = 400
server_threads = 1

Test result of master branch

==== Client Data ====
     client:       0       1  total
        t_0:  460.50 1082.00 1542.50
        t_1:  631.00  896.50 1527.50
        t_2:  882.50  600.00 1482.50
        t_3:  613.50  899.50 1513.00
        t_4:  596.00  923.00 1519.00
        t_5:  651.50  870.00 1521.50
        t_6:  666.00  827.50 1493.50
        t_7:  496.50 1021.50 1518.00
        t_8:    2.50 1485.00 1487.50
        t_9:    0.00 1489.00 1489.00
       t_10:    0.00 1487.00 1487.00
       t_11:    0.00 1479.00 1479.00
       t_12:    0.00 1504.50 1504.50
       t_13:    0.00 1459.00 1459.00
       t_14:    0.00 1476.50 1476.50
       t_15:    0.00    0.00    0.00
    res_ops:    3967       0    3967
   prop_ops:    6033   35000   41033
total time to track responses: 33906243 nanoseconds;
    count: 45000;
    average: 753.47 nanoseconds per request/response
total time to get request parameters: 142394147 nanoseconds;
    count: 45000;
    average: 3164.31 nanoseconds per request/response

client timing for QOS algorithm: 3917.79 nanoseconds per request/response

==== Server Data ====
     server:       0       1       2       3   total
    res_ops:    1463    1488      **52**     964    3967
   prop_ops:    9888    9799   **10994**   10352   41033

As you can see, client 0 got more share than its reservation parameter(250 iops). Because there was a bias for server selection with random server select option. One of servers has large rho value for client 0 in some time.(server 2 in upper case) The server 2 will process client 1's requests more frequently than it was supposed to because client 0 doesn't use its resource due to large rho. After awhile client 1's queue in server 2 will be empty and client 0's requests will be served as proportional phase because there are no other client's requests. Even if other servers will process all reserved share, the server 2 process additional share for client 0 by proportional phase. As a result, Client 0 will get more shares than supposed to and right share will be broken.

The below is the graph of accumulative rho value of client 0 for each server. You can see it diverse as time go.
image

To reinforce the resiliency, I fixed the rho/delta calculation at first commit (0cdb1b7). I removed my_rho/my_delta and the result is below.

==== Client Data ====
     client:       0       1  total
        t_0:  497.50 1013.50 1511.00
        t_1:  456.00 1031.50 1487.50
        t_2:  469.50 1015.50 1485.00
        t_3:  457.00 1052.50 1509.50
        t_4:  466.00 1025.00 1491.00
        t_5:  465.00 1014.50 1479.50
        t_6:  519.00 1022.50 1541.50
        t_7:  487.00 1035.00 1522.00
        t_8:  402.50 1079.50 1482.00
        t_9:  427.00 1111.50 1538.50
       t_10:  353.50 1159.50 1513.00
       t_11:    0.00 1508.50 1508.50
       t_12:    0.00 1502.50 1502.50
       t_13:    0.00 1504.00 1504.00
       t_14:    0.00 1424.50 1424.50
       t_15:    0.00    0.00    0.00
    res_ops:    6510       0    6510
   prop_ops:    3490   35000   38490
total time to track responses: 35112316 nanoseconds;
    count: 45000;
    average: 780.27 nanoseconds per request/response
total time to get request parameters: 142748181 nanoseconds;
    count: 45000;
    average: 3172.18 nanoseconds per request/response

client timing for QOS algorithm: 3952.46 nanoseconds per request/response

==== Server Data ====
     server:       0       1       2       3   total
    res_ops:    1675    1487    1702    1646    6510
   prop_ops:    9625    9690    9517    9658   38490

image

Rho value diversity was disappeared and the result became better than the original code. But it was not enough. Client 0 got almost double IOPS than its parameter. I dug more and found another problem.

Another problem was that rho value calculation ignored replies of proportional phase. With random server selction some request can be handled with proportional phase in some server because there could be bias in short time so that some other client's queue can be empty. In that case, some requests handled with proportional phase and the client get more share than it was supposed to. Therefore replies of proportional phase must not be ignored to compensate with reducing reservation service for the client that got proportional share. If we count replies of proportional phase for rho calculation it will get same number with delta. So, I set rho value with delta and tested(e2a6420). The result is below.

==== Client Data ====
     client:       0       1  total
        t_0:  327.00 1216.50 1543.50
        t_1:  313.00 1204.00 1517.00
        t_2:  318.00 1190.50 1508.50
        t_3:  314.50 1190.50 1505.00
        t_4:  322.00 1183.00 1505.00
        t_5:  299.00 1172.00 1471.00
        t_6:  326.50 1148.00 1474.50
        t_7:  301.50 1187.00 1488.50
        t_8:  343.00 1181.50 1524.50
        t_9:  296.00 1222.00 1518.00
       t_10:  303.00 1174.50 1477.50
       t_11:  323.00 1211.00 1534.00
       t_12:  288.50 1248.00 1536.50
       t_13:  230.50 1299.50 1530.00
       t_14:  595.50  672.00 1267.50
       t_15:   99.00    0.00   99.00
       t_16:    0.00    0.00    0.00
    res_ops:    4345       0    4345
   prop_ops:    5655   35000   40655

I think it's quite acceptible number. I tested other configurations with random server selection mode too and the result was better with this code. I think rho variable need to be removed and replace it with delta variable although I just set rho with delta for convinience in this PR. I need your opinion before comlete the removing rho variable.

And bspark's PR(#6) was effective for this configuration too. I think more reducing rho value make decrease proportional phase so that help to keep from handling request additionally that excess reservation parameter. I will left comment about that at the PR page in sometime.

Signed-off-by: Taewoong Kim <taewoong.kim@sk.com>
Signed-off-by: Taewoong Kim <taewoong.kim@sk.com>
@ivancich
Copy link
Member

ivancich commented Mar 14, 2017

Thank you for identifying an important issue with either the algorithm or this implementation. I want to make sure I fully understand the choices you made and their effects.

You allow delta (and rho) to be zero. When that happens, the tags for the request on the servers don't advance. Why is this not a problem?

In your first graph you show server 2's rho values continuously diverging from those of the others. Initially server 3 also diverges, but then it seems to recover. What do you attribute these different outcomes to?

Certainly the algorithm as described in section 3.2.1 of the original paper (Gulati et al. 2010) is confusing. It says, that rho and delta should be computed based on responses from the other servers. And it also says in the single server case, rho and delta would always be one. Yet that does not seem to adequately account for in-flight operations. If multiple requests go out before any responses come in, it would seem that the calculated rho and delta values would be zero.

@TaewoongKim
Copy link
Collaborator Author

TaewoongKim commented Mar 27, 2017

I'm sorry for late replying. I will interleave comments below.

You allow delta (and rho) to be zero. When that happens, the tags for the request on the servers don't advance. Why is this not a problem?
--> I think zero means just the highest priority. The client that was got zero will get more io services than other clients. As a result, the client that was received zero delta/rho will get more replies and its rho/delta will be increased soon. If there is no coding bug(like divided by zero) there will be no problem. But I think it will need enough tests for verifying the modifications.

In your first graph you show server 2's rho values continuously diverging from those of the others. Initially server 3 also diverges, but then it seems to recover. What do you attribute these different outcomes to?
--> I should correct my words that I said previously. "if dmclock got biased rho/delta value once it will be kept forever" -> "if dmclock got biased rho/delta value once it will take a long time to recover". Recovering needs another bias of server selecting to compensate the previous rho diverging and that needs time and some probability.

Certainly the algorithm as described in section 3.2.1 of the original paper (Gulati et al. 2010) is confusing. It says, that rho and delta should be computed based on responses from the other servers.
--> The paper said, "Here δi denotes number of IO requests from VM vi that have completed service at all the servers ". On my understanding, it's not needed only from other servers.

And it also says in the single server case, rho and delta would always be one.
-->I think it means dmclock needs special handling just in the case of single server usage. Adding some code that makes rho/delta be one is enough just for caring single server use case exceptionally.

Yet that does not seem to adequately account for in-flight operations. If multiple requests go out before any responses come in, it would seem that the calculated rho and delta values would be zero.
--> You are right but it happens just at the very first initial time. I don't think it's a big problem.

@ivancich
Copy link
Member

ivancich commented Aug 7, 2017

Allowing rho and delta to be zero is problematic. The tags for reservations and limits have strong semantics and this change to rho & delta would break those semantics. Let me explain in more detail.

The tags given to a request have clear meanings. A reservation tag represents the time that a request should be handled to remain with the in reservations set. Likewise a limit tag represents the earliest time a request should be handled so as not to exceed the limit. These tags should advance with every request. In the case of one server where all requests go, they should advance by the inverse of the request or limit values. In the case of multiple servers they should advance by a multiple if the inverse based on the level of service the other servers have provided.

When rho and delta become zero, they prevent this type of advancement and the tags lose their semantics. This is such a drastic change that it needs an argument that is very compelling. And that it seems to work better under certain circumstances does not meet that bar.

} else {
Counter delta =
delta_counter - it->second.delta_prev_req;
#ifdef USE_SEPARATED_RHO_CAL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the motivation for this change? I didn't understand from your write-up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removes subtracting my_rho/my_delta that induce a vicious cycle.
In random mode, many requests can go one specific server sometimes by random manner.
The vicious cycle is as follows.

  1. many requests go to one specific server
  2. many acks are returned for the server
  3. my_rho/my_delta value for the server increase
  4. rho/delta value for the server is decreased by subtraction of large my_rho/my_delta
  5. the server receive small rho/delta
  6. the server services more requests from the client because of the client's small rho/delta value
  7. go to step 2 again

As a result, there will be bias for rho/delta value between clients.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the my_rho/my_delta change. I do not understand the USE_SEPARATED_RHO_CAL change. What benefit did it provide?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see this question?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example in this article, a client with a relatively small proportional weight tends to get more shares than it was supposed to. Consider the below scenario that can happen in random mode.
(Let's assume Client 0's proportional IOPS setting is lower than reservation IOPS setting and Client 1's proportional IOPS setting is very high than Client 0's. So, Client 0 must get its all share as reservation phase.)

  1. Irregular I/O dispatch may occur on client 1 for a short period of time.
    For example, Client 1 sends I/O requests rarely to Server a while.
  2. Client 1's queue on Server becomes empty.
  3. Server serves Client 0's requests as proportional phase because Client 1's queue is empty but Client 0's queue is not.
  4. Server sends ack of proportion phase to Client 0
  5. For Client 0, the delta value is increased only but rho value is not.
  6. The reservation is not affected because rho value is not changed. Another server will service rest of IOPS by reservation phase because rho value is not increased.
  7. As a result, Client 0 got extra IOPS with proportional scheduling. But, Client 1 can't get its proportional share because resources are used for Client 1's extra IOPS instead of Client 0's proportional share.

However, if we use rho value equal to delta value, rho will be increased although the request is processed as a propotional phase. IO requests that are serviced as proportional phase will be counted also as reserved IOPS. Increased rho value is a kind of feedback that says the client got some service already, so don't need to be serviced much next time. Increased rho value prevents serving extra IOPS.

This PR consists of two commits. I think it would be better to separate the two commits into two PRs. Because the two commits have not a strong relationship and these are orthogonal, I think separating into two problems is good to handle.

@TaewoongKim
Copy link
Collaborator Author

If the tag value is not advance forever or during a long time it's a big problem as you mentioned. However, this will happen sometimes when there was no reply between request dispatches to the same server.
Actually, I realized that zero rho&delta value had no effect in current code because zero value is ignored during tag calculation. (https://github.com/TaewoongKim/dmclock/blob/d1a73a86ce6063c4d1af967a41f8a92e0608cc0c/src/dmclock_server.h#L200-L202)
I tested also without the ignoration but the result was similar.

I understand your concern but the important point is that we need to change the rho/delta counting method for a better result.
How do you think about adding some exception handling for zero rho&delta?
If the calculated rho&delta is zero it will be set by one. This can prevent the possibility of starvation or monopoly that is caused by no tag advance and can prevent rho/delta bias for some clients also.

@ivancich
Copy link
Member

So we agree that your change did now allow rho or delta to become zero, because dmclock_server.h did now allow the zero values. What is your objection to leaving the "1 +" in the code? Or to ask it the other way, what benefit did you get by removing the "1 +"?

@TaewoongKim
Copy link
Collaborator Author

TaewoongKim commented Aug 16, 2017

"+1" makes rho/delta be wrong especially if there are a small number of servers. rho/delta mean the ratio how many portions of reserved IOPS are responsible for each server. In the ideal case where all requests are distributed to all servers evenly, the rho/delta must be same with the total number of dmclock servers. If we leave the "+1" term, the average will be "1+ the number of dmclock servers". It causes that each server serves "reserved IOPS/(1+ N servers)" and one specific client will get N/(1+N) x reserved IOPS.

@ivancich
Copy link
Member

I'm going to look at your argument more carefully. In the meantime you might want to look at https://github.com/ceph/dmclock/commits/wip-delta-rho-plugin, in which I parameterize the class that tracks rho and delta values, to allow for alternate implementations/algorithms.

@TaewoongKim
Copy link
Collaborator Author

Yes, I'll check the commit~

@ivancich
Copy link
Member

I believe I understand the point you're making about the "1 +". Can you revisit my working branch (https://github.com/ceph/dmclock/commits/wip-delta-rho-plugin) and take a look at the BorrowingTracker and let me know what you think? Would you find that acceptable? Can you either a) test it, or b) outline to me your tests so that I can test it? Then we can compare the results.

Thank you!

@TaewoongKim
Copy link
Collaborator Author

I reviewed the branch and I think it is acceptable for me. In the branch, the total rho & delta value are kept and "1+" term is removed simultaneously. The below is my test result for BorrowingTracker.
The result was improved little than OriginalTracker's result. The result was similar with the result of the first commit of this PR (be763c4). We need to improve the tracker for the more acceptible result because, in the below test result, the client0's IOPS was too high than its configuration. The client0 should get about 250 IOPS because of its low weight. I will analyze the results and look for the cause.

[global]                               
server_groups = 1                      
client_groups = 2                      
server_random_selection = 1            
server_soft_limit = 0                  
                                       
[client.0]                             
client_count = 1                       
client_wait = 0                        
client_total_ops = 10000               
client_server_select_range = 4         
client_iops_goal = 1000                
client_outstanding_ops = 32            
client_reservation = 250.0             
client_limit = 5000.0                  
client_weight = 10.0                   
                                       
[client.1]                             
client_count = 1                       
client_wait = 0                        
client_total_ops = 35000               
client_server_select_range = 4         
client_iops_goal = 3000                
client_outstanding_ops = 32            
client_reservation = 0.0               
client_limit = 10000.0                 
client_weight = 100.0                  
                                       
[server.0]                             
server_count = 4                       
server_iops = 400                      
server_threads = 1                     
                                       
simulation started                     
simulation completed in 29946 millisecs
==== Client Data ====                  
     client:       0       1  total    
        t_0:  436.00 1102.00 1538.00   
        t_1:  502.00 1017.50 1519.50   
        t_2:  459.50 1043.00 1502.50   
        t_3:  488.00 1042.50 1530.50   
        t_4:  469.00 1041.00 1510.00   
        t_5:  465.50 1047.50 1513.00   
        t_6:  549.50  963.50 1513.00   
        t_7:  427.50 1096.50 1524.00   
        t_8:  540.00  974.50 1514.50   
        t_9:  466.00 1032.50 1498.50   
       t_10:  197.00 1306.00 1503.00   
       t_11:    0.00 1520.50 1520.50   
       t_12:    0.00 1452.00 1452.00   
       t_13:    0.00 1447.00 1447.00   
       t_14:    0.00 1414.00 1414.00   
       t_15:    0.00    0.00    0.00   
    res_ops:    6735       0    6735   
   prop_ops:    3265   35000   38265   
total time to track responses: 35606754 nanoseconds;                     
    count: 45000;                                                        
    average: 791.26 nanoseconds per request/response                     
total time to get request parameters: 33954667 nanoseconds;              
    count: 45000;                                                        
    average: 754.55 nanoseconds per request/response                     
                                                                         
client timing for QOS algorithm: 1545.81 nanoseconds per request/response
                                                                         
==== Server Data ====                                                    
     server:       0       1       2       3   total                     
    res_ops:    1891    2007    1403    1434    6735                     
   prop_ops:    9449    9331    9816    9669   38265                     
                                                                         
 k-way heap: 2                                                           
                                                                         
total time to add requests: 226901024 nanoseconds;                       
    count: 45000;                                                        
    average: 5042.24 nanoseconds per request/response                    
total time to note requests complete: 105370772 nanoseconds;             
    count: 45000;                                                        
    average: 2341.57 nanoseconds per request/response                    
                                                                         
server timing for QOS algorithm: 7383.82 nanoseconds per request/response 

@ivancich
Copy link
Member

ivancich commented Sep 1, 2017

Thank you for testing it. And I'll be curious to learn what your analysis finds.

@ivancich
Copy link
Member

@TaewoongKim

We need to improve the tracker for the more acceptible result because, in the below test result, the client0's IOPS was too high than its configuration. The client0 should get about 250 IOPS because of its low weight. I will analyze the results and look for the cause.

Were you able to figure out why you weren't getting the results you were expecting? I'm very curious!

@TaewoongKim
Copy link
Collaborator Author

I have been busy for a while.
Let's test without random server selection option to investigate the problem.
Then, you can see the result is almost perfect!
The below is the result without random selection option.

[global]
server_groups = 1
client_groups = 2
server_random_selection = 0
server_soft_limit = 0

## OMIT...

simulation started
simulation completed in 30315 millisecs
==== Client Data ====
     client:       0       1  total
        t_0:  259.50 1292.50 1552.00
        t_1:  250.00 1304.50 1554.50
        t_2:  250.00 1301.00 1551.00
        t_3:  250.00 1303.00 1553.00
        t_4:  250.00 1305.00 1555.00
        t_5:  250.00 1304.00 1554.00
        t_6:  250.00 1305.50 1555.50
        t_7:  250.00 1305.00 1555.00
        t_8:  251.50 1301.00 1552.50
        t_9:  249.50 1303.00 1552.50
       t_10:  249.50 1305.00 1554.50
       t_11:  250.00 1302.50 1552.50
       t_12:  250.50 1304.00 1554.50
       t_13:  656.00  564.00 1220.00
       t_14:  935.50    0.00  935.50
       t_15:  148.00    0.00  148.00
       t_16:    0.00    0.00    0.00
    res_ops:    8024       0    8024
   prop_ops:    1976   35000   36976

The problem occurs only with random selection.
Client 1's queues in servers are empty sometimes with the random selection because requests are not distributed evenly. The client 0's requests are scheduled by the proportional scheduling when there is no request in the client 1's queues.
In the ideal case, the client 0 can't get the proportional share but the client got it.
This is the reason why the client 0 get more IOPS than 250 IOPS. The client 0 got 250 IOPS by the reservation scheduling and +extra IOPS also by the proportional scheduling. As a result, the client 0 got more share than it had to be served(250 IOPS).

@ivancich
Copy link
Member

@TaewoongKim Thank you for looking into that. Would you be okay with the borrowing tracker to becoming the default tracker?

@TaewoongKim
Copy link
Collaborator Author

I think it's no problem and the borrowing tracker is a more advanced one than the original tracker.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants