Skip to content

What exactly is shrdsize? #59

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
plysdyret opened this issue Apr 30, 2025 · 10 comments
Open

What exactly is shrdsize? #59

plysdyret opened this issue Apr 30, 2025 · 10 comments
Labels
bug Something isn't working

Comments

@plysdyret
Copy link

plysdyret commented Apr 30, 2025

Hi

Wondering if there is a problem with shardsize calculation:

[root@lazy ceph-balancer]# ./placementoptimizer.py show 
cluster acting state

class       size   avail    used   %used   osds
ssd       48.90T  33.66T  15.25T  31.18%     37
hdd        4.10P   1.63P   2.46P  60.13%    407
nvmebulk 195.62T 149.31T  46.30T  23.67%     13
nvme     209.59T  59.27T 150.32T  71.72%    120

poolid name               type     size min pg_num  stored    used    avail shrdsize crush
     4 rbd                repl        3   2   1024 110.16T 286.69T  192.19T   110.2G 4:replicated_hdd default~hdd*1.000
     5 libvirt            repl        3   2    256   3.25T   6.69T    3.36T    13.0G 3:replicated_ssd default~ssd*1.000
     6 rbd_internal       repl        3   2   2048 102.78T 241.79T  299.49T    51.4G 4:replicated_hdd default~hdd*1.000
     8 .mgr               repl        2   1      1   4.91G   2.05G   53.74G     4.9G 3:replicated_ssd default~ssd*1.000
    10 rbd_ec             repl        3   2     32   8.01M   3.35M    1.95T   256.3K 3:replicated_ssd default~ssd*1.000
    11 rbd_ec_data        ec4+2       6   5  16384   1.04P   1.42P  648.12T    16.6G 0:rbd_ec_data default~hdd*1.000
    23 rbd.nvme           repl        2   1   2048  95.38T 148.82T     0.0B    47.7G 5:replicated_nvme default~nvme*1.000
    25 .nfs               repl        3   2     32  20.87K 277.26K    2.77T   667.8B 3:replicated_ssd default~ssd*1.000
    31 cephfs.cephfs.meta repl        3   2    128  15.37G  45.96G    3.36T   123.0M 3:replicated_ssd default~ssd*1.000
    32 cephfs.cephfs.data repl        3   2    512  449.0B  48.00K    3.84T     0.9B 3:replicated_ssd default~ssd*1.000
    34 cephfs.nvme.data   repl        2   1     32 976.56G 122.07G     0.0B    30.5G 5:replicated_nvme default~nvme*1.000
    35 cephfs.ssd.data    repl        3   2     32 748.18G   1.68T    1.68T    23.4G 3:replicated_ssd default~ssd*1.000
    37 cephfs.hdd.data    ec4+5       9   5   2048 206.45T 426.09T  436.36T    25.8G 7:cephfs.hdd.data default~hdd*1.000
    39 rbd.ssd            repl        3   2     64   1.63T   4.24T    3.36T    26.1G 3:replicated_ssd default~ssd*1.000
    43 rbd.ssd.ec         repl        3   2     32   2.00K  17.94K    1.82T    63.9B 3:replicated_ssd default~ssd*1.000
    44 rbd.ssd.ec.data    ec4+5       9   5     32   1.03T   1.95T    1.12T     8.2G 6:rbd.ssd.ec.data default~ssd*1.000
    47 rbd.nvmebulk.ec    repl        3   2     32   3.01M   6.10M    3.32T    96.3K 10:replicated_nvmebulk default~nvmebulk*1.000
    48 rbd.nvmebulk.data  ec4+5       9   5    512  22.86T  46.09T    1.53T    11.4G 11:rbd.nvmebulk.data default~nvmebulk*1.000
sum                                          25249   1.57P   2.56P
default~hdd                                          1.45P   2.35P    4.10P    57.451%
default~ssd                                          6.66T  14.62T   48.90T    29.887%
default~nvme                                        96.34T 148.94T  209.59T    71.063%
default~nvmebulk                                    22.86T  46.09T  195.62T    23.560%

If I look at ceph pg dump the size of a random PG in pool 11 is 70869098496 bytes and the size of a random PG in pool 4 is 107738628096 bytes. The latter sort of fits shrdsize = size of pgs in the pool but the former does not.

What exactly is shrdsize? Something different for EC vs replicated?

Mvh.

Torkil

@plysdyret
Copy link
Author

I read your explanation on the front page and it doesn't make sense to me. Shouldn't it simply be this, given that data chunks and parity chunks are the same size?

EC shard size = pg_size / (k+m)
Replicated shard size = pg_size / replication_factor

Am I missing something?

@leofah
Copy link
Contributor

leofah commented Apr 30, 2025

Hello Torkil,

Here a quick comment. I guess JJ can give more insights later.

I think the compression of Pools is not taken into account in the balancer.

Without compression the used size could be calculated from the stored size. For EC pools used = stored * (k + m ) / k and for replicated pools used = stored * replication_factor.

But your dump shows for pool 11 stored=1.04P and used = 1.42P. According to the calculation the used size should be 1.04P * 6/4 = 1.56P > 1.42P. Similar for pool 4. With a replication factor of 3 the used size would be 110.16T*3 = 330.48T > 286.69T

As far a I see, the balancer calculates the shardsize based on the stored value. This would be for replicated pools stored/pg_num and for erasure coded pools stored / pg_num / k.
So for pool 4 110.16T/1024 = 110.2G and for pool 11 1.04P / / 16384 / 4 = 16.6G

I don't know for sure, why the used size if different than a calculated used size. I think its due to compression. Maybe you can show the output of ceph df detail which shows compression information.

Leo

@plysdyret
Copy link
Author

plysdyret commented Apr 30, 2025

[root@lazy ~]# ceph df detail
--- RAW STORAGE ---
CLASS        SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd       4.1 PiB  1.6 PiB  2.5 PiB   2.5 PiB      60.16
nvme      210 TiB   59 TiB  151 TiB   151 TiB      72.02
nvmebulk  196 TiB  149 TiB   46 TiB    46 TiB      23.68
ssd        49 TiB   34 TiB   15 TiB    15 TiB      31.23
TOTAL     4.5 PiB  1.9 PiB  2.7 PiB   2.7 PiB      58.86
 
--- POOLS ---
POOL                ID    PGS   STORED   (DATA)   (OMAP)  OBJECTS     USED   (DATA)   (OMAP)  %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
rbd                  4   1024  110 TiB  110 TiB  9.6 KiB   29.23M  287 TiB  287 TiB   29 KiB  25.54    279 TiB            N/A          N/A    N/A      40 TiB       82 TiB
libvirt              5    256  3.2 TiB  3.2 TiB   56 KiB  864.70k  6.7 TiB  6.7 TiB  168 KiB  27.99    5.7 TiB            N/A          N/A    N/A     1.5 TiB      4.5 TiB
rbd_internal         6   2048  103 TiB  103 TiB  5.0 KiB   32.55M  242 TiB  242 TiB   15 KiB  22.44    279 TiB            N/A          N/A    N/A      66 TiB      132 TiB
.mgr                 8      1  4.9 GiB  4.9 GiB      0 B    1.26k  2.0 GiB  2.0 GiB      0 B   0.01    8.6 TiB            N/A          N/A    N/A     2.0 GiB      9.8 GiB
rbd_ec              10     32  8.0 MiB  8.0 MiB  1.0 KiB       27  3.4 MiB  3.4 MiB  3.1 KiB      0    5.7 TiB            N/A          N/A    N/A     1.5 MiB       23 MiB
rbd_ec_data         11  16384  1.0 PiB  1.0 PiB  2.6 KiB  279.72M  1.4 PiB  1.4 PiB  4.0 KiB  63.54    557 TiB            N/A          N/A    N/A     138 TiB      276 TiB
rbd.nvme            23   2048   95 TiB   95 TiB  3.3 KiB   25.16M  149 TiB  149 TiB  6.6 KiB  77.30     22 TiB            N/A          N/A    N/A      30 TiB       71 TiB
.nfs                25     32   21 KiB   12 KiB  8.4 KiB       68  277 KiB  252 KiB   25 KiB      0    5.7 TiB            N/A          N/A    N/A         0 B          0 B
cephfs.cephfs.meta  31    128   15 GiB  247 MiB   15 GiB    3.05M   46 GiB  641 MiB   45 GiB   0.26    5.7 TiB            N/A          N/A    N/A      54 MiB      153 MiB
cephfs.cephfs.data  32    512    449 B    449 B      0 B  130.59M   48 KiB   48 KiB      0 B      0    5.7 TiB            N/A          N/A    N/A         0 B          0 B
cephfs.nvme.data    34     32  977 GiB  977 GiB      0 B     250k  122 GiB  122 GiB      0 B   0.27     22 TiB            N/A          N/A    N/A     122 GiB      1.9 TiB
cephfs.ssd.data     35     32  749 GiB  749 GiB      0 B    1.01M  1.7 TiB  1.7 TiB      0 B   8.91    5.7 TiB            N/A          N/A    N/A     327 GiB      855 GiB
cephfs.hdd.data     37   2048  206 TiB  206 TiB    570 B  174.87M  426 TiB  426 TiB  1.3 KiB  33.75    372 TiB            N/A          N/A    N/A      38 TiB       77 TiB
rbd.ssd             39     64  1.6 TiB  1.6 TiB  1.7 KiB  431.84k  4.2 TiB  4.2 TiB  5.1 KiB  19.78    5.7 TiB            N/A          N/A    N/A     519 GiB      1.2 TiB
rbd.ssd.ec          43     32  2.0 KiB     18 B  2.0 KiB        5   18 KiB   12 KiB  5.9 KiB      0    5.7 TiB            N/A          N/A    N/A         0 B          0 B
rbd.ssd.ec.data     44     32  1.0 TiB  1.0 TiB      0 B  269.92k  2.0 TiB  2.0 TiB      0 B  10.18    7.7 TiB            N/A          N/A    N/A     388 GiB      762 GiB
rbd.nvmebulk.ec     47     32  3.0 MiB  3.0 MiB  5.0 KiB        6  6.1 MiB  6.1 MiB   15 KiB      0    9.2 TiB            N/A          N/A    N/A     528 KiB      4.0 MiB
rbd.nvmebulk.data   48    512   23 TiB   23 TiB      0 B    6.00M   46 TiB   46 TiB      0 B  62.66     12 TiB            N/A          N/A    N/A     4.1 TiB      9.4 TiB

@plysdyret
Copy link
Author

As far a I see, the balancer calculates the shardsize based on the stored value. This would be for replicated pools stored/pg_num

That yields pg_size and not shard_size? To get shard_size you need to additionally divide by the replication factor?

for erasure coded pools stored / pg_num / k. So for pool 4 110.16T/1024 = 110.2G and for pool 11 1.04P / / 16384 / 4 = 16.6G

Why divide with just k and not k+m? Data chunks and parity chunks all take up the same space?

Here's the UP acting set for a random pool 11 PG from ceph pg dump:

[503,561,528,444,223,268] (6 shards)

From pool 4:

[125,555,438] (3 shards)

Unless a shard is something else =)

@leofah
Copy link
Contributor

leofah commented Apr 30, 2025

Yes pool 11 has 6 shards per pg, and pool 4 has 3 shards per pg.

Here the difference is, if we are using the stored value vs the used value.
If using used we have shard_size = pg_size / replication_factor or shard_size = pg_size / (k + m). As you suggest as divisor.

When using the stored value (without compression), we only divide by the number of pgs_shards which actually hold the original data. For replicated pools only one of the pg_shards hold the original data, for ec pools k pg_shards hold the original data. The other shards (replication_factor - 1 or m) store the redundancy.

The stored size is the original size, whereas with used the redundancy overhead is taken into account.
Without compression we have stored = used / replication_factor or stored = used / (k + m) * k. The inverse of the previous functions:

Without compression the used size could be calculated from the stored size. For EC pools used = stored * (k + m ) / k and for replicated pools used = stored * replication_factor.

When substitution this in these other functions:

stored / pg_num and stored / pg_num / k.

we get used / pg_num / replication_factor or used / pg_num / (k + m)

The problem is, the balancer does not reflect the compression, so the current calculation of the shard size based on the stored value shows not the correct shard size. Instead the shard size should be correctly calculated using the used value.

@TheJJ
Copy link
Owner

TheJJ commented May 1, 2025

can you try again with the latest code? it should now display the real shard sizes on disk, not those of the data you stored before compression.

@TheJJ TheJJ added the bug Something isn't working label May 1, 2025
@TheJJ
Copy link
Owner

TheJJ commented May 1, 2025

and just for the record, shardsize is the size of one pg's replication chunk, usually stored on just one OSD. in other words: since a pg is on multiple OSDs, a shard is that PG's allocation on one OSD.

when balancing, the shardsize is the minimum thing to possibly move around between OSDs.

@plysdyret
Copy link
Author

[root@slutty ceph-balancer]# ./placementoptimizer.py show 
cluster acting state

class       size   avail    used   %used   osds
ssd       48.90T  33.41T  15.50T  31.69%     37
hdd        4.10P   1.62P   2.47P  60.39%    407
nvmebulk 195.62T 149.23T  46.38T  23.71%     13
nvme     209.59T  55.84T 153.75T  73.36%    120

poolid name               type     size min pg_num  stored    used    avail shrdsize crush
     4 rbd                repl        3   2   1024 110.17T 286.65T  247.30T    95.5G 4:replicated_hdd default~hdd*1.000
     5 libvirt            repl        3   2    256   3.26T   6.71T    3.32T     9.0G 3:replicated_ssd default~ssd*1.000
     6 rbd_internal       repl        3   2   2048 102.89T 242.05T  296.76T    40.3G 4:replicated_hdd default~hdd*1.000
     8 .mgr               repl        2   1      1   4.91G   2.05G   53.20G     1.0G 3:replicated_ssd default~ssd*1.000
    10 rbd_ec             repl        3   2     32   8.01M   3.37M    1.94T    35.9K 3:replicated_ssd default~ssd*1.000
    11 rbd_ec_data        ec4+2       6   5  16384   1.04P   1.43P  633.08T    15.2G 0:rbd_ec_data default~hdd*1.000
    23 rbd.nvme           repl        2   1   2048  95.42T 151.59T     0.0B    37.9G 5:replicated_nvme default~nvme*1.000
    25 .nfs               repl        3   2     32  18.17K 269.15K    2.75T     2.8K 3:replicated_ssd default~ssd*1.000
    31 cephfs.cephfs.meta repl        3   2    128  15.37G  45.94G    3.32T   122.5M 3:replicated_ssd default~ssd*1.000
    32 cephfs.cephfs.data repl        3   2    512  449.0B  48.00K    3.80T    32.0B 3:replicated_ssd default~ssd*1.000
    34 cephfs.nvme.data   repl        2   1     32 976.56G 122.07G     0.0B     1.9G 5:replicated_nvme default~nvme*1.000
    35 cephfs.ssd.data    repl        3   2     32 760.65G   1.71T    1.66T    18.2G 3:replicated_ssd default~ssd*1.000
    37 cephfs.hdd.data    ec4+5       9   5   2048 206.78T 426.85T  448.35T    23.7G 7:cephfs.hdd.data default~hdd*1.000
    39 rbd.ssd            repl        3   2     64   1.63T   4.25T    3.32T    22.6G 3:replicated_ssd default~ssd*1.000
    43 rbd.ssd.ec         repl        3   2     32   2.50K  19.45K    1.80T   207.4B 3:replicated_ssd default~ssd*1.000
    44 rbd.ssd.ec.data    ec4+5       9   5     32   1.03T   1.95T    1.11T     6.9G 6:rbd.ssd.ec.data default~ssd*1.000
    47 rbd.nvmebulk.ec    repl        3   2     32   3.01M   6.10M    3.32T    65.0K 10:replicated_nvmebulk default~nvmebulk*1.000
    48 rbd.nvmebulk.data  ec4+5       9   5    512  22.86T  46.09T    1.53T    10.2G 11:rbd.nvmebulk.data default~nvmebulk*1.000
sum                                          25249   1.58P   2.57P
default~hdd                                          1.45P   2.36P    4.10P    57.629%
default~ssd                                          6.69T  14.67T   48.90T    30.000%
default~nvme                                        96.38T 151.71T  209.59T    72.383%
default~nvmebulk                                    22.86T  46.09T  195.62T    23.560%

@plysdyret
Copy link
Author

plysdyret commented May 2, 2025

< when balancing, the shardsize is the minimum thing to possibly move around between OSDs.

Thanks. I still don't get it though but I'm unfortunately not very bright.

Here's an example pool 4 PG from ceph pg dump.:

PG 4.21c
BYTES 109332307968
OSDs [125,555,438]

If the PG is 110GB in size the shard size should be 110GB/3 = 36,6GB?

Example pool 11 PG:

PG 11.409
BYTES 70422941696
OSDs [275,107,511,106,581,48]

For this case, EC 4+2, shard size should be 70GB/6 = 11,66GB?

Simply taking the size of a PG and looking at how big it is sidesteps all considerations about compression and whatnot, and gives you the actual size of the minimum thing to move?

@TheJJ
Copy link
Owner

TheJJ commented May 2, 2025

these bytes are the stored amount of a pg i think, so you'd have it to multiply by three in order to get the used amount without compression (which then, divided by 3, would give the shardsize).
but since we have compression, we have to take the used amount (i.e. less than stored * 3 due to compression) and divide that by 3 and, we get 286G/3=95G. mind that in the overview list this is also divided by pg num first, since data in a pool is equally distributed to its placement groups.

without compression you could do (stored data * blowup due to redundancy) / (number of pgs * pool size). but with compression you have to ask ceph what it actually allocated (including the blowup), so we have (used data)/(number of pgs * pool size)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants