-
Describe the bug![]() Whether it's the publisher or the consumer, the rate drops to zero at fixed intervals. Is this due to the RabbitMQ server pausing for garbage collection? If observing a single queue leads to an unreliable conclusion about GC pauses, the same pattern holds globally. The entire RabbitMQ instance seems to pause approximately every 30 seconds. Why is this happening? Is RabbitMQ under excessive load? How can we specifically analyze and troubleshoot whether these GC pauses are caused by high load or other factors? ![]() Reproduction stepsWhen using RabbitMQ 3.12.14 with a sufficiently large number of queues and consumers, the issue is 100% reproducible. Expected behaviorI believe both sending consumption and consumption tasks should be smooth. Why does the consumption curve show metrics resetting to zero at fixed intervals? I want to understand the cause. If this is an abnormal phenomenon, I want to know how to prevent it. Additional contextI deployed it using Debian 10 + Docker. The machine configuration is 24 cores and 48GB RAM, with top-tier cloud SSD storage. I'm using an Alibaba Cloud ECS instance with the specification ecs.hfc6.6xlarge. This machine has 24 cores (vCPUs) and 48 GiB of memory. Two disks are present: a 40GB system disk supporting 2280 IOPS and a 200GB data disk supporting 11800 IOPS. I mention this to demonstrate that the machine I'm using is sufficiently powerful. No anomalies were observed in the monitoring panel of the Alibaba Cloud ECS instance. ![]() ![]() Information about my Linux operating system root@szbq-rabbitmq-52:/opt/rabbitmq-server# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/" My Docker version information root@szbq-rabbitmq-52:/opt/rabbitmq-server# docker version
Client: Docker Engine - Community
Version: 20.10.20
API version: 1.41
Go version: go1.18.7
Git commit: 9fdeb9c
Built: Tue Oct 18 18:20:36 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.20
API version: 1.41 (minimum version 1.12)
Go version: go1.18.7
Git commit: 03df974
Built: Tue Oct 18 18:18:26 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.8
GitCommit: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
services:
rabbitmq3-management:
restart: always
container_name: rabbitmq3-management
image: rabbitmq:3.12.14-management
hostname: rabbitmq3-management-standalone
logging:
driver: json-file
options:
max-size: "100m"
max-file: "1"
environment:
- RABBITMQ_DEFAULT_USER=xxxx
- RABBITMQ_DEFAULT_PASS=xxxx
volumes:
- "/xxxdata/rabbitmq:/var/lib/rabbitmq"
- "./rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf"
ports:
- "5672:5672"
- "15672:15672"
- "15692:15692"
My CPU usage is as follows: root@szbq-rabbitmq-52:/opt/rabbitmq-server# sar 1 10
Linux 4.19.0-17-amd64 (szbq-rabbitmq-52) 09/23/2025 _x86_64_ (24 CPU)
06:02:46 PM CPU %user %nice %system %iowait %steal %idle
06:02:47 PM all 49.41 0.00 11.42 2.63 0.00 36.54
06:02:48 PM all 57.86 0.00 14.21 3.26 0.00 24.67
06:02:49 PM all 50.62 0.00 14.37 3.81 0.00 31.21
06:02:50 PM all 46.65 0.00 13.36 3.35 0.00 36.64
06:02:51 PM all 50.82 0.00 12.78 3.21 0.00 33.19
06:02:52 PM all 48.58 0.00 14.05 3.55 0.00 33.82
06:02:53 PM all 51.76 0.00 13.25 3.09 0.00 31.91
06:02:54 PM all 51.82 0.00 15.52 4.16 0.00 28.50
06:02:55 PM all 46.00 0.00 15.28 4.50 0.00 34.22
06:02:56 PM all 51.12 0.00 14.02 3.56 0.00 31.30
Average: all 50.47 0.00 13.83 3.51 0.00 32.19 Disk utilization is as follows. Due to numerous read and write operations, disk read/write activity remains quite intensive. root@szbq-rabbitmq-52:~# iostat -x -d vdb 1
Linux 4.19.0-17-amd64 (szbq-rabbitmq-52) 09/23/2025 _x86_64_ (24 CPU)
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vdb 31.14 12269.90 1938.31 134936.44 0.00 16144.08 0.00 56.82 1.19 0.27 3.25 62.25 11.00 0.07 88.24
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vdb 0.00 13075.00 0.00 122144.00 0.00 14471.00 0.00 52.53 0.00 0.19 2.25 0.00 9.34 0.06 73.60
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vdb 0.00 11924.00 0.00 123160.00 0.00 16937.00 0.00 58.68 0.00 0.19 2.42 0.00 10.33 0.07 81.60
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vdb 0.00 11042.00 0.00 118188.00 0.00 13752.00 0.00 55.47 0.00 0.24 2.58 0.00 10.70 0.07 80.00
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vdb 0.00 11594.00 0.00 115864.00 0.00 15493.00 0.00 57.20 0.00 0.19 2.05 0.00 9.99 0.07 80.40
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vdb 0.00 12204.00 0.00 163384.00 0.00 18747.00 0.00 60.57 0.00 0.31 3.60 0.00 13.39 0.07 80.00
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vdb 0.00 13239.00 0.00 135356.00 0.00 19292.00 0.00 59.30 0.00 0.20 2.26 0.00 10.22 0.06 78.00
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vdb 0.00 11757.00 0.00 122884.00 0.00 17103.00 0.00 59.26 0.00 0.20 2.23 0.00 10.45 0.07 84.00
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vdb 0.00 13287.00 0.00 129608.00 0.00 16788.00 0.00 55.82 0.00 0.19 2.38 0.00 9.75 0.06 77.20
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vdb 0.00 12638.00 0.00 127536.00 0.00 17965.00 0.00 58.70 0.00 0.19 2.53 0.00 10.09 0.07 85.20
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vdb 0.00 13874.00 0.00 169020.00 0.00 20173.00 0.00 59.25 0.00 0.29 4.07 0.00 12.18 0.06 81.60
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vdb 0.00 13910.00 0.00 150768.00 0.00 18611.00 0.00 57.23 0.00 0.22 2.98 0.00 10.84 0.06 83.60
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
vdb 1.00 18049.00 4.00 175260.00 0.00 23895.00 0.00 56.97 0.00 0.19 3.12 4.00 9.71 0.05 91.20 Although this doesn't affect my use of RabbitMQ, these jagged edges have piqued my curiosity. If I want to troubleshoot this issue, where should I start? I'd appreciate some guidance. Thank you. Is there any more detailed information I can provide? It wasn't intentional on my part to use an outdated version, but around June 2025 (around that time), I upgraded RabbitMQ from 3.12.14 to 3.13.7 and encountered the “Restarting crashed queue” issue, which directly caused a production incident. Furthermore, when I attempted to upgrade from 3.13.7 to 4.0.2, I still encountered the “Restarting crashed queue” issue (not an in-place upgrade, but a newly created queue). This has made me hesitant to upgrade beyond 3.12.14 for now. Of course, the “Restarting crashed queue” issue is a separate topic. If you're interested, I can start a new discussion specifically about this problem. @kjnilsson I use classic queues exclusively. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
3.12.4 is out of community support, you need to upgrade to 4.2. That said the graph may just be a side effect of how metrics are calculated. You need to look at the throughout rate of your consumers and see if it matches what you see in the management UI. Upgrade to 4.2 and see if it still occurs. |
Beta Was this translation helpful? Give feedback.
-
@ponponon - what do you expect the RabbitMQ maintainers to do with what little information you provide, exactly? Do you expect them to rush to set up an environment, try to GUESS how you're using RabbitMQ, and report back to you, all for free? You're not even using a supported version of RabbitMQ. If you want to get free support for your issue, I suggest you provide enough information to reproduce what you report. First, reproduce your issue in your environment using the latest version of RabbitMQ and Erlang. If you see the same behavior, provide a git repository with the complete source code to start producers and consumers that mimics your workload and reproduces what you observe. |
Beta Was this translation helpful? Give feedback.
-
@ponponon do you expect us to guess what your consumers do or do not do (like do not acknowledge deliveries in a timely manner or use a suitable prefetch value)? I'm afraid our small team cannot afford guessing, guessing is a very very time consuming approach to troubleshooting distributed infrastructure.
The Erlang runtime does not suffer from "stop the world" pauses caused by GC because there is no global GC, every Erlang process (a connection, a channel or session, or queue or stream replica) has an independent heap and their garbage collections do not affect other processes. Yes, there is a shared reference counted heap for larger binaries but its GC is not "stop the world" for the entire system. As any heavy PerfTest user would confirm, when a stop-the-world Java GC in a consumer or producer process happens, you can usually tell by a drop in publishing or delivery/delivery acknowledgement metrics, even though RabbitMQ was not paused for GC. One scenario where RabbitMQ is guaranteed to stop deliveries is when a consumer is delivered as many messages as its channel's prefetch, which by definition means that RabbitMQ should not deliver any more until some outstanding deliveries are acknowledged. |
Beta Was this translation helpful? Give feedback.
-
By using monitoring data, ideally with a full set of Grafana dashboards (it can be inter-node connection congestion if the messages are large), and by asking the node how does it spend its CPU/scheduler time. If this node has 1 CPU core, then a surge of activity in any part of the system (e.g. on a particular connection) inevitably can take CPU scheduler time from queues or channels (that serialize deliveries to be sent). With an installation so old (it has reached EOL without any exceptions), I cannot rule out that these periodic background GC settings that were relevant for some workloads years ago could be enabled. They force a minor GC run for every single process in the system. |
Beta Was this translation helpful? Give feedback.
3.12.4 is out of community support, you need to upgrade to 4.2.
That said the graph may just be a side effect of how metrics are calculated. You need to look at the throughout rate of your consumers and see if it matches what you see in the management UI. Upgrade to 4.2 and see if it still occurs.