-
Notifications
You must be signed in to change notification settings - Fork 105
[Deepin-Kernel-SIG] [linux 6.12-y] [Openeuler] [Maillist] [PATCH v3] mm: convert mm's rss stats to use atomic mode #1412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: linux-6.12.y
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR changes mm’s rss accounting to use a lightweight atomic mode by default and only switch to percpu counters when a process becomes multithreaded, reducing fork/exec overhead while preserving accurate accounting for multi-threaded workloads. Sequence diagram for switching rss_stat from atomic mode to percpu mode on thread creationsequenceDiagram
actor Task
participant KernelClone
participant OldMM
participant PerCpuCounter
Task->>KernelClone: clone(clone_flags)
KernelClone->>KernelClone: copy_mm(clone_flags, tsk)
KernelClone->>OldMM: obtain oldmm
alt clone_flags has CLONE_THREAD
KernelClone->>OldMM: mm_counter_switch_to_pcpu()
OldMM->>PerCpuCounter: percpu_counter_switch_to_pcpu_many(rss_stat, NR_MM_COUNTERS)
alt rss_stat not initialized
PerCpuCounter->>PerCpuCounter: __percpu_counter_init_many(..., switch_mode=true)
PerCpuCounter-->>OldMM: allocate percpu memory, keep counts in count_atomic
else rss_stat already percpu
PerCpuCounter-->>OldMM: return 0 (no-op)
end
OldMM-->>KernelClone: success or -ENOMEM
alt init failed
KernelClone-->>Task: clone fails with -ENOMEM
else init succeeded
KernelClone->>KernelClone: proceed with CLONE_THREAD handling
KernelClone-->>Task: new thread created sharing mm
end
else no CLONE_THREAD
KernelClone->>KernelClone: keep rss_stat in atomic mode
KernelClone-->>Task: process or thread without shared mm
end
Sequence diagram for rss_stat updates using atomic mode or percpu modesequenceDiagram
actor Task
participant MM
participant PerCpuCounter
Note over Task,MM: Update path (e.g. page fault)
Task->>MM: add_mm_counter(member, value)
MM->>MM: fbc = &rss_stat[member]
alt percpu_counter_initialized(fbc)
MM->>PerCpuCounter: percpu_counter_add(fbc, value)
else atomic mode
MM->>PerCpuCounter: percpu_counter_atomic_add(fbc, value)
end
MM->>MM: mm_trace_rss_stat(mm, member)
Note over Task,MM: Read path (e.g. stats or tracing)
Task->>MM: mm_counter_sum(member)
MM->>MM: fbc = &rss_stat[member]
alt percpu_counter_initialized(fbc)
MM->>PerCpuCounter: percpu_counter_sum(fbc)
PerCpuCounter-->>MM: aggregated percpu value
else atomic mode
MM->>PerCpuCounter: percpu_counter_atomic_read(fbc)
PerCpuCounter-->>MM: atomic count
end
MM-->>Task: rss value
Class diagram for percpu_counter atomic/percpu modes and mm rss_stat helpersclassDiagram
class percpu_counter {
+raw_spinlock_t lock
+s64 count
+atomic64_t count_atomic
+void __percpu* counters
+bool percpu_counter_initialized()
+s64 percpu_counter_atomic_read()
+void percpu_counter_atomic_add(s64 amount)
+int percpu_counter_switch_to_pcpu_many(u32 nr_counters)
}
class mm_struct {
+percpu_counter rss_stat[NR_MM_COUNTERS]
+unsigned long get_mm_counter(int member)
+unsigned long get_mm_counter_sum(int member)
+void add_mm_counter(int member, long value)
+void inc_mm_counter(int member)
+void dec_mm_counter(int member)
+s64 mm_counter_sum(int member)
+s64 mm_counter_sum_positive(int member)
+int mm_counter_switch_to_pcpu()
+void mm_counter_destroy()
}
class fork_copy_mm_path {
+int copy_mm(unsigned long clone_flags, struct task_struct* tsk)
}
percpu_counter <.. mm_struct : used_by
mm_struct "NR_MM_COUNTERS" o-- percpu_counter : rss_stat
mm_struct <.. fork_copy_mm_path : operates_on
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 3 issues, and left some high level feedback:
- The new percpu_counter_switch_to_pcpu_many() uses a
boolforretbut returns it asintwhile intending to propagate-ENOMEM; this will collapse all non‑zero failures to 1 and break callers relying on specific error codes, soretshould be anintand handled consistently with the function’s return contract. - The kerneldoc comment on percpu_counter_switch_to_pcpu_many() still talks about switching to/from “atomic mode” in a confusing way (and has typos like
perpcu), which makes it hard to understand the intended behavior; updating the wording to clearly describe the atomic→percpu transition would improve maintainability. - mm_counter_destroy() always calls percpu_counter_destroy_many() even when the mm’s rss_stat counters were never switched to percpu mode; please double‑check whether destroy_many() is safe when
counters == NULLfor all entries, and if not, gate the destroy on percpu_counter_initialized().
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new percpu_counter_switch_to_pcpu_many() uses a `bool` for `ret` but returns it as `int` while intending to propagate `-ENOMEM`; this will collapse all non‑zero failures to 1 and break callers relying on specific error codes, so `ret` should be an `int` and handled consistently with the function’s return contract.
- The kerneldoc comment on percpu_counter_switch_to_pcpu_many() still talks about switching to/from “atomic mode” in a confusing way (and has typos like `perpcu`), which makes it hard to understand the intended behavior; updating the wording to clearly describe the atomic→percpu transition would improve maintainability.
- mm_counter_destroy() always calls percpu_counter_destroy_many() even when the mm’s rss_stat counters were never switched to percpu mode; please double‑check whether destroy_many() is safe when `counters == NULL` for all entries, and if not, gate the destroy on percpu_counter_initialized().
## Individual Comments
### Comment 1
<location> `lib/percpu_counter.c:329` </location>
<code_context>
+{
+ static struct lock_class_key __key;
+ unsigned long flags;
+ bool ret = 0;
+
+ if (percpu_counter_initialized(fbc))
</code_context>
<issue_to_address>
**issue (bug_risk):** The use of `bool ret` loses the original error code from `__percpu_counter_init_many` and misrepresents `-ENOMEM` as `1`.
`__percpu_counter_init_many` returns an `int` status (0 or negative errno), but here its result is stored in a `bool` and then returned from an `int`-returning function. This converts `-ENOMEM` to `1`, breaking callers that rely on negative error codes (e.g. `copy_mm()` expects `-ENOMEM`). Keep `ret` as an `int` and return it directly:
```c
int ret = 0;
...
if (likely(!percpu_counter_initialized(fbc)))
ret = __percpu_counter_init_many(...);
...
return ret;
```
</issue_to_address>
### Comment 2
<location> `lib/percpu_counter.c:211` </location>
<code_context>
INIT_LIST_HEAD(&fbc[i].list);
#endif
- fbc[i].count = amount;
+ if (likely(!switch_mode))
+ fbc[i].count = amount;
fbc[i].counters = (void __percpu *)counters + i * counter_size;
</code_context>
<issue_to_address>
**issue (bug_risk):** Switching from atomic to percpu mode does not preserve the existing atomic count into the percpu counter state.
When `switch_mode` is true, this path skips initializing `fbc[i].count`, while `amount` is 0 and the percpu memory is zeroed via `__GFP_ZERO`. The pre-existing value in `count_atomic` is therefore lost and subsequent `percpu_counter_*` reads incorrectly return 0.
This path should migrate the current atomic value into the percpu state—either by reading `percpu_counter_atomic_read(&fbc[i])` and assigning it to `fbc[i].count` after allocation (keeping per-cpu shards at 0), or by passing that value as `amount` and using the normal initialization instead of skipping it when `switch_mode` is set.
</issue_to_address>
### Comment 3
<location> `lib/percpu_counter.c:316-323` </location>
<code_context>
EXPORT_SYMBOL(__percpu_counter_compare);
+/*
+ * percpu_counter_switch_to_pcpu_many: Converts struct percpu_counters from
+ * atomic mode to percpu mode.
+ *
+ * Return: 0 if percpu_counter is already in atomic mode or successfully
+ * switched to atomic mode; -ENOMEM if perpcu memory allocation fails,
+ * perpcu_counter is still in atomic mode.
+ */
</code_context>
<issue_to_address>
**suggestion:** The function comment mixes up atomic vs percpu modes and contains typos, which can mislead callers.
The return description is inverted: this helper converts from atomic to percpu and returns 0 when already in percpu mode or successfully switched to percpu, not atomic. Also, "perpcu"/"perpcu_counter" are misspelled. Please correct the mode names and typos to accurately reflect the behavior and avoid confusing callers.
```suggestion
/*
* percpu_counter_switch_to_pcpu_many: Converts struct percpu_counters from
* atomic mode to percpu mode.
*
* Return: 0 if percpu_counter is already in percpu mode or successfully
* switched to percpu mode; -ENOMEM if percpu memory allocation fails, in
* which case percpu_counter remains in atomic mode.
*/
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
534ef54 to
6297058
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements a performance optimization for memory management RSS (Resident Set Size) statistics by introducing a dual-mode approach for percpu counters. The change aims to reduce the 2-4% performance regression in fork/exec/shell operations introduced by earlier commits that converted RSS stats to always use percpu counters.
Key changes:
- Introduce atomic mode for percpu counters as an alternative to percpu-backed mode, selected at runtime
- Delay allocation of percpu memory for RSS stats until a process creates its second thread
- Add helper functions to transparently read, update, and switch between atomic and percpu modes
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/percpu_counter.c | Adds switch_mode parameter to initialization and implements percpu_counter_switch_to_pcpu_many for runtime mode switching |
| include/linux/percpu_counter.h | Defines union for atomic/percpu count storage and declares helper functions for atomic operations and mode switching |
| include/linux/mm.h | Updates mm counter accessors to check mode and use appropriate read/write operations; adds wrapper functions for switching and destruction |
| kernel/fork.c | Removes eager percpu allocation from mm_init, adds lazy switching to percpu mode when creating threads, updates cleanup to use new wrapper |
| include/trace/events/kmem.h | Updates RSS stat tracing to use new mm_counter_sum_positive helper |
Comments suppressed due to low confidence (1)
include/linux/mm.h:2657
- The
get_mm_counter_sumfunction was not updated to handle the new dual-mode counters. It always callspercpu_counter_sum_positiveon the counter without checking if it's in atomic or percpu mode. For atomic mode counters, this will attempt to access uninitialized percpu memory. This function should be updated to check the mode similar to the newmm_counter_sum_positivehelper.
static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member)
{
return percpu_counter_sum_positive(&mm->rss_stat[member]);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #endif | ||
| fbc[i].count = amount; | ||
| if (likely(!switch_mode)) | ||
| fbc[i].count = amount; |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When switching from atomic mode to percpu mode, the existing value in count_atomic needs to be migrated to the percpu counter's count field. Currently, when switch_mode is true, fbc[i].count is not initialized at all, which means the accumulated atomic count value is lost. The function should read the current atomic64_read(&fbc[i].count_atomic) value and assign it to fbc[i].count before switching modes to preserve the accumulated RSS statistics.
| fbc[i].count = amount; | |
| fbc[i].count = amount; | |
| else | |
| fbc[i].count = atomic64_read(&fbc[i].count_atomic); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count 和 count_atomic 是 union,它们在内存中是重叠的。不需要这个操作
| if (percpu_counter_initialized(fbc)) | ||
| return 0; | ||
|
|
||
| preempt_disable(); | ||
| local_irq_save(flags); | ||
| if (likely(!percpu_counter_initialized(fbc))) | ||
| ret = __percpu_counter_init_many(fbc, 0, | ||
| GFP_ATOMIC|__GFP_NOWARN|__GFP_ZERO, | ||
| nr_counters, &__key, true); |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential race condition here. The check at line 331 and line 336 both use percpu_counter_initialized(fbc) without proper synchronization between them. If two threads call this function simultaneously, both could see the counter as uninitialized and attempt to initialize it, leading to a memory leak or inconsistent state. Consider using atomic compare-and-swap or ensuring this function is only called under appropriate locks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
考虑调用的地方是第一次初始化的时候才会percpu_counter_initialized,第一次初始化的时候是fork,始终为单线程进入此处初始化后才可能多线程。
include/linux/mm.h
Outdated
|
|
||
| if (percpu_counter_initialized(fbc)) | ||
| return percpu_counter_read_positive(fbc); | ||
|
|
||
| return percpu_counter_atomic_read(fbc); |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the counter is in atomic mode, percpu_counter_atomic_read returns an s64 which can be negative. However, this function returns unsigned long, and there's no check to ensure non-negative values like percpu_counter_read_positive does. This could lead to incorrect behavior if the atomic counter has a negative value. Consider using max(0, percpu_counter_atomic_read(fbc)) or a similar approach to ensure non-negative results.
| if (percpu_counter_initialized(fbc)) | |
| return percpu_counter_read_positive(fbc); | |
| return percpu_counter_atomic_read(fbc); | |
| s64 val; | |
| if (percpu_counter_initialized(fbc)) | |
| return percpu_counter_read_positive(fbc); | |
| val = percpu_counter_atomic_read(fbc); | |
| if (val < 0) | |
| val = 0; | |
| return (unsigned long)val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduced in 34e5523 ("mm: avoid false sharing of mm_counter"),I will take it in next commit.
maillist inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/I9IA1I CVE: NA Reference: https://lore.kernel.org/all/20240418142008.2775308-1-zhangpeng362@huawei.com/ -------------------------------- Depending on whether counters is NULL, we can support two modes: atomic mode and perpcu mode. We implement both modes by grouping the s64 count and atomic64_t count_atomic in a union. At the same time, we create the interface for adding and reading in atomic mode and for switching atomic mode to percpu mode. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
maillist inclusion category: performance bugzilla: https://gitee.com/openeuler/kernel/issues/I9IA1I CVE: NA Reference: https://lore.kernel.org/all/20240418142008.2775308-1-zhangpeng362@huawei.com/ -------------------------------- Since commit f1a7941 ("mm: convert mm's rss stats into percpu_counter"), the rss_stats have converted into percpu_counter, which convert the error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a performance regression on fork/exec/shell. Even after commit 14ef95b ("kernel/fork: group allocation/free of per-cpu counters for mm struct"), the performance of fork/exec/shell is still poor compared to previous kernel versions. To mitigate performance regression, we delay the allocation of percpu memory for rss_stats. Therefore, we convert mm's rss stats to use percpu_counter atomic mode. For single-thread processes, rss_stat is in atomic mode, which reduces the memory consumption and performance regression caused by using percpu. For multiple-thread processes, rss_stat is switched to the percpu mode to reduce the error margin. We convert rss_stats from atomic mode to percpu mode only when the second thread is created. After lmbench test, we can get 2% ~ 4% performance improvement for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance improvement for lmbench page_fault (before batch mode[1]). The test results are as follows: base base+revert base+this patch fork_proc 416.3ms 400.0ms (3.9%) 398.6ms (4.2%) exec_proc 2095.9ms 2061.1ms (1.7%) 2047.7ms (2.3%) shell_proc 3028.2ms 2954.7ms (2.4%) 2961.2ms (2.2%) page_fault 0.3603ms 0.3358ms (6.8%) 0.3361ms (6.7%) [1] https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/ Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
6297058 to
6d79ca4
Compare
deepin pr auto review我来对这个diff进行详细审查:
总的来说,这是一个很好的改进,通过引入双模式机制优化了不同场景下的性能表现。代码实现考虑了安全性,注释充分,结构清晰。主要需要关注的是错误处理的完善性和边界情况的处理。 |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Avenger-285714 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
PR sync from: Peng Zhang zhangpeng362@huawei.com
https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/message/5H6WRELSEVUK7Z2YFHIMWEVCUHD33QFS/
From: ZhangPeng zhangpeng362@huawei.com
Since commit f1a7941 ("mm: convert mm's rss stats into
percpu_counter"), the rss_stats have converted into percpu_counter,
which convert the error margin from (nr_threads * 64) to approximately
(nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
performance regression on fork/exec/shell. Even after commit 14ef95b
("kernel/fork: group allocation/free of per-cpu counters for mm struct"),
the performance of fork/exec/shell is still poor compared to previous
kernel versions.
To mitigate performance regression, we delay the allocation of percpu
memory for rss_stats. Therefore, we convert mm's rss stats to use
regression caused by using percpu. For multiple-thread processes,
second thread is created.
After lmbench test, we can get 2% ~ 4% performance improvement
for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance
The test results are as follows:
base base+revert base+this patch
fork_proc 416.3ms 400.0ms (3.9%) 398.6ms (4.2%)
exec_proc 2095.9ms 2061.1ms (1.7%) 2047.7ms (2.3%)
shell_proc 3028.2ms 2954.7ms (2.4%) 2961.2ms (2.2%)
page_fault 0.3603ms 0.3358ms (6.8%) 0.3361ms (6.7%)
[1] https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/
ChangeLog:
v2->v3:
remove patch 3.
v1->v2:
Split patch 2 into two patches.
ZhangPeng (2):
--
2.25.1
Summary by Sourcery
Convert mm RSS accounting to use atomics by default and switch to percpu counters only for multithreaded processes to reduce fork/exec overhead while preserving accuracy where needed.
New Features:
Enhancements: