-
Notifications
You must be signed in to change notification settings - Fork 148
Uprobe regs ci #9612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Uprobe regs ci #9612
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
It is already checked whether the VMA size is the same as nr_pages * PAGE_SIZE, so later checking both: aux_size == vma_size && aux_size == nr_pages * PAGE_SIZE is redundant. Remove the vma_size check as nr_pages is what is actually used in the allocation function. That prepares for splitting out the buffer allocation into separate functions, so that only nr_pages needs to be handed in. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104018.424519320@infradead.org
To prepare for splitting the buffer allocation out into separate functions for the ring buffer and the AUX buffer, split out mlock limit handling into a helper function, which can be called from both. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104018.541975109@infradead.org
Similarly to the mlock limit calculation the VM accounting is required for both the ringbuffer and the AUX buffer allocations. To prepare for splitting them out into separate functions, move the accounting into a helper function. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104018.660347811@infradead.org
if (cond) { A; } else { B; } C; into if (cond) { A; C; } else { B; C; } Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104018.781244099@infradead.org
if (cond) { A; } else { B; } if (cond) { C; } else { D; } into: if (cond) { A; C; } else { B; D; } Notably the conditions are not identical in form, but are equivalent. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104018.900078502@infradead.org
if (cond) { A; } else { B; } C; into if (cond) { A; C; } else { B; C; } Notably C has a success branch and both A and B have two places for success. For A (rb case), duplicate the success case because later patches will result in them no longer being identical. For B (aux case), share using goto (cleaned up later). Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.016252852@infradead.org
unlock and aux_unlock are now identical, remove the aux_unlock one. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.131293512@infradead.org
After duplicating the common code into the rb/aux branches is it possible to use a simple guard() for the aux_mutex. Making the aux branch self-contained. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.246250452@infradead.org
Mostly re-indent noise needed to get rid of that label. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.362581570@infradead.org
Move the AUX buffer allocation branch into its own function. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.494205648@infradead.org
Ensure @rb usage doesn't extend out of the branch block. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.605285302@infradead.org
Move the RB buffer allocation branch into its own function. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.722214699@infradead.org
Mostly just re-indent noise. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104019.838047976@infradead.org
Needed because refcount_inc() doesn't allow the 0->1 transition. Specifically, this is the case where we've created the RB, this means there was no RB, and as such there could not have been an mmap. Additionally we hold mmap_mutex to serialize everything. This must be the first. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250812104019.956479989@infradead.org
The recently fixed reference count leaks could have been detected by using refcount_t and refcount_t would have mitigated the potential overflow at least. Now that the code is properly structured, convert the mmap() related mmap_count variants over to refcount_t. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20250812104020.071507932@infradead.org
Currently unapply_uprobe takes mmap_read_lock, but it might call remove_breakpoint which eventually changes user pages. Current code writes either breakpoint or original instruction, so it can go away with read lock as explained in here [1]. But with the upcoming change that writes multiple instructions on the probed address we need to ensure that any update to mm's pages is exclusive. [1] https://lore.kernel.org/all/20240710140045.GA1084@redhat.com/ Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-2-jolsa@kernel.org
We are about to add uprobe trampoline, so cleaning up the namespace. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-3-jolsa@kernel.org
Making copy_from_page global and adding uprobe prefix. Adding the uprobe prefix to copy_to_page as well for symmetry. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-4-jolsa@kernel.org
Adding uprobe_write function that does what uprobe_write_opcode did so far, but allows to pass verify callback function that checks the memory location before writing the opcode. It will be used in following changes to implement specific checking logic for instruction update. The uprobe_write_opcode now calls uprobe_write with verify_opcode as the verify callback. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-5-jolsa@kernel.org
Adding nbytes argument to uprobe_write and related functions as preparation for writing whole instructions in following changes. Also renaming opcode arguments to insn, which seems to fit better. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-6-jolsa@kernel.org
…code The uprobe_write has special path to restore the original page when we write original instruction back. This happens when uprobe_write detects that we want to write anything else but breakpoint instruction. Moving the detection away and passing it to uprobe_write as argument, so it's possible to write different instructions (other than just breakpoint and rest). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-7-jolsa@kernel.org
Making update_ref_ctr call in uprobe_write conditional based on do_ref_ctr argument. This way we can use uprobe_write for instruction update without doing ref_ctr_offset update. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250720112133.244369-8-jolsa@kernel.org
Adding support to add special mapping for user space trampoline with following functions: uprobe_trampoline_get - find or add uprobe_trampoline uprobe_trampoline_put - remove or destroy uprobe_trampoline The user space trampoline is exported as arch specific user space special mapping through tramp_mapping, which is initialized in following changes with new uprobe syscall. The uprobe trampoline needs to be callable/reachable from the probed address, so while searching for available address we use is_reachable_by_call function to decide if the uprobe trampoline is callable from the probe address. All uprobe_trampoline objects are stored in uprobes_state object and are cleaned up when the process mm_struct goes down. Adding new arch hooks for that, because this change is x86_64 specific. Locking is provided by callers in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250720112133.244369-9-jolsa@kernel.org
Adding new uprobe syscall that calls uprobe handlers for given 'breakpoint' address. The idea is that the 'breakpoint' address calls the user space trampoline which executes the uprobe syscall. The syscall handler reads the return address of the initial call to retrieve the original 'breakpoint' address. With this address we find the related uprobe object and call its consumers. Adding the arch_uprobe_trampoline_mapping function that provides uprobe trampoline mapping. This mapping is backed with one global page initialized at __init time and shared by the all the mapping instances. We do not allow to execute uprobe syscall if the caller is not from uprobe trampoline mapping. The uprobe syscall ensures the consumer (bpf program) sees registers values in the state before the trampoline was called. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250720112133.244369-10-jolsa@kernel.org
Putting together all the previously added pieces to support optimized uprobes on top of 5-byte nop instruction. The current uprobe execution goes through following: - installs breakpoint instruction over original instruction - exception handler hit and calls related uprobe consumers - and either simulates original instruction or does out of line single step execution of it - returns to user space The optimized uprobe path does following: - checks the original instruction is 5-byte nop (plus other checks) - adds (or uses existing) user space trampoline with uprobe syscall - overwrites original instruction (5-byte nop) with call to user space trampoline - the user space trampoline executes uprobe syscall that calls related uprobe consumers - trampoline returns back to next instruction This approach won't speed up all uprobes as it's limited to using nop5 as original instruction, but we plan to use nop5 as USDT probe instruction (which currently uses single byte nop) and speed up the USDT probes. The arch_uprobe_optimize triggers the uprobe optimization and is called after first uprobe hit. I originally had it called on uprobe installation but then it clashed with elf loader, because the user space trampoline was added in a place where loader might need to put elf segments, so I decided to do it after first uprobe hit when loading is done. The uprobe is un-optimized in arch specific set_orig_insn call. The instruction overwrite is x86 arch specific and needs to go through 3 updates: (on top of nop5 instruction) - write int3 into 1st byte - write last 4 bytes of the call instruction - update the call instruction opcode And cleanup goes though similar reverse stages: - overwrite call opcode with breakpoint (int3) - write last 4 bytes of the nop5 instruction - write the nop5 first instruction byte We do not unmap and release uprobe trampoline when it's no longer needed, because there's no easy way to make sure none of the threads is still inside the trampoline. But we do not waste memory, because there's just single page for all the uprobe trampoline mappings. We do waste frame on page mapping for every 4GB by keeping the uprobe trampoline page mapped, but that seems ok. We take the benefit from the fact that set_swbp and set_orig_insn are called under mmap_write_lock(mm), so we can use the current instruction as the state the uprobe is in - nop5/breakpoint/call trampoline - and decide the needed action (optimize/un-optimize) based on that. Attaching the speed up from benchs/run_bench_uprobes.sh script: current: usermode-count : 152.604 ± 0.044M/s syscall-count : 13.359 ± 0.042M/s --> uprobe-nop : 3.229 ± 0.002M/s uprobe-push : 3.086 ± 0.004M/s uprobe-ret : 1.114 ± 0.004M/s uprobe-nop5 : 1.121 ± 0.005M/s uretprobe-nop : 2.145 ± 0.002M/s uretprobe-push : 2.070 ± 0.001M/s uretprobe-ret : 0.931 ± 0.001M/s uretprobe-nop5 : 0.957 ± 0.001M/s after the change: usermode-count : 152.448 ± 0.244M/s syscall-count : 14.321 ± 0.059M/s uprobe-nop : 3.148 ± 0.007M/s uprobe-push : 2.976 ± 0.004M/s uprobe-ret : 1.068 ± 0.003M/s --> uprobe-nop5 : 7.038 ± 0.007M/s uretprobe-nop : 2.109 ± 0.004M/s uretprobe-push : 2.035 ± 0.001M/s uretprobe-ret : 0.908 ± 0.001M/s uretprobe-nop5 : 3.377 ± 0.009M/s I see bit more speed up on Intel (above) compared to AMD. The big nop5 speed up is partly due to emulating nop5 and partly due to optimization. The key speed up we do this for is the USDT switch from nop to nop5: uprobe-nop : 3.148 ± 0.007M/s uprobe-nop5 : 7.038 ± 0.007M/s Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250720112133.244369-11-jolsa@kernel.org
Like uprobe_syscall_args; keep things consistent. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250821123656.705837806@infradead.org
Make is_optimized() return a tri-state and avoid return through argument. This simplifies things a little. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250821123656.823296198@infradead.org
Instead of only accepting the x86_64 nop5 chosen by the kernel, accept any x86_64 NOP or NOPL instruction that is 5 bytes. Notably, the x86_64 nop5 pattern is valid in 32bit apps and could get compiler generated when build for i686 (which introduced NOPL). Since the trampoline is x86_64 only, make sure to limit to x86_64 code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250821123656.935559566@infradead.org
The uprobe syscall stores and strips the trampoline stack frame from the user context, to make it appear similar to an exception at the original instruction. It then restores the trampoline stack when it can exit using sysexit. Make sure to match the regular stack manipulation with shadow stack operations such that regular and shadow stack don't get out of sync and causes trouble. This enables using the optimization when shadow stack is in use. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250821123657.055790090@infradead.org
The asm syntax in uretprobe_trampoline and uprobe_trampoline differs in the use of operand size suffixes. Make them consistent and remove all size suffixes. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250821123657.163417243@infradead.org
After the commit 'd971342d38bf ("perf/x86/intel: Decouple BTS initialization from PEBS initialization")' is introduced, x86_pmu.bts would initialized in bts_init() which is hooked by arch_initcall(). Whereas init_hw_perf_events() is hooked by early_initcall(). Once the core PMU is initialized, nmi watchdog initialization is called immediately before bts_init() is called. It leads to the BTS buffer is not really initialized since bts_init() is not called and x86_pmu.bts is still false at that time. Worse, BTS buffer would never be initialized then unless all core PMU events are freed and reserve_ds_buffers() is called again. Thus aligning with init_hw_perf_events(), use early_initcall() to hook bts_init() to ensure x86_pmu.bts is initialized before nmi watchdog initialization. Fixes: d971342 ("perf/x86/intel: Decouple BTS initialization from PEBS initialization") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250820023032.17128-2-dapeng1.mi@linux.intel.com
When running perf_fuzzer on PTL, sometimes the below "unchecked MSR access error" is seen when accessing IA32_PMC_x_CFG_B MSRs. [ 55.611268] unchecked MSR access error: WRMSR to 0x1986 (tried to write 0x0000000200000001) at rIP: 0xffffffffac564b28 (native_write_msr+0x8/0x30) [ 55.611280] Call Trace: [ 55.611282] <TASK> [ 55.611284] ? intel_pmu_config_acr+0x87/0x160 [ 55.611289] intel_pmu_enable_acr+0x6d/0x80 [ 55.611291] intel_pmu_enable_event+0xce/0x460 [ 55.611293] x86_pmu_start+0x78/0xb0 [ 55.611297] x86_pmu_enable+0x218/0x3a0 [ 55.611300] ? x86_pmu_enable+0x121/0x3a0 [ 55.611302] perf_pmu_enable+0x40/0x50 [ 55.611307] ctx_resched+0x19d/0x220 [ 55.611309] __perf_install_in_context+0x284/0x2f0 [ 55.611311] ? __pfx_remote_function+0x10/0x10 [ 55.611314] remote_function+0x52/0x70 [ 55.611317] ? __pfx_remote_function+0x10/0x10 [ 55.611319] generic_exec_single+0x84/0x150 [ 55.611323] smp_call_function_single+0xc5/0x1a0 [ 55.611326] ? __pfx_remote_function+0x10/0x10 [ 55.611329] perf_install_in_context+0xd1/0x1e0 [ 55.611331] ? __pfx___perf_install_in_context+0x10/0x10 [ 55.611333] __do_sys_perf_event_open+0xa76/0x1040 [ 55.611336] __x64_sys_perf_event_open+0x26/0x30 [ 55.611337] x64_sys_call+0x1d8e/0x20c0 [ 55.611339] do_syscall_64+0x4f/0x120 [ 55.611343] entry_SYSCALL_64_after_hwframe+0x76/0x7e On PTL, GP counter 0 and 1 doesn't support auto counter reload feature, thus it would trigger a #GP when trying to write 1 on bit 0 of CFG_B MSR which requires to enable auto counter reload on GP counter 0. The root cause of causing this issue is the check for auto counter reload (ACR) counter mask from user space is incorrect in intel_pmu_acr_late_setup() helper. It leads to an invalid ACR counter mask from user space could be set into hw.config1 and then written into CFG_B MSRs and trigger the MSR access warning. e.g., User may create a perf event with ACR counter mask (config2=0xcb), and there is only 1 event created, so "cpuc->n_events" is 1. The correct check condition should be "i + idx >= cpuc->n_events" instead of "i + idx > cpuc->n_events" (it looks a typo). Otherwise, the counter mask would traverse twice and an invalid "cpuc->assign[1]" bit (bit 0) is set into hw.config1 and cause MSR accessing error. Besides, also check if the ACR counter mask corresponding events are ACR events. If not, filter out these counter mask. If a event is not a ACR event, it could be scheduled to an HW counter which doesn't support ACR. It's invalid to add their counter index in ACR counter mask. Furthermore, remove the WARN_ON_ONCE() since it's easily triggered as user could set any invalid ACR counter mask and the warning message could mislead users. Fixes: ec980e4 ("perf/x86/intel: Support auto counter reload") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250820023032.17128-3-dapeng1.mi@linux.intel.com
IA32_PERF_CAPABILITIES.PEBS_TIMING_INFO[bit 17] is introduced to indicate whether timed PEBS is supported. Timed PEBS adds a new "retired latency" field in basic info group to show the timing info. Please find detailed information about timed PEBS in section 8.4.1 "Timed Processor Event Based Sampling" of "Intel Architecture Instruction Set Extensions and Future Features". This patch adds PERF_CAP_PEBS_TIMING_INFO flag and KVM module leverages this flag to expose timed PEBS feature to guest. Moreover, opportunistically refine the indents and make the macros share consistent indents. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Yi Lai <yi1.lai@intel.com> Link: https://lore.kernel.org/r/20250820023032.17128-5-dapeng1.mi@linux.intel.com
Macro GLOBAL_CTRL_EN_PERF_METRICS is defined to 48 instead of BIT_ULL(48), it's inconsistent with other similar macros. This leads to this macro is quite easily used wrongly since users thinks it's a bit-mask just like other similar macros. Thus change GLOBAL_CTRL_EN_PERF_METRICS to BIT_ULL(48) and eliminate this potential misuse. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Yi Lai <yi1.lai@intel.com> Link: https://lore.kernel.org/r/20250820023032.17128-6-dapeng1.mi@linux.intel.com
ICL_FIXED_0_ADAPTIVE is missed to be added into INTEL_FIXED_BITS_MASK, add it. With help of this new INTEL_FIXED_BITS_MASK, intel_pmu_enable_fixed() can be optimized. The old fixed counter control bits can be unconditionally cleared with INTEL_FIXED_BITS_MASK and then set new control bits base on new configuration. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Yi Lai <yi1.lai@intel.com> Link: https://lore.kernel.org/r/20250820023032.17128-7-dapeng1.mi@linux.intel.com
Along with the introduction Perfmon v6, pmu counters could be incontinuous, like fixed counters on CWF, only fixed counters 0-3 and 5-7 are supported, there is no fixed counter 4 on CWF. To accommodate this change, archPerfmonExt CPUID (0x23) leaves are introduced to enumerate the true-view of counters bitmap. Current perf code already supports archPerfmonExt CPUID and uses counters-bitmap to enumerate HW really supported counters, but x86_pmu_show_pmu_cap() still only dumps the absolute counter number instead of true-view bitmap, it's out-dated and may mislead readers. So dump counters true-view bitmap in x86_pmu_show_pmu_cap() and opportunistically change the dump sequence and words. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20250820023032.17128-8-dapeng1.mi@linux.intel.com
The 'init_nr' argument has double duty: it's used to initialize both the number of contexts and the number of stack entries. That's confusing and the callers always pass zero anyway. Hard code the zero. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <Namhyung@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20250820180428.259565081@kernel.org
… set get_perf_callchain() doesn't support cross-task unwinding for user space stacks, have it return NULL if both the crosstask and user arguments are set. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250820180428.426423415@kernel.org
…nt->mm == NULL To determine if a task is a kernel thread or not, it is more reliable to use (current->flags & (PF_KTHREAD|PF_USER_WORKERi)) than to rely on current->mm being NULL. That is because some kernel tasks (io_uring helpers) may have a mm field. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250820180428.592367294@kernel.org
Simplify the get_perf_callchain() user logic a bit. task_pt_regs() should never be NULL. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250820180428.760066227@kernel.org
If the task is not a user thread, there's no user stack to unwind. Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250820180428.930791978@kernel.org
77afdde
to
c8b6b56
Compare
2d892b1
to
65bae98
Compare
Adding unique flag to uprobe consumer to ensure it's the only consumer attached on the uprobe. This is helpful for use cases when consumer wants to change user space registers which might confuse other consumers, so it's desirable it's the only consumer on specific uprobe. Signed-off-by: Jiri Olsa <jolsa@kernel.org>
If uprobe consumer changes instruction pointer we still execute (single step or emulate) the original instruction and increment the ip register with the size of the instruction. In case the instruction is emulated, the new ip register value is incremented with the instructions size and process is likely to crash with illegal instruction. In case the instruction is single stepped, the ip register change is lost and process continues with the original ip register value. If user decided to take execution elsewhere, it makes little sense to execute the original instruction, so let's skip it. Allowing this only for uprobe with unique consumer attached. Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Adding support to attach unique probe through perf uprobe pmu. Adding new 'unique' format attribute that allows to pass the request to create unique uprobe the uprobe consumer. Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Adding support to attach unique uprobe through uprobe multi link interface. Adding new BPF_F_UPROBE_MULTI_UNIQUE flag that denotes the unique uprobe creation. Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Currently uprobe (BPF_PROG_TYPE_KPROBE) program can't write to the context registers data. While this makes sense for kprobe attachments, for uprobe attachment it might make sense to be able to change user space registers to alter application execution. Since uprobe and kprobe programs share the same type (BPF_PROG_TYPE_KPROBE), we can't deny write access to context during the program load. We need to check on it during program attachment to see if it's going to be kprobe or uprobe. Storing the program's write attempt to context and checking on it during the attachment. Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Adding support to attach unique uprobe multi by adding the 'unique' bool flag to struct bpf_uprobe_multi_opts. Also adding "uprobe.unique[.s]" auto load sections to create uprobe_multi unique uprobe. Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Adding test to check we can change common register values through uprobe program. It's x86_64 specific test. Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Adding test to check we can change the application execution through instruction pointer change through uprobe program. It's x86_64 specific test. Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
65bae98
to
ef664af
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
test