aboutsummaryrefslogtreecommitdiff
path: root/tools/perf
AgeCommit message (Collapse)AuthorFilesLines
2024-01-24perf evlist: Fix evlist__new_default() for > 1 core PMUJames Clark1-1/+8
The 'Session topology' test currently fails with this message when evlist__new_default() opens more than one event: 32: Session topology : --- start --- templ file: /tmp/perf-test-vv5YzZ Using CPUID 0x00000000410fd070 Opening: unknown-hardware:HG ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0xb00000000 disabled 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 = 4 Opening: unknown-hardware:HG ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0xa00000000 disabled 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 = 5 non matching sample_type FAILED tests/topology.c:73 can't get session ---- end ---- Session topology: FAILED! This is because when re-opening the file and parsing the header, Perf expects that any file that has more than one event has the sample ID flag set. Perf record already sets the flag in a similar way when there is more than one event, so add the same logic to evlist__new_default(). evlist__new_default() is only currently used in tests, so I don't expect this change to have any other side effects. The other tests that use it don't save and re-open the file so don't hit this issue. The session topology test has been failing on Arm big.LITTLE platforms since commit 251aa040244a ("perf parse-events: Wildcard most "numeric" events") when evlist__new_default() started opening multiple events for 'cycles'. Fixes: 251aa040244a ("perf parse-events: Wildcard most "numeric" events") Closes: https://lore.kernel.org/lkml/CAP-5=fWVQ-7ijjK3-w1q+k2WYVNHbAcejb-xY0ptbjRw476VKA@mail.gmail.com/ Tested-by: Ian Rogers <irogers@google.com> Reviewed-by: Ian Rogers <irogers@google.com> Tested-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: James Clark <james.clark@arm.com> Cc: Changbin Du <changbin.du@huawei.com> Cc: Yang Jihong <yangjihong1@huawei.com> Link: https://lore.kernel.org/r/20240124094358.489372-1-james.clark@arm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24perf mem: Clean up perf_pmus__num_mem_pmus()Kan Liang7-19/+17
The number of mem PMUs can be calculated by searching the perf_pmus__scan_mem(). Remove the ARCH specific perf_pmus__num_mem_pmus() Tested-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: ravi.bangoria@amd.com Cc: james.clark@arm.com Cc: will@kernel.org Cc: mike.leach@linaro.org Cc: renyu.zj@linux.alibaba.com Cc: yuhaixin.yhx@linux.alibaba.com Cc: tmricht@linux.ibm.com Cc: atrajeev@linux.vnet.ibm.com Cc: linux-arm-kernel@lists.infradead.org Cc: john.g.garry@oracle.com Link: https://lore.kernel.org/r/20240123185036.3461837-8-kan.liang@linux.intel.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24perf mem: Clean up perf_mem_events__record_args()Kan Liang4-53/+17
The current code iterates all memory PMUs. It doesn't matter if the system has only one memory PMU or multiple PMUs. The check of perf_pmus__num_mem_pmus() is not required anymore. The rec_tmp is not used in c2c and mem. Removing them as well. Suggested-by: Leo Yan <leo.yan@linaro.org> Tested-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: ravi.bangoria@amd.com Cc: james.clark@arm.com Cc: will@kernel.org Cc: mike.leach@linaro.org Cc: renyu.zj@linux.alibaba.com Cc: yuhaixin.yhx@linux.alibaba.com Cc: tmricht@linux.ibm.com Cc: atrajeev@linux.vnet.ibm.com Cc: linux-arm-kernel@lists.infradead.org Cc: john.g.garry@oracle.com Link: https://lore.kernel.org/r/20240123185036.3461837-7-kan.liang@linux.intel.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24perf mem: Clean up is_mem_loads_aux_event()Kan Liang2-21/+16
The aux_event can be retrieved from the perf_pmu now. Implement a generic support. Reviewed-by: Ian Rogers <irogers@google.com> Tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Tested-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: james.clark@arm.com Cc: will@kernel.org Cc: mike.leach@linaro.org Cc: renyu.zj@linux.alibaba.com Cc: yuhaixin.yhx@linux.alibaba.com Cc: tmricht@linux.ibm.com Cc: atrajeev@linux.vnet.ibm.com Cc: linux-arm-kernel@lists.infradead.org Cc: john.g.garry@oracle.com Link: https://lore.kernel.org/r/20240123185036.3461837-6-kan.liang@linux.intel.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24perf mem: Clean up perf_mem_event__supported()Kan Liang5-29/+31
For some ARCHs, e.g., ARM and AMD, to get the availability of the mem-events, perf checks the existence of a specific PMU. For the other ARCHs, e.g., Intel and Power, perf has to check the existence of some specific events. The current perf only iterates the mem-events-supported PMUs. It's not required to check the existence of a specific PMU anymore. Rename sysfs_name to event_name, which stores the specific mem-events. Perf only needs to check those events for the availability of the mem-events. Rename perf_mem_event__supported to perf_pmu__mem_events_supported. Reviewed-by: Ian Rogers <irogers@google.com> Tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Tested-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: james.clark@arm.com Cc: will@kernel.org Cc: mike.leach@linaro.org Cc: renyu.zj@linux.alibaba.com Cc: yuhaixin.yhx@linux.alibaba.com Cc: tmricht@linux.ibm.com Cc: atrajeev@linux.vnet.ibm.com Cc: linux-arm-kernel@lists.infradead.org Cc: john.g.garry@oracle.com Link: https://lore.kernel.org/r/20240123185036.3461837-5-kan.liang@linux.intel.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24perf mem: Clean up perf_mem_events__name()Kan Liang10-107/+97
Introduce a generic perf_mem_events__name(). Remove the ARCH-specific one. The mem_load events may have a different format. Add ldlat and aux_event in the struct perf_mem_event to indicate the format and the extra aux event. Add perf_mem_events_intel_aux[] to support the extra mem_load_aux event. Rename perf_mem_events__name to perf_pmu__mem_events_name. Reviewed-by: Ian Rogers <irogers@google.com> Tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Tested-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: james.clark@arm.com Cc: will@kernel.org Cc: mike.leach@linaro.org Cc: renyu.zj@linux.alibaba.com Cc: yuhaixin.yhx@linux.alibaba.com Cc: tmricht@linux.ibm.com Cc: atrajeev@linux.vnet.ibm.com Cc: linux-arm-kernel@lists.infradead.org Cc: john.g.garry@oracle.com Link: https://lore.kernel.org/r/20240123185036.3461837-4-kan.liang@linux.intel.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24perf mem: Clean up perf_mem_events__ptr()Kan Liang6-92/+104
The mem_events can be retrieved from the struct perf_pmu now. An ARCH specific perf_mem_events__ptr() is not required anymore. Remove all of them. The Intel hybrid has multiple mem-events-supported PMUs. But they share the same mem_events. Other ARCHs only support one mem-events-supported PMU. In the configuration, it's good enough to only configure the mem_events for one PMU. Add perf_mem_events_find_pmu() which returns the first mem-events-supported PMU. In the perf_mem_events__init(), the perf_pmus__scan() is not required anymore. It avoids checking the sysfs for every PMU on the system. Make the perf_mem_events__record_args() more generic. Remove the perf_mem_events__print_unsupport_hybrid(). Since pmu is added as a new parameter, rename perf_mem_events__ptr() to perf_pmu__mem_events_ptr(). Several other functions also do a similar rename. Reviewed-by: Ian Rogers <irogers@google.com> Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Tested-by: Kajol jain <kjain@linux.ibm.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: james.clark@arm.com Cc: will@kernel.org Cc: leo.yan@linaro.org Cc: mike.leach@linaro.org Cc: renyu.zj@linux.alibaba.com Cc: yuhaixin.yhx@linux.alibaba.com Cc: tmricht@linux.ibm.com Cc: atrajeev@linux.vnet.ibm.com Cc: linux-arm-kernel@lists.infradead.org Cc: john.g.garry@oracle.com Link: https://lore.kernel.org/r/20240123185036.3461837-3-kan.liang@linux.intel.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-24perf mem: Add mem_events into the supported perf_pmuKan Liang10-7/+44
With the mem_events, perf doesn't need to read sysfs for each PMU to find the mem-events-supported PMU. The patch also makes it possible to clean up the related __weak functions later. The patch is only to add the mem_events into the perf_pmu for all ARCHs. It will be used in the later cleanup patches. Reviewed-by: Ian Rogers <irogers@google.com> Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Tested-by: Leo Yan <leo.yan@linaro.org> Tested-by: Kajol Jain <kjain@linux.ibm.com> Suggested-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Cc: will@kernel.org Cc: mike.leach@linaro.org Cc: renyu.zj@linux.alibaba.com Cc: yuhaixin.yhx@linux.alibaba.com Cc: tmricht@linux.ibm.com Cc: atrajeev@linux.vnet.ibm.com Cc: linux-arm-kernel@lists.infradead.org Cc: john.g.garry@oracle.com Link: https://lore.kernel.org/r/20240123185036.3461837-2-kan.liang@linux.intel.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf sched: Commit to evsel__taskstate() to parse task state infoZe Gao1-36/+10
Now that we have evsel__taskstate() which no longer relies on the hardcoded task state string and has good backward compatibility, we have a good reason to use it. Note TASK_STATE_TO_CHAR_STR and task bitmasks are useless now so we remove them for good. And now we pass the state info back and forth in a symbolic char which explains itself well instead. Signed-off-by: Ze Gao <zegao@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20240123022425.1611483-1-zegao@tencent.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf util: Add evsel__taskstate() to parse the task state info insteadZe Gao2-1/+36
Now that we have the __prinf_flags() parsing routines, we add a new helper evsel__taskstate() to extract the task state info from the recorded data. Signed-off-by: Ze Gao <zegao@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20240122070859.1394479-5-zegao@tencent.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf util: Add helpers to parse task state string from libtraceeventZe Gao1-0/+112
Perf uses a hard coded string "RSDTtXZPI" to index the sched_switch prev_state field raw bitmask value. This works well except for when the kernel changes this string, in which case this will break again. Instead we add a new way to parse task state string from tracepoint print format already recorded by perf, which eliminates the further dependencies with this hardcode and unmaintainable macro, and this is exactly what libtraceevent[1] does for now. So we borrow the print flags parsing logic from libtraceevent[1]. And in get_states(), we walk the print arguments until the __print_flags() for the target state field is found, and use that to build the states string for future parsing. [1]: https://lore.kernel.org/linux-trace-devel/20231224140732.7d41698d@rorschach.local.home/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Ze Gao <zegao@tencent.com> Link: https://lore.kernel.org/r/20240122070859.1394479-4-zegao@tencent.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf sched: Sync state char array with the kernelZe Gao1-12/+1
Update state char array to match the latest kernel definitions and remove unused state mapping macros. Note this is the preparing patch for get rid of the way to parse process state from raw bitmask value. Instead we are going to parse it from the recorded tracepoint print format, and this change marks why we're doing it. Signed-off-by: Ze Gao <zegao@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20240122070859.1394479-3-zegao@tencent.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf data: Minor code style alignment cleanupYang Jihong3-10/+11
Minor code style alignment cleanup for perf_data__switch() and perf_data__write(). No functional change. Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240119040304.3708522-4-yangjihong1@huawei.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf record: Check conflict between '--timestamp-filename' option and pipe ↵Yang Jihong2-2/+5
mode before recording In pipe mode, no need to switch perf data output, therefore, '--timestamp-filename' option should not take effect. Check the conflict before recording and output WARNING. In this case, the check pipe mode in perf_data__switch() can be removed. Before: # perf record --timestamp-filename -o- perf test -w noploop | perf report -i- --percent-limit=1 # To display the perf.data header info, please use --header/--header-only options. # [ perf record: Woken up 1 times to write data ] [ perf record: Dump -.2024011812110182 ] # # Total Lost Samples: 0 # # Samples: 4K of event 'cycles:P' # Event count (approx.): 2176784359 # # Overhead Command Shared Object Symbol # ........ ....... .................... ...................................... # 97.83% perf perf [.] noploop # # (Tip: Print event counts in CSV format with: perf stat -x,) # After: # perf record --timestamp-filename -o- perf test -w noploop | perf report -i- --percent-limit=1 WARNING: --timestamp-filename option is not available in pipe mode. # To display the perf.data header info, please use --header/--header-only options. # [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.000 MB - ] # # Total Lost Samples: 0 # # Samples: 4K of event 'cycles:P' # Event count (approx.): 2185575421 # # Overhead Command Shared Object Symbol # ........ ....... ..................... ............................................. # 97.75% perf perf [.] noploop # # (Tip: Profiling branch (mis)predictions with: perf record -b / perf report) # Fixes: ecfd7a9c044e ("perf record: Add '--timestamp-filename' option to append timestamp to output file name") Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240119040304.3708522-3-yangjihong1@huawei.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf record: Fix possible incorrect free in record__switch_output()Yang Jihong1-1/+1
perf_data__switch() may not assign a legal value to 'new_filename'. In this case, 'new_filename' uses the on-stack value, which may cause a incorrect free and unexpected result. Fixes: 03724b2e9c45 ("perf record: Allow to limit number of reported perf.data files") Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240119040304.3708522-2-yangjihong1@huawei.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf dwarf-aux: Check allowed DWARF OpsNamhyung Kim1-4/+40
The DWARF location expression can be fairly complex and it'd be hard to match it with the condition correctly. So let's be conservative and only allow simple expressions. For now it just checks the first operation in the list. The following operations looks ok: * DW_OP_stack_value * DW_OP_deref_size * DW_OP_deref * DW_OP_piece To refuse complex (and unsupported) location expressions, add check_allowed_ops() to compare the rest of the list. It seems earlier result contained those unsupported expressions. For example, I found some local struct variable is placed like below. <2><43d1517>: Abbrev Number: 62 (DW_TAG_variable) <43d1518> DW_AT_location : 15 byte block: 91 50 93 8 91 78 93 4 93 84 8 91 68 93 4 (DW_OP_fbreg: -48; DW_OP_piece: 8; DW_OP_fbreg: -8; DW_OP_piece: 4; DW_OP_piece: 1028; DW_OP_fbreg: -24; DW_OP_piece: 4) Another example is something like this. 0057c8be ffffffffffffffff ffffffff812109f0 (base address) 0057c8ce ffffffff812112b5 ffffffff812112c8 (DW_OP_breg3 (rbx): 0; DW_OP_constu: 18446744073709551612; DW_OP_and; DW_OP_stack_value) It should refuse them. After the change, the stat shows: Annotate data type stats: total 294, ok 158 (53.7%), bad 136 (46.3%) ----------------------------------------------------------- 30 : no_sym 32 : no_mem_ops 53 : no_var 14 : no_typeinfo 7 : bad_offset Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/20240117062657.985479-10-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf annotate-data: Support stack variablesNamhyung Kim3-24/+93
Local variables are allocated in the stack and the location list should look like base register(s) and an offset. Extend the die_find_variable_by_reg() to handle the following expressions * DW_OP_breg{0..31} * DW_OP_bregx * DW_OP_fbreg Ususally DWARF subprogram entries have frame base information and use it to locate stack variable like below: <2><43d1575>: Abbrev Number: 62 (DW_TAG_variable) <43d1576> DW_AT_location : 2 byte block: 91 7c (DW_OP_fbreg: -4) <--- here <43d1579> DW_AT_name : (indirect string, offset: 0x2c00c9): i <43d157d> DW_AT_decl_file : 1 <43d157e> DW_AT_decl_line : 78 <43d157f> DW_AT_type : <0x43d19d7> I found some differences on saving the frame base between gcc and clang. The gcc uses the CFA to get the base so it needs to check the current frame's CFI info. In this case, stack offset needs to be adjusted from the start of the CFA. <1><1bb8d>: Abbrev Number: 102 (DW_TAG_subprogram) <1bb8e> DW_AT_name : (indirect string, offset: 0x74d41): kernel_init <1bb92> DW_AT_decl_file : 2 <1bb92> DW_AT_decl_line : 1440 <1bb94> DW_AT_decl_column : 18 <1bb95> DW_AT_prototyped : 1 <1bb95> DW_AT_type : <0xcc> <1bb99> DW_AT_low_pc : 0xffffffff81bab9e0 <1bba1> DW_AT_high_pc : 0x1b2 <1bba9> DW_AT_frame_base : 1 byte block: 9c (DW_OP_call_frame_cfa) <------ here <1bbab> DW_AT_call_all_calls: 1 <1bbab> DW_AT_sibling : <0x1bf5a> While clang sets it to a register directly and it can check the register and offset in the instruction directly. <1><43d1542>: Abbrev Number: 60 (DW_TAG_subprogram) <43d1543> DW_AT_low_pc : 0xffffffff816a7c60 <43d154b> DW_AT_high_pc : 0x98 <43d154f> DW_AT_frame_base : 1 byte block: 56 (DW_OP_reg6 (rbp)) <---------- here <43d1551> DW_AT_GNU_all_call_sites: 1 <43d1551> DW_AT_name : (indirect string, offset: 0x3bce91): foo <43d1555> DW_AT_decl_file : 1 <43d1556> DW_AT_decl_line : 75 <43d1557> DW_AT_prototyped : 1 <43d1557> DW_AT_type : <0x43c7332> <43d155b> DW_AT_external : 1 Also it needs to update the offset after finding the type like global variables since the offset was from the frame base. Factor out match_var_offset() to check global and local variables in the same way. The type stats are improved too: Annotate data type stats: total 294, ok 160 (54.4%), bad 134 (45.6%) ----------------------------------------------------------- 30 : no_sym 32 : no_mem_ops 51 : no_var 14 : no_typeinfo 7 : bad_offset Reviewed-by: Ian Rogers <irogers@google.com> Cc: Stephane Eranian <eranian@google.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20240117062657.985479-9-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf dwarf-aux: Add die_get_cfa()Namhyung Kim2-1/+82
The die_get_cfa() is to get frame base register and offset at the given instruction address (pc). This info will be used to locate stack variables which have location expression using DW_OP_fbreg. Reviewed-by: Ian Rogers <irogers@google.com> Cc: Stephane Eranian <eranian@google.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20240117062657.985479-8-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf annotate-data: Support global variablesNamhyung Kim4-13/+92
Global variables are accessed using PC-relative address so it needs to be handled separately. The PC-rel addressing is detected by using DWARF_REG_PC. On x86, %rip register would be used. The address can be calculated using the ip and offset in the instruction. But it should start from the next instruction so add calculate_pcrel_addr() to do it properly. But global variables defined in a different file would only have a declaration which doesn't include a location list. So it first tries to get the type info using the address, and then looks up the variable declarations using name. The name of global variables should be get from the symbol table. The declaration would have the type info. So extend find_var_type() to take both address and name for global variables. The stat is now looks like: Annotate data type stats: total 294, ok 153 (52.0%), bad 141 (48.0%) ----------------------------------------------------------- 30 : no_sym 32 : no_mem_ops 61 : no_var 10 : no_typeinfo 8 : bad_offset Reviewed-by: Ian Rogers <irogers@google.com> Cc: Stephane Eranian <eranian@google.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20240117062657.985479-7-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf annotate-data: Handle PC-relative addressingNamhyung Kim1-18/+38
Extend find_data_type_die() to find data type from PC-relative address using die_find_variable_by_addr(). Users need to pass the address for the (global) variable. The offset for the variable should be updated after finding the type because the offset in the instruction is just to calcuate the address for the variable. So it changed to pass a pointer to offset and renamed it to 'poffset'. First it searches variables in the CU DIE as it's likely that the global variables are defined in the file level. And then it iterates the scope DIEs to find a local (static) variable. Reviewed-by: Ian Rogers <irogers@google.com> Cc: Stephane Eranian <eranian@google.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20240117062657.985479-6-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf annotate-data: Add stack operation pseudo typeNamhyung Kim2-0/+27
A typical function prologue and epilogue include multiple stack operations to save and restore the current value of registers. On x86, it looks like below: push r15 push r14 push r13 push r12 ... pop r12 pop r13 pop r14 pop r15 ret As these all touches the stack memory region, chances are high that they appear in a memory profile data. But these are not used for any real purpose yet so it'd return no types. One of my profile type shows that non neglible portion of data came from the stack operations. It also seems GCC generates more stack operations than clang. Annotate Instruction stats total 264, ok 169 (64.0%), bad 95 (36.0%) Name : Good Bad ----------------------------------------------------------- movq : 49 27 movl : 24 9 popq : 0 19 <-- here cmpl : 17 2 addq : 14 1 cmpq : 12 2 cmpxchgl : 3 7 Instead of dealing them as unknown, let's create a seperate pseudo type to represent those stack operations separately. Reviewed-by: Ian Rogers <irogers@google.com> Cc: Stephane Eranian <eranian@google.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20240117062657.985479-5-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf annotate-data: Handle array style accessesNamhyung Kim4-19/+61
On x86, instructions for array access often looks like below. mov 0x1234(%rax,%rbx,8), %rcx Usually the first register holds the type information and the second one has the index. And the current code only looks up a variable for the first register. But it's possible to be in the other way around so it needs to check the second register if the first one failed. The stat changed like this. Annotate data type stats: total 294, ok 148 (50.3%), bad 146 (49.7%) ----------------------------------------------------------- 30 : no_sym 32 : no_mem_ops 66 : no_var 10 : no_typeinfo 8 : bad_offset Reviewed-by: Ian Rogers <irogers@google.com> Cc: Stephane Eranian <eranian@google.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20240117062657.985479-4-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf annotate-data: Handle macro fusion on x86Namhyung Kim1-1/+16
When a sample was come from a conditional branch without a memory operand, it could be due to a macro fusion with a previous instruction. So it needs to check the memory operand in the previous one. This improves the stat like below: Annotate data type stats: total 294, ok 147 (50.0%), bad 147 (50.0%) ----------------------------------------------------------- 30 : no_sym 32 : no_mem_ops 71 : no_var 6 : no_typeinfo 8 : bad_offset Reviewed-by: Ian Rogers <irogers@google.com> Cc: Stephane Eranian <eranian@google.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20240117062657.985479-3-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf annotate-data: Parse 'lock' prefix from llvm-objdumpNamhyung Kim1-1/+13
For the performance reason, I prefer llvm-objdump over GNU's. But I found that llvm-objdump puts x86 lock prefix in a separate line like below. ffffffff81000695: f0 lock ffffffff81000696: ff 83 54 0b 00 00 incl 2900(%rbx) This should be parsed properly, but I just changed to find the insn with next offset for now. This improves the statistics as it can process more instructions. Annotate data type stats: total 294, ok 144 (49.0%), bad 150 (51.0%) ----------------------------------------------------------- 30 : no_sym 35 : no_mem_ops 71 : no_var 6 : no_typeinfo 8 : bad_offset Reviewed-by: Ian Rogers <irogers@google.com> Cc: Stephane Eranian <eranian@google.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20240117062657.985479-2-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf build: Check whether pkg-config is installed when libtraceevent is linkedYang Jihong1-0/+6
If pkg-config is not installed when libtraceevent is linked, the build fails. The error information is as follows: $ make <SNIP> In file included from /home/yjh/projects_linux/perf-tool-next/linux/tools/perf/util/evsel.c:43: /home/yjh/projects_linux/perf-tool-next/linux/tools/perf/util/trace-event.h:149:62: error: operator '&&' has no right operand 149 | #if defined(LIBTRACEEVENT_VERSION) && LIBTRACEEVENT_VERSION >= MAKE_LIBTRACEEVENT_VERSION(1, 5, 0) | ^~ error: command '/usr/bin/gcc' failed with exit code 1 cp: cannot stat 'python_ext_build/lib/perf*.so': No such file or directory make[2]: *** [Makefile.perf:668: python/perf.cpython-310-x86_64-linux-gnu.so] Error 1 make[2]: *** Waiting for unfinished jobs.... Because pkg-config is not installed, fail to get libtraceevent version in Makefile.config file. As a result, LIBTRACEEVENT_VERSION is empty. However, the preceding error information is not user-friendly. Identify errors in advance by checking that pkg-config is installed at compile time. The build results of various scenarios are as follows: 1. build successful when libtraceevent is not linked and pkg-config is not installed $ pkg-config --version -bash: /usr/bin/pkg-config: No such file or directory $ make clean >/dev/null $ make NO_LIBTRACEEVENT=1 >/dev/null Makefile.config:1133: No alternatives command found, you need to set JDIR= to point to the root of your Java directory PERF_VERSION = 6.7.rc6.gd988c9f511af $ echo $? 0 2. dummy pkg-config is missing when libtraceevent is linked $ pkg-config --version -bash: /usr/bin/pkg-config: No such file or directory $ make clean >/dev/null $ make >/dev/null Makefile.config:221: *** Error: pkg-config needed by libtraceevent is missing on this system, please install it. Stop. make[1]: *** [Makefile.perf:251: sub-make] Error 2 make: *** [Makefile:70: all] Error 2 $ echo $? 2 3. build successful when libtraceevent is linked and pkg-config is installed $ pkg-config --version 0.29.2 $ make clean >/dev/null $ make >/dev/null Makefile.config:1133: No alternatives command found, you need to set JDIR= to point to the root of your Java directory PERF_VERSION = 6.7.rc6.gd988c9f511af $ echo $? 0 Signed-off-by: Yang Jihong <yangjihong1@huawei.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240112034019.3558584-1-yangjihong1@huawei.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-22perf test: raise limit to 20 percent for perf_stat_--bpf-counters_testThomas Richter1-6/+6
This test case often fails on s390 (about 2 out of 10) because the 10% percent limit on the difference between --bpf-counters event counting and s390 hardware counting is more than 10% in all failure cases. Raise the limit to 20% on s390 and the test case succeeds. Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: gor@linux.ibm.com Cc: hca@linux.ibm.com Cc: sumanthk@linux.ibm.com Cc: svens@linux.ibm.com Link: https://lore.kernel.org/r/20240108084009.3959211-1-tmricht@linux.ibm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2024-01-19Merge tag 'perf-tools-for-v6.8-1-2024-01-09' of ↵Linus Torvalds227-2027/+7652
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools Pull perf tools updates from Arnaldo Carvalho de Melo: "Add Namhyung Kim as tools/perf/ co-maintainer, we're taking turns processing patches, switching roles from perf-tools to perf-tools-next at each Linux release. Data profiling: - Associate samples that identify loads and stores with data structures. This uses events available on Intel, AMD and others and DWARF info: # To get memory access samples in kernel for 1 second (on Intel) $ perf mem record -a -K --ldlat=4 -- sleep 1 # Similar for the AMD (but it requires 6.3+ kernel for BPF filters) $ perf mem record -a --filter 'mem_op == load || mem_op == store, ip > 0x8000000000000000' -- sleep 1 Then, amongst several modes of post processing, one can do things like: $ perf report -s type,typeoff --hierarchy --group --stdio ... # # Samples: 10K of events 'cpu/mem-loads,ldlat=4/P, cpu/mem-stores/P, dummy:u' # Event count (approx.): 602758064 # # Overhead Data Type / Data Type Offset # ........................... ............................ # 26.09% 3.28% 0.00% long unsigned int 26.09% 3.28% 0.00% long unsigned int +0 (no field) 18.48% 0.73% 0.00% struct page 10.83% 0.02% 0.00% struct page +8 (lru.next) 3.90% 0.28% 0.00% struct page +0 (flags) 3.45% 0.06% 0.00% struct page +24 (mapping) 0.25% 0.28% 0.00% struct page +48 (_mapcount.counter) 0.02% 0.06% 0.00% struct page +32 (index) 0.02% 0.00% 0.00% struct page +52 (_refcount.counter) 0.02% 0.01% 0.00% struct page +56 (memcg_data) 0.00% 0.01% 0.00% struct page +16 (lru.prev) 15.37% 17.54% 0.00% (stack operation) 15.37% 17.54% 0.00% (stack operation) +0 (no field) 11.71% 50.27% 0.00% (unknown) 11.71% 50.27% 0.00% (unknown) +0 (no field) $ perf annotate --data-type ... Annotate type: 'struct cfs_rq' in [kernel.kallsyms] (13 samples): ============================================================================ samples offset size field 13 0 640 struct cfs_rq { 2 0 16 struct load_weight load { 2 0 8 unsigned long weight; 0 8 4 u32 inv_weight; }; 0 16 8 unsigned long runnable_weight; 0 24 4 unsigned int nr_running; 1 28 4 unsigned int h_nr_running; ... $ perf annotate --data-type=page --group Annotate type: 'struct page' in [kernel.kallsyms] (480 samples): event[0] = cpu/mem-loads,ldlat=4/P event[1] = cpu/mem-stores/P event[2] = dummy:u =================================================================================== samples offset size field 447 33 0 0 64 struct page { 108 8 0 0 8 long unsigned int flags; 319 13 0 8 40 union { 319 13 0 8 40 struct { 236 2 0 8 16 union { 236 2 0 8 16 struct list_head lru { 236 1 0 8 8 struct list_head* next; 0 1 0 16 8 struct list_head* prev; }; 236 2 0 8 16 struct { 236 1 0 8 8 void* __filler; 0 1 0 16 4 unsigned int mlock_count; }; 236 2 0 8 16 struct list_head buddy_list { 236 1 0 8 8 struct list_head* next; 0 1 0 16 8 struct list_head* prev; }; 236 2 0 8 16 struct list_head pcp_list { 236 1 0 8 8 struct list_head* next; 0 1 0 16 8 struct list_head* prev; }; }; 82 4 0 24 8 struct address_space* mapping; 1 7 0 32 8 union { 1 7 0 32 8 long unsigned int index; 1 7 0 32 8 long unsigned int share; }; 0 0 0 40 8 long unsigned int private; }; This uses the existing annotate code, calling objdump to do the disassembly, with improvements to avoid having this take too long, but longer term a switch to a disassembler library, possibly reusing code in the kernel will be pursued. This is the initial implementation, please use it and report impressions and bugs. Make sure the kernel-debuginfo packages match the running kernel. The 'perf report' phase for non short perf.data files may take a while. There is a great article about it on LWN: https://lwn.net/Articles/955709/ - "Data-type profiling for perf" One last test I did while writing this text, on a AMD Ryzen 5950X, using a distro kernel, while doing a simple 'find /' on an otherwise idle system resulted in: # uname -r 6.6.9-100.fc38.x86_64 # perf -vv | grep BPF_ bpf: [ on ] # HAVE_LIBBPF_SUPPORT bpf_skeletons: [ on ] # HAVE_BPF_SKEL # rpm -qa | grep kernel-debuginfo kernel-debuginfo-common-x86_64-6.6.9-100.fc38.x86_64 kernel-debuginfo-6.6.9-100.fc38.x86_64 # # perf mem record -a --filter 'mem_op == load || mem_op == store, ip > 0x8000000000000000' ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 2.199 MB perf.data (2913 samples) ] # # ls -la perf.data -rw-------. 1 root root 2346486 Jan 9 18:36 perf.data # perf evlist ibs_op// dummy:u # perf evlist -v ibs_op//: type: 11, size: 136, config: 0, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|ADDR|CPU|PERIOD|IDENTIFIER|DATA_SRC|WEIGHT, read_format: ID, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1 dummy:u: type: 1 (PERF_TYPE_SOFTWARE), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|ADDR|CPU|IDENTIFIER|DATA_SRC|WEIGHT, read_format: ID, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, mmap_data: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1 # # perf report -s type,typeoff --hierarchy --group --stdio # Total Lost Samples: 0 # # Samples: 2K of events 'ibs_op//, dummy:u' # Event count (approx.): 1904553038 # # Overhead Data Type / Data Type Offset # ................... ............................ # 73.70% 0.00% (unknown) 73.70% 0.00% (unknown) +0 (no field) 3.01% 0.00% long unsigned int 3.00% 0.00% long unsigned int +0 (no field) 0.01% 0.00% long unsigned int +2 (no field) 2.73% 0.00% struct task_struct 1.71% 0.00% struct task_struct +52 (on_cpu) 0.38% 0.00% struct task_struct +2104 (rcu_read_unlock_special.b.blocked) 0.23% 0.00% struct task_struct +2100 (rcu_read_lock_nesting) 0.14% 0.00% struct task_struct +2384 () 0.06% 0.00% struct task_struct +3096 (signal) 0.05% 0.00% struct task_struct +3616 (cgroups) 0.05% 0.00% struct task_struct +2344 (active_mm) 0.02% 0.00% struct task_struct +46 (flags) 0.02% 0.00% struct task_struct +2096 (migration_disabled) 0.01% 0.00% struct task_struct +24 (__state) 0.01% 0.00% struct task_struct +3956 (mm_cid_active) 0.01% 0.00% struct task_struct +1048 (cpus_ptr) 0.01% 0.00% struct task_struct +184 (se.group_node.next) 0.01% 0.00% struct task_struct +20 (thread_info.cpu) 0.00% 0.00% struct task_struct +104 (on_rq) 0.00% 0.00% struct task_struct +2456 (pid) 1.36% 0.00% struct module 0.59% 0.00% struct module +952 (kallsyms) 0.42% 0.00% struct module +0 (state) 0.23% 0.00% struct module +8 (list.next) 0.12% 0.00% struct module +216 (syms) 0.95% 0.00% struct inode 0.41% 0.00% struct inode +40 (i_sb) 0.22% 0.00% struct inode +0 (i_mode) 0.06% 0.00% struct inode +76 (i_rdev) 0.06% 0.00% struct inode +56 (i_security) <SNIP> perf top/report: - Don't ignore job control, allowing control+Z + bg to work. - Add s390 raw data interpretation for PAI (Processor Activity Instrumentation) counters. perf archive: - Add new option '--all' to pack perf.data with DSOs. - Add new option '--unpack' to expand tarballs. Initialization speedups: - Lazily initialize zstd streams to save memory when not using it. - Lazily allocate/size mmap event copy. - Lazy load kernel symbols in 'perf record'. - Be lazier in allocating lost samples buffer in 'perf record'. - Don't synthesize BPF events when disabled via the command line (perf record --no-bpf-event). Assorted improvements: - Show note on AMD systems that the :p, :pp, :ppp and :P are all the same, as IBS (Instruction Based Sampling) is used and it is inherentely precise, not having levels of precision like in Intel systems. - When 'cycles' isn't available, fall back to the "task-clock" event when not system wide, not to 'cpu-clock'. - Add --debug-file option to redirect debug output, e.g.: $ perf --debug-file /tmp/perf.log record -v true - Shrink 'struct map' to under one cacheline by avoiding function pointers for selecting if addresses are identity or DSO relative, and using just a byte for some boolean struct members. - Resolve the arch specific strerrno just once to use in perf_env__arch_strerrno(). - Reduce memory for recording PERF_RECORD_LOST_SAMPLES event. Assorted fixes: - Fix the default 'perf top' usage on Intel hybrid systems, now it starts with a browser showing the number of samples for Efficiency (cpu_atom/cycles/P) and Performance (cpu_core/cycles/P). This behaviour is similar on ARM64, with its respective set of big.LITTLE processors. - Fix segfault on build_mem_topology() error path. - Fix 'perf mem' error on hybrid related to availability of mem event in a PMU. - Fix missing reference count gets (map, maps) in the db-export code. - Avoid recursively taking env->bpf_progs.lock in the 'perf_env' code. - Use the newly introduced maps__for_each_map() to add missing locking around iteration of 'struct map' entries. - Parse NOTE segments until the build id is found, don't stop on the first one, ELF files may have several such NOTE segments. - Remove 'egrep' usage, its deprecated, use 'grep -E' instead. - Warn first about missing libelf, not libbpf, that depends on libelf. - Use alternative to 'find ... -printf' as this isn't supported in busybox. - Address python 3.6 DeprecationWarning for string scapes. - Fix memory leak in uniq() in libsubcmd. - Fix man page formatting for 'perf lock' - Fix some spelling mistakes. perf tests: - Fail shell tests that needs some symbol in perf itself if it is stripped. These tests check if a symbol is resolved, if some hot function is indeed detected by profiling, etc. - The 'perf test sigtrap' test is currently failing on PREEMPT_RT, skip it if sleeping spinlocks are detected (using BTF) and point to the mailing list discussion about it. This test is also being skipped on several architectures (powerpc, s390x, arm and aarch64) due to other pending issues with intruction breakpoints. - Adjust test case perf record offcpu profiling tests for s390. - Fix 'Setup struct perf_event_attr' fails on s390 on z/VM guest, addressing issues caused by the fallback from cycles to task-clock done in this release. - Fix mask for VG register in the user-regs test. - Use shellcheck on 'perf test' shell scripts automatically to make sure changes don't introduce things it flags as problematic. - Add option to change objdump binary and allow it to be set via 'perf config'. - Add basic 'perf script', 'perf list --json" and 'perf diff' tests. - Basic branch counter support. - Make DSO tests a suite rather than individual. - Remove atomics from test_loop to avoid test failures. - Fix call chain match on powerpc for the record+probe_libc_inet_pton test. - Improve Intel hybrid tests. Vendor event files (JSON): powerpc: - Update datasource event name to fix duplicate events on IBM's Power10. - Add PVN for HX-C2000 CPU with Power8 Architecture. Intel: - Alderlake/rocketlake metric fixes. - Update emeraldrapids events to v1.02. - Update icelakex events to v1.23. - Update sapphirerapids events to v1.17. - Add skx, clx, icx and spr upi bandwidth metric. AMD: - Add Zen 4 memory controller events. RISC-V: - Add StarFive Dubhe-80 and Dubhe-90 JSON files. https://www.starfivetech.com/en/site/cpu-u - Add T-HEAD C9xx JSON file. https://github.com/riscv-software-src/opensbi/blob/master/docs/platform/thead-c9xx.md ARM64: - Remove UTF-8 characters from cmn.json, that were causing build failure in some distros. - Add core PMU events and metrics for Ampere One X. - Rename Ampere One's BPU_FLUSH_MEM_FAULT to GPC_FLUSH_MEM_FAULT libperf: - Rename several perf_cpu_map constructor names to clarify what they really do. - Ditto for some other methods, coping with some issues in their semantics, like perf_cpu_map__empty() -> perf_cpu_map__has_any_cpu_or_is_empty(). - Document perf_cpu_map__nr()'s behavior perf stat: - Exit if parse groups fails. - Combine the -A/--no-aggr and --no-merge options. - Fix help message for --metric-no-threshold option. Hardware tracing: ARM64 CoreSight: - Bump minimum OpenCSD version to ensure a bugfix is present. - Add 'T' itrace option for timestamp trace - Set start vm addr of exectable file to 0 and don't ignore first sample on the arm-cs-trace-disasm.py 'perf script'" * tag 'perf-tools-for-v6.8-1-2024-01-09' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (179 commits) MAINTAINERS: Add Namhyung as tools/perf/ co-maintainer perf test: test case 'Setup struct perf_event_attr' fails on s390 on z/vm perf db-export: Fix missing reference count get in call_path_from_sample() perf tests: Add perf script test libsubcmd: Fix memory leak in uniq() perf TUI: Don't ignore job control perf vendor events intel: Update sapphirerapids events to v1.17 perf vendor events intel: Update icelakex events to v1.23 perf vendor events intel: Update emeraldrapids events to v1.02 perf vendor events intel: Alderlake/rocketlake metric fixes perf x86 test: Add hybrid test for conflicting legacy/sysfs event perf x86 test: Update hybrid expectations perf vendor events amd: Add Zen 4 memory controller events perf stat: Fix hard coded LL miss units perf record: Reduce memory for recording PERF_RECORD_LOST_SAMPLES event perf env: Avoid recursively taking env->bpf_progs.lock perf annotate: Add --insn-stat option for debugging perf annotate: Add --type-stat option for debugging perf annotate: Support event group display perf annotate: Add --data-type option ...
2024-01-09Merge tag 'lsm-pr-20240105' of ↵Linus Torvalds4-0/+20
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull security module updates from Paul Moore: - Add three new syscalls: lsm_list_modules(), lsm_get_self_attr(), and lsm_set_self_attr(). The first syscall simply lists the LSMs enabled, while the second and third get and set the current process' LSM attributes. Yes, these syscalls may provide similar functionality to what can be found under /proc or /sys, but they were designed to support multiple, simultaneaous (stacked) LSMs from the start as opposed to the current /proc based solutions which were created at a time when only one LSM was allowed to be active at a given time. We have spent considerable time discussing ways to extend the existing /proc interfaces to support multiple, simultaneaous LSMs and even our best ideas have been far too ugly to support as a kernel API; after +20 years in the kernel, I felt the LSM layer had established itself enough to justify a handful of syscalls. Support amongst the individual LSM developers has been nearly unanimous, with a single objection coming from Tetsuo (TOMOYO) as he is worried that the LSM_ID_XXX token concept will make it more difficult for out-of-tree LSMs to survive. Several members of the LSM community have demonstrated the ability for out-of-tree LSMs to continue to exist by picking high/unused LSM_ID values as well as pointing out that many kernel APIs rely on integer identifiers, e.g. syscalls (!), but unfortunately Tetsuo's objections remain. My personal opinion is that while I have no interest in penalizing out-of-tree LSMs, I'm not going to penalize in-tree development to support out-of-tree development, and I view this as a necessary step forward to support the push for expanded LSM stacking and reduce our reliance on /proc and /sys which has occassionally been problematic for some container users. Finally, we have included the linux-api folks on (all?) recent revisions of the patchset and addressed all of their concerns. - Add a new security_file_ioctl_compat() LSM hook to handle the 32-bit ioctls on 64-bit systems problem. This patch includes support for all of the existing LSMs which provide ioctl hooks, although it turns out only SELinux actually cares about the individual ioctls. It is worth noting that while Casey (Smack) and Tetsuo (TOMOYO) did not give explicit ACKs to this patch, they did both indicate they are okay with the changes. - Fix a potential memory leak in the CALIPSO code when IPv6 is disabled at boot. While it's good that we are fixing this, I doubt this is something users are seeing in the wild as you need to both disable IPv6 and then attempt to configure IPv6 labeled networking via NetLabel/CALIPSO; that just doesn't make much sense. Normally this would go through netdev, but Jakub asked me to take this patch and of all the trees I maintain, the LSM tree seemed like the best fit. - Update the LSM MAINTAINERS entry with additional information about our process docs, patchwork, bug reporting, etc. I also noticed that the Lockdown LSM is missing a dedicated MAINTAINERS entry so I've added that to the pull request. I've been working with one of the major Lockdown authors/contributors to see if they are willing to step up and assume a Lockdown maintainer role; hopefully that will happen soon, but in the meantime I'll continue to look after it. - Add a handful of mailmap entries for Serge Hallyn and myself. * tag 'lsm-pr-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (27 commits) lsm: new security_file_ioctl_compat() hook lsm: Add a __counted_by() annotation to lsm_ctx.ctx calipso: fix memory leak in netlbl_calipso_add_pass() selftests: remove the LSM_ID_IMA check in lsm/lsm_list_modules_test MAINTAINERS: add an entry for the lockdown LSM MAINTAINERS: update the LSM entry mailmap: add entries for Serge Hallyn's dead accounts mailmap: update/replace my old email addresses lsm: mark the lsm_id variables are marked as static lsm: convert security_setselfattr() to use memdup_user() lsm: align based on pointer length in lsm_fill_user_ctx() lsm: consolidate buffer size handling into lsm_fill_user_ctx() lsm: correct error codes in security_getselfattr() lsm: cleanup the size counters in security_getselfattr() lsm: don't yet account for IMA in LSM_CONFIG_COUNT calculation lsm: drop LSM_ID_IMA LSM: selftests for Linux Security Module syscalls SELinux: Add selfattr hooks AppArmor: Add selfattr hooks Smack: implement setselfattr and getselfattr hooks ...
2024-01-09Merge tag 'mm-stable-2024-01-08-15-31' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
2024-01-08mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDERKirill A. Shutemov1-1/+1
commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has changed the definition of MAX_ORDER to be inclusive. This has caused issues with code that was not yet upstream and depended on the previous definition. To draw attention to the altered meaning of the define, rename MAX_ORDER to MAX_PAGE_ORDER. Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-04perf test: test case 'Setup struct perf_event_attr' fails on s390 on z/vmThomas Richter1-1/+1
perf test 17 'Setup struct perf_event_attr' fails on s390 z/VM guest, using linux-next kernel. Root cause is the fall-back from hardware counter cycles perf_event_attr: type 0 (PERF_TYPE_HARDWARE) size 136 config 0 (PERF_COUNT_HW_CPU_CYCLES) { sample_period, sample_freq } 4000 sample_type IP|TID|TIME|ADDR|PERIOD|DATA_SRC read_format ID|LOST which returns -ENOENT on s390 z/VM guest. This causes the code to fall back to software counter task-clock, as can be seen in the debug output: ------------------------------------------------------------ perf_event_attr: type 1 (PERF_TYPE_SOFTWARE) size 136 config 0x1 (PERF_COUNT_SW_TASK_CLOCK) <-here { sample_period, sample_freq } 4000 sample_type IP|TID|TIME|ADDR|PERIOD|DATA_SRC read_format ID|LOST This succeeds on s390 z/VM guest. This successful installation of the counter task-clock is not listed in the expected results and the test case fails. This is caused by commit eb2eac0c7b618033 ("perf evsel: Fallback to "task-clock" when not system wide") which introduced fall back from event 'cycles' to event 'task-clock'. To fix this on s390 allow event number 0 (cycles) and event number 1 (task-clock) as expected result. Output before: # ./perf test -Fv 17 17: Setup struct perf_event_attr : --- start --- running './tests/attr/test-stat-group1' unsupp './tests/attr/test-stat-group1' running './tests/attr/test-record-graph-default' test limitation '!aarch64' excluded architecture list ['aarch64'] expected config=0, got 1 FAILED './tests/attr/test-record-graph-default' - match failure ---- end ---- Setup struct perf_event_attr: FAILED! # Output after: # ./perf test -F 17 17: Setup struct perf_event_attr : Ok # Fixes: eb2eac0c7b618033 ("perf evsel: Fallback to "task-clock" when not system wide") Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Ian Rogers <irogers@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Sumanth Korikkar <sumanthk@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Link: https://lore.kernel.org/r/20231219143235.1075522-1-tmricht@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-04perf db-export: Fix missing reference count get in call_path_from_sample()Ben Gainey1-2/+2
The addr_location map and maps fields in the inner loop were missing calls to map__get()/maps__get(). The subsequent addr_location__exit() call in each loop puts the map/maps fields causing use-after-free aborts. This issue reproduces on at least arm64 and x86_64 with something simple like `perf record -g ls` followed by `perf script -s script.py` with the following script: perf_db_export_mode = True perf_db_export_calls = False perf_db_export_callchains = True def sample_table(*args): print(f'sample_table({args})') def call_path_table(*args): print(f'call_path_table({args}') Committer testing: This test, just introduced by Ian Rogers, now passes, not segfaulting anymore: # perf test "perf script tests" 95: perf script tests : Ok # Fixes: 0dd5041c9a0eaf8c ("perf addr_location: Add init/exit/copy functions") Signed-off-by: Ben Gainey <ben.gainey@arm.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20231207140911.3240408-1-ben.gainey@arm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-04perf tests: Add perf script testIan Rogers1-0/+66
Start a new set of shell tests for testing perf script. The initial contribution is checking that some perf db-export functionality works as reported in this regression by Ben Gainey <ben.gainey@arm.com>: https://lore.kernel.org/lkml/20231207140911.3240408-1-ben.gainey@arm.com/ Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Ben Gainey <ben.gainey@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20231207174057.1482161-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-04perf TUI: Don't ignore job controlAhelenia Ziemiańska4-0/+26
In its infinite wisdom, by default, SLang sets susp undef, and this can only be un-done by calling SLtty_set_suspend_state(true). After every SLang_init_tty(). Additionally, no provisions are made for maintaining the teletype attributes across suspend/continue (outside of curses emulation mode(?!), which provides full support, naturally), so we need to save and restore the flags ourselves, as well as reset the text colours when going under. We need to also re-draw the screen, and raising SIGWINCH, shockingly, Just Works. The correct solution would be to Not Use SLang, but as a stop-gap, this makes TUI 'perf report' usable. Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Namhyung Kim <namhyung@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: yaowenbin <yaowenbin1@huawei.com> Link: https://lore.kernel.org/r/0354dcae23a8713f75f4fed609e0caec3c6e3cd5.1672174189.git.nabijaczleweli@nabijaczleweli.xyz Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-04perf vendor events intel: Update sapphirerapids events to v1.17Ian Rogers5-9/+60
Update to v1.17 released in: https://github.com/intel/perfmon/pull/123 Add events FP_ARITH_DISPATCHED.V0, FP_ARITH_DISPATCHED.V1, FP_ARITH_DISPATCHED.V2, UNC_IIO_IOMMU0.1G_HITS, UNC_IIO_IOMMU0.2M_HITS and UNC_IIO_IOMMU0.4K_HITS. Description updates. Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Edward Baker <edward.baker@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20240104074259.653219-4-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-04perf vendor events intel: Update icelakex events to v1.23Ian Rogers4-6/+6
Update to v1.23 released in: https://github.com/intel/perfmon/pull/123 Updates to event descriptions. Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Edward Baker <edward.baker@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20240104074259.653219-3-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-04perf vendor events intel: Update emeraldrapids events to v1.02Ian Rogers5-25/+60
Update to v1.02 released in: https://github.com/intel/perfmon/pull/123 Removes events AMX_OPS_RETIRED.BF16 and AMX_OPS_RETIRED.INT8. Add events FP_ARITH_DISPATCHED.V0, FP_ARITH_DISPATCHED.V1, FP_ARITH_DISPATCHED.V2, UNC_IIO_IOMMU0.1G_HITS, UNC_IIO_IOMMU0.2M_HITS and UNC_IIO_IOMMU0.4K_HITS. Description updates. Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Edward Baker <edward.baker@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20240104074259.653219-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-04perf vendor events intel: Alderlake/rocketlake metric fixesIan Rogers2-8/+9
Fix that the core PMU is being specified for 2 uncore events. Specify a PMU for the alderlake UNCORE_FREQ metric. Conversion script updated in: https://github.com/intel/perfmon/pull/126 Committer testing: Before this patch the "perf all metricgroups test" was failing, now: root@number:~# perf test metric 10: PMU events : 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok 10.5: Parsing of metric thresholds with fake PMUs : Ok 61: Parse and process metrics : Ok 98: perf stat metrics (shadow stat) test : Skip 101: perf all metricgroups test : Ok 102: perf all metrics test : FAILED! 107: perf metrics value validation : Ok root@number:~# Test 102 is failing for another reason, not being able to get as many counters as needed, Ian Rogers suggested disabling the NMI watchdog to have more counters available: root@number:/home/acme# cat /proc/sys/kernel/nmi_watchdog 1 root@number:/home/acme# echo 0 > /proc/sys/kernel/nmi_watchdog root@number:/home/acme# perf test 102 102: perf all metrics test : Ok root@number:/home/acme# Closes: https://lore.kernel.org/lkml/ZZWOdHXJJ_oecWwm@kernel.org/ Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Edward Baker <edward.baker@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20240104074259.653219-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-03perf x86 test: Add hybrid test for conflicting legacy/sysfs eventIan Rogers1-0/+23
The cpu-cycles event is both a legacy event and declared in /sys/devices/cpu_core/events/cpu-cycles. The cycles event is a legacy event but with no sysfs version. Add a test that the sysfs version is preferred to the legacy for cpu-cycles, while for cycles we use the legacy version. Suggested-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20240103170159.1435753-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-03perf x86 test: Update hybrid expectationsIan Rogers1-7/+7
The legacy events cpu-cycles and instructions have sysfs event equivalents on x86 (see /sys/devices/cpu_core/events). As sysfs/JSON events are now higher in priority than legacy events this causes the hybrid test expectations not to be met. To fix this switch to legacy events that don't have sysfs versions, namely cpu-cycles becomes cycles and instructions becomes branches. Fixes: a24d9d9dc096fc0d ("perf parse-events: Make legacy events lower priority than sysfs/JSON") Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Closes: https://lore.kernel.org/lkml/ZYbm5L7tw7bdpDpE@kernel.org/ Link: https://lore.kernel.org/r/20240103170159.1435753-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-03perf vendor events amd: Add Zen 4 memory controller eventsSandipan Das3-0/+187
Make the jevents parser aware of the Unified Memory Controller (UMC) PMU and add events taken from Section 8.2.1 "UMC Performance Monitor Events" of the Processor Programming Reference (PPR) for AMD Family 19h Model 11h processors. The events capture UMC command activity such as CAS, ACTIVATE, PRECHARGE etc. while the metrics derive data bus utilization and memory bandwidth out of these events. Signed-off-by: Sandipan Das <sandipan.das@amd.com> Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ananth Narayan <ananth.narayan@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Stephane Eranian <eranian@google.com> Link: https://lore.kernel.org/r/e0d8a7e8ca8ee3e378d8029e80b456ac327d6419.1701238314.git.sandipan.das@amd.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-03perf stat: Fix hard coded LL miss unitsIan Rogers1-1/+1
Copy-paste error where LL cache misses are reported as l1i. Fixes: 0a57b910807ad163 ("perf stat: Use counts rather than saved_value") Suggested-by: Guillaume Endignoux <guillaumee@google.com> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Garry <john.g.garry@oracle.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20231211181242.1721059-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-03perf record: Reduce memory for recording PERF_RECORD_LOST_SAMPLES eventIan Rogers1-2/+4
Reduce from PERF_SAMPLE_MAX_SIZE to "sizeof(*lost) + session->machines.host.id_hdr_size". Suggested-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20231207021627.1322884-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2024-01-03perf env: Avoid recursively taking env->bpf_progs.lockIan Rogers5-32/+50
Add variants of perf_env__insert_bpf_prog_info(), perf_env__insert_btf() and perf_env__find_btf prefixed with __ to indicate the env->bpf_progs.lock is assumed held. Call these variants when the lock is held to avoid recursively taking it and potentially having a thread deadlock with itself. Fixes: f8dfeae009effc0b ("perf bpf: Show more BPF program info in print_bpf_prog_info()") Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ming Wang <wangming01@loongson.cn> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20231207014655.1252484-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-12-23perf annotate: Add --insn-stat option for debuggingNamhyung Kim3-0/+87
This is for a debugging purpose. It'd be useful to see per-instrucion level success/failure stats. $ perf annotate --data-type --insn-stat Annotate Instruction stats total 264, ok 143 (54.2%), bad 121 (45.8%) Name : Good Bad ----------------------------------------------------------- movq : 45 31 movl : 22 11 popq : 0 19 cmpl : 16 3 addq : 8 7 cmpq : 11 3 cmpxchgl : 3 7 cmpxchgq : 8 0 incl : 3 3 movzbl : 4 2 incq : 4 2 decl : 6 0 ... Committer notes: So these are about being able to find the type for accesses from these instructions, we should improve the naming, but it is for debugging, we can improve this later: @@ -3726,6 +3759,10 @@ struct annotated_data_type *hist_entry__get_data_type(struct hist_entry *he) continue; mem_type = find_data_type(ms, ip, op_loc->reg, op_loc->offset); + if (mem_type) + istat->good++; + else + istat->bad++; Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: linux-toolchains@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Link: https://lore.kernel.org/r/20231213001323.718046-18-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-12-23perf annotate: Add --type-stat option for debuggingNamhyung Kim5-7/+108
The --type-stat option is to be used with --data-type and to print detailed failure reasons for the data type annotation. $ perf annotate --data-type --type-stat Annotate data type stats: total 294, ok 116 (39.5%), bad 178 (60.5%) ----------------------------------------------------------- 30 : no_sym 40 : no_insn_ops 33 : no_mem_ops 63 : no_var 4 : no_typeinfo 8 : bad_offset Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: linux-toolchains@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Link: https://lore.kernel.org/r/20231213001323.718046-17-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-12-23perf annotate: Support event group displayNamhyung Kim1-12/+77
When events are grouped together, it'd be natural to show them at once like in other mode. Handle group leaders with members to collect the number of samples together and display like below: $ perf annotate --data-type --group ... Annotate type: 'struct page' in vmlinux (1 samples): event[0] = cpu/mem-loads,ldlat=30/P event[1] = cpu/mem-stores/P event[2] = dummy:u ============================================================================ samples offset size field 1 0 0 0 64 struct page { 0 0 0 0 8 long unsigned int flags; 0 0 0 8 40 union { 0 0 0 8 40 struct { 0 0 0 8 16 union { 0 0 0 8 16 struct list_head lru { 0 0 0 8 8 struct list_head* next; 0 0 0 16 8 struct list_head* prev; }; 0 0 0 8 16 struct { 0 0 0 8 8 void* __filler; 0 0 0 16 4 unsigned int mlock_count; }; 0 0 0 8 16 struct list_head buddy_list { 0 0 0 8 8 struct list_head* next; 0 0 0 16 8 struct list_head* prev; }; Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: linux-toolchains@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Link: https://lore.kernel.org/r/20231213001323.718046-16-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-12-23perf annotate: Add --data-type optionNamhyung Kim6-11/+118
Support data type annotation with new --data-type option. It internally uses type sort key to collect sample histogram for the type and display every members like below. $ perf annotate --data-type ... Annotate type: 'struct cfs_rq' in [kernel.kallsyms] (13 samples): ============================================================================ samples offset size field 13 0 640 struct cfs_rq { 2 0 16 struct load_weight load { 2 0 8 unsigned long weight; 0 8 4 u32 inv_weight; }; 0 16 8 unsigned long runnable_weight; 0 24 4 unsigned int nr_running; 1 28 4 unsigned int h_nr_running; ... For simplicity it prints the number of samples per field for now. But it should be easy to show the overhead percentage instead. The number at the outer struct is a sum of the numbers of the inner members. For example, struct cfs_rq got total 13 samples, and 2 came from the load (struct load_weight) and 1 from h_nr_running. Similarly, the struct load_weight got total 2 samples and they all came from the weight field. I've added two new flags in the symbol_conf for this. The annotate_data_member is to get the members of the type. This is also needed for perf report with typeoff sort key. The annotate_data_sample is to update sample stats for each offset and used only in annotate. Currently it only support stdio output mode, TUI support can be added later. Committer testing: With the perf.data from the previous csets, a very simple, short duration one: # perf annotate --data-type Annotate type: 'struct list_head' in [kernel.kallsyms] (1 samples): ============================================================================ samples offset size field 1 0 16 struct list_head { 0 0 8 struct list_head* next; 1 8 8 struct list_head* prev; }; Annotate type: 'char' in [kernel.kallsyms] (1 samples): ============================================================================ samples offset size field 1 0 1 char ; # Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: linux-toolchains@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Link: https://lore.kernel.org/r/20231213001323.718046-15-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-12-23perf report: Add 'symoff' sort keyNamhyung Kim4-0/+50
The symoff sort key is to print symbol and offset of sample. This is useful for data type profiling to show exact instruction in the function which refers the data. $ perf report -s type,sym,typeoff,symoff --hierarchy ... # Overhead Data Type / Symbol / Data Type Offset / Symbol Offset # .............. ..................................................... # 1.23% struct cfs_rq 0.84% update_blocked_averages 0.19% struct cfs_rq +336 (leaf_cfs_rq_list.next) 0.19% [k] update_blocked_averages+0x96 0.19% struct cfs_rq +0 (load.weight) 0.14% [k] update_blocked_averages+0x104 0.04% [k] update_blocked_averages+0x31c 0.17% struct cfs_rq +404 (throttle_count) 0.12% [k] update_blocked_averages+0x9d 0.05% [k] update_blocked_averages+0x1f9 0.08% struct cfs_rq +272 (propagate) 0.07% [k] update_blocked_averages+0x3d3 0.02% [k] update_blocked_averages+0x45b ... Committer testing: # perf report --stdio -s type,typeoff,symoff # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 4 of event 'cpu_atom/mem-loads,ldlat=30/P' # Event count (approx.): 7 # # Overhead Data Type Data Type Offset Symbol Offset # ........ ......... ................ ............. # 42.86% struct list_head struct list_head +8 (prev) [k] __list_del_entry_valid_or_report+0x7 28.57% (unknown) (unknown) +0 (no field) [.] _nl_intern_locale_data+0x25 14.29% char char +0 (no field) [k] strncpy_from_user+0xa5 14.29% (unknown) (unknown) +0 (no field) [.] _dl_lookup_symbol_x+0x50 # # (Tip: To change sampling frequency to 100 Hz: perf record -F 100) # Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: linux-toolchains@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Link: https://lore.kernel.org/r/20231213001323.718046-14-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-12-23perf report: Add 'typeoff' sort keyNamhyung Kim5-1/+87
The typeoff sort key shows the data type name, offset and the name of the field. This is useful to see which field in the struct is accessed most frequently. $ perf report -s type,typeoff --hierarchy --stdio ... # Overhead Data Type / Data Type Offset # ............ ............................ # ... 1.23% struct cfs_rq 0.19% struct cfs_rq +404 (throttle_count) 0.19% struct cfs_rq +0 (load.weight) 0.19% struct cfs_rq +336 (leaf_cfs_rq_list.next) 0.09% struct cfs_rq +272 (propagate) 0.09% struct cfs_rq +196 (removed.nr) 0.09% struct cfs_rq +80 (curr) 0.09% struct cfs_rq +544 (lt_b_children_throttled) 0.06% struct cfs_rq +320 (rq) Committer testing: Again with the perf.data from the previous csets: # perf report --stdio -s type,typeoff # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 4 of event 'cpu_atom/mem-loads,ldlat=30/P' # Event count (approx.): 7 # # Overhead Data Type Data Type Offset # ........ ......... ................ # 42.86% struct list_head struct list_head +8 (prev) 42.86% (unknown) (unknown) +0 (no field) 14.29% char char +0 (no field) # # (Tip: To see callchains in a more compact form: perf report -g folded) # # perf report --stdio -s dso,type,typeoff # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 4 of event 'cpu_atom/mem-loads,ldlat=30/P' # Event count (approx.): 7 # # Overhead Shared Object Data Type Data Type Offset # ........ .................... ......... ................ # 42.86% [kernel.kallsyms] struct list_head struct list_head +8 (prev) 28.57% libc.so.6 (unknown) (unknown) +0 (no field) 14.29% [kernel.kallsyms] char char +0 (no field) 14.29% ld-linux-x86-64.so.2 (unknown) (unknown) +0 (no field) # # (Tip: If you have debuginfo enabled, try: perf report -s sym,srcline) # # Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: linux-toolchains@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Link: https://lore.kernel.org/r/20231213001323.718046-13-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>