aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2016-02-12cgroup: provide cgroup_nov1= to disable controllers in v1 mountsJohannes Weiner1-1/+42
Testing cgroup2 can be painful with system software automatically mounting and populating all cgroup controllers in v1 mode. Sometimes they can be unmounted from rc.local, sometimes even that is too late. Provide a commandline option to disable certain controllers in v1 mounts, so that they remain available for cgroup2 mounts. Example use: cgroup_no_v1=memory,cpu cgroup_no_v1=all Disabling will be confirmed at boot-time as such: [ 0.013770] Disabling cpu control group subsystem in v1 mounts [ 0.016004] Disabling memory control group subsystem in v1 mounts Signed-off-by: Johannes Weiner <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2016-02-11mm: fix pfn_t vs highmemDan Williams1-1/+1
The pfn_t type uses an unsigned long to store a pfn + flags value. On a 64-bit platform the upper 12 bits of an unsigned long are never used for storing the value of a pfn. However, this is not true on highmem platforms, all 32-bits of a pfn value are used to address a 44-bit physical address space. A pfn_t needs to store a 64-bit value. Link: https://bugzilla.kernel.org/show_bug.cgi?id=112211 Fixes: 01c8f1c44b83 ("mm, dax, gpu: convert vm_insert_mixed to pfn_t") Signed-off-by: Dan Williams <[email protected]> Reported-by: Stuart Foster <[email protected]> Reported-by: Julian Margetson <[email protected]> Tested-by: Julian Margetson <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-02-11kernel/locking/lockdep.c: convert hash tables to hlistsAndrew Morton1-23/+19
Mike said: : CONFIG_UBSAN_ALIGNMENT breaks x86-64 kernel with lockdep enabled, i. e : kernel with CONFIG_UBSAN_ALIGNMENT fails to load without even any error : message. : : The problem is that ubsan callbacks use spinlocks and might be called : before lockdep is initialized. Particularly this line in the : reserve_ebda_region function causes problem: : : lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES); : : If i put lockdep_init() before reserve_ebda_region call in : x86_64_start_reservations kernel loads well. Fix this ordering issue permanently: change lockdep so that it uses hlists for the hash tables. Unlike a list_head, an hlist_head is in its initialized state when it is all-zeroes, so lockdep is ready for operation immediately upon boot - lockdep_init() need not have run. The patch will also save some memory. lockdep_init() and lockdep_initialized can be done away with now - a 4.6 patch has been prepared to do this. Reported-by: Mike Krinkin <[email protected]> Suggested-by: Mike Krinkin <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-02-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-1/+1
Pull networking fixes from David Miller: 1) Fix BPF handling of branch offset adjustmnets on backjumps, from Daniel Borkmann. 2) Make sure selinux knows about SOCK_DESTROY netlink messages, from Lorenzo Colitti. 3) Fix openvswitch tunnel mtu regression, from David Wragg. 4) Fix ICMP handling of TCP sockets in syn_recv state, from Eric Dumazet. 5) Fix SCTP user hmacid byte ordering bug, from Xin Long. 6) Fix recursive locking in ipv6 addrconf, from Subash Abhinov Kasiviswanathan. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: bpf: fix branch offset adjustment on backjumps after patching ctx expansion vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices geneve: Relax MTU constraints vxlan: Relax MTU constraints flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen of: of_mdio: Add marvell, 88e1145 to whitelist of PHY compatibilities. selinux: nlmsgtab: add SOCK_DESTROY to the netlink mapping tables sctp: translate network order to host order when users get a hmacid enic: increment devcmd2 result ring in case of timeout tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segs net:Add sysctl_max_skb_frags tcp: do not drop syn_recv on all icmp reports ipv6: fix a lockdep splat unix: correctly track in-flight fds in sending process user_struct update be2net maintainers' email addresses dwc_eth_qos: Reset hardware before PHY start ipv6: addrconf: Fix recursive spin lock call
2016-02-11PM / suspend: replacing printksaurabh1-3/+3
replacing printk(s) with appropriate pr_info and pr_err in order to fix checkpatch.pl warnings Signed-off-by: Saurabh Sengar <[email protected]> Acked-by: Pavel Machek <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-02-11PM/freezer: y2038, use boottime to compare tstampsAbhilash Jindal1-7/+5
Wall time obtained from do_gettimeofday gives 32 bit timeval which can only represent time until January 2038. This patch moves to ktime_t, a 64-bit time. Also, wall time is susceptible to sudden jumps due to user setting the time or due to NTP. Boot time is constantly increasing time better suited for subtracting two timestamps. Signed-off-by: Abhilash Jindal <[email protected]> Acked-by: Pavel Machek <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-02-11Merge branch 'pm-opp' into pm-cpufreqRafael J. Wysocki3-10/+23
2016-02-10bpf: fix branch offset adjustment on backjumps after patching ctx expansionDaniel Borkmann1-1/+1
When ctx access is used, the kernel often needs to expand/rewrite instructions, so after that patching, branch offsets have to be adjusted for both forward and backward jumps in the new eBPF program, but for backward jumps it fails to account the delta. Meaning, for example, if the expansion happens exactly on the insn that sits at the jump target, it doesn't fix up the back jump offset. Analysis on what the check in adjust_branches() is currently doing: /* adjust offset of jmps if necessary */ if (i < pos && i + insn->off + 1 > pos) insn->off += delta; else if (i > pos && i + insn->off + 1 < pos) insn->off -= delta; First condition (forward jumps): Before: After: insns[0] insns[0] insns[1] <--- i/insn insns[1] <--- i/insn insns[2] <--- pos insns[P] <--- pos insns[3] insns[P] `------| delta insns[4] <--- target_X insns[P] `-----| insns[5] insns[3] insns[4] <--- target_X insns[5] First case is if we cross pos-boundary and the jump instruction was before pos. This is handeled correctly. I.e. if i == pos, then this would mean our jump that we currently check was the patchlet itself that we just injected. Since such patchlets are self-contained and have no awareness of any insns before or after the patched one, the delta is correctly not adjusted. Also, for the second condition in case of i + insn->off + 1 == pos, means we jump to that newly patched instruction, so no offset adjustment are needed. That part is correct. Second condition (backward jumps): Before: After: insns[0] insns[0] insns[1] <--- target_X insns[1] <--- target_X insns[2] <--- pos <-- target_Y insns[P] <--- pos <-- target_Y insns[3] insns[P] `------| delta insns[4] <--- i/insn insns[P] `-----| insns[5] insns[3] insns[4] <--- i/insn insns[5] Second interesting case is where we cross pos-boundary and the jump instruction was after pos. Backward jump with i == pos would be impossible and pose a bug somewhere in the patchlet, so the first condition checking i > pos is okay only by itself. However, i + insn->off + 1 < pos does not always work as intended to trigger the adjustment. It works when jump targets would be far off where the delta wouldn't matter. But, for example, where the fixed insn->off before pointed to pos (target_Y), it now points to pos + delta, so that additional room needs to be taken into account for the check. This means that i) both tests here need to be adjusted into pos + delta, and ii) for the second condition, the test needs to be <= as pos itself can be a target in the backjump, too. Fixes: 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields") Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-02-10Merge branch 'for-4.5-fixes' of ↵Linus Torvalds2-30/+72
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - The destruction path of cgroup objects are asynchronous and multi-staged and some of them ended up destroying parents before children leading to failures in cpu and memory controllers. Ensure that parents are always destroyed after children. - cpuset mm node migration was performed synchronously while holding threadgroup and cgroup mutexes and the recent threadgroup locking update resulted in a possible deadlock. The migration is best effort and shouldn't have been performed under those locks to begin with. Made asynchronous. - Minor documentation fix. * 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: Documentation: cgroup: Fix 'cgroup-legacy' -> 'cgroup-v1' cgroup: make sure a parent css isn't freed before its children cgroup: make sure a parent css isn't offlined before its children cpuset: make mm migration asynchronous
2016-02-10Merge branch 'for-4.5-fixes' of ↵Linus Torvalds1-7/+67
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Workqueue fixes for v4.5-rc3. - Remove a spurious triggering of flush dependency warning. - Officially break local execution guarantee of unbound work items and add a debug feature to flush out usages which depend on it. - Work around CPU -> NODE mapping becoming invalid on CPU offline. The branch is young but pushing out early as stable kernels are being affected" * 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup workqueue: implement "workqueue.debug_force_rr_cpu" debug feature workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs Revert "workqueue: make sure delayed work run in local cpu" workqueue: skip flush dependency checks for legacy workqueues
2016-02-10workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookupTejun Heo1-0/+10
When looking up the pool_workqueue to use for an unbound workqueue, workqueue assumes that the target CPU is always bound to a valid NUMA node. However, currently, when a CPU goes offline, the mapping is destroyed and cpu_to_node() returns NUMA_NO_NODE. This has always been broken but hasn't triggered often enough before 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu"). After the commit, workqueue forcifully assigns the local CPU for delayed work items without explicit target CPU to fix a different issue. This widens the window where CPU can go offline while a delayed work item is pending causing delayed work items dispatched with target CPU set to an already offlined CPU. The resulting NUMA_NO_NODE mapping makes workqueue try to queue the work item on a NULL pool_workqueue and thus crash. While 874bbfe600a6 has been reverted for a different reason making the bug less visible again, it can still happen. Fix it by mapping NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node(). This is a temporary workaround. The long term solution is keeping CPU -> NODE mapping stable across CPU off/online cycles which is being worked on. Signed-off-by: Tejun Heo <[email protected]> Reported-by: Mike Galbraith <[email protected]> Cc: Tang Chen <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Len Brown <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/g/[email protected] Link: http://lkml.kernel.org/g/[email protected]
2016-02-09Merge tag 'fixes-for-linus' of ↵Linus Torvalds1-45/+75
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module fixes from Rusty Russell: "Fix for async_probe module param added in 4.3 (clearly not widely used yet), and a much more interesting kallsyms race which has been around approximately forever. This fix is more invasive, and will require some care in backporting, but I hated all the bandaids I could think of, so... There are some more coming, which are only for breakages introduced this cycle (livepatch), but wanted these in now" * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: modules: fix longstanding /proc/kallsyms vs module insertion race. module: wrapper for symbol name. modules: fix modparam async_probe request
2016-02-09workqueue: implement "workqueue.debug_force_rr_cpu" debug featureTejun Heo1-2/+21
Workqueue used to guarantee local execution for work items queued without explicit target CPU. The guarantee is gone now which can break some usages in subtle ways. To flush out those cases, this patch implements a debug feature which forces round-robin CPU selection for all such work items. The debug feature defaults to off and can be enabled with a kernel parameter. The default can be flipped with a debug config option. If you hit this commit during bisection, please refer to 041bd12e272c ("Revert "workqueue: make sure delayed work run in local cpu"") for more information and ping me. Signed-off-by: Tejun Heo <[email protected]>
2016-02-09workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUsMike Galbraith1-2/+32
WORK_CPU_UNBOUND work items queued to a bound workqueue always run locally. This is a good thing normally, but not when the user has asked us to keep unbound work away from certain CPUs. Round robin these to wq_unbound_cpumask CPUs instead, as perturbation avoidance trumps performance. tj: Cosmetic and comment changes. WARN_ON_ONCE() dropped from empty (wq_unbound_cpumask AND cpu_online_mask). If we want that, it should be done when config changes. Signed-off-by: Mike Galbraith <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2016-02-09Revert "workqueue: make sure delayed work run in local cpu"Tejun Heo1-4/+4
This reverts commit 874bbfe600a660cba9c776b3957b1ce393151b76. Workqueue used to implicity guarantee that work items queued without explicit CPU specified are put on the local CPU. Recent changes in timer broke the guarantee and led to vmstat breakage which was fixed by 176bed1de5bf ("vmstat: explicitly schedule per-cpu work on the CPU we need it to run on"). vmstat is the most likely to expose the issue and it's quite possible that there are other similar problems which are a lot more difficult to trigger. As a preventive measure, 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu") was applied to restore the local CPU guarnatee. Unfortunately, the change exposed a bug in timer code which got fixed by 22b886dd1018 ("timers: Use proper base migration in add_timer_on()"). Due to code restructuring, the commit couldn't be backported beyond certain point and stable kernels which only had 874bbfe600a6 started crashing. The local CPU guarantee was accidental more than anything else and we want to get rid of it anyway. As, with the vmstat case fixed, 874bbfe600a6 is causing more problems than it's fixing, it has been decided to take the chance and officially break the guarantee by reverting the commit. A debug feature will be added to force foreign CPU assignment to expose cases relying on the guarantee and fixes for the individual cases will be backported to stable as necessary. Signed-off-by: Tejun Heo <[email protected]> Fixes: 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu") Link: http://lkml.kernel.org/g/[email protected] Cc: [email protected] Cc: Mike Galbraith <[email protected]> Cc: Henrique de Moraes Holschuh <[email protected]> Cc: Daniel Bilik <[email protected]> Cc: Jan Kara <[email protected]> Cc: Shaohua Li <[email protected]> Cc: Sasha Levin <[email protected]> Cc: Ben Hutchings <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Daniel Bilik <[email protected]> Cc: Jiri Slaby <[email protected]> Cc: Michal Hocko <[email protected]>
2016-02-09sched/numa: Spread memory according to CPU and memory useRik van Riel1-40/+47
The pseudo-interleaving in NUMA placement has a fundamental problem: using hard usage thresholds to spread memory equally between nodes can prevent workloads from converging, or keep memory "trapped" on nodes where the workload is barely running any more. In order for workloads to properly converge, the memory migration should not be stopped when nodes reach parity, but instead be distributed according to how heavily memory is used from each node. This way memory migration and task migration reinforce each other, instead of one putting the brakes on the other. Remove the hard thresholds from the pseudo-interleaving code, and instead use a more gradual policy on memory placement. This also seems to improve convergence of workloads that do not run flat out, but sleep in between bursts of activity. We still want to slow down NUMA scanning and migration once a workload has settled on a few actively used nodes, so keep the 3/4 hysteresis in place. Keep track of whether a workload is actively running on multiple nodes, so task_numa_migrate does a full scan of the system for better task placement. In the case of running 3 SPECjbb2005 instances on a 4 node system, this code seems to result in fairer distribution of memory between nodes, with more memory bandwidth for each instance. Signed-off-by: Rik van Riel <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] [ Minor readability tweaks. ] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-09locking/lockdep: Eliminate lockdep_init()Andrey Ryabinin1-59/+0
Lockdep is initialized at compile time now. Get rid of lockdep_init(). Signed-off-by: Andrey Ryabinin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Krinkin <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-09locking/lockdep: Convert hash tables to hlistsAndrew Morton1-23/+19
Mike said: : CONFIG_UBSAN_ALIGNMENT breaks x86-64 kernel with lockdep enabled, i.e. : kernel with CONFIG_UBSAN_ALIGNMENT=y fails to load without even any error : message. : : The problem is that ubsan callbacks use spinlocks and might be called : before lockdep is initialized. Particularly this line in the : reserve_ebda_region function causes problem: : : lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES); : : If i put lockdep_init() before reserve_ebda_region call in : x86_64_start_reservations kernel loads well. Fix this ordering issue permanently: change lockdep so that it uses hlists for the hash tables. Unlike a list_head, an hlist_head is in its initialized state when it is all-zeroes, so lockdep is ready for operation immediately upon boot - lockdep_init() need not have run. The patch will also save some memory. Probably lockdep_init() and lockdep_initialized can be done away with now. Suggested-by: Mike Krinkin <[email protected]> Reported-by: Mike Krinkin <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Cc: Andrey Ryabinin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-09sched/debug: Make schedstats a runtime tunable that is disabled by defaultMel Gorman8-90/+232
schedstats is very useful during debugging and performance tuning but it incurs overhead to calculate the stats. As such, even though it can be disabled at build time, it is often enabled as the information is useful. This patch adds a kernel command-line and sysctl tunable to enable or disable schedstats on demand (when it's built in). It is disabled by default as someone who knows they need it can also learn to enable it when necessary. The benefits are dependent on how scheduler-intensive the workload is. If it is then the patch reduces the number of cycles spent calculating the stats with a small benefit from reducing the cache footprint of the scheduler. These measurements were taken from a 48-core 2-socket machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a single socket machine 8-core machine with Intel i7-3770 processors. netperf-tcp 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v3r1 Hmean 64 560.45 ( 0.00%) 575.98 ( 2.77%) Hmean 128 766.66 ( 0.00%) 795.79 ( 3.80%) Hmean 256 950.51 ( 0.00%) 981.50 ( 3.26%) Hmean 1024 1433.25 ( 0.00%) 1466.51 ( 2.32%) Hmean 2048 2810.54 ( 0.00%) 2879.75 ( 2.46%) Hmean 3312 4618.18 ( 0.00%) 4682.09 ( 1.38%) Hmean 4096 5306.42 ( 0.00%) 5346.39 ( 0.75%) Hmean 8192 10581.44 ( 0.00%) 10698.15 ( 1.10%) Hmean 16384 18857.70 ( 0.00%) 18937.61 ( 0.42%) Small gains here, UDP_STREAM showed nothing intresting and neither did the TCP_RR tests. The gains on the 8-core machine were very similar. tbench4 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v3r1 Hmean mb/sec-1 500.85 ( 0.00%) 522.43 ( 4.31%) Hmean mb/sec-2 984.66 ( 0.00%) 1018.19 ( 3.41%) Hmean mb/sec-4 1827.91 ( 0.00%) 1847.78 ( 1.09%) Hmean mb/sec-8 3561.36 ( 0.00%) 3611.28 ( 1.40%) Hmean mb/sec-16 5824.52 ( 0.00%) 5929.03 ( 1.79%) Hmean mb/sec-32 10943.10 ( 0.00%) 10802.83 ( -1.28%) Hmean mb/sec-64 15950.81 ( 0.00%) 16211.31 ( 1.63%) Hmean mb/sec-128 15302.17 ( 0.00%) 15445.11 ( 0.93%) Hmean mb/sec-256 14866.18 ( 0.00%) 15088.73 ( 1.50%) Hmean mb/sec-512 15223.31 ( 0.00%) 15373.69 ( 0.99%) Hmean mb/sec-1024 14574.25 ( 0.00%) 14598.02 ( 0.16%) Hmean mb/sec-2048 13569.02 ( 0.00%) 13733.86 ( 1.21%) Hmean mb/sec-3072 12865.98 ( 0.00%) 13209.23 ( 2.67%) Small gains of 2-4% at low thread counts and otherwise flat. The gains on the 8-core machine were slightly different tbench4 on 8-core i7-3770 single socket machine Hmean mb/sec-1 442.59 ( 0.00%) 448.73 ( 1.39%) Hmean mb/sec-2 796.68 ( 0.00%) 794.39 ( -0.29%) Hmean mb/sec-4 1322.52 ( 0.00%) 1343.66 ( 1.60%) Hmean mb/sec-8 2611.65 ( 0.00%) 2694.86 ( 3.19%) Hmean mb/sec-16 2537.07 ( 0.00%) 2609.34 ( 2.85%) Hmean mb/sec-32 2506.02 ( 0.00%) 2578.18 ( 2.88%) Hmean mb/sec-64 2511.06 ( 0.00%) 2569.16 ( 2.31%) Hmean mb/sec-128 2313.38 ( 0.00%) 2395.50 ( 3.55%) Hmean mb/sec-256 2110.04 ( 0.00%) 2177.45 ( 3.19%) Hmean mb/sec-512 2072.51 ( 0.00%) 2053.97 ( -0.89%) In constract, this shows a relatively steady 2-3% gain at higher thread counts. Due to the nature of the patch and the type of workload, it's not a surprise that the result will depend on the CPU used. hackbench-pipes 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v3r1 Amean 1 0.0637 ( 0.00%) 0.0660 ( -3.59%) Amean 4 0.1229 ( 0.00%) 0.1181 ( 3.84%) Amean 7 0.1921 ( 0.00%) 0.1911 ( 0.52%) Amean 12 0.3117 ( 0.00%) 0.2923 ( 6.23%) Amean 21 0.4050 ( 0.00%) 0.3899 ( 3.74%) Amean 30 0.4586 ( 0.00%) 0.4433 ( 3.33%) Amean 48 0.5910 ( 0.00%) 0.5694 ( 3.65%) Amean 79 0.8663 ( 0.00%) 0.8626 ( 0.43%) Amean 110 1.1543 ( 0.00%) 1.1517 ( 0.22%) Amean 141 1.4457 ( 0.00%) 1.4290 ( 1.16%) Amean 172 1.7090 ( 0.00%) 1.6924 ( 0.97%) Amean 192 1.9126 ( 0.00%) 1.9089 ( 0.19%) Some small gains and losses and while the variance data is not included, it's close to the noise. The UMA machine did not show anything particularly different pipetest 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v2r2 Min Time 4.13 ( 0.00%) 3.99 ( 3.39%) 1st-qrtle Time 4.38 ( 0.00%) 4.27 ( 2.51%) 2nd-qrtle Time 4.46 ( 0.00%) 4.39 ( 1.57%) 3rd-qrtle Time 4.56 ( 0.00%) 4.51 ( 1.10%) Max-90% Time 4.67 ( 0.00%) 4.60 ( 1.50%) Max-93% Time 4.71 ( 0.00%) 4.65 ( 1.27%) Max-95% Time 4.74 ( 0.00%) 4.71 ( 0.63%) Max-99% Time 4.88 ( 0.00%) 4.79 ( 1.84%) Max Time 4.93 ( 0.00%) 4.83 ( 2.03%) Mean Time 4.48 ( 0.00%) 4.39 ( 1.91%) Best99%Mean Time 4.47 ( 0.00%) 4.39 ( 1.91%) Best95%Mean Time 4.46 ( 0.00%) 4.38 ( 1.93%) Best90%Mean Time 4.45 ( 0.00%) 4.36 ( 1.98%) Best50%Mean Time 4.36 ( 0.00%) 4.25 ( 2.49%) Best10%Mean Time 4.23 ( 0.00%) 4.10 ( 3.13%) Best5%Mean Time 4.19 ( 0.00%) 4.06 ( 3.20%) Best1%Mean Time 4.13 ( 0.00%) 4.00 ( 3.39%) Small improvement and similar gains were seen on the UMA machine. The gain is small but it stands to reason that doing less work in the scheduler is a good thing. The downside is that the lack of schedstats and tracepoints may be surprising to experts doing performance analysis until they find the existence of the schedstats= parameter or schedstats sysctl. It will be automatically activated for latencytop and sleep profiling to alleviate the problem. For tracepoints, there is a simple warning as it's not safe to activate schedstats in the context when it's known the tracepoint may be wanted but is unavailable. Signed-off-by: Mel Gorman <[email protected]> Reviewed-by: Matt Fleming <[email protected]> Reviewed-by: Srikar Dronamraju <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-09kprobes: Optimize hot path by using percpu counter to collect 'nhit' statisticsMartin KaFai Lau1-4/+15
When doing ebpf+kprobe on some hot TCP functions (e.g. tcp_rcv_established), the kprobe_dispatcher() function shows up in 'perf report'. In kprobe_dispatcher(), there is a lot of cache bouncing on 'tk->nhit++'. 'tk->nhit' and 'tk->tp.flags' also share the same cacheline. perf report (cycles:pp): 8.30% ipv4_dst_check 4.74% copy_user_enhanced_fast_string 3.93% dst_release 2.80% tcp_v4_rcv 2.31% queued_spin_lock_slowpath 2.30% _raw_spin_lock 1.88% mlx4_en_process_rx_cq 1.84% eth_get_headlen 1.81% ip_rcv_finish ~~~~ 1.71% kprobe_dispatcher ~~~~ 1.55% mlx4_en_xmit 1.09% __probe_kernel_read perf report after patch: 9.15% ipv4_dst_check 5.00% copy_user_enhanced_fast_string 4.12% dst_release 2.96% tcp_v4_rcv 2.50% _raw_spin_lock 2.39% queued_spin_lock_slowpath 2.11% eth_get_headlen 2.03% mlx4_en_process_rx_cq 1.69% mlx4_en_xmit 1.19% ip_rcv_finish 1.12% __probe_kernel_read 1.02% ehci_hcd_cleanup Signed-off-by: Martin KaFai Lau <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Acked-by: Steven Rostedt <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Kernel Team <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-09locking/lockdep: Fix stack trace caching logicDmitry Vyukov1-6/+10
check_prev_add() caches saved stack trace in static trace variable to avoid duplicate save_trace() calls in dependencies involving trylocks. But that caching logic contains a bug. We may not save trace on first iteration due to early return from check_prev_add(). Then on the second iteration when we actually need the trace we don't save it because we think that we've already saved it. Let check_prev_add() itself control when stack is saved. There is another bug. Trace variable is protected by graph lock. But we can temporary release graph lock during printing. Fix this by invalidating cached stack trace when we release graph lock. Signed-off-by: Dmitry Vyukov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-08audit: Fix typo in commentWei Yuan2-4/+4
Signed-off-by: Weiyuan <[email protected]> Signed-off-by: Paul Moore <[email protected]>
2016-02-08genirq: Add default affinity mask command line optionThomas Gleixner1-2/+19
If we isolate CPUs, then we don't want random device interrupts on them. Even w/o the user space irq balancer enabled we can end up with irqs on non boot cpus and chasing newly requested interrupts is a tedious task. Allow to restrict the default irq affinity mask. Signed-off-by: Thomas Gleixner <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Frederic Weisbecker <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Sebastian Siewior <[email protected]> Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1602031948190.25254@nanos Signed-off-by: Thomas Gleixner <[email protected]>
2016-02-06bpf: add lookup/update support for per-cpu hash and array mapsAlexei Starovoitov3-26/+178
The functions bpf_map_lookup_elem(map, key, value) and bpf_map_update_elem(map, key, value, flags) need to get/set values from all-cpus for per-cpu hash and array maps, so that user space can aggregate/update them as necessary. Example of single counter aggregation in user space: unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF); long values[nr_cpus]; long value = 0; bpf_lookup_elem(fd, key, values); for (i = 0; i < nr_cpus; i++) value += values[i]; The user space must provide round_up(value_size, 8) * nr_cpus array to get/set values, since kernel will use 'long' copy of per-cpu values to try to copy good counters atomically. It's a best-effort, since bpf programs and user space are racing to access the same memory. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-02-06bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY mapAlexei Starovoitov1-11/+91
Primary use case is a histogram array of latency where bpf program computes the latency of block requests or other events and stores histogram of latency into array of 64 elements. All cpus are constantly running, so normal increment is not accurate, bpf_xadd causes cache ping-pong and this per-cpu approach allows fastest collision-free counters. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-02-06bpf: introduce BPF_MAP_TYPE_PERCPU_HASH mapAlexei Starovoitov1-47/+228
Introduce BPF_MAP_TYPE_PERCPU_HASH map type which is used to do accurate counters without need to use BPF_XADD instruction which turned out to be too costly for high-performance network monitoring. In the typical use case the 'key' is the flow tuple or other long living object that sees a lot of events per second. bpf_map_lookup_elem() returns per-cpu area. Example: struct { u32 packets; u32 bytes; } * ptr = bpf_map_lookup_elem(&map, &key); /* ptr points to this_cpu area of the value, so the following * increments will not collide with other cpus */ ptr->packets ++; ptr->bytes += skb->len; bpf_update_elem() atomically creates a new element where all per-cpu values are zero initialized and this_cpu value is populated with given 'value'. Note that non-per-cpu hash map always allocates new element and then deletes old after rcu grace period to maintain atomicity of update. Per-cpu hash map updates element values in-place. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-02-05signals: avoid random wakeups in sigsuspend()Sasha Levin1-2/+4
A random wakeup can get us out of sigsuspend() without TIF_SIGPENDING being set. Avoid that by making sure we were signaled, like sys_pause() does. Signed-off-by: Sasha Levin <[email protected]> Acked-by: Oleg Nesterov <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2016-02-05sched/isolcpus: Output warning when the 'isolcpus=' kernel parameter is invalidPrarit Bhargava1-2/+7
The isolcpus= kernel boot parameter restricts userspace from scheduling on the specified CPUs. If a CPU is specified that is outside the range of 0 to nr_cpu_ids, cpulist_parse() will return -ERANGE, return an empty cpulist, and fail silently. This patch adds an error message to isolated_cpu_setup() to indicate to the user that something has gone awry, and returns 0 on error. Signed-off-by: Prarit Bhargava <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Link: http://lkml.kernel.org/r/[email protected] [ Twiddled some details. ] Signed-off-by: Ingo Molnar <[email protected]>
2016-02-05cpufreq: powernv/tracing: Add powernv_throttle tracepointShilpasri G Bhat1-0/+1
This patch adds the powernv_throttle tracepoint to trace the CPU frequency throttling event, which is used by the powernv-cpufreq driver in POWER8. Signed-off-by: Shilpasri G Bhat <[email protected]> Reviewed-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-02-03Merge tag 'trace-v4.5-rc1-2' of ↵Linus Torvalds1-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "A cleanup to the stack tracer broke stack tracing on s390. Here's a simple fix to correct that issue" * tag 'trace-v4.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/stacktrace: Show entire trace if passed in function not found
2016-02-03modules: fix longstanding /proc/kallsyms vs module insertion race.Rusty Russell1-43/+69
For CONFIG_KALLSYMS, we keep two symbol tables and two string tables. There's one full copy, marked SHF_ALLOC and laid out at the end of the module's init section. There's also a cut-down version that only contains core symbols and strings, and lives in the module's core section. After module init (and before we free the module memory), we switch the mod->symtab, mod->num_symtab and mod->strtab to point to the core versions. We do this under the module_mutex. However, kallsyms doesn't take the module_mutex: it uses preempt_disable() and rcu tricks to walk through the modules, because it's used in the oops path. It's also used in /proc/kallsyms. There's nothing atomic about the change of these variables, so we can get the old (larger!) num_symtab and the new symtab pointer; in fact this is what I saw when trying to reproduce. By grouping these variables together, we can use a carefully-dereferenced pointer to ensure we always get one or the other (the free of the module init section is already done in an RCU callback, so that's safe). We allocate the init one at the end of the module init section, and keep the core one inside the struct module itself (it could also have been allocated at the end of the module core, but that's probably overkill). Reported-by: Weilong Chen <[email protected]> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111541 Cc: [email protected] Signed-off-by: Rusty Russell <[email protected]>
2016-02-03module: wrapper for symbol name.Rusty Russell1-11/+15
This trivial wrapper adds clarity and makes the following patch smaller. Cc: [email protected] Signed-off-by: Rusty Russell <[email protected]>
2016-02-03modules: fix modparam async_probe requestLuis R. Rodriguez1-1/+1
Commit f2411da746985 ("driver-core: add driver module asynchronous probe support") added async probe support, in two forms: * in-kernel driver specification annotation * generic async_probe module parameter (modprobe foo async_probe) To support the generic kernel parameter parse_args() was extended via commit ecc8617053e0 ("module: add extra argument for parse_params() callback") however commit failed to f2411da746985 failed to add the required argument. This causes a crash then whenever async_probe generic module parameter is used. This was overlooked when the form in which in-kernel async probe support was reworked a bit... Fix this as originally intended. Cc: Hannes Reinecke <[email protected]> Cc: Dmitry Torokhov <[email protected]> Cc: [email protected] (4.2+) Signed-off-by: Luis R. Rodriguez <[email protected]> Signed-off-by: Rusty Russell <[email protected]> [minimized]
2016-02-01Merge branch 'libnvdimm-fixes' of ↵Linus Torvalds1-8/+12
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm fixes from Dan Williams: "1/ Fixes to the libnvdimm 'pfn' device that establishes a reserved area for storing a struct page array. 2/ Fixes for dax operations on a raw block device to prevent pagecache collisions with dax mappings. 3/ A fix for pfn_t usage in vm_insert_mixed that lead to a null pointer de-reference. These have received build success notification from the kbuild robot across 153 configs and pass the latest ndctl tests" * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: phys_to_pfn_t: use phys_addr_t mm: fix pfn_t to page conversion in vm_insert_mixed block: use DAX for partition table reads block: revert runtime dax control of the raw block device fs, block: force direct-I/O for dax-enabled block devices devm_memremap_pages: fix vmem_altmap lifetime + alignment handling libnvdimm, pfn: fix restoring memmap location libnvdimm: fix mode determination for e820 devices
2016-02-01Merge 4.5-rc2 into tty-nextGreg Kroah-Hartman22-765/+833
We want the tty/serial fixes in here as well to make merges easier. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2016-01-31Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds6-30/+61
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "The timer departement delivers: - a regression fix for the NTP code along with a proper selftest - prevent a spurious timer interrupt in the NOHZ lowres code - a fix for user space interfaces returning the remaining time on architectures with CONFIG_TIME_LOW_RES=y - a few patches to fix COMPILE_TEST fallout" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/nohz: Set the correct expiry when switching to nohz/lowres mode clocksource: Fix dependencies for archs w/o HAS_IOMEM clocksource: Select CLKSRC_MMIO where needed tick/sched: Hide unused oneshot timer code kselftests: timers: Add adjtimex SETOFFSET validity tests ntp: Fix ADJ_SETOFFSET being used w/ ADJ_NANO itimers: Handle relative timers with CONFIG_TIME_LOW_RES proper posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES proper timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper hrtimer: Handle remaining time proper for TIME_LOW_RES clockevents/tcb_clksrc: Prevent disabling an already disabled clock
2016-01-31Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds3-9/+25
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Thomas Gleixner: "Three small fixes in the scheduler/core: - use after free in the numa code - crash in the numa init code - a simple spelling fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: pid: Fix spelling in comments sched/numa: Fix use-after-free bug in the task_numa_compare sched: Fix crash in sched_init_numa()
2016-01-31Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds5-635/+641
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "This is much bigger than typical fixes, but Peter found a category of races that spurred more fixes and more debugging enhancements. Work started before the merge window, but got finished only now. Aside of that this contains the usual small fixes to perf and tools. Nothing particular exciting" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits) perf: Remove/simplify lockdep annotation perf: Synchronously clean up child events perf: Untangle 'owner' confusion perf: Add flags argument to perf_remove_from_context() perf: Clean up sync_child_event() perf: Robustify event->owner usage and SMP ordering perf: Fix STATE_EXIT usage perf: Update locking order perf: Remove __free_event() perf/bpf: Convert perf_event_array to use struct file perf: Fix NULL deref perf/x86: De-obfuscate code perf/x86: Fix uninitialized value usage perf: Fix race in perf_event_exit_task_context() perf: Fix orphan hole perf stat: Do not clean event's private stats perf hists: Fix HISTC_MEM_DCACHELINE width setting perf annotate browser: Fix behaviour of Shift-Tab with nothing focussed perf tests: Remove wrong semicolon in while loop in CQM test perf: Synchronously free aux pages in case of allocation failure ...
2016-01-31Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds2-72/+81
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Thomas Gleixner: "A single commit, which makes the rtmutex.wait_lock an irq safe lock. This prevents a potential deadlock which can be triggered by the rcu boosting code from rcu_read_unlock()" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rtmutex: Make wait_lock irq safe
2016-01-31Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds2-5/+11
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull IRQ fixes from Ingo Molnar: "Mostly irqchip driver fixes, but also an irq core crash fix and a build fix" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/mxs: Add missing set_handle_irq() irqchip/atmel-aic: Fix wrong bit operation for IRQ priority irqchip/gic-v3-its: Recompute the number of pages on page size change base: Export platform_msi_domain_[alloc,free]_irqs of: MSI: Simplify irqdomain lookup irqdomain: Allow domain lookup with DOMAIN_BUS_WIRED token irqchip: Fix dependencies for archs w/o HAS_IOMEM irqchip/s3c24xx: Mark init_eint as __maybe_unused genirq: Validate action before dereferencing it in handle_irq_event_percpu()
2016-01-31phys_to_pfn_t: use phys_addr_tDan Williams1-1/+1
A dma_addr_t is potentially smaller than a phys_addr_t on some archs. Don't truncate the address when doing the pfn conversion. Cc: Ross Zwisler <[email protected]> Reported-by: Matthew Wilcox <[email protected]> [willy: fix pfn_t_to_phys as well] Signed-off-by: Dan Williams <[email protected]>
2016-01-31kernel/Makefile: remove the useless CFLAGS_REMOVE_cgroup-debug.oLi Bin1-2/+1
The file cgroup-debug.c had been removed from commit fe6934354f8e (cgroups: move the cgroup debug subsys into cgroup.c to access internal state). Remain the CFLAGS_REMOVE_cgroup-debug.o = $(CC_FLAGS_FTRACE) useless in kernel/Makefile. Signed-off-by: Li Bin <[email protected]> Acked-by: Zefan Li <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2016-01-30resource: Kill walk_iomem_res()Toshi Kani1-44/+5
walk_iomem_res_desc() replaced walk_iomem_res() and there is no caller to walk_iomem_res() any more. Kill it. Also remove @name from find_next_iomem_res() as it is no longer used. Signed-off-by: Toshi Kani <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Dave Young <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Dan Williams <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Hanjun Guo <[email protected]> Cc: Jakub Sitnicki <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luis R. Rodriguez <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vinod Koul <[email protected]> Cc: [email protected] Cc: linux-mm <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-01-30x86, kexec, nvdimm: Use walk_iomem_res_desc() for iomem searchToshi Kani1-4/+4
Change the callers of walk_iomem_res() scanning for the following resources by name to use walk_iomem_res_desc() instead. "ACPI Tables" "ACPI Non-volatile Storage" "Persistent Memory (legacy)" "Crash kernel" Note, the caller of walk_iomem_res() with "GART" will be removed in a later patch. Signed-off-by: Toshi Kani <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Dave Young <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Chun-Yi <[email protected]> Cc: Dan Williams <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: Don Zickus <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Lee, Chun-Yi <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luis R. Rodriguez <[email protected]> Cc: Minfei Huang <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Ross Zwisler <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Takao Indoh <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Toshi Kani <[email protected]> Cc: [email protected] Cc: [email protected] Cc: linux-mm <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-01-30resource: Add walk_iomem_res_desc()Toshi Kani1-10/+56
Add a new interface, walk_iomem_res_desc(), which walks through the iomem table by identifying a target with @flags and @desc. This interface provides the same functionality as walk_iomem_res(), but does not use strcmp() to @name for better efficiency. walk_iomem_res() is deprecated and will be removed in a later patch. Requested-by: Borislav Petkov <[email protected]> Signed-off-by: Toshi Kani <[email protected]> [ Fixup comments. ] Signed-off-by: Borislav Petkov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Dan Williams <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Hanjun Guo <[email protected]> Cc: Jakub Sitnicki <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luis R. Rodriguez <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Toshi Kani <[email protected]> Cc: [email protected] Cc: linux-mm <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-01-30memremap: Change region_intersects() to take @flags and @descToshi Kani2-17/+22
Change region_intersects() to identify a target with @flags and @desc, instead of @name with strcmp(). Change the callers of region_intersects(), memremap() and devm_memremap(), to set IORESOURCE_SYSTEM_RAM in @flags and IORES_DESC_NONE in @desc when searching System RAM. Also, export region_intersects() so that the ACPI EINJ error injection driver can call this function in a later patch. Signed-off-by: Toshi Kani <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Acked-by: Dan Williams <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jakub Sitnicki <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luis R. Rodriguez <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: [email protected] Cc: linux-mm <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-01-30resource: Change walk_system_ram() to use System RAM typeToshi Kani1-13/+13
Now that all System RAM resource entries have been initialized to IORESOURCE_SYSTEM_RAM type, change walk_system_ram_res() and walk_system_ram_range() to call find_next_iomem_res() by setting @res.flags to IORESOURCE_SYSTEM_RAM and @name to NULL. With this change, they walk through the iomem table to find System RAM ranges without the need to do strcmp() on the resource names. No functional change is made to the interfaces. Signed-off-by: Toshi Kani <[email protected]> [ Boris: fixup comments. ] Signed-off-by: Borislav Petkov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Dan Williams <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jakub Sitnicki <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luis R. Rodriguez <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Toshi Kani <[email protected]> Cc: [email protected] Cc: linux-mm <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-01-30kexec: Set IORESOURCE_SYSTEM_RAM for System RAMToshi Kani2-4/+6
Set proper ioresource flags and types for crash kernel reservation areas. Signed-off-by: Toshi Kani <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Reviewed-by: Dave Young <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: HATAYAMA Daisuke <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luis R. Rodriguez <[email protected]> Cc: Minfei Huang <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: [email protected] Cc: [email protected] Cc: linux-mm <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-01-30resource: Add I/O resource descriptorToshi Kani1-0/+5
walk_iomem_res() and region_intersects() still need to use strcmp() for searching a resource entry by @name in the iomem table. This patch introduces I/O resource descriptor 'desc' in struct resource for the iomem search interfaces. Drivers can assign their unique descriptor to a range when they support the search interfaces. Otherwise, 'desc' is set to IORES_DESC_NONE (0). This avoids changing most of the drivers as they typically allocate resource entries statically, or by calling alloc_resource(), kzalloc(), or alloc_bootmem_low(), which set the field to zero by default. A later patch will address some drivers that use kmalloc() without zero'ing the field. Also change release_mem_region_adjustable() to set 'desc' when its resource entry gets separated. Other resource interfaces are also changed to initialize 'desc' explicitly although alloc_resource() sets it to 0. Signed-off-by: Toshi Kani <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Dan Williams <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jakub Sitnicki <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luis R. Rodriguez <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Toshi Kani <[email protected]> Cc: [email protected] Cc: linux-mm <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-01-30resource: Handle resource flags properlyToshi Kani1-3/+4
I/O resource flags consist of I/O resource types and modifier bits. Therefore, checking an I/O resource type in 'flags' must be performed with a bitwise operation. Fix find_next_iomem_res() and region_intersects() that simply compare 'flags' against a given value. Also change __request_region() to set 'res->flags' from resource_type() and resource_ext_type() of the parent, so that children nodes will inherit the extended I/O resource type. Signed-off-by: Toshi Kani <[email protected]> Signed-off-by: Borislav Petkov <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Dan Williams <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: Jakub Sitnicki <[email protected]> Cc: Jiang Liu <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luis R. Rodriguez <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Toshi Kani <[email protected]> Cc: Vinod Koul <[email protected]> Cc: [email protected] Cc: linux-mm <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>