aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2017-10-31userns: use union in {g,u}idmap structChristian Brauner1-12/+18
- Add a struct containing two pointer to extents and wrap both the static extent array and the struct into a union. This is done in preparation for bumping the {g,u}idmap limits for user namespaces. - Add brackets around anonymous union when using designated initializers to initialize members in order to please gcc <= 4.4. Signed-off-by: Christian Brauner <[email protected]> Signed-off-by: Eric W. Biederman <[email protected]>
2017-10-31Merge branch 'fortglx/4.15/time' of ↵Thomas Gleixner5-183/+227
https://git.linaro.org/people/john.stultz/linux into timers/core Pull timekeeping updates from John Stultz: - More y2038 work from Arnd Bergmann - A new mechanism to allow RTC drivers to specify the resolution of the RTC so the suspend/resume code can make informed decisions whether to inject the suspended time or not in case of fast suspend/resume cycles.
2017-10-31fsnotify: convert fsnotify_mark.refcnt from atomic_t to refcount_tElena Reshetova1-1/+1
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable fsnotify_mark.refcnt is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <[email protected]> Reviewed-by: David Windsor <[email protected]> Reviewed-by: Hans Liljestrand <[email protected]> Signed-off-by: Elena Reshetova <[email protected]> Signed-off-by: Jan Kara <[email protected]>
2017-10-31irq/work: Don't reinvent the wheel but use existing llist APIByungchul Park1-5/+1
Use the proper llist APIs instead of open-coded variants of them. Signed-off-by: Byungchul Park <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-30time: Move time_t conversion helpers to time32.hArnd Bergmann1-2/+3
On 64-bit architectures, the timespec64 based helpers in linux/time.h are defined as macros pointing to their timespec based counterparts. This made sense when they were first introduced, but as we are migrating away from timespec in general, it's much less intuitive now. This changes the macros to work in the exact opposite way: we always provide the timespec64 based helpers and define the old interfaces as macros for them. Now we can move those macros into linux/time32.h, which already contains the respective helpers for 32-bit architectures. Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miroslav Lichvar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Stephen Boyd <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: John Stultz <[email protected]>
2017-10-30time: Remove unused functionsArnd Bergmann1-18/+0
The (slow but) ongoing work on conversion from timespec to timespec64 has led some timespec based helper functions to become unused. No new code should use them, so we can remove the functions entirely. I'm planning to obsolete additional interfaces next and remove more of these. Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miroslav Lichvar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Stephen Boyd <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: John Stultz <[email protected]>
2017-10-30timekeeping: Use timespec64 in timekeeping_inject_offsetArnd Bergmann1-47/+25
As part of changing all the timekeeping code to use 64-bit time_t consistently, this removes the uses of timeval and timespec as much as possible from do_adjtimex() and timekeeping_inject_offset(). The timeval_inject_offset_valid() and timespec_inject_offset_valid() just complicate this, so I'm folding them into the respective callers. This leaves the actual 'struct timex' definition, which is part of the user-space ABI and should be dealt with separately when we have agreed on the ABI change. Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miroslav Lichvar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Stephen Boyd <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> Signed-off-by: John Stultz <[email protected]>
2017-10-30timekeeping: Consolidate timekeeping_inject_offset codeArnd Bergmann5-100/+123
The code to check the adjtimex() or clock_adjtime() arguments is spread out across multiple files for presumably only historic reasons. As a preparatation for a rework to get rid of the use of 'struct timeval' and 'struct timespec' in there, this moves all the portions into kernel/time/timekeeping.c and marks them as 'static'. The warp_clock() function here is not as closely related as the others, but I feel it still makes sense to move it here in order to consolidate all callers of timekeeping_inject_offset(). Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miroslav Lichvar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Stephen Boyd <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]> [jstultz: Whitespace fixup] Signed-off-by: John Stultz <[email protected]>
2017-10-30rtc: Allow rtc drivers to specify the tv_nsec value for ntpJason Gunthorpe1-53/+113
ntp is currently hardwired to try and call the rtc set when wall clock tv_nsec is 0.5 seconds. This historical behaviour works well with certain PC RTCs, but is not universal to all rtc hardware. Change how this works by introducing the driver specific concept of set_offset_nsec, the delay between current wall clock time and the target time to set (with a 0 tv_nsecs). For x86-style CMOS set_offset_nsec should be -0.5 s which causes the last second to be written 0.5 s after it has started. For compat with the old rtc_set_ntp_time, the value is defaulted to + 0.5 s, which causes the next second to be written 0.5s before it starts, as things were before this patch. Testing shows many non-x86 RTCs would like set_offset_nsec ~= 0, so ultimately each RTC driver should set the set_offset_nsec according to its needs, and non x86 architectures should stop using update_persistent_clock64 in order to access this feature. Future patches will revise the drivers as needed. Since CMOS and RTC now have very different handling they are split into two dedicated code paths, sharing the support code, and ifdefs are replaced with IS_ENABLED. Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Miroslav Lichvar <[email protected]> Cc: Richard Cochran <[email protected]> Cc: Prarit Bhargava <[email protected]> Cc: Stephen Boyd <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: John Stultz <[email protected]>
2017-10-30added new line symbol after warning about dropped messagesMaxim Akristiniy1-1/+1
so this message will not mess with the next one Cc: Steven Rostedt <[email protected]> Signed-off-by: Maxim Akristiniy <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]>
2017-10-30cgroup: mark @cgrp __maybe_unused in cpu_stat_show()Tejun Heo1-1/+1
The local variable @cgrp isn't used if !CONFIG_CGROUP_SCHED. Mark the variable with __maybe_unused to avoid a compile warning. Reported-by: "[email protected]" <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2017-10-30workqueue: Fix NULL pointer dereferenceLi Bin1-1/+2
When queue_work() is used in irq (not in task context), there is a potential case that trigger NULL pointer dereference. ---------------------------------------------------------------- worker_thread() |-spin_lock_irq() |-process_one_work() |-worker->current_pwq = pwq |-spin_unlock_irq() |-worker->current_func(work) |-spin_lock_irq() |-worker->current_pwq = NULL |-spin_unlock_irq() //interrupt here |-irq_handler |-__queue_work() //assuming that the wq is draining |-is_chained_work(wq) |-current_wq_worker() //Here, 'current' is the interrupted worker! |-current->current_pwq is NULL here! |-schedule() ---------------------------------------------------------------- Avoid it by checking for task context in current_wq_worker(), and if not in task context, we shouldn't use the 'current' to check the condition. Reported-by: Xiaofei Tan <[email protected]> Signed-off-by: Li Bin <[email protected]> Reviewed-by: Lai Jiangshan <[email protected]> Signed-off-by: Tejun Heo <[email protected]> Fixes: 8d03ecfe4718 ("workqueue: reimplement is_chained_work() using current_wq_worker()") Cc: [email protected] # v3.9+
2017-10-30printk: fix typo in printk_safe.cBaoquan He1-1/+1
Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Baoquan He <[email protected]> Reviewed-by: Sergey Senozhatsky <[email protected]> Signed-off-by: Petr Mladek <[email protected]>
2017-10-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller4-26/+42
Several conflicts here. NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to nfp_fl_output() needed some adjustments because the code block is in an else block now. Parallel additions to net/pkt_cls.h and net/sch_generic.h A bug fix in __tcp_retransmit_skb() conflicted with some of the rbtree changes in net-next. The tc action RCU callback fixes in 'net' had some overlap with some of the recent tcf_block reworking. Signed-off-by: David S. Miller <[email protected]>
2017-10-30perf/cgroup: Fix perf cgroup hierarchy supportTejun Heo1-2/+4
The following commit: 864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups") made list_update_cgroup_event() skip setting cpuctx->cgrp if no cgroup event targets %current's cgroup. This breaks perf_event's hierarchical support because events which target one of the ancestors get ignored. Fix it by using cgroup_is_descendant() test instead of equality. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: David Carrillo-Cisneros <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] # v4.9+ Fixes: 864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-29genirq: Document vcpu_info usage for percpu_devid interruptsChristoffer Dall1-1/+2
It is currently unclear how to set the VCPU affinity for a percpu_devid interrupt , since the Linux irq_data structure describes the state for multiple interrupts, one for each physical CPU on the system. Since each such interrupt can be associated with different VCPUs or none at all, associating a single VCPU state with such an interrupt does not capture the necessary semantics. The implementers of irq_set_affinity are the Intel and AMD IOMMUs, and the ARM GIC irqchip. The Intel and AMD callers do not appear to use percpu_devid interrupts, and the ARM GIC implementation only checks the pointer against NULL vs. non-NULL. Therefore, simply update the function documentation to explain the expected use in the context of percpu_devid interrupts, allowing future changes or additions to irqchip implementers to do the right thing. Signed-off-by: Christoffer Dall <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Acked-by: Marc Zyngier <[email protected]> Cc: [email protected] Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Cc: Eric Auger <[email protected]> Cc: [email protected] Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected]
2017-10-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-3/+12
Pull networking fixes from David Miller: 1) Fix route leak in xfrm_bundle_create(). 2) In mac80211, validate user rate mask before configuring it. From Johannes Berg. 3) Properly enforce memory limits in fair queueing code, from Toke Hoiland-Jorgensen. 4) Fix lockdep splat in inet_csk_route_req(), from Eric Dumazet. 5) Fix TSO header allocation and management in mvpp2 driver, from Yan Markman. 6) Don't take socket lock in BH handler in strparser code, from Tom Herbert. 7) Don't show sockets from other namespaces in AF_UNIX code, from Andrei Vagin. 8) Fix double free in error path of tap_open(), from Girish Moodalbail. 9) Fix TX map failure path in igb and ixgbe, from Jean-Philippe Brucker and Alexander Duyck. 10) Fix DCB mode programming in stmmac driver, from Jose Abreu. 11) Fix err_count handling in various tunnels (ipip, ip6_gre). From Xin Long. 12) Properly align SKB head before building SKB in tuntap, from Jason Wang. 13) Avoid matching qdiscs with a zero handle during lookups, from Cong Wang. 14) Fix various endianness bugs in sctp, from Xin Long. 15) Fix tc filter callback races and add selftests which trigger the problem, from Cong Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits) selftests: Introduce a new test case to tc testsuite selftests: Introduce a new script to generate tc batch file net_sched: fix call_rcu() race on act_sample module removal net_sched: add rtnl assertion to tcf_exts_destroy() net_sched: use tcf_queue_work() in tcindex filter net_sched: use tcf_queue_work() in rsvp filter net_sched: use tcf_queue_work() in route filter net_sched: use tcf_queue_work() in u32 filter net_sched: use tcf_queue_work() in matchall filter net_sched: use tcf_queue_work() in fw filter net_sched: use tcf_queue_work() in flower filter net_sched: use tcf_queue_work() in flow filter net_sched: use tcf_queue_work() in cgroup filter net_sched: use tcf_queue_work() in bpf filter net_sched: use tcf_queue_work() in basic filter net_sched: introduce a workqueue for RCU callbacks of tc filter sctp: fix some type cast warnings introduced since very beginning sctp: fix a type cast warnings that causes a_rwnd gets the wrong value sctp: fix some type cast warnings introduced by transport rhashtable sctp: fix some type cast warnings introduced by stream reconf ...
2017-10-29bpf: rename sk_actions to align with bpf infrastructureJohn Fastabend1-1/+2
Recent additions to support multiple programs in cgroups impose a strict requirement, "all yes is yes, any no is no". To enforce this the infrastructure requires the 'no' return code, SK_DROP in this case, to be 0. To apply these rules to SK_SKB program types the sk_actions return codes need to be adjusted. This fix adds SK_PASS and makes 'SK_DROP = 0'. Finally, remove SK_ABORTED to remove any chance that the API may allow aborted program flows to be passed up the stack. This would be incorrect behavior and allow programs to break existing policies. Signed-off-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-29bpf: bpf_compute_data uses incorrect cb structureJohn Fastabend1-2/+10
SK_SKB program types use bpf_compute_data to store the end of the packet data. However, bpf_compute_data assumes the cb is stored in the qdisc layer format. But, for SK_SKB this is the wrong layer of the stack for this type. It happens to work (sort of!) because in most cases nothing happens to be overwritten today. This is very fragile and error prone. Fortunately, we have another hole in tcp_skb_cb we can use so lets put the data_end value there. Note, SK_SKB program types do not use data_meta, they are failed by sk_skb_is_valid_access(). Signed-off-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27bpf: remove tail_call and get_stackid helper declarations from bpf.hGianluca Borello1-0/+2
commit afdb09c720b6 ("security: bpf: Add LSM hooks for bpf object related syscall") included linux/bpf.h in linux/security.h. As a result, bpf programs including bpf_helpers.h and some other header that ends up pulling in also security.h, such as several examples under samples/bpf, fail to compile because bpf_tail_call and bpf_get_stackid are now "redefined as different kind of symbol". >From bpf.h: u64 bpf_tail_call(u64 ctx, u64 r2, u64 index, u64 r4, u64 r5); u64 bpf_get_stackid(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); Whereas in bpf_helpers.h they are: static void (*bpf_tail_call)(void *ctx, void *map, int index); static int (*bpf_get_stackid)(void *ctx, void *map, int flags); Fix this by removing the unused declaration of bpf_tail_call and moving the declaration of bpf_get_stackid in bpf_trace.c, which is the only place where it's needed. Signed-off-by: Gianluca Borello <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2017-10-27perf/core: Rewrite event timekeepingPeter Zijlstra1-256/+129
The current even timekeeping, which computes enabled and running times, uses 3 distinct timestamps to reflect the various event states: OFF (stopped), INACTIVE (enabled) and ACTIVE (running). Furthermore, the update rules are such that even INACTIVE events need their timestamps updated. This is undesirable because we'd like to not touch INACTIVE events if at all possible, this makes event scheduling (much) more expensive than needed. Rewrite the timekeeping to directly use event->state, this greatly simplifies the code and results in only having to update things when we change state, or an up-to-date value is requested (read). Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27perf/core: Fix perf_event_read()Peter Zijlstra1-16/+39
perf_event_read() has a number of issues regarding the timekeeping bits. - The IPI didn't update group times when it found INACTIVE - The direct call would not re-check ->state after taking ctx->lock which can result in ->count and timestamps getting out of sync. And we can make use of the ordering introduced for perf_event_stop() to make it more accurate for ACTIVE. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27perf/core: Remove wrong barrierPeter Zijlstra1-5/+0
The barrier and comment make no sense: - if what the barrier says is true, it should be wmb() but that should then be part of the arch driver, not the generic code. - if it is an SMP barrier, there must be a matching barrier, and there isn't one. So kill it. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27perf/core: Rename 'enum perf_event_active_state'Peter Zijlstra1-1/+1
Its a weird name, active is one of the states, it should not be part of the name, also, its too long. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27perf/core: Make sure to update ctx time before using itPeter Zijlstra1-2/+10
We should make sure to update ctx time before we use it to update event times. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27perf/core: Fix __perf_read_group_add() lockingPeter Zijlstra1-2/+2
Event timestamps are serialized using ctx->lock, make sure to hold it over reading all values. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27perf/core: Update ctx time before detaching eventsPeter Zijlstra1-0/+1
We should make sure the ctx time is updated before we detach events; which will want to update event times. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27perf/core: Fix perf_event_read_value() lockingPeter Zijlstra1-2/+14
perf_event_read_value() is an external accessor, just like perf_event_{en,dis}able() and should thus use perf_event_ctx_lock(). Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: f63a8daa5812 ("perf: Fix event->ctx locking") Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf ↵Yonghong Song3-4/+15
event change needed for subsequent bpf helpers" eBPF programs would like access to the (perf) event enabled and running times along with the event value, such that they can deal with event multiplexing (among other things). This patch extends the interface; a future eBPF patch will utilize the new functionality. [ Note, there's a same-content commit with a poor changelog and a meaningless title in the networking tree as well - but we need this change for subsequent perf work, so apply it here as well, with a proper changelog. Hopefully Git will be able to sort out this somewhat messy workflow, if there are no other, conflicting changes to these files. ] Signed-off-by: Yonghong Song <[email protected]> [ Rewrote the changelog. ] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: David S. Miller <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar9-59/+111
Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27sched/isolation: Add basic isolcpus flagsFrederic Weisbecker1-1/+25
Add flags to control NOHZ and domain isolation from "isolcpus=", in order to centralize the isolation features to a common interface. Domain isolation remains the default so not to break the existing isolcpus boot paramater behaviour. Further flags in the future may include 0hz (1hz tick offload) and timers, workqueue, RCU, kthread, watchdog, likely all merged together in a common flag ("async"?). In any case, this will have to be modifiable by cpusets. Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27sched/isolation: Move isolcpus= handling to the housekeeping codeFrederic Weisbecker4-56/+62
We want to centralize the isolation features, to be done by the housekeeping subsystem and scheduler domain isolation is a significant part of it. No intended behaviour change, we just reuse the housekeeping cpumask and core code. Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27sched/isolation: Handle the nohz_full= parameterFrederic Weisbecker2-22/+33
We want to centralize the isolation management, done by the housekeeping subsystem. Therefore we need to handle the nohz_full= parameter from there. Since nohz_full= so far has involved unbound timers, watchdog, RCU and tilegx NAPI isolation, we keep that default behaviour. nohz_full= will be deprecated in the future. We want to control the isolation features from the isolcpus= parameter. Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27sched/isolation: Introduce housekeeping flagsFrederic Weisbecker6-19/+24
Before we implement isolcpus under housekeeping, we need the isolation features to be more finegrained. For example some people want NOHZ_FULL without the full scheduler isolation, others want full scheduler isolation without NOHZ_FULL. So let's cut all these isolation features piecewise, at the risk of overcutting it right now. We can still merge some flags later if they always make sense together. Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27sched/isolation: Split out new CONFIG_CPU_ISOLATION=y config from ↵Frederic Weisbecker1-1/+1
CONFIG_NO_HZ_FULL Split the housekeeping config from CONFIG_NO_HZ_FULL. This way we finally separate the isolation code from NOHZ. Although a dependency to CONFIG_NO_HZ_FULL remains for now, while the housekeeping code still deals with NOHZ internals. Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27sched/isolation: Rename is_housekeeping_cpu() to housekeeping_cpu()Frederic Weisbecker2-4/+4
Fit it into the housekeeping_*() namespace. Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27sched/isolation: Use its own static keyFrederic Weisbecker1-4/+9
Housekeeping code still depends on the nohz_full static key. Since we want to decouple housekeeping from NOHZ, let's create a housekeeping specific static key. It's mostly relevant for calls to is_housekeeping_cpu() from the scheduler. Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27sched/isolation: Make the housekeeping cpumask privateFrederic Weisbecker1-1/+35
Nobody needs to access this detail. housekeeping_cpumask() already takes care of it. Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27sched/isolation, watchdog: Use housekeeping_cpumask() instead of ad-hoc versionFrederic Weisbecker1-8/+3
While trying to disable the watchog on nohz_full CPUs, the watchdog implements an ad-hoc version of housekeeping_cpumask(). Lets replace those re-invented lines. Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-27sched/isolation: Move housekeeping related code to its own fileFrederic Weisbecker8-18/+39
The housekeeping code is currently tied to the NOHZ code. As we are planning to make housekeeping independent from it, start with moving the relevant code to its own file. Signed-off-by: Frederic Weisbecker <[email protected]> Acked-by: Thomas Gleixner <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Cc: Chris Metcalf <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Luiz Capitulino <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Wanpeng Li <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-26cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.statTejun Heo4-16/+68
The basic cpu stat is currently shown with "cpu." prefix in cgroup.stat, and the same information is duplicated in cpu.stat when cpu controller is enabled. This is ugly and not very scalable as we want to expand the coverage of stat information which is always available. This patch makes cgroup core always create "cpu.stat" file and show the basic cpu stat there and calls the cpu controller to show the extra stats when enabled. This ensures that the same information isn't presented in multiple places and makes future expansion of basic stats easier. Signed-off-by: Tejun Heo <[email protected]> Acked-by: Peter Zijlstra (Intel) <[email protected]>
2017-10-26livepatch: __klp_disable_patch() should never be called for disabled patchesPetr Mladek1-1/+4
__klp_disable_patch() should never be called when the patch is not enabled. Let's add the same warning that we have in __klp_enable_patch(). This allows to remove the check when calling klp_pre_unpatch_callback(). It was strange anyway because it repeatedly checked per-patch flag for each patched object. Signed-off-by: Petr Mladek <[email protected]> Acked-by: Joe Lawrence <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2017-10-26livepatch: Correctly call klp_post_unpatch_callback() in error pathsPetr Mladek2-6/+6
The post_unpatch_enabled flag in struct klp_callbacks is set when a pre-patch callback successfully executes, indicating that we need to call a corresponding post-unpatch callback when the patch is reverted. This is true for ordinary patch disable as well as the error paths of klp_patch_object() callers. As currently coded, we inadvertently execute the post-patch callback twice in klp_module_coming() when klp_patch_object() fails: - We explicitly call klp_post_unpatch_callback() for the failed object - We call it again for the same object (and all the others) via klp_cleanup_module_patches_limited() We should clear the flag in klp_post_unpatch_callback() to make sure that the callback is not called twice. It makes the API more safe. (We could have removed the callback from the former error path as it would be covered by the latter call, but I think that is is cleaner to clear the post_unpatch_enabled after its invoked. For example, someone might later decide to call the callback only when obj->patched flag is set.) There is another mistake in the error path of klp_coming_module() in which it skips the post-unpatch callback for the klp_transition_patch. However, the pre-patch callback was called even for this patch, so be sure to make the corresponding callbacks for all patches. Finally, I used this opportunity to make klp_pre_patch_callback() more readable. [[email protected]: incorporate changelog wording changes proposed by Joe Lawrence] Signed-off-by: Petr Mladek <[email protected]> Acked-by: Joe Lawrence <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2017-10-26sched/idle: Micro-optimize the idle loopCheng Jian1-1/+2
Move the loop-invariant calculation of 'cpu' in do_idle() out of the loop body, because the current CPU is always constant. This improves the generated code both on x86-64 and ARM64: x86-64: Before patch (execution in loop): 864: 0f ae e8 lfence 867: 65 8b 05 c2 38 f1 7e mov %gs:0x7ef138c2(%rip),%eax 86e: 89 c0 mov %eax,%eax 870: 48 0f a3 05 68 19 08 bt %rax,0x1081968(%rip) 877: 01 After patch (execution in loop): 872: 0f ae e8 lfence 875: 4c 0f a3 25 63 19 08 bt %r12,0x1081963(%rip) 87c: 01 ARM64: Before patch (execution in loop): c58: d5033d9f dsb ld c5c: d538d080 mrs x0, tpidr_el1 c60: b8606a61 ldr w1, [x19,x0] c64: 1100fc20 add w0, w1, #0x3f c68: 7100003f cmp w1, #0x0 c6c: 1a81b000 csel w0, w0, w1, lt c70: 13067c00 asr w0, w0, #6 c74: 93407c00 sxtw x0, w0 c78: f8607a80 ldr x0, [x20,x0,lsl #3] c7c: 9ac12401 lsr x1, x0, x1 c80: 36000581 tbz w1, #0, d30 <do_idle+0x128> After patch (execution in loop): c84: d5033d9f dsb ld c88: f9400260 ldr x0, [x19] c8c: ea14001f tst x0, x20 c90: 54000580 b.eq d40 <do_idle+0x138> Signed-off-by: Cheng Jian <[email protected]> [ Rewrote the title and the changelog. ] Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-25workqueue: Remove now redundant lock acquisitions wrt. workqueue flushesByungchul Park1-16/+3
The workqueue code added manual lock acquisition annotations to catch deadlocks. After lockdepcrossrelease was introduced, some of those became redundant, since wait_for_completion() already does the acquisition and tracking. Remove the duplicate annotations. Signed-off-by: Byungchul Park <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-25locking/lockdep: Introduce CONFIG_BOOTPARAM_LOCKDEP_CROSSRELEASE_FULLSTACK=yByungchul Park1-0/+4
Add a Kconfig knob that enables the lockdep "crossrelease_fullstack" boot parameter. Suggested-by: Ingo Molnar <[email protected]> Signed-off-by: Byungchul Park <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-25locking/lockdep: Add a boot parameter allowing unwind in cross-release and ↵Byungchul Park1-2/+17
disable it by default Johan Hovold reported a heavy performance regression caused by lockdep cross-release: > Boot time (from "Linux version" to login prompt) had in fact doubled > since 4.13 where it took 17 seconds (with my current config) compared to > the 35 seconds I now see with 4.14-rc4. > > I quick bisect pointed to lockdep and specifically the following commit: > > 28a903f63ec0 ("locking/lockdep: Handle non(or multi)-acquisition > of a crosslock") > > which I've verified is the commit which doubled the boot time (compared > to 28a903f63ec0^) (added by lockdep crossrelease series [1]). Currently cross-release performs unwind on every acquisition, but that is very expensive. This patch makes unwind optional and disables it by default and only records acquire_ip. Full stack traces are sometimes required for full analysis, in which case a boot paramter, crossrelease_fullstack, can be specified. On my qemu Ubuntu machine (x86_64, 4 cores, 512M), the regression was fixed. We measure boot times with 'perf stat --null --repeat 10 $QEMU', where $QEMU launches a kernel with init=/bin/true: 1. No lockdep enabled: Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs): 2.756558155 seconds time elapsed ( +- 0.09% ) 2. Lockdep enabled: Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs): 2.968710420 seconds time elapsed ( +- 0.12% ) 3. Lockdep enabled + cross-release enabled: Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs): 3.153839636 seconds time elapsed ( +- 0.31% ) 4. Lockdep enabled + cross-release enabled + this patch applied: Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs): 2.963669551 seconds time elapsed ( +- 0.11% ) I.e. lockdep cross-release performance is now indistinguishable from vanilla lockdep. Bisected-by: Johan Hovold <[email protected]> Analyzed-by: Thomas Gleixner <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Reported-by: Johan Hovold <[email protected]> Signed-off-by: Byungchul Park <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-25locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns ↵Mark Rutland8-11/+11
to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-25locking/atomics, workqueue: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()Mark Rutland1-2/+2
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't currently harmful. However, for some features it is necessary to instrument reads and writes separately, which is not possible with ACCESS_ONCE(). This distinction is critical to correct operation. It's possible to transform the bulk of kernel code using the Coccinelle script below. However, this doesn't handle comments, leaving references to ACCESS_ONCE() instances which have been removed. As a preparatory step, this patch converts the workqueue code and comments to use {READ,WRITE}_ONCE() consistently. ---- virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Acked-by: Tejun Heo <[email protected]> Cc: Lai Jiangshan <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2017-10-25locking/qrwlock: Prevent slowpath writers getting held up by fastpathWill Deacon1-15/+5
When a prospective writer takes the qrwlock locking slowpath due to the lock being held, it attempts to cmpxchg the wmode field from 0 to _QW_WAITING so that concurrent lockers also take the slowpath and queue on the spinlock accordingly, allowing the lockers to drain. Unfortunately, this isn't fair, because a fastpath writer that comes in after the lock is made available but before the _QW_WAITING flag is set can effectively jump the queue. If there is a steady stream of prospective writers, then the waiter will be held off indefinitely. This patch restores fairness by separating _QW_WAITING and _QW_LOCKED into two distinct fields: _QW_LOCKED continues to occupy the bottom byte of the lockword so that it can be cleared unconditionally when unlocking, but _QW_WAITING now occupies what used to be the bottom bit of the reader count. This then forces the slow-path for concurrent lockers. Tested-by: Waiman Long <[email protected]> Tested-by: Jeremy Linton <[email protected]> Tested-by: Adam Wallis <[email protected]> Tested-by: Jan Glauber <[email protected]> Signed-off-by: Will Deacon <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Boqun Feng <[email protected]> Cc: [email protected] Cc: Linus Torvalds <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>