aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2016-06-28Revert "HID: multitouch: enable palm rejection if device implements ↵Allen Hung1-5/+0
confidence usage" This reverts commit 25a84db15b3f ("HID: multitouch: enable palm rejection if device implements confidence usage") The commit enables palm rejection for Win8 Precision Touchpad devices but the quirk MT_QUIRK_VALID_IS_CONFIDENCE it is using is not working very properly. This quirk is originally designed for some WIn7 touchscreens. Use of this for a Win8 Precision Touchpad will cause unexpected pointer jumping problem. Cc: [email protected] # v4.5+ Reviewed-by: Benjamin Tissoires <[email protected]> Tested-by: Andy Lutomirski <[email protected]> # XPS 13 9350, BIOS 1.4.3 Signed-off-by: Allen Hung <[email protected]> Signed-off-by: Jiri Kosina <[email protected]>
2016-06-28powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()Gavin Shan1-1/+1
When calling eeh_rmv_device() in eeh_reset_device() for partial hotplug case, @rmv_data instead of its address is the proper argument. Otherwise, the stack frame is corrupted when writing to @rmv_data (actually its address) in eeh_rmv_device(). It results in kernel crash as observed. This fixes the issue by passing @rmv_data, not its address to eeh_rmv_device() in eeh_reset_device(). Fixes: 67086e32b564 ("powerpc/eeh: powerpc/eeh: Support error recovery for VF PE") Reported-by: Pridhiviraj Paidipeddi <[email protected]> Signed-off-by: Gavin Shan <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2016-06-28mac80211: Fix mesh estab_plinks counting in STA removal caseJouni Malinen1-2/+5
If a user space program (e.g., wpa_supplicant) deletes a STA entry that is currently in NL80211_PLINK_ESTAB state, the number of established plinks counter was not decremented and this could result in rejecting new plink establishment before really hitting the real maximum plink limit. For !user_mpm case, this decrementation is handled by mesh_plink_deactive(). Fix this by decrementing estab_plinks on STA deletion (mesh_sta_cleanup() gets called from there) so that the counter has a correct value and the Beacon frame advertisement in Mesh Configuration element shows the proper value for capability to accept additional peers. Cc: [email protected] Signed-off-by: Jouni Malinen <[email protected]> Signed-off-by: Johannes Berg <[email protected]>
2016-06-28net/mlx5: use mlx5_buf_alloc_node instead of mlx5_buf_alloc in mlx5_wq_ll_createWang Sheng-Hui1-7/+8
Commit 311c7c71c9bb ("net/mlx5e: Allocate DMA coherent memory on reader NUMA node") introduced mlx5_*_alloc_node() but missed changing some calling and warn messages. This patch introduces 2 changes: * Use mlx5_buf_alloc_node() instead of mlx5_buf_alloc() in mlx5_wq_ll_create() * Update the failure warn messages with _node postfix for mlx5_*_alloc function names Fixes: 311c7c71c9bb ("net/mlx5e: Allocate DMA coherent memory on reader NUMA node") Signed-off-by: Wang Sheng-Hui <[email protected]> Acked-By: Saeed Mahameed <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-06-28Merge branch 'bgmac-fixes'David S. Miller1-3/+5
Florian Fainelli says: ==================== net: bgmac: Random fixes This patch series fixes a few issues spotted by code inspection and actual testing. ==================== Signed-off-by: David S. Miller <[email protected]>
2016-06-28net: bgmac: Remove superflous netif_carrier_on()Florian Fainelli1-2/+0
bgmac_open() calls phy_start() to initialize the PHY state machine, which will set the interface's carrier state accordingly, no need to force that as this could be conflicting with the PHY state determined by PHYLIB. Fixes: dd4544f05469 ("bgmac: driver for GBit MAC core on BCMA bus") Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-06-28net: bgmac: Start transmit queue in bgmac_openFlorian Fainelli1-0/+3
The driver does not start the transmit queue in bgmac_open(). If the queue was stopped prior to closing then re-opening the interface, we would never be able to wake-up again. Fixes: dd4544f05469 ("bgmac: driver for GBit MAC core on BCMA bus") Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-06-28net: bgmac: Fix SOF bit checkingFlorian Fainelli1-2/+3
We are checking for the Start of Frame bit in the ctl1 word, while this bit is set in the ctl0 word instead. Read the ctl0 word and update the check to verify that. Fixes: 9cde94506eac ("bgmac: implement scatter/gather support") Signed-off-by: Florian Fainelli <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-06-28bonding: fix 802.3ad aggregator reselectionJay Vosburgh1-19/+45
Since commit 7bb11dc9f59d ("bonding: unify all places where actor-oper key needs to be updated."), the logic in bonding to handle selection between multiple aggregators has not functioned. This affects only configurations wherein the bonding slaves connect to two discrete aggregators (e.g., two independent switches, each with LACP enabled), thus creating two separate aggregation groups within a single bond. The cause is a change in 7bb11dc9f59d to no longer set AD_PORT_BEGIN on a port after a link state change, which would cause the port to be reselected for attachment to an aggregator as if were newly added to the bond. We cannot restore the prior behavior, as it contradicts IEEE 802.1AX 5.4.12, which requires ports that "become inoperable" (lose carrier, setting port_enabled=false as per 802.1AX 5.4.7) to remain selected (i.e., assigned to the aggregator). As the port now remains selected, the aggregator selection logic is not invoked. A side effect of this change is that aggregators in bonding will now contain ports that are link down. The aggregator selection logic does not currently handle this situation correctly, causing incorrect aggregator selection. This patch makes two changes to repair the aggregator selection logic in bonding to function as documented and within the confines of the standard: First, the aggregator selection and related logic now utilizes the number of active ports per aggregator, not the number of selected ports (as some selected ports may be down). The ad_select "bandwidth" and "count" options only consider ports that are link up. Second, on any carrier state change of any slave, the aggregator selection logic is explicitly called to insure the correct aggregator is active. Reported-by: Veli-Matti Lintu <[email protected]> Fixes: 7bb11dc9f59d ("bonding: unify all places where actor-oper key needs to be updated.") Signed-off-by: Jay Vosburgh <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-06-28ipmr/ip6mr: Initialize the last assert time of mfc entries.Tom Goff2-1/+4
This fixes wrong-interface signaling on 32-bit platforms for entries created when jiffies > 2^31 + MFC_ASSERT_THRESH. Signed-off-by: Tom Goff <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-06-28s390: fix test_fp_ctl inline assembly contraintsMartin Schwidefsky1-1/+1
The test_fp_ctl function is used to test if a given value is a valid floating-point control. The inline assembly in test_fp_ctl uses an incorrect constraint for the 'orig_fpc' variable. If the compiler chooses the same register for 'fpc' and 'orig_fpc' the test_fp_ctl() function always returns true. This allows user space to trigger kernel oopses with invalid floating-point control values on the signal stack. This problem has been introduced with git commit 4725c86055f5bbdcdf "s390: fix save and restore of the floating-point-control register" Cc: [email protected] # v3.13+ Reviewed-by: Heiko Carstens <[email protected]> Signed-off-by: Martin Schwidefsky <[email protected]>
2016-06-28Revert "s390/kdump: Clear subchannel ID to signal non-CCW/SCSI IPL"Michael Holzheu1-7/+0
This reverts commit 852ffd0f4e23248b47531058e531066a988434b5. There are use cases where an intermediate boot kernel (1) uses kexec to boot the final production kernel (2). For this scenario we should provide the original boot information to the production kernel (2). Therefore clearing the boot information during kexec() should not be done. Cc: [email protected] # v3.17+ Reported-by: Steffen Maier <[email protected]> Signed-off-by: Michael Holzheu <[email protected]> Reviewed-by: Heiko Carstens <[email protected]> Signed-off-by: Martin Schwidefsky <[email protected]>
2016-06-28arc: unwind: warn only once if DW2_UNWIND is disabledAlexey Brodkin1-1/+1
If CONFIG_ARC_DW2_UNWIND is disabled every time arc_unwind_core() gets called following message gets printed in debug console: ----------------->8--------------- CONFIG_ARC_DW2_UNWIND needs to be enabled ----------------->8--------------- That message makes sense if user indeed wants to see a backtrace or get nice function call-graphs in perf but what if user disabled unwinder for the purpose? Why pollute his debug console? So instead we'll warn user about possibly missing feature once and let him decide if that was what he or she really wanted. Signed-off-by: Alexey Brodkin <[email protected]> Cc: [email protected] Signed-off-by: Vineet Gupta <[email protected]>
2016-06-28ARC: unwind: ensure that .debug_frame is generated (vs. .eh_frame)Vineet Gupta1-2/+0
With recent binutils update to support dwarf CFI pseudo-ops in gas, we now get .eh_frame vs. .debug_frame. Although the call frame info is exactly the same in both, the CIE differs, which the current kernel unwinder can't cope with. This broke both the kernel unwinder as well as loadable modules (latter because of a new unhandled relo R_ARC_32_PCREL from .rela.eh_frame in the module loader) The ideal solution would be to switch unwinder to .eh_frame. For now however we can make do by just ensureing .debug_frame is generated by removing -fasynchronous-unwind-tables .eh_frame generated with -gdwarf-2 -fasynchronous-unwind-tables .debug_frame generated with -gdwarf-2 Fixes STAR 9001058196 Cc: [email protected] Signed-off-by: Vineet Gupta <[email protected]>
2016-06-27Merge tag 'for-v4.7-rc' of ↵Linus Torvalds2-12/+21
git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply Pull power supply fixes from Sebastian Reichel. * tag 'for-v4.7-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply: power_supply: tps65217-charger: Fix NULL deref during property export power_supply: power_supply_read_temp only if use_cnt > 0
2016-06-27Merge branch 'for-linus' of ↵Linus Torvalds7-46/+70
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input fixes from Dmitry Torokhov. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: vmmouse - remove port reservation Input: elantech - add more IC body types to the list Input: wacom_w8001 - ignore invalid pen data packets Input: wacom_w8001 - w8001_MAX_LENGTH should be 13 Input: xpad - fix oops when attaching an unknown Xbox One gamepad MAINTAINERS: add Pali Rohár as reviewer of ALPS PS/2 touchpad driver Input: add HDMI CEC specific keycodes Input: add BUS_CEC type Input: xpad - fix rumble on Xbox One controllers with 2015 firmware
2016-06-28cpufreq: Avoid false-positive WARN_ON()s in cpufreq_update_policy()Rafael J. Wysocki1-0/+4
CPU notifications from the firmware coming in when cpufreq is suspended cause cpufreq_update_current_freq() to return 0 which triggers the WARN_ON() in cpufreq_update_policy() for no reason. Avoid that by checking cpufreq_suspended before calling cpufreq_update_current_freq(). Fixes: c9d9c929e674 (cpufreq: Abort cpufreq_update_current_freq() for cpufreq_suspended set) Signed-off-by: Rafael J. Wysocki <[email protected]> Acked-by: Viresh Kumar <[email protected]> Cc: 4.6+ <[email protected]> # 4.6+
2016-06-27cpufreq: dt: call of_node_put() before error outMasahiro Yamada1-3/+4
If of_match_node() fails, this init function bails out without calling of_node_put(). Also change of_node_put(of_root) to of_node_put(np); both of them hold the same pointer, but it seems better to call of_node_put() against the node returned by of_find_node_by_path(). Signed-off-by: Masahiro Yamada <[email protected]> Acked-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-06-27intel_pstate: Do not clear utilization update hooks on policy changesRafael J. Wysocki1-2/+3
intel_pstate_set_policy() is invoked by the cpufreq core during driver initialization, on changes of policy attributes (minimim and maximum frequency, for example) via sysfs and via CPU notifications from the platform firmware. On some platforms the latter may occur relatively often. Commit bb6ab52f2bef (intel_pstate: Do not set utilization update hook too early) made intel_pstate_set_policy() clear the CPU's utilization update hook before updating the policy attributes for it (and set the hook again after doind that), but that involves invoking synchronize_sched() and adds overhead to the CPU notifications mentioned above and to the sched-RCU handling in general. That extra overhead is arguably not necessary, because updating policy attributes when the CPU's utilization update hook is active should not lead to any adverse effects, so drop the clearing of the hook from intel_pstate_set_policy() and make it check if the hook has been set already when attempting to set it. Fixes: bb6ab52f2bef (intel_pstate: Do not set utilization update hook too early) Reported-by: Jisheng Zhang <[email protected]> Tested-by: Jisheng Zhang <[email protected]> Tested-by: Doug Smythies <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2016-06-27Merge branch 'rc-fixes' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kbuild regression fix from Michal Marek: "The problem is that commit 9c8fa9bc08f6 ("kbuild: fix if_change and friends to consider argument order") fixed a potential missed rebuild, but this results in unnnecessary rebuilds with the packaging targets. Which is still more correct than the previous logic, but also very annoying" * 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: kbuild: Initialize exported variables
2016-06-27ALSA: echoaudio: Fix memory allocationChristophe JAILLET1-2/+2
'commpage_bak' is allocated with 'sizeof(struct echoaudio)' bytes. We then copy 'sizeof(struct comm_page)' bytes in it. On my system, smatch complains because one is 2960 and the other is 3072. This would result in memory corruption or a oops. Signed-off-by: Christophe JAILLET <[email protected]> Cc: <[email protected]> Signed-off-by: Takashi Iwai <[email protected]>
2016-06-27dax: fix offset overflow in dax_ioEric Sandeen1-1/+6
This isn't functionally apparent for some reason, but when we test io at extreme offsets at the end of the loff_t rang, such as in fstests xfs/071, the calculation of "max" in dax_io() can be wrong due to pos + size overflowing. For example, # xfs_io -c "pwrite 9223372036854771712 512" /mnt/test/file enters dax_io with: start 0x7ffffffffffff000 end 0x7ffffffffffff200 and the rounded up "size" variable is 0x1000. This yields: pos + size 0x8000000000000000 (overflows loff_t) end 0x7ffffffffffff200 Due to the overflow, the min() function picks the wrong value for the "max" variable, and when we send (max - pos) into i.e. copy_from_iter_pmem() it is also the wrong value. This somehow(tm) gets magically absorbed without incident, probably because iter->count is correct. But it seems best to fix it up properly by comparing the two values as unsigned. Signed-off-by: Eric Sandeen <[email protected]> Signed-off-by: Dan Williams <[email protected]>
2016-06-27Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds9-52/+124
Pull cifs fixes from Steve French: "Various small cifs/smb3 fixes, include some for stable, and some from the recent SMB3 test event" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: File names with trailing period or space need special case conversion Fix reconnect to not defer smb3 session reconnect long after socket reconnect cifs: check hash calculating succeeded cifs: dynamic allocation of ntlmssp blob cifs: use CIFS_MAX_DOMAINNAME_LEN when converting the domain name cifs: stuff the fl_owner into "pid" field in the lock request
2016-06-27Merge branch 'linus' of ↵Linus Torvalds5-5/+6
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "This fixes the following issues: - Missing length check for user-space GETALG request - Bogus memmove length in ux500 driver - Incorrect priority setting for vmx driver - Incorrect ABI selection for vmx driver" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: user - re-add size check for CRYPTO_MSG_GETALG crypto: ux500 - memmove the right size crypto: vmx - Increase priority of aes-cbc cipher crypto: vmx - Fix ABI detection
2016-06-27USB: don't free bandwidth_mutex too earlyAlan Stern1-10/+7
The USB core contains a bug that can show up when a USB-3 host controller is removed. If the primary (USB-2) hcd structure is released before the shared (USB-3) hcd, the core will try to do a double-free of the common bandwidth_mutex. The problem was described in graphical form by Chung-Geol Kim, who first reported it: ================================================= At *remove USB(3.0) Storage sequence <1> --> <5> ((Problem Case)) ================================================= VOLD ------------------------------------|------------ (uevent) ________|_________ |<1> | |dwc3_otg_sm_work | |usb_put_hcd | |peer_hcd(kref=2)| |__________________| ________|_________ |<2> | |New USB BUS #2 | | | |peer_hcd(kref=1) | | | --(Link)-bandXX_mutex| | |__________________| | ___________________ | |<3> | | |dwc3_otg_sm_work | | |usb_put_hcd | | |primary_hcd(kref=1)| | |___________________| | _________|_________ | |<4> | | |New USB BUS #1 | | |hcd_release | | |primary_hcd(kref=0)| | | | | |bandXX_mutex(free) |<- |___________________| (( VOLD )) ______|___________ |<5> | | SCSI | |usb_put_hcd | |peer_hcd(kref=0) | |*hcd_release | |bandXX_mutex(free*)|<- double free |__________________| ================================================= This happens because hcd_release() frees the bandwidth_mutex whenever it sees a primary hcd being released (which is not a very good idea in any case), but in the course of releasing the primary hcd, it changes the pointers in the shared hcd in such a way that the shared hcd will appear to be primary when it gets released. This patch fixes the problem by changing hcd_release() so that it deallocates the bandwidth_mutex only when the _last_ hcd structure referencing it is released. The patch also removes an unnecessary test, so that when an hcd is released, both the shared_hcd and primary_hcd pointers in the hcd's peer will be cleared. Signed-off-by: Alan Stern <[email protected]> Reported-by: Chung-Geol Kim <[email protected]> Tested-by: Chung-Geol Kim <[email protected]> CC: <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
2016-06-27vsock: make listener child lock ordering explicitStefan Hajnoczi1-2/+10
There are several places where the listener and pending or accept queue child sockets are accessed at the same time. Lockdep is unhappy that two locks from the same class are held. Tell lockdep that it is safe and document the lock ordering. Originally Claudio Imbrenda <[email protected]> sent a similar patch asking whether this is safe. I have audited the code and also covered the vsock_pending_work() function. Suggested-by: Claudio Imbrenda <[email protected]> Signed-off-by: Stefan Hajnoczi <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-06-27ipv6: enforce egress device match in per table nexthop lookupsPaolo Abeni1-1/+1
with the commit 8c14586fc320 ("net: ipv6: Use passed in table for nexthop lookups"), net hop lookup is first performed on route creation in the passed-in table. However device match is not enforced in table lookup, so the found route can be later discarded due to egress device mismatch and no global lookup will be performed. This cause the following to fail: ip link add dummy1 type dummy ip link add dummy2 type dummy ip link set dummy1 up ip link set dummy2 up ip route add 2001:db8:8086::/48 dev dummy1 metric 20 ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy1 metric 20 ip route add 2001:db8:8086::/48 dev dummy2 metric 21 ip route add 2001:db8:d34d::/64 via 2001:db8:8086::2 dev dummy2 metric 21 RTNETLINK answers: No route to host This change fixes the issue enforcing device lookup in ip6_nh_lookup_table() v1->v2: updated commit message title Fixes: 8c14586fc320 ("net: ipv6: Use passed in table for nexthop lookups") Reported-and-tested-by: Beniamino Galvani <[email protected]> Signed-off-by: Paolo Abeni <[email protected]> Acked-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2016-06-27Merge tag 'linux-can-fixes-for-4.7-20160623' of ↵David S. Miller3-1/+18
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2016-06-23 this is a pull request of 3 patches for the upcoming linux-4.7 release. The first two patches are by Oliver Hartkopp fixing oopes in the generic CAN device netlink handling. Jimmy Assarsson's patch for the kvaser_usb driver adds support for more devices by adding their USB product ids. ==================== Signed-off-by: David S. Miller <[email protected]>
2016-06-27KVM: nVMX: VMX instructions: fix segment checks when L1 is in long mode.Quentin Casasnovas1-12/+11
I couldn't get Xen to boot a L2 HVM when it was nested under KVM - it was getting a GP(0) on a rather unspecial vmread from Xen: (XEN) ----[ Xen-4.7.0-rc x86_64 debug=n Not tainted ]---- (XEN) CPU: 1 (XEN) RIP: e008:[<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450 (XEN) RFLAGS: 0000000000010202 CONTEXT: hypervisor (d1v0) (XEN) rax: ffff82d0801e6288 rbx: ffff83003ffbfb7c rcx: fffffffffffab928 (XEN) rdx: 0000000000000000 rsi: 0000000000000000 rdi: ffff83000bdd0000 (XEN) rbp: ffff83000bdd0000 rsp: ffff83003ffbfab0 r8: ffff830038813910 (XEN) r9: ffff83003faf3958 r10: 0000000a3b9f7640 r11: ffff83003f82d418 (XEN) r12: 0000000000000000 r13: ffff83003ffbffff r14: 0000000000004802 (XEN) r15: 0000000000000008 cr0: 0000000080050033 cr4: 00000000001526e0 (XEN) cr3: 000000003fc79000 cr2: 0000000000000000 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008 (XEN) Xen code around <ffff82d0801e629e> (vmx_get_segment_register+0x14e/0x450): (XEN) 00 00 41 be 02 48 00 00 <44> 0f 78 74 24 08 0f 86 38 56 00 00 b8 08 68 00 (XEN) Xen stack trace from rsp=ffff83003ffbfab0: ... (XEN) Xen call trace: (XEN) [<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450 (XEN) [<ffff82d0801f3695>] get_page_from_gfn_p2m+0x165/0x300 (XEN) [<ffff82d0801bfe32>] hvmemul_get_seg_reg+0x52/0x60 (XEN) [<ffff82d0801bfe93>] hvm_emulate_prepare+0x53/0x70 (XEN) [<ffff82d0801ccacb>] handle_mmio+0x2b/0xd0 (XEN) [<ffff82d0801be591>] emulate.c#_hvm_emulate_one+0x111/0x2c0 (XEN) [<ffff82d0801cd6a4>] handle_hvm_io_completion+0x274/0x2a0 (XEN) [<ffff82d0801f334a>] __get_gfn_type_access+0xfa/0x270 (XEN) [<ffff82d08012f3bb>] timer.c#add_entry+0x4b/0xb0 (XEN) [<ffff82d08012f80c>] timer.c#remove_entry+0x7c/0x90 (XEN) [<ffff82d0801c8433>] hvm_do_resume+0x23/0x140 (XEN) [<ffff82d0801e4fe7>] vmx_do_resume+0xa7/0x140 (XEN) [<ffff82d080164aeb>] context_switch+0x13b/0xe40 (XEN) [<ffff82d080128e6e>] schedule.c#schedule+0x22e/0x570 (XEN) [<ffff82d08012c0cc>] softirq.c#__do_softirq+0x5c/0x90 (XEN) [<ffff82d0801602c5>] domain.c#idle_loop+0x25/0x50 (XEN) (XEN) (XEN) **************************************** (XEN) Panic on CPU 1: (XEN) GENERAL PROTECTION FAULT (XEN) [error_code=0000] (XEN) **************************************** Tracing my host KVM showed it was the one injecting the GP(0) when emulating the VMREAD and checking the destination segment permissions in get_vmx_mem_address(): 3) | vmx_handle_exit() { 3) | handle_vmread() { 3) | nested_vmx_check_permission() { 3) | vmx_get_segment() { 3) 0.074 us | vmx_read_guest_seg_base(); 3) 0.065 us | vmx_read_guest_seg_selector(); 3) 0.066 us | vmx_read_guest_seg_ar(); 3) 1.636 us | } 3) 0.058 us | vmx_get_rflags(); 3) 0.062 us | vmx_read_guest_seg_ar(); 3) 3.469 us | } 3) | vmx_get_cs_db_l_bits() { 3) 0.058 us | vmx_read_guest_seg_ar(); 3) 0.662 us | } 3) | get_vmx_mem_address() { 3) 0.068 us | vmx_cache_reg(); 3) | vmx_get_segment() { 3) 0.074 us | vmx_read_guest_seg_base(); 3) 0.068 us | vmx_read_guest_seg_selector(); 3) 0.071 us | vmx_read_guest_seg_ar(); 3) 1.756 us | } 3) | kvm_queue_exception_e() { 3) 0.066 us | kvm_multiple_exception(); 3) 0.684 us | } 3) 4.085 us | } 3) 9.833 us | } 3) + 10.366 us | } Cross-checking the KVM/VMX VMREAD emulation code with the Intel Software Developper Manual Volume 3C - "VMREAD - Read Field from Virtual-Machine Control Structure", I found that we're enforcing that the destination operand is NOT located in a read-only data segment or any code segment when the L1 is in long mode - BUT that check should only happen when it is in protected mode. Shuffling the code a bit to make our emulation follow the specification allows me to boot a Xen dom0 in a nested KVM and start HVM L2 guests without problems. Fixes: f9eb4af67c9d ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions") Signed-off-by: Quentin Casasnovas <[email protected]> Cc: Eugene Korenevsky <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Radim Krčmář <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: H. Peter Anvin <[email protected]> Cc: linux-stable <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2016-06-27KVM: LAPIC: cap __delay at lapic_timer_advance_nsMarcelo Tosatti1-1/+2
The host timer which emulates the guest LAPIC TSC deadline timer has its expiration diminished by lapic_timer_advance_ns nanoseconds. Therefore if, at wait_lapic_expire, a difference larger than lapic_timer_advance_ns is encountered, delay at most lapic_timer_advance_ns. This fixes a problem where the guest can cause the host to delay for large amounts of time. Reported-by: Alan Jenkins <[email protected]> Signed-off-by: Marcelo Tosatti <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2016-06-27KVM: x86: move nsec_to_cycles from x86.c to x86.hMarcelo Tosatti2-6/+7
Move the inline function nsec_to_cycles from x86.c to x86.h, as the next patch uses it from lapic.c. Signed-off-by: Marcelo Tosatti <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2016-06-27pvclock: Get rid of __pvclock_read_cycles in function pvclock_read_flagsMinfei Huang1-2/+5
There is a generic function __pvclock_read_cycles to be used to get both flags and cycles. For function pvclock_read_flags, it's useless to get cycles value. To make this function be more effective, get this variable flags directly in function. Signed-off-by: Minfei Huang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2016-06-27pvclock: Cleanup to remove function pvclock_get_nsec_offsetMinfei Huang1-16/+7
Function __pvclock_read_cycles is short enough, so there is no need to have another function pvclock_get_nsec_offset to calculate tsc delta. It's better to combine it into function __pvclock_read_cycles. Remove useless variables in function __pvclock_read_cycles. Signed-off-by: Minfei Huang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2016-06-27pvclock: Add CPU barriers to get correct version valueMinfei Huang2-0/+6
Protocol for the "version" fields is: hypervisor raises it (making it uneven) before it starts updating the fields and raises it again (making it even) when it is done. Thus the guest can make sure the time values it got are consistent by checking the version before and after reading them. Add CPU barries after getting version value just like what function vread_pvclock does, because all of callees in this function is inline. Fixes: 502dfeff239e8313bfbe906ca0a1a6827ac8481b Cc: [email protected] Signed-off-by: Minfei Huang <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2016-06-27make nfs_atomic_open() call d_drop() on all ->open_context() errors.Al Viro1-1/+1
In "NFSv4: Move dentry instantiation into the NFSv4-specific atomic open code" unconditional d_drop() after the ->open_context() had been removed. It had been correct for success cases (there ->open_context() itself had been doing dcache manipulations), but not for error ones. Only one of those (ENOENT) got a compensatory d_drop() added in that commit, but in fact it should've been done for all errors. As it is, the case of O_CREAT non-exclusive open on a hashed negative dentry racing with e.g. symlink creation from another client ended up with ->open_context() getting an error and proceeding to call nfs_lookup(). On a hashed dentry, which would've instantly triggered BUG_ON() in d_materialise_unique() (or, these days, its equivalent in d_splice_alias()). Cc: [email protected] # v3.10+ Tested-by: Oleg Drokin <[email protected]> Signed-off-by: Al Viro <[email protected]> Signed-off-by: Trond Myklebust <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
2016-06-27iommu/amd: Initialize devid variable before using itNicolas Iooss1-1/+1
Commit 2a0cb4e2d423 ("iommu/amd: Add new map for storing IVHD dev entry type HID") added a call to DUMP_printk in init_iommu_from_acpi() which used the value of devid before this variable was initialized. Fixes: 2a0cb4e2d423 ('iommu/amd: Add new map for storing IVHD dev entry type HID') Signed-off-by: Nicolas Iooss <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2016-06-27iommu/vt-d: Fix overflow of iommu->domains arrayJan Niehusmann1-1/+1
The valid range of 'did' in get_iommu_domain(*iommu, did) is 0..cap_ndoms(iommu->cap), so don't exceed that range in free_all_cpu_cached_iovas(). The user-visible impact of the out-of-bounds access is the machine hanging on suspend-to-ram. It is, in fact, a kernel panic, but due to already suspended devices, that's often not visible to the user. Fixes: 22e2f9fa63b0 ("iommu/vt-d: Use per-cpu IOVA caching") Signed-off-by: Jan Niehusmann <[email protected]> Tested-By: Marius Vlad <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2016-06-27KVM: arm/arm64: Stop leaking vcpu pid referencesJames Morse1-0/+1
kvm provides kvm_vcpu_uninit(), which amongst other things, releases the last reference to the struct pid of the task that was last running the vcpu. On arm64 built with CONFIG_DEBUG_KMEMLEAK, starting a guest with kvmtool, then killing it with SIGKILL results (after some considerable time) in: > cat /sys/kernel/debug/kmemleak > unreferenced object 0xffff80007d5ea080 (size 128): > comm "lkvm", pid 2025, jiffies 4294942645 (age 1107.776s) > hex dump (first 32 bytes): > 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > [<ffff8000001b30ec>] create_object+0xfc/0x278 > [<ffff80000071da34>] kmemleak_alloc+0x34/0x70 > [<ffff80000019fa2c>] kmem_cache_alloc+0x16c/0x1d8 > [<ffff8000000d0474>] alloc_pid+0x34/0x4d0 > [<ffff8000000b5674>] copy_process.isra.6+0x79c/0x1338 > [<ffff8000000b633c>] _do_fork+0x74/0x320 > [<ffff8000000b66b0>] SyS_clone+0x18/0x20 > [<ffff800000085cb0>] el0_svc_naked+0x24/0x28 > [<ffffffffffffffff>] 0xffffffffffffffff On x86 kvm_vcpu_uninit() is called on the path from kvm_arch_destroy_vm(), on arm no equivalent call is made. Add the call to kvm_arch_vcpu_free(). Signed-off-by: James Morse <[email protected]> Fixes: 749cf76c5a36 ("KVM: ARM: Initial skeleton to compile KVM support") Cc: <[email protected]> # 3.10+ Acked-by: Marc Zyngier <[email protected]> Signed-off-by: Christoffer Dall <[email protected]>
2016-06-27iommu/iova: Disable preemption around use of this_cpu_ptr()Chris Wilson1-2/+6
Between acquiring the this_cpu_ptr() and using it, ideally we don't want to be preempted and work on another CPU's private data. this_cpu_ptr() checks whether or not preemption is disable, and get_cpu_ptr() provides a convenient wrapper for operating on the cpu ptr inside a preemption disabled critical section (which currently is provided by the spinlock). [ 167.997877] BUG: using smp_processor_id() in preemptible [00000000] code: usb-storage/216 [ 167.997940] caller is debug_smp_processor_id+0x17/0x20 [ 167.997945] CPU: 7 PID: 216 Comm: usb-storage Tainted: G U 4.7.0-rc1-gfxbench-RO_Patchwork_1057+ #1 [ 167.997948] Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, BIOS 8.11 10/24/2012 [ 167.997951] 0000000000000000 ffff880118b7f9c8 ffffffff8140dca5 0000000000000007 [ 167.997958] ffffffff81a3a7e9 ffff880118b7f9f8 ffffffff8142a927 0000000000000000 [ 167.997965] ffff8800d499ed58 0000000000000001 00000000000fffff ffff880118b7fa08 [ 167.997971] Call Trace: [ 167.997977] [<ffffffff8140dca5>] dump_stack+0x67/0x92 [ 167.997981] [<ffffffff8142a927>] check_preemption_disabled+0xd7/0xe0 [ 167.997985] [<ffffffff8142a947>] debug_smp_processor_id+0x17/0x20 [ 167.997990] [<ffffffff81507e17>] alloc_iova_fast+0xb7/0x210 [ 167.997994] [<ffffffff8150c55f>] intel_alloc_iova+0x7f/0xd0 [ 167.997998] [<ffffffff8151021d>] intel_map_sg+0xbd/0x240 [ 167.998002] [<ffffffff810e5efd>] ? debug_lockdep_rcu_enabled+0x1d/0x20 [ 167.998009] [<ffffffff81596059>] usb_hcd_map_urb_for_dma+0x4b9/0x5a0 [ 167.998013] [<ffffffff81596d19>] usb_hcd_submit_urb+0xe9/0xaa0 [ 167.998017] [<ffffffff810cff2f>] ? mark_held_locks+0x6f/0xa0 [ 167.998022] [<ffffffff810d525c>] ? __raw_spin_lock_init+0x1c/0x50 [ 167.998025] [<ffffffff810e5efd>] ? debug_lockdep_rcu_enabled+0x1d/0x20 [ 167.998028] [<ffffffff815988f3>] usb_submit_urb+0x3f3/0x5a0 [ 167.998032] [<ffffffff810d0082>] ? trace_hardirqs_on_caller+0x122/0x1b0 [ 167.998035] [<ffffffff81599ae7>] usb_sg_wait+0x67/0x150 [ 167.998039] [<ffffffff815dc202>] usb_stor_bulk_transfer_sglist.part.3+0x82/0xd0 [ 167.998042] [<ffffffff815dc29c>] usb_stor_bulk_srb+0x4c/0x60 [ 167.998045] [<ffffffff815dc42e>] usb_stor_Bulk_transport+0x17e/0x420 [ 167.998049] [<ffffffff815dcf32>] usb_stor_invoke_transport+0x242/0x540 [ 167.998052] [<ffffffff810e5efd>] ? debug_lockdep_rcu_enabled+0x1d/0x20 [ 167.998058] [<ffffffff815dba19>] usb_stor_transparent_scsi_command+0x9/0x10 [ 167.998061] [<ffffffff815de518>] usb_stor_control_thread+0x158/0x260 [ 167.998064] [<ffffffff815de3c0>] ? fill_inquiry_response+0x20/0x20 [ 167.998067] [<ffffffff815de3c0>] ? fill_inquiry_response+0x20/0x20 [ 167.998071] [<ffffffff8109ddfa>] kthread+0xea/0x100 [ 167.998078] [<ffffffff817ac6af>] ret_from_fork+0x1f/0x40 [ 167.998081] [<ffffffff8109dd10>] ? kthread_create_on_node+0x1f0/0x1f0 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96293 Signed-off-by: Chris Wilson <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Joerg Roedel <[email protected]> Cc: [email protected] Cc: [email protected] Fixes: 9257b4a206fc ('iommu/iova: introduce per-cpu caching to iova allocation') Signed-off-by: Joerg Roedel <[email protected]>
2016-06-27arm64: KVM: fix build with CONFIG_ARM_PMU disabledSudeep Holla1-2/+2
When CONFIG_ARM_PMU is disabled, we get the following build error: arch/arm64/kvm/sys_regs.c: In function 'pmu_counter_idx_valid': arch/arm64/kvm/sys_regs.c:564:27: error: 'ARMV8_PMU_CYCLE_IDX' undeclared (first use in this function) if (idx >= val && idx != ARMV8_PMU_CYCLE_IDX) ^ arch/arm64/kvm/sys_regs.c:564:27: note: each undeclared identifier is reported only once for each function it appears in arch/arm64/kvm/sys_regs.c: In function 'access_pmu_evcntr': arch/arm64/kvm/sys_regs.c:592:10: error: 'ARMV8_PMU_CYCLE_IDX' undeclared (first use in this function) idx = ARMV8_PMU_CYCLE_IDX; ^ arch/arm64/kvm/sys_regs.c: In function 'access_pmu_evtyper': arch/arm64/kvm/sys_regs.c:638:14: error: 'ARMV8_PMU_CYCLE_IDX' undeclared (first use in this function) if (idx == ARMV8_PMU_CYCLE_IDX) ^ arch/arm64/kvm/hyp/switch.c:86:15: error: 'ARMV8_PMU_USERENR_MASK' undeclared (first use in this function) write_sysreg(ARMV8_PMU_USERENR_MASK, pmuserenr_el0); This patch fixes the build with CONFIG_ARM_PMU disabled. Cc: Christoffer Dall <[email protected]> Cc: Marc Zyngier <[email protected]> Signed-off-by: Sudeep Holla <[email protected]> Signed-off-by: Christoffer Dall <[email protected]>
2016-06-27sched/fair: Rework throttle_count syncPeter Zijlstra2-21/+20
Since we already take rq->lock when creating a cgroup, use it to also sync the throttle_count and avoid the extra state and enqueue path branch. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] [ Fixed build warning. ] Signed-off-by: Ingo Molnar <[email protected]>
2016-06-27sched/core: Fix sched_getaffinity() return value kerneldoc commentZev Weiss1-1/+2
Previous version was probably written referencing the man page for glibc's wrapper, but the wrapper's behavior differs from that of the syscall itself in this case. Signed-off-by: Zev Weiss <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-06-27powerpc/tm: Always reclaim in start_thread() for exec() class syscallsCyril Bur1-0/+10
Userspace can quite legitimately perform an exec() syscall with a suspended transaction. exec() does not return to the old process, rather it load a new one and starts that, the expectation therefore is that the new process starts not in a transaction. Currently exec() is not treated any differently to any other syscall which creates problems. Firstly it could allow a new process to start with a suspended transaction for a binary that no longer exists. This means that the checkpointed state won't be valid and if the suspended transaction were ever to be resumed and subsequently aborted (a possibility which is exceedingly likely as exec()ing will likely doom the transaction) the new process will jump to invalid state. Secondly the incorrect attempt to keep the transactional state while still zeroing state for the new process creates at least two TM Bad Things. The first triggers on the rfid to return to userspace as start_thread() has given the new process a 'clean' MSR but the suspend will still be set in the hardware MSR. The second TM Bad Thing triggers in __switch_to() as the processor is still transactionally suspended but __switch_to() wants to zero the TM sprs for the new process. This is an example of the outcome of calling exec() with a suspended transaction. Note the first 700 is likely the first TM bad thing decsribed earlier only the kernel can't report it as we've loaded userspace registers. c000000000009980 is the rfid in fast_exception_return() Bad kernel stack pointer 3fffcfa1a370 at c000000000009980 Oops: Bad kernel stack pointer, sig: 6 [#1] CPU: 0 PID: 2006 Comm: tm-execed Not tainted NIP: c000000000009980 LR: 0000000000000000 CTR: 0000000000000000 REGS: c00000003ffefd40 TRAP: 0700 Not tainted MSR: 8000000300201031 <SF,ME,IR,DR,LE,TM[SE]> CR: 00000000 XER: 00000000 CFAR: c0000000000098b4 SOFTE: 0 PACATMSCRATCH: b00000010000d033 GPR00: 0000000000000000 00003fffcfa1a370 0000000000000000 0000000000000000 GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR12: 00003fff966611c0 0000000000000000 0000000000000000 0000000000000000 NIP [c000000000009980] fast_exception_return+0xb0/0xb8 LR [0000000000000000] (null) Call Trace: Instruction dump: f84d0278 e9a100d8 7c7b03a6 e84101a0 7c4ff120 e8410170 7c5a03a6 e8010070 e8410080 e8610088 e8810090 e8210078 <4c000024> 48000000 e8610178 88ed023b Kernel BUG at c000000000043e80 [verbose debug info unavailable] Unexpected TM Bad Thing exception at c000000000043e80 (msr 0x201033) Oops: Unrecoverable exception, sig: 6 [#2] CPU: 0 PID: 2006 Comm: tm-execed Tainted: G D task: c0000000fbea6d80 ti: c00000003ffec000 task.ti: c0000000fb7ec000 NIP: c000000000043e80 LR: c000000000015a24 CTR: 0000000000000000 REGS: c00000003ffef7e0 TRAP: 0700 Tainted: G D MSR: 8000000300201033 <SF,ME,IR,DR,RI,LE,TM[SE]> CR: 28002828 XER: 00000000 CFAR: c000000000015a20 SOFTE: 0 PACATMSCRATCH: b00000010000d033 GPR00: 0000000000000000 c00000003ffefa60 c000000000db5500 c0000000fbead000 GPR04: 8000000300001033 2222222222222222 2222222222222222 00000000ff160000 GPR08: 0000000000000000 800000010000d033 c0000000fb7e3ea0 c00000000fe00004 GPR12: 0000000000002200 c00000000fe00000 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 c0000000fbea7410 00000000ff160000 GPR24: c0000000ffe1f600 c0000000fbea8700 c0000000fbea8700 c0000000fbead000 GPR28: c000000000e20198 c0000000fbea6d80 c0000000fbeab680 c0000000fbea6d80 NIP [c000000000043e80] tm_restore_sprs+0xc/0x1c LR [c000000000015a24] __switch_to+0x1f4/0x420 Call Trace: Instruction dump: 7c800164 4e800020 7c0022a6 f80304a8 7c0222a6 f80304b0 7c0122a6 f80304b8 4e800020 e80304a8 7c0023a6 e80304b0 <7c0223a6> e80304b8 7c0123a6 4e800020 This fixes CVE-2016-5828. Fixes: bc2a9408fa65 ("powerpc: Hook in new transactional memory code") Cc: [email protected] # v3.9+ Signed-off-by: Cyril Bur <[email protected]> Signed-off-by: Michael Ellerman <[email protected]>
2016-06-27sched/fair: Reorder cgroup creation codePeter Zijlstra3-4/+21
A future patch needs rq->lock held _after_ we link the task_group into the hierarchy. In order to avoid taking every rq->lock twice, reorder things a little and create online_fair_sched_group() to be called after we link the task_group. All this code is still ran from css_alloc() so css_online() isn't in fact used for this. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Konstantin Khlebnikov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-06-27sched/fair: Apply more PELT fixesPeter Zijlstra1-4/+49
One additional 'rule' for using update_cfs_rq_load_avg() is that one should call update_tg_load_avg() if it returns true. Add a bunch of comments to hopefully clarify some of the rules: o You need to update cfs_rq _before_ any entity attach/detach, this is important, because while for mathmatical consisency this isn't strictly needed, it is required for the physical interpretation of the model, you attach/detach _now_. o When you modify the cfs_rq avg, you have to then call update_tg_load_avg() in order to propagate changes upwards. o (Fair) entities are always attached, switched_{to,from}_fair() deal with !fair. This directly follows from the definition of the cfs_rq averages, namely that they are a direct sum of all (runnable or blocked) entities on that rq. It is the second rule that this patch enforces, but it adds comments pertaining to all of them. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-06-27sched/fair: Fix PELT integrity for new tasksPeter Zijlstra3-15/+63
Vincent and Yuyang found another few scenarios in which entity tracking goes wobbly. The scenarios are basically due to the fact that new tasks are not immediately attached and thereby differ from the normal situation -- a task is always attached to a cfs_rq load average (such that it includes its blocked contribution) and are explicitly detached/attached on migration to another cfs_rq. Scenario 1: switch to fair class p->sched_class = fair_class; if (queued) enqueue_task(p); ... enqueue_entity() enqueue_entity_load_avg() migrated = !sa->last_update_time (true) if (migrated) attach_entity_load_avg() check_class_changed() switched_from() (!fair) switched_to() (fair) switched_to_fair() attach_entity_load_avg() If @p is a new task that hasn't been fair before, it will have !last_update_time and, per the above, end up in attach_entity_load_avg() _twice_. Scenario 2: change between cgroups sched_move_group(p) if (queued) dequeue_task() task_move_group_fair() detach_task_cfs_rq() detach_entity_load_avg() set_task_rq() attach_task_cfs_rq() attach_entity_load_avg() if (queued) enqueue_task(); ... enqueue_entity() enqueue_entity_load_avg() migrated = !sa->last_update_time (true) if (migrated) attach_entity_load_avg() Similar as with scenario 1, if @p is a new task, it will have !load_update_time and we'll end up in attach_entity_load_avg() _twice_. Furthermore, notice how we do a detach_entity_load_avg() on something that wasn't attached to begin with. As stated above; the problem is that the new task isn't yet attached to the load tracking and thereby violates the invariant assumption. This patch remedies this by ensuring a new task is indeed properly attached to the load tracking on creation, through post_init_entity_util_avg(). Of course, this isn't entirely as straightforward as one might think, since the task is hashed before we call wake_up_new_task() and thus can be poked at. We avoid this by adding TASK_NEW and teaching cpu_cgroup_can_attach() to refuse such tasks. Reported-by: Yuyang Du <[email protected]> Reported-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-06-27sched/cgroup: Fix cpu_cgroup_fork() handlingVincent Guittot3-24/+67
A new fair task is detached and attached from/to task_group with: cgroup_post_fork() ss->fork(child) := cpu_cgroup_fork() sched_move_task() task_move_group_fair() Which is wrong, because at this point in fork() the task isn't fully initialized and it cannot 'move' to another group, because its not attached to any group as yet. In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we can just call this small part directly instead sched_move_task(). And the task doesn't really migrate because it is not yet attached so we need the following sequence: do_fork() sched_fork() __set_task_cpu() cgroup_post_fork() set_task_rq() # set task group and runqueue wake_up_new_task() select_task_rq() can select a new cpu __set_task_cpu post_init_entity_util_avg attach_task_cfs_rq() activate_task enqueue_task This patch makes that happen. Signed-off-by: Vincent Guittot <[email protected]> [ Added TASK_SET_GROUP to set depth properly. ] Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-06-27sched/fair: Fix PELT integrity for new groupsPeter Zijlstra1-0/+10
Vincent reported that when a new task is moved into a new cgroup it gets attached twice to the load tracking: sched_move_task() task_move_group_fair() detach_task_cfs_rq() set_task_rq() attach_task_cfs_rq() attach_entity_load_avg() se->avg.last_load_update = cfs_rq->avg.last_load_update // == 0 enqueue_entity() enqueue_entity_load_avg() update_cfs_rq_load_avg() now = clock() __update_load_avg(&cfs_rq->avg) cfs_rq->avg.last_load_update = now // ages load/util for: now - 0, load/util -> 0 if (migrated) attach_entity_load_avg() se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0 The problem is that we don't update cfs_rq load_avg before all entity attach/detach operations. Only enqueue_task() and migrate_task() do this. By fixing this, the above will not happen, because the sched_move_task() attach will have updated cfs_rq's last_load_update time before attach, and in turn the attach will have set the entity's last_load_update stamp. Note that there is a further problem with sched_move_task() calling detach on a task that hasn't yet been attached; this will be taken care of in a subsequent patch. Reported-by: Vincent Guittot <[email protected]> Tested-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Yuyang Du <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-06-27sched/fair: Fix and optimize the fork() pathPeter Zijlstra2-26/+17
The task_fork_fair() callback already calls __set_task_cpu() and takes rq->lock. If we move the sched_class::task_fork callback in sched_fork() under the existing p->pi_lock, right after its set_task_cpu() call, we can avoid doing two such calls and omit the IRQ disabling on the rq->lock. Change to __set_task_cpu() to skip the migration bits, this is a new task, not a migration. Similarly, make wake_up_new_task() use __set_task_cpu() for the same reason, the task hasn't actually migrated as it hasn't ever ran. This cures the problem of calling migrate_task_rq_fair(), which does remove_entity_from_load_avg() on tasks that have never been added to the load avg to begin with. This bug would result in transiently messed up load_avg values, averaged out after a few dozen milliseconds. This is probably the reason why this bug was not found for such a long time. Reported-by: Vincent Guittot <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Mike Galbraith <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Signed-off-by: Ingo Molnar <[email protected]>
2016-06-27Merge branch 'sched/urgent' into sched/core, to pick up fixesIngo Molnar466-2419/+4140
Signed-off-by: Ingo Molnar <[email protected]>