Age | Commit message (Collapse) | Author | Files | Lines |
|
Reclaim pressure balance between anon and file pages is calculated
through a tuple of numerators and a shared denominator.
Exceptional cases that want to force-scan anon or file pages configure
the numerators and denominator such that one list is preferred, which is
not necessarily the most obvious way:
fraction[0] = 1;
fraction[1] = 0;
denominator = 1;
goto out;
Make this easier by making the force-scan cases explicit and use the
fractionals only in case they are calculated from reclaim history.
[[email protected]: avoid using unintialized_var()]
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Satoru Moriya <[email protected]>
Cc: Simon Jeons <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix comment style and elaborate on why anonymous memory is force-scanned
when file cache runs low.
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Satoru Moriya <[email protected]>
Cc: Simon Jeons <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
A swappiness of 0 has a slightly different meaning for global reclaim
(may swap if file cache really low) and memory cgroup reclaim (never
swap, ever).
In addition, global reclaim at highest priority will scan all LRU lists
equal to their size and ignore other balancing heuristics. UNLESS
swappiness forbids swapping, then the lists are balanced based on recent
reclaim effectiveness. UNLESS file cache is running low, then anonymous
pages are force-scanned.
This (total mess of a) behaviour is implicit and not obvious from the
way the code is organized. At least make it apparent in the code flow
and document the conditions. It will be it easier to come up with sane
semantics later.
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Satoru Moriya <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Simon Jeons <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
amount of pages is scanned from the LRU lists on each iteration, to make
progress.
Do not make this minimum bigger than the respective LRU list size,
however, and save some busy work trying to isolate and reclaim pages
that are not there.
Empty LRU lists are quite common with memory cgroups in NUMA
environments because there exists a set of LRU lists for each zone for
each memory cgroup, while the memory of a single cgroup is expected to
stay on just one node. The number of expected empty LRU lists is thus
memcgs * (nodes - 1) * lru types
Each attempt to reclaim from an empty LRU list does expensive size
comparisons between lists, acquires the zone's lru lock etc. Avoid
that.
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Satoru Moriya <[email protected]>
Cc: Simon Jeons <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Commit e9868505987a ("mm, vmscan: only evict file pages when we have
plenty") makes a point of not going for anonymous memory while there is
still enough inactive cache around.
The check was added only for global reclaim, but it is just as useful to
reduce swapping in memory cgroup reclaim:
200M-memcg-defconfig-j2
vanilla patched
Real time 454.06 ( +0.00%) 453.71 ( -0.08%)
User time 668.57 ( +0.00%) 668.73 ( +0.02%)
System time 128.92 ( +0.00%) 129.53 ( +0.46%)
Swap in 1246.80 ( +0.00%) 814.40 ( -34.65%)
Swap out 1198.90 ( +0.00%) 827.00 ( -30.99%)
Pages allocated 16431288.10 ( +0.00%) 16434035.30 ( +0.02%)
Major faults 681.50 ( +0.00%) 593.70 ( -12.86%)
THP faults 237.20 ( +0.00%) 242.40 ( +2.18%)
THP collapse 241.20 ( +0.00%) 248.50 ( +3.01%)
THP splits 157.30 ( +0.00%) 161.40 ( +2.59%)
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Satoru Moriya <[email protected]>
Cc: Simon Jeons <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
As per documentation and other places calling putback_lru_pages(),
putback_lru_pages() is called on error only. Make the CMA code behave
consistently.
[[email protected]: remove a test-n-branch in the wrapup code]
Signed-off-by: Srinivas Pandruvada <[email protected]>
Acked-by: Michal Nazarewicz <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Bartlomiej Zolnierkiewicz <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Cc: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Hillf Danton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Acked-by: Sha Zhengju <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Currently when a memcg oom is happening the oom dump messages is still
global state and provides few useful info for users. This patch prints
more pointed memcg page statistics for memcg-oom and take hierarchy into
consideration:
Based on Michal's advice, we take hierarchy into consideration: supppose
we trigger an OOM on A's limit
root_memcg
|
A (use_hierachy=1)
/ \
B C
|
D
then the printed info will be:
Memory cgroup stats for /A:...
Memory cgroup stats for /A/B:...
Memory cgroup stats for /A/C:...
Memory cgroup stats for /A/B/D:...
Following are samples of oom output:
(1) Before change:
mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0
mal-80 cpuset=/ mems_allowed=0
Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10
Call Trace:
[<ffffffff8167fbfb>] dump_header+0x83/0x1ca
..... (call trace)
[<ffffffff8168a818>] page_fault+0x28/0x30
<<<<<<<<<<<<<<<<<<<<< memcg specific information
Task in /A/B/D killed as a result of limit of /A
memory: usage 101376kB, limit 101376kB, failcnt 57
memory+swap: usage 101376kB, limit 101376kB, failcnt 0
kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
<<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
......
CPU 3: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 173
......
CPU 3: hi: 186, btch: 31 usd: 130
<<<<<<<<<<<<<<<<<<<<< print global page state
active_anon:92963 inactive_anon:40777 isolated_anon:0
active_file:33027 inactive_file:51718 isolated_file:0
unevictable:0 dirty:3 writeback:0 unstable:0
free:729995 slab_reclaimable:6897 slab_unreclaimable:6263
mapped:20278 shmem:35971 pagetables:5885 bounce:0
free_cma:0
<<<<<<<<<<<<<<<<<<<<< print per zone page state
Node 0 DMA free:15836kB ... all_unreclaimable? no
lowmem_reserve[]: 0 3175 3899 3899
Node 0 DMA32 free:2888564kB ... all_unrelaimable? no
lowmem_reserve[]: 0 0 724 724
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB
Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB
120710 total pagecache pages
0 pages in swap cache
<<<<<<<<<<<<<<<<<<<<< print global swap cache stat
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 499708kB
Total swap = 499708kB
1040368 pages RAM
58678 pages reserved
169065 pages shared
173632 pages non-shared
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 2693] 0 2693 6005 1324 17 0 0 god
[ 2754] 0 2754 6003 1320 16 0 0 god
[ 2811] 0 2811 5992 1304 18 0 0 god
[ 2874] 0 2874 6005 1323 18 0 0 god
[ 2935] 0 2935 8720 7742 21 0 0 mal-30
[ 2976] 0 2976 21520 17577 42 0 0 mal-80
Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child
Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB
We can see that messages dumped by show_free_areas() are longsome and can
provide so limited info for memcg that just happen oom.
(2) After change
mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
mal-80 cpuset=/ mems_allowed=0
Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10
Call Trace:
[<ffffffff8167fd0b>] dump_header+0x83/0x1d1
.......(call trace)
[<ffffffff8168a918>] page_fault+0x28/0x30
Task in /A/B/D killed as a result of limit of /A
<<<<<<<<<<<<<<<<<<<<< memcg specific information
memory: usage 102400kB, limit 102400kB, failcnt 140
memory+swap: usage 102400kB, limit 102400kB, failcnt 0
kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB
Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 2260] 0 2260 6006 1325 18 0 0 god
[ 2383] 0 2383 6003 1319 17 0 0 god
[ 2503] 0 2503 6004 1321 18 0 0 god
[ 2622] 0 2622 6004 1321 16 0 0 god
[ 2695] 0 2695 8720 7741 22 0 0 mal-30
[ 2704] 0 2704 21520 17839 43 0 0 mal-80
Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child
Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB
This version provides more pointed info for memcg in "Memory cgroup stats
for XXX" section.
Signed-off-by: Sha Zhengju <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
Fix the warning:
drivers/md/persistent-data/dm-transaction-manager.c:28:1: warning: "HASH_SIZE" redefined
In file included from include/linux/elevator.h:5,
from include/linux/blkdev.h:216,
from drivers/md/persistent-data/dm-block-manager.h:11,
from drivers/md/persistent-data/dm-transaction-manager.h:10,
from drivers/md/persistent-data/dm-transaction-manager.c:6:
include/linux/hashtable.h:22:1: warning: this is the location of the previous definition
Cc: Alasdair Kergon <[email protected]>
Cc: Neil Brown <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Benjamin Herrenschmidt:
"So from the depth of frozen Minnesota, here's the powerpc pull request
for 3.9. It has a few interesting highlights, in addition to the
usual bunch of bug fixes, minor updates, embedded device tree updates
and new boards:
- Hand tuned asm implementation of SHA1 (by Paulus & Michael
Ellerman)
- Support for Doorbell interrupts on Power8 (kind of fast
thread-thread IPIs) by Ian Munsie
- Long overdue cleanup of the way we handle relocation of our open
firmware trampoline (prom_init.c) on 64-bit by Anton Blanchard
- Support for saving/restoring & context switching the PPR (Processor
Priority Register) on server processors that support it. This
allows the kernel to preserve thread priorities established by
userspace. By Haren Myneni.
- DAWR (new watchpoint facility) support on Power8 by Michael Neuling
- Ability to change the DSCR (Data Stream Control Register) which
controls cache prefetching on a running process via ptrace by
Alexey Kardashevskiy
- Support for context switching the TAR register on Power8 (new
branch target register meant to be used by some new specific
userspace perf event interrupt facility which is yet to be enabled)
by Ian Munsie.
- Improve preservation of the CFAR register (which captures the
origin of a branch) on various exception conditions by Paulus.
- Move the Bestcomm DMA driver from arch powerpc to drivers/dma where
it belongs by Philippe De Muyter
- Support for Transactional Memory on Power8 by Michael Neuling
(based on original work by Matt Evans). For those curious about
the feature, the patch contains a pretty good description."
(See commit db8ff907027b: "powerpc: Documentation for transactional
memory on powerpc" for the mentioned description added to the file
Documentation/powerpc/transactional_memory.txt)
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (140 commits)
powerpc/kexec: Disable hard IRQ before kexec
powerpc/85xx: l2sram - Add compatible string for BSC9131 platform
powerpc/85xx: bsc9131 - Correct typo in SDHC device node
powerpc/e500/qemu-e500: enable coreint
powerpc/mpic: allow coreint to be determined by MPIC version
powerpc/fsl_pci: Store the pci ctlr device ptr in the pci ctlr struct
powerpc/85xx: Board support for ppa8548
powerpc/fsl: remove extraneous DIU platform functions
arch/powerpc/platforms/85xx/p1022_ds.c: adjust duplicate test
powerpc: Documentation for transactional memory on powerpc
powerpc: Add transactional memory to pseries and ppc64 defconfigs
powerpc: Add config option for transactional memory
powerpc: Add transactional memory to POWER8 cpu features
powerpc: Add new transactional memory state to the signal context
powerpc: Hook in new transactional memory code
powerpc: Routines for FP/VSX/VMX unavailable during a transaction
powerpc: Add transactional memory unavaliable execption handler
powerpc: Add reclaim and recheckpoint functions for context switching transactional memory processes
powerpc: Add FP/VSX and VMX register load functions for transactional memory
powerpc: Add helper functions for transactional memory context switching
...
|
|
* acpi-pm:
ACPI / PM: Take unusual configurations of power resources into account
|
|
Commit d2e5f0c (ACPI / PCI: Rework the setup and cleanup of device
wakeup) moved the initial disabling of system wakeup for PCI devices
into a place where it can actually work and that exposed a hidden old
issue with crap^Wunusual system designs where the same power
resources are used for both wakeup power and device power control at
run time.
Namely, say there is one power resource such that the ACPI power
state D0 of a PCI device depends on that power resource (i.e. the
device is in D0 when that power resource is "on") and it is used
as a wakeup power resource for the same device. Then, calling
acpi_pci_sleep_wake(pci_dev, false) for the device in question will
cause the reference counter of that power resource to drop to 0,
which in turn will cause it to be turned off. As a result, the
device will go into D3cold at that point, although it should have
stayed in D0.
As it turns out, that happens to USB controllers on some laptops
and USB becomes unusable on those machines as a result, which is
a major regression from v3.8.
To fix this problem, (1) increment the reference counters of wakup
power resources during their initialization if they are "on"
initially, (2) prevent acpi_disable_wakeup_device_power() from
decrementing the reference counters of wakeup power resources that
were not enabled for wakeup power previously, and (3) prevent
acpi_enable_wakeup_device_power() from incrementing the reference
counters of wakeup power resources that already are enabled for
wakeup power.
In addition to that, if it is impossible to determine the initial
states of wakeup power resources, avoid enabling wakeup for devices
whose wakeup power depends on those power resources.
Reported-by: Dave Jones <[email protected]>
Reported-by: Fabio Baltieri <[email protected]>
Tested-by: Fabio Baltieri <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Trivial, but WRITE SAME is an SBC command so it seems strange for a
related function (defined in target_core_sbc.c) to be in the spc_
namespace.
Signed-off-by: Roland Dreier <[email protected]>
Signed-off-by: Nicholas Bellinger <[email protected]>
|
|
The sock_diag_lock_handler() and sock_diag_unlock_handler() actually
make the code less readable. Get rid of them and make the lock usage
and access to sock_diag_handlers[] clear on the first sight.
Signed-off-by: Mathias Krause <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Userland can send a netlink message requesting SOCK_DIAG_BY_FAMILY
with a family greater or equal then AF_MAX -- the array size of
sock_diag_handlers[]. The current code does not test for this
condition therefore is vulnerable to an out-of-bound access opening
doors for a privilege escalation.
Signed-off-by: Mathias Krause <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.
Signed-off-by: Kees Cook <[email protected]>
Cc: Stephen Hemminger <[email protected]>
Cc: David S. Miller <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The mlx4_en driver allocates the number of objects for the CPU affinity
reverse-map based on the number of rx rings of the device. However,
mlx4_assign_eq() calls irq_cpu_rmap_add() as many times as IRQ's are
assigned to EQ's, which can be as large as mlx4_dev->caps.comp_pool. If
caps.comp_pool is larger than rx_ring_num we will eventually hit the
BUG_ON() in cpu_rmap_add().
Fix this problem by allocating space for the maximum number of CPU
affinity reverse-map objects we might want to add.
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
Acked-by: Amir Vadai <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The memory to hold the network device tx_cq is not being allocated with
the correct size in mlx4_en_init_netdev(). It should use MAX_TX_RINGS
instead of MAX_RX_RINGS. This can cause problems if the number of tx
rings being used is greater than MAX_RX_RINGS.
Signed-off-by: Kleber Sacilotto de Souza <[email protected]>
Acked-by: Amir Vadai <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Disable hard IRQ before kexec a new kernel image.
Not doing it can result in corrupted data in the memory segments
reserved for the new kernel.
Signed-off-by: Phileas Fogg <[email protected]>
CC: <[email protected]>
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/lliubbo/blackfin
Pull small blackfin update from Bob Liu.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lliubbo/blackfin:
blackfin: time-ts: Remove duplicate assignment
blackfin: pm: fix build error
blackfin: sync data in blackfin write buffer
blackfin: use bitmap library functions
blackfin: mem_init: update dmc config register
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller.
The bulk of this is optimized page coping/clearing and cache flushing
(virtual caches are lovely) by John David Anglin.
* 'parisc-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: (31 commits)
arch/parisc/include/asm: use ARRAY_SIZE macro in mmzone.h
parisc: remove empty lines and unnecessary #ifdef coding in include/asm/signal.h
parisc: sendfile and sendfile64 syscall cleanups
parisc: switch to available compat_sched_rr_get_interval implementation
parisc: fix fallocate syscall
parisc: fix error return codes for rt_sigaction and rt_sigprocmask
parisc: convert msgrcv and msgsnd syscalls to use compat layer
parisc: correctly wire up mq_* functions for CONFIG_COMPAT case
parisc: fix personality on 32bit kernel
parisc: wire up process_vm_readv, process_vm_writev, kcmp and finit_module syscalls
parisc: led driver requires CONFIG_VM_EVENT_COUNTERS
parisc: remove unused compat_rt_sigframe.h header
parisc/mm/fault.c: Port OOM changes to do_page_fault
parisc: space register variables need to be in native length (unsigned long)
parisc: fix ptrace breakage
parisc: always detect multiple physical ranges
parisc: ensure that mmapped shared pages are aligned at SHMLBA addresses
parisc: disable preemption while flushing D- or I-caches through TMPALIAS region
parisc: remove IRQF_DISABLED
parisc: fixes and cleanups in page cache flushing (4/4)
...
|
|
It's safe only under namespace_sem or vfsmount_lock; all places
in fs/namespace.c that want mnt->mnt_ns->user_ns actually want to use
current->nsproxy->mnt_ns->user_ns (note the calls of check_mnt() in
there).
Cc: [email protected]
Signed-off-by: Al Viro <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull ia64 build breakage fix from Tony Luck.
* tag 'please-pull-fix-ia64-build' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
sched: move RR_TIMESLICE from sysctl.h to rt.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking changes from Ingo Molnar:
"The biggest change is the rwsem lock-steal improvements, both to the
assembly optimized and the spinlock based variants.
The other notable change is the clean up of the seqlock implementation
to be based on the seqcount infrastructure.
The rest is assorted smaller debuggability, cleanup and continued -rt
locking changes."
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rwsem-spinlock: Implement writer lock-stealing for better scalability
futex: Revert "futex: Mark get_robust_list as deprecated"
generic: Use raw local irq variant for generic cmpxchg
lockdep: Selftest: convert spinlock to raw spinlock
seqlock: Use seqcount infrastructure
seqlock: Remove unused functions
ntp: Make ntp_lock raw
intel_idle: Convert i7300_idle_lock to raw_spinlock
locking: Various static lock initializer fixes
lockdep: Print more info when MAX_LOCK_DEPTH is exceeded
rwsem: Implement writer lock-stealing for better scalability
lockdep: Silence warning if CONFIG_LOCKDEP isn't set
watchdog: Use local_clock for get_timestamp()
lockdep: Rename print_unlock_inbalance_bug() to print_unlock_imbalance_bug()
locking/stat: Fix a typo
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loading update from Peter Anvin:
"This patchset lets us update the CPU microcode very, very early in
initialization if the BIOS fails to do so (never happens, right?)
This is handy for dealing with things like the Atom erratum where we
have to run without PSE because microcode loading happens too late.
As I mentioned in the x86/mm push request it depends on that
infrastructure but it is otherwise a standalone feature."
* 'x86/microcode' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/Kconfig: Make early microcode loading a configuration feature
x86/mm/init.c: Copy ucode from initrd image to kernel memory
x86/head64.c: Early update ucode in 64-bit
x86/head_32.S: Early update ucode in 32-bit
x86/microcode_intel_early.c: Early update ucode on Intel's CPU
x86/tlbflush.h: Define __native_flush_tlb_global_irq_disabled()
x86/microcode_intel_lib.c: Early update ucode on Intel's CPU
x86/microcode_core_early.c: Define interfaces for early loading ucode
x86/common.c: load ucode in 64 bit or show loading ucode info in 32 bit on AP
x86/common.c: Make have_cpuid_p() a global function
x86/microcode_intel.h: Define functions and macros for early loading ucode
x86, doc: Documentation for early microcode loading
|
|
With commit 8170e6bed465 ("x86, 64bit: Use a #PF handler to materialize
early mappings on demand") we started hitting an early bootup crash
where the Xen hypervisor would inform us that:
(XEN) d7:v0: unhandled page fault (ec=0000)
(XEN) Pagetable walk from ffffea000005b2d0:
(XEN) L4[0x1d4] = 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 7 (vcpu#0) crashed on cpu#3:
(XEN) ----[ Xen-4.2.0 x86_64 debug=n Not tainted ]----
.. that Xen was unable to context switch back to dom0.
Looking at the calling stack we find:
[<ffffffff8103feba>] xen_get_user_pgd+0x5a <--
[<ffffffff8103feba>] xen_get_user_pgd+0x5a
[<ffffffff81042d27>] xen_write_cr3+0x77
[<ffffffff81ad2d21>] init_mem_mapping+0x1f9
[<ffffffff81ac293f>] setup_arch+0x742
[<ffffffff81666d71>] printk+0x48
We are trying to figure out whether we need to up-date the user PGD as
well. Please keep in mind that under 64-bit PV guests we have a limited
amount of rings: 0 for the Hypervisor, and 1 for both the Linux kernel
and user-space. As such the Linux pvops'fied version of write_cr3
checks if it has to update the user-space cr3 as well.
That clearly is not needed during early bootup. The recent changes (see
above git commit) streamline the x86 page table allocation to be much
simpler (And also incidentally the #PF handler ends up in spirit being
similar to how the Xen toolstack sets up the initial page-tables).
The fix is to have an early-bootup version of cr3 that just loads the
kernel %cr3. The later version - which also handles user-page
modifications will be used after the initial page tables have been
setup.
[ hpa: removed a redundant #ifdef and made the new function __init.
Also note that x86-32 already has such an early xen_write_cr3. ]
Tested-by: "H. Peter Anvin" <[email protected]>
Reported-by: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: H. Peter Anvin <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
* acpi-cleanup:
ACPI / APEI: Fix crash in apei_hest_parse() for acpi=off
|
|
* pm-cpufreq:
imx6q-cpufreq: fix return value check in imx6q_cpufreq_probe()
PM / OPP: fix condition for empty of_init_opp_table()
|
|
* pm-cpuidle:
microblaze idle: Fix compile error
blackfin idle: Fix compile error
|
|
In case of error, the function devm_regulator_get() returns
ERR_PTR() and never returns NULL. The NULL test in the return
value check should be replaced with IS_ERR().
Signed-off-by: Wei Yongjun <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Acked-by: Shawn Guo <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
randconfig build reports the following error which is caused by
CONFIG_PM_OPP being unset.
CC arch/arm/mach-imx/mach-imx6q.o
arch/arm/mach-imx/mach-imx6q.c: In function ‘imx6q_opp_init’:
arch/arm/mach-imx/mach-imx6q.c:248:2: error: implicit declaration of function ‘of_init_opp_table’ [-Werror=implicit-function-declaration]
Fix the error by giving a more correct condition for empty
of_init_opp_table() implementation.
Reported-by: Rob Herring <[email protected]>
Signed-off-by: Shawn Guo <[email protected]>
Acked-by: Nishanth Menon <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
After commit 92ef2a2 (ACPI: Change the ordering of PCI root bridge
driver registrarion), acpi_hest_init() is never called for acpi=off
(acpi_disabled), so hest_disable is not set, but hest_tab is NULL,
which causes apei_hest_parse() to crash when it is called from
aer_acpi_firmware_first().
Fix that by making apei_hest_parse() check if hest_tab is not NULL
in addition to checking hest_disable. Also remove the now useless
acpi_disabled check from apei_hest_parse().
Reported-by: Thomas Gleixner <[email protected]>
Tested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Commit def8203 ("microblaze idle: delete pm_idle") introduced the following
compile error:
arch/microblaze/kernel/process.c: In function 'cpu_idle':
arch/microblaze/kernel/process.c:100: error: 'idle' undeclared (first use in this function)
arch/microblaze/kernel/process.c:100: error: (Each undeclared identifier is reported only once
arch/microblaze/kernel/process.c:100: error: for each function it appears in.)
arch/microblaze/kernel/process.c:106: error: implicit declaration of function 'idle'
This patch fixes it.
Signed-off-by: Lars-Peter Clausen <[email protected]>
Acked-by: Len Brown <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Commit 26bab0c ("blackfin idle: delete pm_idle") introduced the following
compile error:
arch/blackfin/kernel/process.c: In function ‘cpu_idle’:
arch/blackfin/kernel/process.c:83: error: ‘idle’ undeclared (first use in this function)
arch/blackfin/kernel/process.c:83: error: (Each undeclared identifier is reported only once
arch/blackfin/kernel/process.c:83: error: for each function it appears in.)
arch/blackfin/kernel/process.c:88: error: implicit declaration of function ‘idle’
This patch fixes it.
Signed-off-by: Lars-Peter Clausen <[email protected]>
Acked-by: Len Brown <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
|
|
Update the ia64 hugetlb_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
|
|
Update the ia64 arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.
Signed-off-by: Michel Lespinasse <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
|
|
The code requires the use of the proper per-exception-vector stub
functions (set up as the early_idt_handlers[] array - note the 's') that
make sure to set up the error vector number. This is true regardless of
whether CONFIG_EARLY_PRINTK is set or not.
Why? The stack offset for the comparison of __KERNEL_CS won't be right
otherwise, nor will the new check (from commit 8170e6bed465: "x86,
64bit: Use a #PF handler to materialize early mappings on demand") for
the page fault exception vector.
Acked-by: H. Peter Anvin <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
|
|
ext4_has_free_clusters() should tell us whether there is enough free
clusters to allocate, however number of free clusters in the file system
is converted to blocks using EXT4_C2B() which is not only wrong use of
the macro (we should have used EXT4_NUM_B2C) but it's also completely
wrong concept since everything else is in cluster units.
Moreover when calculating number of root clusters we should be using
macro EXT4_NUM_B2C() instead of EXT4_B2C() otherwise the result might be
off by one. However r_blocks_count should always be a multiple of the
cluster ratio so doing a plain bit shift should be enough here. We
avoid using EXT4_B2C() because it's confusing.
As a result of the first problem number of free clusters is much bigger
than it should have been and ext4_has_free_clusters() would return 1 even
if there is really not enough free clusters available.
Fix this by removing the EXT4_C2B() conversion of free clusters and
using bit shift when calculating number of root clusters. This bug
affects number of xfstests tests covering file system ENOSPC situation
handling. With this patch most of the ENOSPC problems with bigalloc file
system disappear, especially the errors caused by delayed allocation not
having enough space when the actual allocation is finally requested.
Signed-off-by: Lukas Czerner <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: [email protected]
|
|
len is 0 means no extent needs to be removed, so return immediately.
Otherwise it could trigger the following BUG_ON() in
ext4_es_remove_extent()
end = lblk + len - 1;
BUG_ON(end < lblk);
This could be reproduced by a simple truncate(1) command by an
unprivileged user
truncate -s $(($((2**32 - 1)) * 4096)) /mnt/ext4/testfile
The same is true for __es_insert_extent().
Patched kernel passed xfstests regression test.
Signed-off-by: Eryu Guan <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Reviewed-by: Zheng Liu <[email protected]>
|
|
In fast open the sender unncessarily reduces the space available
for data in SYN by 12 bytes. This is because in the sender
incorrectly reserves space for TS option twice in tcp_send_syn_data():
tcp_mtu_to_mss() already accounts for TS option space. But it further
reserves MAX_TCP_OPTION_SPACE when computing the payload space.
Signed-off-by: Yuchung Cheng <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
The mem cgroup socket limit is only used if the config option is
enabled. Found with sparse
Signed-off-by: Stephen Hemminger <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Dave Jones reported a lockdep splat occurring in IP defrag code.
commit 6d7b857d541ecd1d (net: use lib/percpu_counter API for
fragmentation mem accounting) added a possible deadlock.
Because percpu_counter_sum_positive() needs to acquire
a lock that can be used from softirq, we need to disable BH
in sum_frag_mem_limit()
Reported-by: Dave Jones <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
commit 7f7d6c282 (net: fec: Ensure that initialization is done prior to
request_irq()) placed fec_ptp_init() into a point that ptp clock was not
available, which causes a division by zero in fec_ptp_start_cyclecounter():
[ 17.895723] Division by zero in kernel.
[ 17.899571] Backtrace:
[ 17.902094] [<80012564>] (dump_backtrace+0x0/0x10c) from [<8056deec>]
(dump_stack+0x18/0x1c)
[ 17.910539] r6:bfba8500 r5:8075c950 r4:bfba8000 r3:bfbd0000
[ 17.916284] [<8056ded4>] (dump_stack+0x0/0x1c) from [<80012688>]
(__div0+0x18/0x20)
[ 17.923968] [<80012670>] (__div0+0x0/0x20) from [<802829c4>] (Ldiv0+0x8/0x10)
[ 17.931140] [<80398534>] (fec_ptp_start_cyclecounter+0x0/0x110) from
[<80394f64>] (fec_restart+0x6c8/0x754)
[ 17.940898] [<8039489c>] (fec_restart+0x0/0x754) from [<803969a0>]
(fec_enet_adjust_link+0xdc/0x108)
[ 17.950046] [<803968c4>] (fec_enet_adjust_link+0x0/0x108) from [<80390bc4>]
(phy_state_machine+0x178/0x534)
...
Fix this by rearraging the code so that fec_ptp_init() is called only after
the clocks have been properly acquired.
Tested on both mx53 and mx6 platforms.
Reported-by: Jim Baxter <[email protected]>
Signed-off-by: Fabio Estevam <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Smatch found a locking bug in netif_set_xps_queue in which we were not
releasing the lock in the case of an allocation failure.
This change corrects that so that we release the xps_map_mutex before
returning -ENOMEM in the case of an allocation failure.
Signed-off-by: Alexander Duyck <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Now we handle icmp errors in each transport protocol's err_handler,
for icmp protocols, that is ping_err. Since this handler only care
of those icmp errors triggered by echo request, errors triggered
by echo reply(which sent by kernel) are sliently ignored.
So wrap ping_err() with icmp_err() to deal with those icmp errors.
Signed-off-by: Li Wei <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Add the missing unlock before return from function brcmf_notify_vif_event()
in the error handling case.
Signed-off-by: Wei Yongjun <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
Acked-by: Arend van Spriel <[email protected]>
Signed-off-by: John W. Linville <[email protected]>
|
|
Unload sequence for mwifiex PCIE driver is as follows:
1. Invoking cleanup module from kernel results into
pci_unregister_driver
2. Kernel invokes PCIE remove() handler which disconnects all
interfaces.
3. One step during disconnect is to clean PCIE TX rings.
During this we read txbd_rdptr from FW.
While loading driver next time, we see pci_enable_device()
results into system freeze. This may have happened because we
accessed PCI device after unregistering from bus driver.
Removing this ioread() operation resolves this bug.
Signed-off-by: Avinash Patil <[email protected]>
Signed-off-by: Bing Zhao <[email protected]>
Signed-off-by: John W. Linville <[email protected]>
|
|
If the system suspends with mwifiex wifi powered on, and is then woken
by an ICMP ping packet, the ping response is discarded by the kernel
because the kernel incorrectly thinks there is no carrier.
I can't see any valid reason to want to report loss of carrier here,
so remove the offending code.
Fixes http://dev.laptop.org/ticket/12554
Signed-off-by: Daniel Drake <[email protected]>
Acked-by: Bing Zhao <[email protected]>
Signed-off-by: John W. Linville <[email protected]>
|
|
Originally submitted by Clark Williams as part of a cleanup,
but happens also to fix an ia64 build problem:
arch/ia64/kernel/init_task.c:38: error: 'RR_TIMESLICE' undeclared here (not in a function)
Signed-off-by: Clark Williams <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
|