aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2018-04-12btrfs: replace GPL boilerplate by SPDX -- headersDavid Sterba35-475/+133
Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Unify the include protection macros to match the file names. Signed-off-by: David Sterba <[email protected]>
2018-04-12powerpc/mm/radix: Fix checkstops caused by invalid tlbielMichael Ellerman1-3/+2
In tlbiel_radix_set_isa300() we use the PPC_TLBIEL() macro to construct tlbiel instructions. The instruction takes 5 fields, two of which are registers, and the others are constants. But because it's constructed with inline asm the compiler doesn't know that. We got the constraint wrong on the 'r' field, using "r" tells the compiler to put the value in a register. The value we then get in the macro is the *register number*, not the value of the field. That means when we mask the register number with 0x1 we get 0 or 1 depending on which register the compiler happens to put the constant in, eg: li r10,1 tlbiel r8,r9,2,0,0 li r7,1 tlbiel r10,r6,0,0,1 If we're unlucky we might generate an invalid instruction form, for example RIC=0, PRS=1 and R=0, tlbiel r8,r7,0,1,0, this has been observed to cause machine checks: Oops: Machine check, sig: 7 [#1] CPU: 24 PID: 0 Comm: swapper NIP: 00000000000385f4 LR: 000000000100ed00 CTR: 000000000000007f REGS: c00000000110bb40 TRAP: 0200 MSR: 9000000000201003 <SF,HV,ME,RI,LE> CR: 48002222 XER: 20040000 CFAR: 00000000000385d0 DAR: 0000000000001c00 DSISR: 00000200 SOFTE: 1 If the machine check happens early in boot while we have MSR_ME=0 it will escalate into a checkstop and kill the box entirely. To fix it we could change the inline asm constraint to "i" which tells the compiler the value is a constant. But a better fix is to just pass a literal 1 into the macro, which bypasses any problems with inline asm constraints. Fixes: d4748276ae14 ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9") Cc: [email protected] # v4.16+ Signed-off-by: Michael Ellerman <[email protected]> Reviewed-by: Nicholas Piggin <[email protected]>
2018-04-12Btrfs: fix loss of prealloc extents past i_size after fsync log replayFilipe Manana1-5/+58
Currently if we allocate extents beyond an inode's i_size (through the fallocate system call) and then fsync the file, we log the extents but after a power failure we replay them and then immediately drop them. This behaviour happens since about 2009, commit c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log"), because it marks the inode as an orphan instead of dropping any extents beyond i_size before replaying logged extents, so after the log replay, and while the mount operation is still ongoing, we find the inode marked as an orphan and then perform a truncation (drop extents beyond the inode's i_size). Because the processing of orphan inodes is still done right after replaying the log and before the mount operation finishes, the intention of that commit does not make any sense (at least as of today). However reverting that behaviour is not enough, because we can not simply discard all extents beyond i_size and then replay logged extents, because we risk dropping extents beyond i_size created in past transactions, for example: add prealloc extent beyond i_size fsync - clears the flag BTRFS_INODE_NEEDS_FULL_SYNC from the inode transaction commit add another prealloc extent beyond i_size fsync - triggers the fast fsync path power failure In that scenario, we would drop the first extent and then replay the second one. To fix this just make sure that all prealloc extents beyond i_size are logged, and if we find too many (which is far from a common case), fallback to a full transaction commit (like we do when logging regular extents in the fast fsync path). Trivial reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xab 0 256K" /mnt/foo $ sync $ xfs_io -c "falloc -k 256K 1M" /mnt/foo $ xfs_io -c "fsync" /mnt/foo <power failure> # mount to replay log $ mount /dev/sdb /mnt # at this point the file only has one extent, at offset 0, size 256K A test case for fstests follows soon, covering multiple scenarios that involve adding prealloc extents with previous shrinking truncates and without such truncates. Fixes: c71bf099abdd ("Btrfs: Avoid orphan inodes cleanup while replaying log") Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-04-12Btrfs: clean up resources during umount after trans is abortedLiu Bo1-1/+2
Currently if some fatal errors occur, like all IO get -EIO, resources would be cleaned up when a) transaction is being committed or b) BTRFS_FS_STATE_ERROR is set However, in some rare cases, resources may be left alone after transaction gets aborted and umount may run into some ASSERT(), e.g. ASSERT(list_empty(&block_group->dirty_list)); For case a), in btrfs_commit_transaciton(), there're several places at the beginning where we just call btrfs_end_transaction() without cleaning up resources. For case b), it is possible that the trans handle doesn't have any dirty stuff, then only trans hanlde is marked as aborted while BTRFS_FS_STATE_ERROR is not set, so resources remain in memory. This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that all resources won't stay in memory after umount. Signed-off-by: Liu Bo <[email protected]> Signed-off-by: David Sterba <[email protected]>
2018-04-12ovl: update documentation w.r.t "xino" featureAmir Goldstein1-6/+33
Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: add support for "xino" mount and config optionsAmir Goldstein3-3/+88
With mount option "xino=on", mounter declares that there are enough free high bits in underlying fs to hold the layer fsid. If overlayfs does encounter underlying inodes using the high xino bits reserved for layer fsid, a warning will be emitted and the original inode number will be used. The mount option name "xino" goes after a similar meaning mount option of aufs, but in overlayfs case, the mapping is stateless. An example for a use case of "xino=on" is when upper/lower is on an xfs filesystem. xfs uses 64bit inode numbers, but it currently never uses the upper 8bit for inode numbers exposed via stat(2) and that is not likely to change in the future without user opting-in for a new xfs feature. The actual number of unused upper bit is much larger and determined by the xfs filesystem geometry (64 - agno_log - agblklog - inopblog). That means that for all practical purpose, there are enough unused bits in xfs inode numbers for more than OVL_MAX_STACK unique fsid's. Another use case of "xino=on" is when upper/lower is on tmpfs. tmpfs inode numbers are allocated sequentially since boot, so they will practially never use the high inode number bits. For compatibility with applications that expect 32bit inodes, the feature can be disabled with "xino=off". The option "xino=auto" automatically detects underlying filesystem that use 32bit inodes and enables the feature. The Kconfig option OVERLAY_FS_XINO_AUTO and module parameter of the same name, determine if the default mode for overlayfs mount is "xino=auto" or "xino=off". Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: consistent d_ino for non-samefs with xinoAmir Goldstein1-6/+39
When overlay layers are not all on the same fs, but all inode numbers of underlying fs do not use the high 'xino' bits, overlay st_ino values are constant and persistent. In that case, relax non-samefs constraint for consistent d_ino and always iterate non-merge dir using ovl_fill_real() actor so we can remap lower inode numbers to unique lower fs range. Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: consistent i_ino for non-samefs with xinoAmir Goldstein4-14/+21
When overlay layers are not all on the same fs, but all inode numbers of underlying fs do not use the high 'xino' bits, overlay st_ino values are constant and persistent. In that case, set i_ino value to the same value as st_ino for nfsd readdirplus validator. Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: constant st_ino for non-samefs with xinoAmir Goldstein5-10/+75
On 64bit systems, when overlay layers are not all on the same fs, but all inode numbers of underlying fs are not using the high bits, use the high bits to partition the overlay st_ino address space. The high bits hold the fsid (upper fsid is 0). This way overlay inode numbers are unique and all inodes use overlay st_dev. Inode numbers are also persistent for a given layer configuration. Currently, our only indication for available high ino bits is from a filesystem that supports file handles and uses the default encode_fh() operation, which encodes a 32bit inode number. Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: allocate anon bdev per unique lower fsAmir Goldstein4-28/+72
Instead of allocating an anonymous bdev per lower layer, allocate one anonymous bdev per every unique lower fs that is different than upper fs. Every unique lower fs is assigned an fsid > 0 and the number of unique lower fs are stored in ofs->numlowerfs. The assigned fsid is stored in the lower layer struct and will be used also for inode number multiplexing. Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: factor out ovl_map_dev_ino() helperAmir Goldstein3-39/+57
A helper for ovl_getattr() to map the values of st_dev and st_ino according to constant st_ino rules. Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: cleanup ovl_update_time()Miklos Szeredi1-17/+11
No need to mess with an alias, the upperdentry can be retrieved directly from the overlay inode. Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: add WARN_ON() for non-dir redirect casesMiklos Szeredi1-0/+11
Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: cleanup setting OVL_INDEXVivek Goyal3-5/+3
Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: set d->is_dir and d->opaque for last path elementVivek Goyal1-2/+6
Certain properties in ovl_lookup_data should be set only for the last element of the path. IOW, if we are calling ovl_lookup_single() for an absolute redirect, then d->is_dir and d->opaque do not make much sense for intermediate path elements. Instead set them only if dentry being lookup is last path element. As of now we do not seem to be making use of d->opaque if it is set for a path/dentry in lower. But just define the semantics so that future code can make use of this assumption. Signed-off-by: Vivek Goyal <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: Do not check for redirect if this is last layerVivek Goyal1-1/+4
If we are looking in last layer, then there should not be any need to process redirect. redirect information is used only for lookup in next lower layer and there is no more lower layer to look into. So no need to process redirects. IOW, ignore redirects on lowest layer. Signed-off-by: Vivek Goyal <[email protected]> Reviewed-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: lookup in inode cache first when decoding lower file handleAmir Goldstein1-25/+33
When decoding a lower file handle, we need to check if lower file was copied up and indexed and if it has a whiteout index, we need to check if this is an unlinked but open non-dir before returning -ESTALE. To find out if this is an unlinked but open non-dir we need to lookup an overlay inode in inode cache by lower inode and that requires decoding the lower file handle before looking in inode cache. Before this change, if the lower inode turned out to be a directory, we may have paid an expensive cost to reconnect that lower directory for nothing. After this change, we start by decoding a disconnected lower dentry and using the lower inode for looking up an overlay inode in inode cache. If we find overlay inode and dentry in cache, we avoid the index lookup overhead. If we don't find an overlay inode and dentry in cache, then we only need to decode a connected lower dentry in case the lower dentry is a non-indexed directory. The xfstests group overlay/exportfs tests decoding overlayfs file handles after drop_caches with different states of the file at encode and decode time. Overall the tests in the group call ovl_lower_fh_to_d() 89 times to decode a lower file handle. Before this change, the tests called ovl_get_index_fh() 75 times and reconnect_one() 61 times. After this change, the tests call ovl_get_index_fh() 70 times and reconnect_one() 59 times. The 2 cases where reconnect_one() was avoided are cases where a non-upper directory file handle was encoded, then the directory removed and then file handle was decoded. To demonstrate the affect on decoding file handles with hot inode/dentry cache, the drop_caches call in the tests was disabled. Without drop_caches, there are no reconnect_one() calls at all before or after the change. Before the change, there are 75 calls to ovl_get_index_fh(), exactly as the case with drop_caches. After the change, there are only 10 calls to ovl_get_index_fh(). Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: do not try to reconnect a disconnected origin dentryAmir Goldstein4-11/+23
On lookup of non directory, we try to decode the origin file handle stored in upper inode. The origin file handle is supposed to be decoded to a disconnected non-dir dentry, which is fine, because we only need the lower inode of a copy up origin. However, if the origin file handle somehow turns out to be a directory we pay the expensive cost of reconnecting the directory dentry, only to get a mismatch file type and drop the dentry. Optimize this case by explicitly opting out of reconnecting the dentry. Opting-out of reconnect is done by passing a NULL acceptable callback to exportfs_decode_fh(). While the case described above is a strange corner case that does not really need to be optimized, the API added for this optimization will be used by a following patch to optimize a more common case of decoding an overlayfs file handle. Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: disambiguate ovl_encode_fh()Amir Goldstein4-16/+16
Rename ovl_encode_fh() to ovl_encode_real_fh() to differentiate from the exportfs function ovl_encode_inode_fh() and change the latter to ovl_encode_fh() to match the exportfs method name. Rename ovl_decode_fh() to ovl_decode_real_fh() for consistency. Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: set lower layer st_dev only if setting lower st_inoAmir Goldstein1-5/+2
For broken hardlinks, we do not return lower st_ino, so we should also not return lower pseudo st_dev. Fixes: a0c5ad307ac0 ("ovl: relax same fs constraint for constant st_ino") Cc: <[email protected]> #v4.15 Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: fix lookup with middle layer opaque dir and absolute path redirectsAmir Goldstein1-0/+9
As of now if we encounter an opaque dir while looking for a dentry, we set d->last=true. This means that there is no need to look further in any of the lower layers. This works fine as long as there are no redirets or relative redircts. But what if there is an absolute redirect on the children dentry of opaque directory. We still need to continue to look into next lower layer. This patch fixes it. Here is an example to demonstrate the issue. Say you have following setup. upper: /redirect (redirect=/a/b/c) lower1: /a/[b]/c ([b] is opaque) (c has absolute redirect=/a/b/d/) lower0: /a/b/d/foo Now "redirect" dir should merge with lower1:/a/b/c/ and lower0:/a/b/d. Note, despite the fact lower1:/a/[b] is opaque, we need to continue to look into lower0 because children c has an absolute redirect. Following is a reproducer. Watch me make foo disappear: $ mkdir lower middle upper work work2 merged $ mkdir lower/origin $ touch lower/origin/foo $ mount -t overlay none merged/ \ -olowerdir=lower,upperdir=middle,workdir=work2 $ mkdir merged/pure $ mv merged/origin merged/pure/redirect $ umount merged $ mount -t overlay none merged/ \ -olowerdir=middle:lower,upperdir=upper,workdir=work $ mv merged/pure/redirect merged/redirect Now you see foo inside a twice redirected merged dir: $ ls merged/redirect foo $ umount merged $ mount -t overlay none merged/ \ -olowerdir=middle:lower,upperdir=upper,workdir=work After mount cycle you don't see foo inside the same dir: $ ls merged/redirect During middle layer lookup, the opaqueness of middle/pure is left in the lookup state and then middle/pure/redirect is wrongly treated as opaque. Fixes: 02b69b284cd7 ("ovl: lookup redirects") Cc: <[email protected]> #v4.10 Signed-off-by: Amir Goldstein <[email protected]> Signed-off-by: Vivek Goyal <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12ovl: Set d->last properly during lookupVivek Goyal1-2/+6
d->last signifies that this is the last layer we are looking into and there is no more. And that means this allows for some optimzation opportunities during lookup. For example, in ovl_lookup_single() we don't have to check for opaque xattr of a directory is this is the last layer we are looking into (d->last = true). But knowing for sure whether we are looking into last layer can be very tricky. If redirects are not enabled, then we can look at poe->numlower and figure out if the lookup we are about to is last layer or not. But if redircts are enabled then it is possible poe->numlower suggests that we are looking in last layer, but there is an absolute redirect present in found element and that redirects us to a layer in root and that means lookup will continue in lower layers further. For example, consider following. /upperdir/pure (opaque=y) /upperdir/pure/foo (opaque=y,redirect=/bar) /lowerdir/bar In this case pure is "pure upper". When we look for "foo", that time poe->numlower=0. But that alone does not mean that we will not search for a merge candidate in /lowerdir. Absolute redirect changes that. IOW, d->last should not be set just based on poe->numlower if redirects are enabled. That can lead to setting d->last while it should not have and that means we will not check for opaque xattr while we should have. So do this. - If redirects are not enabled, then continue to rely on poe->numlower information to determine if it is last layer or not. - If redirects are enabled, then set d->last = true only if this is the last layer in root ovl_entry (roe). Suggested-by: Amir Goldstein <[email protected]> Reviewed-by: Amir Goldstein <[email protected]> Signed-off-by: Vivek Goyal <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]> Fixes: 02b69b284cd7 ("ovl: lookup redirects") Cc: <[email protected]> #v4.10
2018-04-12ovl: set i_ino to the value of st_ino for NFS exportAmir Goldstein2-5/+24
Eddie Horng reported that readdir of an overlayfs directory that was exported via NFSv3 returns entries with d_type set to DT_UNKNOWN. The reason is that while preparing the response for readdirplus, nfsd checks inside encode_entryplus_baggage() that a child dentry's inode number matches the value of d_ino returns by overlayfs readdir iterator. Because the overlayfs inodes use arbitrary inode numbers that are not correlated with the values of st_ino/d_ino, NFSv3 falls back to not encoding d_type. Although this is an allowed behavior, we can fix it for the case of all overlayfs layers on the same underlying filesystem. When NFS export is enabled and d_ino is consistent with st_ino (samefs), set the same value also to i_ino in ovl_fill_inode() for all overlayfs inodes, nfsd readdirplus sanity checks will pass. ovl_fill_inode() may be called from ovl_new_inode(), before real inode was created with ino arg 0. In that case, i_ino will be updated to real upper inode i_ino on ovl_inode_init() or ovl_inode_update(). Reported-by: Eddie Horng <[email protected]> Tested-by: Eddie Horng <[email protected]> Signed-off-by: Amir Goldstein <[email protected]> Fixes: 8383f1748829 ("ovl: wire up NFS export operations") Cc: <[email protected]> #v4.16 Signed-off-by: Miklos Szeredi <[email protected]>
2018-04-12perf/core: Need CAP_SYS_ADMIN to create k/uprobe with perf_event_open()Song Liu1-0/+8
Non-root user cannot create kprobe or uprobe through the text-based interface (kprobe_events, uprobe_events),so they should not be able to create probes via perf_event_open() either. Reported-by: Vince Weaver <[email protected]> Signed-off-by: Song Liu <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Fixes: 33ea4b24277b ("perf/core: Implement the 'perf_uprobe' PMU") Fixes: e12f03d7031a ("perf/core: Implement the 'perf_kprobe' PMU") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-04-12Merge branch 'WIP.x86/asm' into x86/urgent, because the topic is readyIngo Molnar5001-520290/+188666
Signed-off-by: Ingo Molnar <[email protected]>
2018-04-12x86/pgtable: Don't set huge PUD/PMD on non-leaf entriesJoerg Roedel1-0/+9
The pmd_set_huge() and pud_set_huge() functions are used from the generic ioremap() code to establish large mappings where this is possible. But the generic ioremap() code does not check whether the PMD/PUD entries are already populated with a non-leaf entry, so that any page-table pages these entries point to will be lost. Further, on x86-32 with SHARED_KERNEL_PMD=0, this causes a BUG_ON() in vmalloc_sync_one() when PMD entries are synced from swapper_pg_dir to the current page-table. This happens because the PMD entry from swapper_pg_dir was promoted to a huge-page entry while the current PGD still contains the non-leaf entry. Because both entries are present and point to a different page, the BUG_ON() triggers. This was actually triggered with pti-x32 enabled in a KVM virtual machine by the graphics driver. A real and better fix for that would be to improve the page-table handling in the generic ioremap() code. But that is out-of-scope for this patch-set and left for later work. Reported-by: David H. Gutteridge <[email protected]> Signed-off-by: Joerg Roedel <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Boris Ostrovsky <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Laight <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: Eduardo Valentin <[email protected]> Cc: Greg KH <[email protected]> Cc: Jiri Kosina <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Pavel Machek <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Waiman Long <[email protected]> Cc: Will Deacon <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-04-12x86/pti: Leave kernel text global for !PCIDDave Hansen3-4/+82
Global pages are bad for hardening because they potentially let an exploit read the kernel image via a Meltdown-style attack which makes it easier to find gadgets. But, global pages are good for performance because they reduce TLB misses when making user/kernel transitions, especially when PCIDs are not available, such as on older hardware, or where a hypervisor has disabled them for some reason. This patch implements a basic, sane policy: If you have PCIDs, you only map a minimal amount of kernel text global. If you do not have PCIDs, you map all kernel text global. This policy effectively makes PCIDs something that not only adds performance but a little bit of hardening as well. I ran a simple "lseek" microbenchmark[1] to test the benefit on a modern Atom microserver. Most of the benefit comes from applying the series before this patch ("entry only"), but there is still a signifiant benefit from this patch. No Global Lines (baseline ): 6077741 lseeks/sec 88 Global Lines (entry only): 7528609 lseeks/sec (+23.9%) 94 Global Lines (this patch): 8433111 lseeks/sec (+38.8%) [1.] https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c Signed-off-by: Dave Hansen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-04-12x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel imageDave Hansen3-10/+35
Summary: In current kernels, with PTI enabled, no pages are marked Global. This potentially increases TLB misses. But, the mechanism by which the Global bit is set and cleared is rather haphazard. This patch makes the process more explicit. In the end, it leaves us with Global entries in the page tables for the areas truly shared by userspace and kernel and increases TLB hit rates. The place this patch really shines in on systems without PCIDs. In this case, we are using an lseek microbenchmark[1] to see how a reasonably non-trivial syscall behaves. Higher is better: No Global pages (baseline): 6077741 lseeks/sec 88 Global Pages (this set): 7528609 lseeks/sec (+23.9%) On a modern Skylake desktop with PCIDs, the benefits are tangible, but not huge for a kernel compile (lower is better): No Global pages (baseline): 186.951 seconds time elapsed ( +- 0.35% ) 28 Global pages (this set): 185.756 seconds time elapsed ( +- 0.09% ) -1.195 seconds (-0.64%) I also re-checked everything using the lseek1 test[1]: No Global pages (baseline): 15783951 lseeks/sec 28 Global pages (this set): 16054688 lseeks/sec +270737 lseeks/sec (+1.71%) The effect is more visible, but still modest. Details: The kernel page tables are inherited from head_64.S which rudely marks them as _PAGE_GLOBAL. For PTI, we have been relying on the grace of $DEITY and some insane behavior in pageattr.c to clear _PAGE_GLOBAL. This patch tries to do better. First, stop filtering out "unsupported" bits from being cleared in the pageattr code. It's fine to filter out *setting* these bits but it is insane to keep us from clearing them. Then, *explicitly* go clear _PAGE_GLOBAL from the kernel identity map. Do not rely on pageattr to do it magically. After this patch, we can see that "GLB" shows up in each copy of the page tables, that we have the same number of global entries in each and that they are the *same* entries. /sys/kernel/debug/page_tables/current_kernel:11 /sys/kernel/debug/page_tables/current_user:11 /sys/kernel/debug/page_tables/kernel:11 9caae8ad6a1fb53aca2407ec037f612d current_kernel.GLB 9caae8ad6a1fb53aca2407ec037f612d current_user.GLB 9caae8ad6a1fb53aca2407ec037f612d kernel.GLB A quick visual audit also shows that all the entries make sense. 0xfffffe0000000000 is the cpu_entry_area and 0xffffffff81c00000 is the entry/exit text: 0xfffffe0000000000-0xfffffe0000002000 8K ro GLB NX pte 0xfffffe0000002000-0xfffffe0000003000 4K RW GLB NX pte 0xfffffe0000003000-0xfffffe0000006000 12K ro GLB NX pte 0xfffffe0000006000-0xfffffe0000007000 4K ro GLB x pte 0xfffffe0000007000-0xfffffe000000d000 24K RW GLB NX pte 0xfffffe000002d000-0xfffffe000002e000 4K ro GLB NX pte 0xfffffe000002e000-0xfffffe000002f000 4K RW GLB NX pte 0xfffffe000002f000-0xfffffe0000032000 12K ro GLB NX pte 0xfffffe0000032000-0xfffffe0000033000 4K ro GLB x pte 0xfffffe0000033000-0xfffffe0000039000 24K RW GLB NX pte 0xffffffff81c00000-0xffffffff81e00000 2M ro PSE GLB x pmd [1.] https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c Signed-off-by: Dave Hansen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-04-12x86/pti: Enable global pages for shared areasDave Hansen2-2/+35
The entry/exit text and cpu_entry_area are mapped into userspace and the kernel. But, they are not _PAGE_GLOBAL. This creates unnecessary TLB misses. Add the _PAGE_GLOBAL flag for these areas. Signed-off-by: Dave Hansen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-04-12x86/mm: Do not forbid _PAGE_RW before init for __ro_after_initDave Hansen1-2/+4
__ro_after_init data gets stuck in the .rodata section. That's normally fine because the kernel itself manages the R/W properties. But, if we run __change_page_attr() on an area which is __ro_after_init, the .rodata checks will trigger and force the area to be immediately read-only, even if it is early-ish in boot. This caused problems when trying to clear the _PAGE_GLOBAL bit for these area in the PTI code: it cleared _PAGE_GLOBAL like I asked, but also took it up on itself to clear _PAGE_RW. The kernel then oopses the next time it wrote to a __ro_after_init data structure. To fix this, add the kernel_set_to_readonly check, just like we have for kernel text, just a few lines below in this function. Signed-off-by: Dave Hansen <[email protected]> Acked-by: Kees Cook <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-04-12x86/mm: Comment _PAGE_GLOBAL mysteryDave Hansen1-1/+10
I was mystified as to where the _PAGE_GLOBAL in the kernel page tables for kernel text came from. I audited all the places I could find, but I missed one: head_64.S. The page tables that we create in here live for a long time, and they also have _PAGE_GLOBAL set, despite whether the processor supports it or not. It's harmless, and we got *lucky* that the pageattr code accidentally clears it when we wipe it out of __supported_pte_mask and then later try to mark kernel text read-only. Comment some of these properties to make it easier to find and understand in the future. Signed-off-by: Dave Hansen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-04-12x86/mm: Remove extra filtering in pageattr codeDave Hansen1-4/+2
The pageattr code has a mode where it can set or clear PTE bits in existing PTEs, so the page protections of the *new* PTEs come from one of two places: 1. The set/clear masks: cpa->mask_clr / cpa->mask_set 2. The existing PTE We filter ->mask_set/clr for supported PTE bits at entry to __change_page_attr() so we never need to filter them again. The only other place permissions can come from is an existing PTE and those already presumably have good bits. We do not need to filter them again. Signed-off-by: Dave Hansen <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-04-12x86/mm: Do not auto-massage page protectionsDave Hansen10-12/+75
A PTE is constructed from a physical address and a pgprotval_t. __PAGE_KERNEL, for instance, is a pgprot_t and must be converted into a pgprotval_t before it can be used to create a PTE. This is done implicitly within functions like pfn_pte() by massage_pgprot(). However, this makes it very challenging to set bits (and keep them set) if your bit is being filtered out by massage_pgprot(). This moves the bit filtering out of pfn_pte() and friends. For users of PAGE_KERNEL*, filtering will be done automatically inside those macros but for users of __PAGE_KERNEL*, they need to do their own filtering now. Note that we also just move pfn_pte/pmd/pud() over to check_pgprot() instead of massage_pgprot(). This way, we still *look* for unsupported bits and properly warn about them if we find them. This might happen if an unfiltered __PAGE_KERNEL* value was passed in, for instance. - printk format warning fix from: Arnd Bergmann <[email protected]> - boot crash fix from: Tom Lendacky <[email protected]> - crash bisected by: Mike Galbraith <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reported-and-fixed-by: Arnd Bergmann <[email protected]> Fixed-by: Tom Lendacky <[email protected]> Bisected-by: Mike Galbraith <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arjan van de Ven <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dan Williams <[email protected]> Cc: David Woodhouse <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Josh Poimboeuf <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kees Cook <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
2018-04-11Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds2-1/+9
Pull virtio update from Michael Tsirkin: "This adds reporting hugepage stats to virtio-balloon" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio_balloon: export hugetlb page allocation counts
2018-04-11Merge tag 'iommu-updates-v4.17' of ↵Linus Torvalds25-853/+1008
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU updates from Joerg Roedel: - OF_IOMMU support for the Rockchip iommu driver so that it can use generic DT bindings - rework of locking in the AMD IOMMU interrupt remapping code to make it work better in RT kernels - support for improved iotlb flushing in the AMD IOMMU driver - support for 52-bit physical and virtual addressing in the ARM-SMMU - various other small fixes and cleanups * tag 'iommu-updates-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (53 commits) iommu/io-pgtable-arm: Avoid warning with 32-bit phys_addr_t iommu/rockchip: Support sharing IOMMU between masters iommu/rockchip: Add runtime PM support iommu/rockchip: Fix error handling in init iommu/rockchip: Use OF_IOMMU to attach devices automatically iommu/rockchip: Use IOMMU device for dma mapping operations dt-bindings: iommu/rockchip: Add clock property iommu/rockchip: Control clocks needed to access the IOMMU iommu/rockchip: Fix TLB flush of secondary IOMMUs iommu/rockchip: Use iopoll helpers to wait for hardware iommu/rockchip: Fix error handling in attach iommu/rockchip: Request irqs in rk_iommu_probe() iommu/rockchip: Fix error handling in probe iommu/rockchip: Prohibit unbind and remove iommu/amd: Return proper error code in irq_remapping_alloc() iommu/amd: Make amd_iommu_devtable_lock a spin_lock iommu/amd: Drop the lock while allocating new irq remap table iommu/amd: Factor out setting the remap table for a devid iommu/amd: Use `table' instead `irt' as variable name in amd_iommu_update_ga() iommu/amd: Remove the special case from alloc_irq_table() ...
2018-04-11Merge tag 'pm-4.17-rc1-2' of ↵Linus Torvalds25-161/+438
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These include one big-ticket item which is the rework of the idle loop in order to prevent CPUs from spending too much time in shallow idle states. It reduces idle power on some systems by 10% or more and may improve performance of workloads in which the idle loop overhead matters. This has been in the works for several weeks and it has been tested and reviewed quite thoroughly. Also included are changes that finalize the cpufreq cleanup moving frequency table validation from drivers to the core, a few fixes and cleanups of cpufreq drivers, a cpuidle documentation update and a PM QoS core update to mark the expected switch fall-throughs in it. Specifics: - Rework the idle loop in order to prevent CPUs from spending too much time in shallow idle states by making it stop the scheduler tick before putting the CPU into an idle state only if the idle duration predicted by the idle governor is long enough. That required the code to be reordered to invoke the idle governor before stopping the tick, among other things (Rafael Wysocki, Frederic Weisbecker, Arnd Bergmann). - Add the missing description of the residency sysfs attribute to the cpuidle documentation (Prashanth Prakash). - Finalize the cpufreq cleanup moving frequency table validation from drivers to the core (Viresh Kumar). - Fix a clock leak regression in the armada-37xx cpufreq driver (Gregory Clement). - Fix the initialization of the CPU performance data structures for shared policies in the CPPC cpufreq driver (Shunyong Yang). - Clean up the ti-cpufreq, intel_pstate and CPPC cpufreq drivers a bit (Viresh Kumar, Rafael Wysocki). - Mark the expected switch fall-throughs in the PM QoS core (Gustavo Silva)" * tag 'pm-4.17-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits) tick-sched: avoid a maybe-uninitialized warning cpufreq: Drop cpufreq_table_validate_and_show() cpufreq: SCMI: Don't validate the frequency table twice cpufreq: CPPC: Initialize shared perf capabilities of CPUs cpufreq: armada-37xx: Fix clock leak cpufreq: CPPC: Don't set transition_latency cpufreq: ti-cpufreq: Use builtin_platform_driver() cpufreq: intel_pstate: Do not include debugfs.h PM / QoS: mark expected switch fall-throughs cpuidle: Add definition of residency to sysfs documentation time: hrtimer: Use timerqueue_iterate_next() to get to the next timer nohz: Avoid duplication of code related to got_idle_tick nohz: Gather tick_sched booleans under a common flag field cpuidle: menu: Avoid selecting shallow states with stopped tick cpuidle: menu: Refine idle state selection for running tick sched: idle: Select idle state before stopping the tick time: hrtimer: Introduce hrtimer_next_event_without() time: tick-sched: Split tick_nohz_stop_sched_tick() cpuidle: Return nohz hint from cpuidle_select() jiffies: Introduce USER_TICK_USEC and redefine TICK_USEC ...
2018-04-11Merge tag 'ktest-v4.17' of ↵Linus Torvalds3-272/+1091
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest Pull ktest updates from Steven Rostedt: "These commits have either been sitting in my INBOX or have been in my local tree for some time. I need to push them upstream: - Separate out config-bisect.pl from ktest.pl. This allows users to do config bisects without full ktest setup. - Email on status change. Allow the user to be emailed on test start, finish, failure, etc. - Other small fixes and enhancements" * tag 'ktest-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-ktest: (24 commits) ktest: Take submenu into account for grub2 menus ktest.pl: Add MAIL_COMMAND option to define how to send email ktest.pl: Use run_command to execute sending mail ktest.pl: Allow dodie be recursive ktest.pl: Kill test if mailer is not supported ktest.pl: Add MAIL_PATH option to define where to find the mailer ktest.pl: No need to print no mailer is specified when mailto is not Ktest: add email options to sample.config Ktest: Use dodie for critical falures Ktest: Add SigInt handling Ktest: Add email support ktest.pl: Detect if a config-bisect was interrupted ktest.pl: Make finding config-bisect.pl dynamic ktest.pl: Have ktest.pl pass -r to config-bisect.pl to reset bisect ktest.pl: Use diffconfig if available for failed config bisects ktest.pl: Allow for the config-bisect.pl output to display to console ktest: Use config-bisect.pl in ktest.pl ktest: Add standalone config-bisect.pl program ktest: Set do_not_reboot=y for CONFIG_BISECT_TYPE=build ktest: Set buildonly=1 for CONFIG_BISECT_TYPE=build ...
2018-04-11Merge tag 'tags/upstream-4.17-rc1' of git://git.infradead.org/linux-ubifsLinus Torvalds7-11/+24
Pull UBI and UBIFS updates from Richard Weinberger: "Minor bug fixes and improvements" * tag 'tags/upstream-4.17-rc1' of git://git.infradead.org/linux-ubifs: ubi: Reject MLC NAND ubifs: Remove useless parameter of lpt_heap_replace ubifs: Constify struct ubifs_lprops in scan_for_leb_for_idx ubifs: remove unnecessary assignment ubi: Fix error for write access ubi: fastmap: Don't flush fastmap work on detach ubifs: Check ubifs_wbuf_sync() return code
2018-04-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/rw/umlLinus Torvalds23-325/+3395
Pull UML updates from Richard Weinberger: - a new and faster epoll based IRQ controller and NIC driver - misc fixes and janitorial updates * git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: Fix vector raw inintialization logic Migrate vector timers to new timer API um: Compile with modern headers um: vector: Fix an error handling path in 'vector_parse()' um: vector: Fix a memory allocation check um: vector: fix missing unlock on error in vector_net_open() um: Add missing EXPORT for free_irq_by_fd() High Performance UML Vector Network Driver Epoll based IRQ controller um: Use POSIX ucontext_t instead of struct ucontext um: time: Use timespec64 for persistent clock um: Restore symbol versions for __memcpy and memcpy
2018-04-11Merge tag 'armsoc-fixes' of ↵Linus Torvalds2-3/+10
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC fixes from Arnd Bergmann: "Here is a very small set of fixes for inclusion in linux-4.17-rc1: Two changes for the maintainer file, and one more fix for the newly added npcm platform, to enable the level 2 cache controller" * tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: MAINTAINERS: Update ASPEED entry with details MAINTAINERS: Migrate oxnas list to groups.io arm: npcm: enable L2 cache in NPCM7xx architecture
2018-04-11Merge tag 'nios2-v4.17-rc1' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2 Pull nios2 update from Ley Foon Tan: "Use read_persistent_clock64() instead of read_persistent_clock()" * tag 'nios2-v4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2: nios2: Use read_persistent_clock64() instead of read_persistent_clock()
2018-04-11CIFS: add ONCE flag for cifs_dbg typeAurelien Aptel2-29/+22
* Since cifs_vfs_error was just using pr_debug_ratelimited like the rest of cifs_dbg, move it there too * Add a ONCE type flag to call the pr_xxx_once() debug function instead of the ratelimited ones. To convert existing printk_once() calls to this we can run: perl -i -pE \ 's/printk_once\s*\(([^" \n]+)(.*)/cifs_dbg(VFS|ONCE,$2/g' \ fs/cifs/*.c Signed-off-by: Aurelien Aptel <[email protected]> Signed-off-by: Steve French <[email protected]>
2018-04-11cifs: Use ULL suffix for 64-bit constantGeert Uytterhoeven1-1/+1
On 32-bit (e.g. with m68k-linux-gnu-gcc-4.1): fs/cifs/inode.c: In function ‘simple_hashstr’: fs/cifs/inode.c:713: warning: integer constant is too large for ‘long’ type Fixes: 7ea884c77e5c97f1 ("smb3: Fix root directory when server returns inode number of zero") Signed-off-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Steve French <[email protected]> Reviewed-by: Aurelien Aptel <[email protected]>
2018-04-11SMB3: Log at least once if tree connect fails during reconnectSteve French2-2/+7
Adding an extra debug message to show if a tree connect failure during reconnect (and made it a log once so it doesn't spam the logs). Saw a case recently where tree connect repeatedly returned access denied on reconnect and it wasn't as easy to spot as it should have been. Signed-off-by: Steve French <[email protected]> Reviewed-by: Aurelien Aptel <[email protected]>
2018-04-11cifs: smb2pdu: Fix potential NULL pointer dereferenceGustavo A. R. Silva1-1/+3
tcon->ses is being dereferenced before it is null checked, hence there is a potential null pointer dereference. Fix this by moving the pointer dereference after tcon->ses has been properly null checked. Addresses-Coverity-ID: 1467426 ("Dereference before null check") Fixes: 93012bf98416 ("cifs: add server->vals->header_preamble_size") Signed-off-by: Gustavo A. R. Silva <[email protected]> Reviewed-by: Aurelien Aptel <[email protected]> Signed-off-by: Steve French <[email protected]>
2018-04-11Merge branch 'l2tp-tunnel-creation-fixes'David S. Miller4-137/+123
Guillaume Nault says: ==================== l2tp: tunnel creation fixes L2TP tunnel creation is racy. We need to make sure that the tunnel returned by l2tp_tunnel_create() isn't going to be freed while the caller is using it. This is done in patch #1, by separating tunnel creation from tunnel registration. With the tunnel registration code in place, we can now check for duplicate tunnels in a race-free way. This is done in patch #2, which incidentally removes the last use of l2tp_tunnel_find(). ==================== Signed-off-by: David S. Miller <[email protected]>
2018-04-11l2tp: fix race in duplicate tunnel detectionGuillaume Nault3-28/+14
We can't use l2tp_tunnel_find() to prevent l2tp_nl_cmd_tunnel_create() from creating a duplicate tunnel. A tunnel can be concurrently registered after l2tp_tunnel_find() returns. Therefore, searching for duplicates must be done at registration time. Finally, remove l2tp_tunnel_find() entirely as it isn't use anywhere anymore. Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP") Signed-off-by: Guillaume Nault <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-11l2tp: fix races in tunnel creationGuillaume Nault4-110/+110
l2tp_tunnel_create() inserts the new tunnel into the namespace's tunnel list and sets the socket's ->sk_user_data field, before returning it to the caller. Therefore, there are two ways the tunnel can be accessed and freed, before the caller even had the opportunity to take a reference. In practice, syzbot could crash the module by closing the socket right after a new tunnel was returned to pppol2tp_create(). This patch moves tunnel registration out of l2tp_tunnel_create(), so that the caller can safely hold a reference before publishing the tunnel. This second step is done with the new l2tp_tunnel_register() function, which is now responsible for associating the tunnel to its socket and for inserting it into the namespace's list. While moving the code to l2tp_tunnel_register(), a few modifications have been done. First, the socket validation tests are done in a helper function, for clarity. Also, modifying the socket is now done after having inserted the tunnel to the namespace's tunnels list. This will allow insertion to fail, without having to revert theses modifications in the error path (a followup patch will check for duplicate tunnels before insertion). Either the socket is a kernel socket which we control, or it is a user-space socket for which we have a reference on the file descriptor. In any case, the socket isn't going to be closed from under us. Reported-by: [email protected] Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts") Signed-off-by: Guillaume Nault <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2018-04-11i2c: add param sanity check to i2c_transfer()Ard Biesheuvel1-0/+3
The API docs describe i2c_transfer() as taking a pointer to an array of i2c_msg containing at least 1 entry, but leaves it to the individual drivers to sanity check the msgs and num parameters. Let's do this in core code instead. Signed-off-by: Ard Biesheuvel <[email protected]> [wsa: changed '<= 0' to '< 1'] Signed-off-by: Wolfram Sang <[email protected]>
2018-04-11MAINTAINERS: add maintainer for Renesas I2C related driversWolfram Sang1-0/+16
Intentionally missing i2c-riic here, Chris Brandt will add himself for that one later. Signed-off-by: Wolfram Sang <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Wolfram Sang <[email protected]>