aboutsummaryrefslogtreecommitdiff
path: root/arch/ia64/kernel/syscalls
AgeCommit message (Collapse)AuthorFilesLines
2023-09-11arch: Remove Itanium (IA-64) architectureArd Biesheuvel2-407/+0
The Itanium architecture is obsolete, and an informal survey [0] reveals that any residual use of Itanium hardware in production is mostly HP-UX or OpenVMS based. The use of Linux on Itanium appears to be limited to enthusiasts that occasionally boot a fresh Linux kernel to see whether things are still working as intended, and perhaps to churn out some distro packages that are rarely used in practice. None of the original companies behind Itanium still produce or support any hardware or software for the architecture, and it is listed as 'Orphaned' in the MAINTAINERS file, as apparently, none of the engineers that contributed on behalf of those companies (nor anyone else, for that matter) have been willing to support or maintain the architecture upstream or even be responsible for applying the odd fix. The Intel firmware team removed all IA-64 support from the Tianocore/EDK2 reference implementation of EFI in 2018. (Itanium is the original architecture for which EFI was developed, and the way Linux supports it deviates significantly from other architectures.) Some distros, such as Debian and Gentoo, still maintain [unofficial] ia64 ports, but many have dropped support years ago. While the argument is being made [1] that there is a 'for the common good' angle to being able to build and run existing projects such as the Grid Community Toolkit [2] on Itanium for interoperability testing, the fact remains that none of those projects are known to be deployed on Linux/ia64, and very few people actually have access to such a system in the first place. Even if there were ways imaginable in which Linux/ia64 could be put to good use today, what matters is whether anyone is actually doing that, and this does not appear to be the case. There are no emulators widely available, and so boot testing Itanium is generally infeasible for ordinary contributors. GCC still supports IA-64 but its compile farm [3] no longer has any IA-64 machines. GLIBC would like to get rid of IA-64 [4] too because it would permit some overdue code cleanups. In summary, the benefits to the ecosystem of having IA-64 be part of it are mostly theoretical, whereas the maintenance overhead of keeping it supported is real. So let's rip off the band aid, and remove the IA-64 arch code entirely. This follows the timeline proposed by the Debian/ia64 maintainer [5], which removes support in a controlled manner, leaving IA-64 in a known good state in the most recent LTS release. Other projects will follow once the kernel support is removed. [0] https://lore.kernel.org/all/CAMj1kXFCMh_578jniKpUtx_j8ByHnt=s7S+yQ+vGbKt9ud7+kQ@mail.gmail.com/ [1] https://lore.kernel.org/all/[email protected]/ [2] https://gridcf.org/gct-docs/latest/index.html [3] https://cfarm.tetaneutral.net/machines/list/ [4] https://lore.kernel.org/all/[email protected]/ [5] https://lore.kernel.org/all/ff58a3e76e5102c94bb5946d99187b358def688a.camel@physik.fu-berlin.de/ Acked-by: Tony Luck <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]>
2023-07-27arch: Register fchmodat2, usually as syscall 452Palmer Dabbelt1-0/+1
This registers the new fchmodat2 syscall in most places as nuber 452, with alpha being the exception where it's 562. I found all these sites by grepping for fspick, which I assume has found me everything. Signed-off-by: Palmer Dabbelt <[email protected]> Signed-off-by: Alexey Gladkov <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Message-Id: <a677d521f048e4ca439e7080a5328f21eb8e960e.1689092120.git.legion@kernel.org> Signed-off-by: Christian Brauner <[email protected]>
2023-06-09cachestat: wire up cachestat for other architecturesNhat Pham1-0/+1
cachestat is previously only wired in for x86 (and architectures using the generic unistd.h table): https://lore.kernel.org/lkml/[email protected]/ This patch wires cachestat in for all the other architectures. [[email protected]: wire up cachestat for arm64] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nhat Pham <[email protected]> Tested-by: Michael Ellerman <[email protected]> [powerpc] Acked-by: Geert Uytterhoeven <[email protected]> [m68k] Reviewed-by: Arnd Bergmann <[email protected]> Acked-by: Heiko Carstens <[email protected]> [s390] Cc: Alexander Gordeev <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christophe Leroy <[email protected]> Cc: Chris Zankel <[email protected]> Cc: David S. Miller <[email protected]> Cc: Helge Deller <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: "James E.J. Bottomley" <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Paul Adrian Glaubitz <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michal Simek <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Yoshinori Sato <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-09-11ia64: fix clock_getres(CLOCK_MONOTONIC) to report ITC frequencySergei Trofimovich1-1/+1
clock_gettime(CLOCK_MONOTONIC, &tp) is very precise on ia64 as it uses ITC (similar to rdtsc on x86). It's not quite a hrtimer as it is a few times slower than 1ns. Usually 2-3ns. clock_getres(CLOCK_MONOTONIC, &res) never reflected that fact and reported 0.04s precision (1/HZ value). In https://bugs.gentoo.org/596382 gstreamer's test suite failed loudly when it noticed precision discrepancy. Before the change: clock_getres(CLOCK_MONOTONIC, &res) reported 250Hz precision. After the change: clock_getres(CLOCK_MONOTONIC, &res) reports ITC (400Mhz) precision. The patch is based on matoro's fix. I added a bit of explanation why we need to special-case arch-specific clock_getres(). [[email protected]: coding-style cleanups] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sergei Trofimovich <[email protected]> Cc: matoro <[email protected]> Cc: Émeric Maschino <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2022-03-31arch: syscalls: simplify uapi/kapi directory creationMasahiro Yamada1-2/+1
$(shell ...) expands to empty. There is no need to assign it to _dummy. Signed-off-by: Masahiro Yamada <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]>
2022-01-15mm/mempolicy: wire up syscall set_mempolicy_home_nodeAneesh Kumar K.V1-0/+1
Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Aneesh Kumar K.V <[email protected]> Cc: Ben Widawsky <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Feng Tang <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Dan Williams <[email protected]> Cc: Huang Ying <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-11-25futex: Wireup futex_waitv syscallAndré Almeida1-0/+1
Wireup futex_waitv syscall for all remaining archs. Signed-off-by: André Almeida <[email protected]> Acked-by: Max Filippov <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Tested-by: Michael Ellerman <[email protected]> (powerpc) Signed-off-by: Arnd Bergmann <[email protected]>
2021-09-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-0/+2
Merge misc updates from Andrew Morton: "173 patches. Subsystems affected by this series: ia64, ocfs2, block, and mm (debug, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock, oom-kill, migration, ksm, percpu, vmstat, and madvise)" * emailed patches from Andrew Morton <[email protected]>: (173 commits) mm/madvise: add MADV_WILLNEED to process_madvise() mm/vmstat: remove unneeded return value mm/vmstat: simplify the array size calculation mm/vmstat: correct some wrong comments mm/percpu,c: remove obsolete comments of pcpu_chunk_populated() selftests: vm: add COW time test for KSM pages selftests: vm: add KSM merging time test mm: KSM: fix data type selftests: vm: add KSM merging across nodes test selftests: vm: add KSM zero page merging test selftests: vm: add KSM unmerge test selftests: vm: add KSM merge test mm/migrate: correct kernel-doc notation mm: wire up syscall process_mrelease mm: introduce process_mrelease system call memblock: make memblock_find_in_range method private mm/mempolicy.c: use in_task() in mempolicy_slab_node() mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies mm/mempolicy: advertise new MPOL_PREFERRED_MANY mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY ...
2021-09-03mm: wire up syscall process_mreleaseSuren Baghdasaryan1-0/+2
Split off from prev patch in the series that implements the syscall. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Rientjes <[email protected]> Cc: Florian Weimer <[email protected]> Cc: Jan Engelhardt <[email protected]> Cc: Jann Horn <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tim Murray <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2021-07-12exit/bdflush: Remove the deprecated bdflush system callEric W. Biederman1-1/+1
The bdflush system call has been deprecated for a very long time. Recently Michael Schmitz tested[1] and found that the last known caller of of the bdflush system call is unaffected by it's removal. Since the code is not needed delete it. [1] https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/87sg10quue.fsf_-_@disp2133 Tested-by: Michael Schmitz <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Acked-by: Cyril Hrubis <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-06-07quota: Wire up quotactl_fd syscallJan Kara1-1/+1
Wire up the quotactl_fd syscall. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jan Kara <[email protected]>
2021-05-17quota: Disable quotactl_path syscallJan Kara1-1/+1
In commit fa8b90070a80 ("quota: wire up quotactl_path") we have wired up new quotactl_path syscall. However some people in LWN discussion have objected that the path based syscall is missing dirfd and flags argument which is mostly standard for contemporary path based syscalls. Indeed they have a point and after a discussion with Christian Brauner and Sascha Hauer I've decided to disable the syscall for now and update its API. Since there is no userspace currently using that syscall and it hasn't been released in any major release, we should be fine. CC: Christian Brauner <[email protected]> CC: Sascha Hauer <[email protected]> Link: https://lore.kernel.org/lkml/20210512153621.n5u43jsytbik4yze@wittgenstein Signed-off-by: Jan Kara <[email protected]>
2021-05-01Merge tag 'landlock_v34' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull Landlock LSM from James Morris: "Add Landlock, a new LSM from Mickaël Salaün. Briefly, Landlock provides for unprivileged application sandboxing. From Mickaël's cover letter: "The goal of Landlock is to enable to restrict ambient rights (e.g. global filesystem access) for a set of processes. Because Landlock is a stackable LSM [1], it makes possible to create safe security sandboxes as new security layers in addition to the existing system-wide access-controls. This kind of sandbox is expected to help mitigate the security impact of bugs or unexpected/malicious behaviors in user-space applications. Landlock empowers any process, including unprivileged ones, to securely restrict themselves. Landlock is inspired by seccomp-bpf but instead of filtering syscalls and their raw arguments, a Landlock rule can restrict the use of kernel objects like file hierarchies, according to the kernel semantic. Landlock also takes inspiration from other OS sandbox mechanisms: XNU Sandbox, FreeBSD Capsicum or OpenBSD Pledge/Unveil. In this current form, Landlock misses some access-control features. This enables to minimize this patch series and ease review. This series still addresses multiple use cases, especially with the combined use of seccomp-bpf: applications with built-in sandboxing, init systems, security sandbox tools and security-oriented APIs [2]" The cover letter and v34 posting is here: https://lore.kernel.org/linux-security-module/[email protected]/ See also: https://landlock.io/ This code has had extensive design discussion and review over several years" Link: https://lore.kernel.org/lkml/[email protected]/ [1] Link: https://lore.kernel.org/lkml/[email protected]/ [2] * tag 'landlock_v34' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: landlock: Enable user space to infer supported features landlock: Add user and kernel documentation samples/landlock: Add a sandbox manager example selftests/landlock: Add user space tests landlock: Add syscall implementations arch: Wire up Landlock syscalls fs,security: Add sb_delete hook landlock: Support filesystem access-control LSM: Infrastructure management of the superblock landlock: Add ptrace restrictions landlock: Set up the security framework and manage credentials landlock: Add ruleset and domain management landlock: Add object management
2021-04-29Merge tag 'kbuild-v5.13' of ↵Linus Torvalds3-80/+4
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Evaluate $(call cc-option,...) etc. only for build targets - Add CONFIG_VMLINUX_MAP to generate .map file when linking vmlinux - Remove unnecessary --gcc-toolchains Clang flag because the --prefix flag finds the toolchains - Do not pass Clang's --prefix flag when using the integrated as - Check the assembler version in Kconfig time - Add new CONFIG options, AS_VERSION, AS_IS_GNU, AS_IS_LLVM to clean up some dependencies in Kconfig - Fix invalid Module.symvers creation when building only modules without vmlinux - Fix false-positive modpost warnings when CONFIG_TRIM_UNUSED_KSYMS is set, but there is no module to build - Refactor module installation Makefile - Support zstd for module compression - Convert alpha and ia64 to use generic shell scripts to generate the syscall headers - Add a new elfnote to indicate if the kernel was built with LTO, which will be used by pahole - Flatten the directory structure under include/config/ so CONFIG options and filenames match - Change the deb source package name from linux-$(KERNELRELEASE) to linux-upstream * tag 'kbuild-v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (42 commits) kbuild: Add $(KBUILD_HOSTLDFLAGS) to 'has_libelf' test kbuild: deb-pkg: change the source package name to linux-upstream tools: do not include scripts/Kbuild.include kbuild: redo fake deps at include/config/*.h kbuild: remove TMPO from try-run MAINTAINERS: add pattern for dummy-tools kbuild: add an elfnote for whether vmlinux is built with lto ia64: syscalls: switch to generic syscallhdr.sh ia64: syscalls: switch to generic syscalltbl.sh alpha: syscalls: switch to generic syscallhdr.sh alpha: syscalls: switch to generic syscalltbl.sh sysctl: use min() helper for namecmp() kbuild: add support for zstd compressed modules kbuild: remove CONFIG_MODULE_COMPRESS kbuild: merge scripts/Makefile.modsign to scripts/Makefile.modinst kbuild: move module strip/compression code into scripts/Makefile.modinst kbuild: refactor scripts/Makefile.modinst kbuild: rename extmod-prefix to extmod_prefix kbuild: check module name conflict for external modules as well kbuild: show the target directory for depmod log ...
2021-04-25ia64: syscalls: switch to generic syscallhdr.shMasahiro Yamada2-42/+2
Many architectures duplicate similar shell scripts. This commit converts ia64 to use scripts/syscallhdr.sh. Signed-off-by: Masahiro Yamada <[email protected]>
2021-04-25ia64: syscalls: switch to generic syscalltbl.shMasahiro Yamada2-38/+2
Many architectures duplicate similar shell scripts. This commit converts ia64 to use scripts/syscalltbl.sh. Signed-off-by: Masahiro Yamada <[email protected]>
2021-04-22arch: Wire up Landlock syscallsMickaël Salaün1-0/+3
Wire up the following system calls for all architectures: * landlock_create_ruleset(2) * landlock_add_rule(2) * landlock_restrict_self(2) Cc: Arnd Bergmann <[email protected]> Cc: James Morris <[email protected]> Cc: Jann Horn <[email protected]> Cc: Kees Cook <[email protected]> Cc: Serge E. Hallyn <[email protected]> Signed-off-by: Mickaël Salaün <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: James Morris <[email protected]>
2021-03-17quota: wire up quotactl_pathSascha Hauer1-0/+1
Wire up the quotactl_path syscall added in the previous patch. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sascha Hauer <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Jan Kara <[email protected]>
2021-02-25Merge tag 'kbuild-v5.12' of ↵Linus Torvalds1-6/+7
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Fix false-positive build warnings for ARCH=ia64 builds - Optimize dictionary size for module compression with xz - Check the compiler and linker versions in Kconfig - Fix misuse of extra-y - Support DWARF v5 debug info - Clamp SUBLEVEL to 255 because stable releases 4.4.x and 4.9.x exceeded the limit - Add generic syscall{tbl,hdr}.sh for cleanups across arches - Minor cleanups of genksyms - Minor cleanups of Kconfig * tag 'kbuild-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (38 commits) initramfs: Remove redundant dependency of RD_ZSTD on BLK_DEV_INITRD kbuild: remove deprecated 'always' and 'hostprogs-y/m' kbuild: parse C= and M= before changing the working directory kbuild: reuse this-makefile to define abs_srctree kconfig: unify rule of config, menuconfig, nconfig, gconfig, xconfig kconfig: omit --oldaskconfig option for 'make config' kconfig: fix 'invalid option' for help option kconfig: remove dead code in conf_askvalue() kconfig: clean up nested if-conditionals in check_conf() kconfig: Remove duplicate call to sym_get_string_value() Makefile: Remove # characters from compiler string Makefile: reuse CC_VERSION_TEXT kbuild: check the minimum linker version in Kconfig kbuild: remove ld-version macro scripts: add generic syscallhdr.sh scripts: add generic syscalltbl.sh arch: syscalls: remove $(srctree)/ prefix from syscall tables arch: syscalls: add missing FORCE and fix 'targets' to make if_changed work gen_compile_commands: prune some directories kbuild: simplify access to the kernel's version ...
2021-02-22arch: syscalls: remove $(srctree)/ prefix from syscall tablesMasahiro Yamada1-1/+1
The 'syscall' variables are not directly used in the commands. Remove the $(srctree)/ prefix because we can rely on VPATH. Signed-off-by: Masahiro Yamada <[email protected]>
2021-02-22arch: syscalls: add missing FORCE and fix 'targets' to make if_changed workMasahiro Yamada1-5/+6
The rules in these Makefiles cannot detect the command line change because the prerequisite 'FORCE' is missing. Adding 'FORCE' will result in the headers being rebuilt every time because the 'targets' additions are also wrong; the file paths in 'targets' must be relative to the current Makefile. Fix all of them so the if_changed rules work correctly. Signed-off-by: Masahiro Yamada <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]>
2021-01-24fs: add mount_setattr()Christian Brauner1-0/+1
This implements the missing mount_setattr() syscall. While the new mount api allows to change the properties of a superblock there is currently no way to change the properties of a mount or a mount tree using file descriptors which the new mount api is based on. In addition the old mount api has the restriction that mount options cannot be applied recursively. This hasn't changed since changing mount options on a per-mount basis was implemented in [1] and has been a frequent request not just for convenience but also for security reasons. The legacy mount syscall is unable to accommodate this behavior without introducing a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND | MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost mount. Changing MS_REC to apply to the whole mount tree would mean introducing a significant uapi change and would likely cause significant regressions. The new mount_setattr() syscall allows to recursively clear and set mount options in one shot. Multiple calls to change mount options requesting the same changes are idempotent: int mount_setattr(int dfd, const char *path, unsigned flags, struct mount_attr *uattr, size_t usize); Flags to modify path resolution behavior are specified in the @flags argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW, and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to restrict path resolution as introduced with openat2() might be supported in the future. The mount_setattr() syscall can be expected to grow over time and is designed with extensibility in mind. It follows the extensible syscall pattern we have used with other syscalls such as openat2(), clone3(), sched_{set,get}attr(), and others. The set of mount options is passed in the uapi struct mount_attr which currently has the following layout: struct mount_attr { __u64 attr_set; __u64 attr_clr; __u64 propagation; __u64 userns_fd; }; The @attr_set and @attr_clr members are used to clear and set mount options. This way a user can e.g. request that a set of flags is to be raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in @attr_set while at the same time requesting that another set of flags is to be lowered such as removing noexec from a mount tree by specifying MOUNT_ATTR_NOEXEC in @attr_clr. Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0, not a bitmap, users wanting to transition to a different atime setting cannot simply specify the atime setting in @attr_set, but must also specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in @attr_clr. The @propagation field lets callers specify the propagation type of a mount tree. Propagation is a single property that has four different settings and as such is not really a flag argument but an enum. Specifically, it would be unclear what setting and clearing propagation settings in combination would amount to. The legacy mount() syscall thus forbids the combination of multiple propagation settings too. The goal is to keep the semantics of mount propagation somewhat simple as they are overly complex as it is. The @userns_fd field lets user specify a user namespace whose idmapping becomes the idmapping of the mount. This is implemented and explained in detail in the next patch. [1]: commit 2e4b7fcd9260 ("[PATCH] r/o bind mounts: honor mount writer counts at remount") Link: https://lore.kernel.org/r/[email protected] Cc: David Howells <[email protected]> Cc: Aleksa Sarai <[email protected]> Cc: Al Viro <[email protected]> Cc: [email protected] Cc: [email protected] Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2020-12-19epoll: wire up syscall epoll_pwait2Willem de Bruijn1-0/+1
Split off from prev patch in the series that implements the syscall. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Willem de Bruijn <[email protected]> Cc: Al Viro <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2020-10-18mm/madvise: introduce process_madvise() syscall: an external memory hinting APIMinchan Kim1-0/+1
There is usecase that System Management Software(SMS) want to give a memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the case of Android, it is the ActivityManagerService. The information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces a new syscall process_madvise(2). It uses pidfd of an external process to give the hint. It also supports vector address range because Android app has thousands of vmas due to zygote so it's totally waste of CPU and power if we should call the syscall one by one for each vma.(With testing 2000-vma syscall vs 1-vector syscall, it showed 15% performance improvement. I think it would be bigger in real practice because the testing ran very cache friendly environment). Another potential use case for the vector range is to amortize the cost ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could benefit users like TCP receive zerocopy and malloc implementations. In future, we could find more usecases for other advises so let's make it happens as API since we introduce a new syscall at this moment. With that, existing madvise(2) user could replace it with process_madvise(2) with their own pid if they want to have batch address ranges support feature. ince it could affect other process's address range, only privileged process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same UID) gives it the right to ptrace the process could use it successfully. The flag argument is reserved for future use if we need to extend the API. I think supporting all hints madvise has/will supported/support to process_madvise is rather risky. Because we are not sure all hints make sense from external process and implementation for the hint may rely on the caller being in the current context so it could be error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. If someone want to add other hints, we could hear the usecase and review it for each hint. It's safer for maintenance rather than introducing a buggy syscall but hard to fix it later. So finally, the API is as follows, ssize_t process_madvise(int pidfd, const struct iovec *iovec, unsigned long vlen, int advice, unsigned int flags); DESCRIPTION The process_madvise() system call is used to give advice or directions to the kernel about the address ranges from external process as well as local process. It provides the advice to address ranges of process described by iovec and vlen. The goal of such advice is to improve system or application performance. The pidfd selects the process referred to by the PID file descriptor specified in pidfd. (See pidofd_open(2) for further information) The pointer iovec points to an array of iovec structures, defined in <sys/uio.h> as: struct iovec { void *iov_base; /* starting address */ size_t iov_len; /* number of bytes to be advised */ }; The iovec describes address ranges beginning at address(iov_base) and with size length of bytes(iov_len). The vlen represents the number of elements in iovec. The advice is indicated in the advice argument, which is one of the following at this moment if the target process specified by pidfd is external. MADV_COLD MADV_PAGEOUT Permission to provide a hint to external process is governed by a ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2). The process_madvise supports every advice madvise(2) has if target process is in same thread group with calling process so user could use process_madvise(2) to extend existing madvise(2) to support vector address ranges. RETURN VALUE On success, process_madvise() returns the number of bytes advised. This return value may be less than the total number of requested bytes, if an error occurred. The caller should check return value to determine whether a partial advice occurred. FAQ: Q.1 - Why does any external entity have better knowledge? Quote from Sandeep "For Android, every application (including the special SystemServer) are forked from Zygote. The reason of course is to share as many libraries and classes between the two as possible to benefit from the preloading during boot. After applications start, (almost) all of the APIs end up calling into this SystemServer process over IPC (binder) and back to the application. In a fully running system, the SystemServer monitors every single process periodically to calculate their PSS / RSS and also decides which process is "important" to the user for interactivity. So, because of how these processes start _and_ the fact that the SystemServer is looping to monitor each process, it does tend to *know* which address range of the application is not used / useful. Besides, we can never rely on applications to clean things up themselves. We've had the "hey app1, the system is low on memory, please trim your memory usage down" notifications for a long time[1]. They rely on applications honoring the broadcasts and very few do. So, if we want to avoid the inevitable killing of the application and restarting it, some way to be able to tell the OS about unimportant memory in these applications will be useful. - ssp Q.2 - How to guarantee the race(i.e., object validation) between when giving a hint from an external process and get the hint from the target process? process_madvise operates on the target process's address space as it exists at the instant that process_madvise is called. If the space target process can run between the time the process_madvise process inspects the target process address space and the time that process_madvise is actually called, process_madvise may operate on memory regions that the calling process does not expect. It's the responsibility of the process calling process_madvise to close this race condition. For example, the calling process can suspend the target process with ptrace, SIGSTOP, or the freezer cgroup so that it doesn't have an opportunity to change its own address space before process_madvise is called. Another option is to operate on memory regions that the caller knows a priori will be unchanged in the target process. Yet another option is to accept the race for certain process_madvise calls after reasoning that mistargeting will do no harm. The suggested API itself does not provide synchronization. It also apply other APIs like move_pages, process_vm_write. The race isn't really a problem though. Why is it so wrong to require that callers do their own synchronization in some manner? Nobody objects to write(2) merely because it's possible for two processes to open the same file and clobber each other's writes --- instead, we tell people to use flock or something. Think about mmap. It never guarantees newly allocated address space is still valid when the user tries to access it because other threads could unmap the memory right before. That's where we need synchronization by using other API or design from userside. It shouldn't be part of API itself. If someone needs more fine-grained synchronization rather than process level, there were two ideas suggested - cookie[2] and anon-fd[3]. Both are applicable via using last reserved argument of the API but I don't think it's necessary right now since we have already ways to prevent the race so don't want to add additional complexity with more fine-grained optimization model. To make the API extend, it reserved an unsigned long as last argument so we could support it in future if someone really needs it. Q.3 - Why doesn't ptrace work? Injecting an madvise in the target process using ptrace would not work for us because such injected madvise would have to be executed by the target process, which means that process would have to be runnable and that creates the risk of the abovementioned race and hinting a wrong VMA. Furthermore, we want to act the hint in caller's context, not the callee's, because the callee is usually limited in cpuset/cgroups or even freezed state so they can't act by themselves quick enough, which causes more thrashing/kill. It doesn't work if the target process are ptraced(e.g., strace, debugger, minidump) because a process can have at most one ptracer. [1] https://developer.android.com/topic/performance/memory" [2] process_getinfo for getting the cookie which is updated whenever vma of process address layout are changed - Daniel Colascione - https://lore.kernel.org/lkml/[email protected]/T/#m7694416fd179b2066a2c62b5b139b14e3894e224 [3] anonymous fd which is used for the object(i.e., address range) validation - Michal Hocko - https://lore.kernel.org/lkml/[email protected]/ [[email protected]: fix process_madvise build break for arm64] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: fix build error for mips of process_madvise] Link: http://lkml.kernel.org/r/[email protected] [[email protected]: fix patch ordering issue] [[email protected]: fix arm64 whoops] [[email protected]: make process_madvise() vlen arg have type size_t, per Florian] [[email protected]: fix i386 build] [[email protected]: fix syscall numbering] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: madvise.c needs compat.h] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: fix mips build] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: remove duplicate header which is included twice] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: do not use helper functions for process_madvise] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: pidfd_get_pid() gained an argument] [[email protected]: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Minchan Kim <[email protected]> Signed-off-by: YueHaibing <[email protected]> Signed-off-by: Stephen Rothwell <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Suren Baghdasaryan <[email protected]> Reviewed-by: Vlastimil Babka <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Alexander Duyck <[email protected]> Cc: Brian Geffon <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Daniel Colascione <[email protected]> Cc: Jann Horn <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: John Dias <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oleksandr Natalenko <[email protected]> Cc: Sandeep Patil <[email protected]> Cc: SeongJae Park <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Sonny Rao <[email protected]> Cc: Tim Murray <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Florian Weimer <[email protected]> Cc: <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-09-11ia64: Remove perfmonChristoph Hellwig1-1/+1
perfmon has been marked broken and thus been disabled for all builds for more than two years. Remove it entirely. Cc: Anant Thazhemadam <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]> Enthusiastically-ACKed-by: Al Viro <[email protected]> Signed-off-by: Tony Luck <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2020-08-14all arch: remove system call sys_sysctlXiaoming Ni1-1/+1
Since commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system call"), sys_sysctl is actually unavailable: any input can only return an error. We have been warning about people using the sysctl system call for years and believe there are no more users. Even if there are users of this interface if they have not complained or fixed their code by now they probably are not going to, so there is no point in warning them any longer. So completely remove sys_sysctl on all architectures. [[email protected]: s390: fix build error for sys_call_table_emu] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Xiaoming Ni <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Acked-by: Will Deacon <[email protected]> [arm/arm64] Acked-by: "Eric W. Biederman" <[email protected]> Cc: Aleksa Sarai <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Al Viro <[email protected]> Cc: Andi Kleen <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Bin Meng <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: chenzefeng <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Chris Zankel <[email protected]> Cc: David Howells <[email protected]> Cc: David S. Miller <[email protected]> Cc: Diego Elio Pettenò <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Dominik Brodowski <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Geert Uytterhoeven <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Helge Deller <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Iurii Zaikin <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: James Bottomley <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kars de Jong <[email protected]> Cc: Kees Cook <[email protected]> Cc: Krzysztof Kozlowski <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Marco Elver <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Martin K. Petersen <[email protected]> Cc: Masahiro Yamada <[email protected]> Cc: Matt Turner <[email protected]> Cc: Max Filippov <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Michal Simek <[email protected]> Cc: Miklos Szeredi <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Naveen N. Rao <[email protected]> Cc: Nick Piggin <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Olof Johansson <[email protected]> Cc: Paul Burton <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Peter Zijlstra (Intel) <[email protected]> Cc: Randy Dunlap <[email protected]> Cc: Ravi Bangoria <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Rich Felker <[email protected]> Cc: Russell King <[email protected]> Cc: Sami Tolvanen <[email protected]> Cc: Sargun Dhillon <[email protected]> Cc: Stephen Rothwell <[email protected]> Cc: Sudeep Holla <[email protected]> Cc: Sven Schnelle <[email protected]> Cc: Thiago Jung Bauermann <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tony Luck <[email protected]> Cc: Vasily Gorbik <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yoshinori Sato <[email protected]> Cc: Zhou Yanjie <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Linus Torvalds <[email protected]>
2020-06-17arch: wire-up close_range()Christian Brauner1-0/+1
This wires up the close_range() syscall into all arches at once. Suggested-by: Arnd Bergmann <[email protected]> Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Acked-by: Michael Ellerman <[email protected]> (powerpc) Cc: Jann Horn <[email protected]> Cc: David Howells <[email protected]> Cc: Dmitry V. Levin <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Al Viro <[email protected]> Cc: Florian Weimer <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected]
2020-05-14vfs: add faccessat2 syscallMiklos Szeredi1-0/+1
POSIX defines faccessat() as having a fourth "flags" argument, while the linux syscall doesn't have it. Glibc tries to emulate AT_EACCESS and AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken. Add a new faccessat(2) syscall with the added flags argument and implement both flags. The value of AT_EACCESS is defined in glibc headers to be the same as AT_REMOVEDIR. Use this value for the kernel interface as well, together with the explanatory comment. Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can be useful and is trivial to implement. Signed-off-by: Miklos Szeredi <[email protected]>
2020-04-07asm-generic: fix unistd_32.h generation formatMichal Simek1-1/+1
Generated files are also checked by sparse that's why add newline to remove sparse (C=1) warning. The issue was found on Microblaze and reported like this: ./arch/microblaze/include/generated/uapi/asm/unistd_32.h:438:45: warning: no newline at end of file Mips and PowerPC have it already but let's align with style used by m68k. Signed-off-by: Michal Simek <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Reviewed-by: Stefan Asserhall <[email protected]> Acked-by: Max Filippov <[email protected]> (xtensa) Cc: Arnd Bergmann <[email protected]> Cc: Max Filippov <[email protected]> Cc: Benjamin Herrenschmidt <[email protected]> Cc: Chris Zankel <[email protected]> Cc: David S. Miller <[email protected]> Cc: Fenghua Yu <[email protected]> Cc: Helge Deller <[email protected]> Cc: Ivan Kokshaysky <[email protected]> Cc: James Bottomley <[email protected]> Cc: Matt Turner <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Paul Burton <[email protected]> Cc: Paul Mackerras <[email protected]> Cc: Ralf Baechle <[email protected]> Cc: Rich Felker <[email protected]> Cc: Richard Henderson <[email protected]> Cc: Tony Luck <[email protected]> Cc: Yoshinori Sato <[email protected]> Link: http://lkml.kernel.org/r/4d32ab4e1fb2edb691d2e1687e8fb303c09fd023.1581504803.git.michal.simek@xilinx.com Signed-off-by: Linus Torvalds <[email protected]>
2020-01-29Merge tag 'threads-v5.6' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull thread management updates from Christian Brauner: "Sargun Dhillon over the last cycle has worked on the pidfd_getfd() syscall. This syscall allows for the retrieval of file descriptors of a process based on its pidfd. A task needs to have ptrace_may_access() permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and Andy) on the target. One of the main use-cases is in combination with seccomp's user notification feature. As a reminder, seccomp's user notification feature was made available in v5.0. It allows a task to retrieve a file descriptor for its seccomp filter. The file descriptor is usually handed of to a more privileged supervising process. The supervisor can then listen for syscall events caught by the seccomp filter of the supervisee and perform actions in lieu of the supervisee, usually emulating syscalls. pidfd_getfd() is needed to expand its uses. There are currently two major users that wait on pidfd_getfd() and one future user: - Netflix, Sargun said, is working on a service mesh where users should be able to connect to a dns-based VIP. When a user connects to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be redirected to an envoy process. This service mesh uses seccomp user notifications and pidfd to intercept all connect calls and instead of connecting them to 1.2.3.4:80 connects them to e.g. 127.0.0.1:8080. - LXD uses the seccomp notifier heavily to intercept and emulate mknod() and mount() syscalls for unprivileged containers/processes. With pidfd_getfd() more uses-cases e.g. bridging socket connections will be possible. - The patchset has also seen some interest from the browser corner. Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a broker process. In the future glibc will start blocking all signals during dlopen() rendering this type of sandbox impossible. Hence, in the future Firefox will switch to a seccomp-user-nofication based sandbox which also makes use of file descriptor retrieval. The thread for this can be found at https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html With pidfd_getfd() it is e.g. possible to bridge socket connections for the supervisee (binding to a privileged port) and taking actions on file descriptors on behalf of the supervisee in general. Sargun's first version was using an ioctl on pidfds but various people pushed for it to be a proper syscall which he duely implemented as well over various review cycles. Selftests are of course included. I've also added instructions how to deal with merge conflicts below. There's also a small fix coming from the kernel mentee project to correctly annotate struct sighand_struct with __rcu to fix various sparse warnings. We've received a few more such fixes and even though they are mostly trivial I've decided to postpone them until after -rc1 since they came in rather late and I don't want to risk introducing build warnings. Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is needed to avoid allocation recursions triggerable by storage drivers that have userspace parts that run in the IO path (e.g. dm-multipath, iscsi, etc). These allocation recursions deadlock the device. The new prctl() allows such privileged userspace components to avoid allocation recursions by setting the PF_MEMALLOC_NOIO and PF_LESS_THROTTLE flags. The patch carries the necessary acks from the relevant maintainers and is routed here as part of prctl() thread-management." * tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim sched.h: Annotate sighand_struct with __rcu test: Add test for pidfd getfd arch: wire up pidfd_getfd syscall pid: Implement pidfd_getfd syscall vfs, fdtable: Add fget_task helper
2020-01-18open: introduce openat2(2) syscallAleksa Sarai1-0/+1
/* Background. */ For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Userspace also has a hard time figuring out whether a particular flag is supported on a particular kernel. While it is now possible with contemporary kernels (thanks to [3]), older kernels will expose unknown flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during openat(2) time matches modern syscall designs and is far more fool-proof. In addition, the newly-added path resolution restriction LOOKUP flags (which we would like to expose to user-space) don't feel related to the pre-existing O_* flag set -- they affect all components of path lookup. We'd therefore like to add a new flag argument. Adding a new syscall allows us to finally fix the flag-ignoring problem, and we can make it extensible enough so that we will hopefully never need an openat3(2). /* Syscall Prototype. */ /* * open_how is an extensible structure (similar in interface to * clone3(2) or sched_setattr(2)). The size parameter must be set to * sizeof(struct open_how), to allow for future extensions. All future * extensions will be appended to open_how, with their zero value * acting as a no-op default. */ struct open_how { /* ... */ }; int openat2(int dfd, const char *pathname, struct open_how *how, size_t size); /* Description. */ The initial version of 'struct open_how' contains the following fields: flags Used to specify openat(2)-style flags. However, any unknown flag bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR) will result in -EINVAL. In addition, this field is 64-bits wide to allow for more O_ flags than currently permitted with openat(2). mode The file mode for O_CREAT or O_TMPFILE. Must be set to zero if flags does not contain O_CREAT or O_TMPFILE. resolve Restrict path resolution (in contrast to O_* flags they affect all path components). The current set of flags are as follows (at the moment, all of the RESOLVE_ flags are implemented as just passing the corresponding LOOKUP_ flag). RESOLVE_NO_XDEV => LOOKUP_NO_XDEV RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS RESOLVE_BENEATH => LOOKUP_BENEATH RESOLVE_IN_ROOT => LOOKUP_IN_ROOT open_how does not contain an embedded size field, because it is of little benefit (userspace can figure out the kernel open_how size at runtime fairly easily without it). It also only contains u64s (even though ->mode arguably should be a u16) to avoid having padding fields which are never used in the future. Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE is no longer permitted for openat(2). As far as I can tell, this has always been a bug and appears to not be used by userspace (and I've not seen any problems on my machines by disallowing it). If it turns out this breaks something, we can special-case it and only permit it for openat(2) but not openat2(2). After input from Florian Weimer, the new open_how and flag definitions are inside a separate header from uapi/linux/fcntl.h, to avoid problems that glibc has with importing that header. /* Testing. */ In a follow-up patch there are over 200 selftests which ensure that this syscall has the correct semantics and will correctly handle several attack scenarios. In addition, I've written a userspace library[4] which provides convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care must be taken when using RESOLVE_IN_ROOT'd file descriptors with other syscalls). During the development of this patch, I've run numerous verification tests using libpathrs (showing that the API is reasonably usable by userspace). /* Future Work. */ Additional RESOLVE_ flags have been suggested during the review period. These can be easily implemented separately (such as blocking auto-mount during resolution). Furthermore, there are some other proposed changes to the openat(2) interface (the most obvious example is magic-link hardening[5]) which would be a good opportunity to add a way for userspace to restrict how O_PATH file descriptors can be re-opened. Another possible avenue of future work would be some kind of CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace which openat2(2) flags and fields are supported by the current kernel (to avoid userspace having to go through several guesses to figure it out). [1]: https://lwn.net/Articles/588444/ [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com [3]: commit 629e014bb834 ("fs: completely ignore unknown open flags") [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 [5]: https://lore.kernel.org/lkml/[email protected]/ [6]: https://youtu.be/ggD-eb3yPVs Suggested-by: Christian Brauner <[email protected]> Signed-off-by: Aleksa Sarai <[email protected]> Signed-off-by: Al Viro <[email protected]>
2020-01-13arch: wire up pidfd_getfd syscallSargun Dhillon1-0/+1
This wires up the pidfd_getfd syscall for all architectures. Signed-off-by: Sargun Dhillon <[email protected]> Acked-by: Christian Brauner <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2019-07-15arch: mark syscall number 435 reserved for clone3Christian Brauner1-0/+1
A while ago Arnd made it possible to give new system calls the same syscall number on all architectures (except alpha). To not break this nice new feature let's mark 435 for clone3 as reserved on all architectures that do not yet implement it. Even if an architecture does not plan to implement it this ensures that new system calls coming after clone3 will have the same number on all architectures. Signed-off-by: Christian Brauner <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Arnd Bergmann <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2019-06-28arch: wire-up pidfd_open()Christian Brauner1-0/+1
This wires up the pidfd_open() syscall into all arches at once. Signed-off-by: Christian Brauner <[email protected]> Reviewed-by: David Howells <[email protected]> Reviewed-by: Oleg Nesterov <[email protected]> Acked-by: Arnd Bergmann <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Cc: Kees Cook <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Jann Horn <[email protected]> Cc: Andy Lutomirsky <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Aleksa Sarai <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Al Viro <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected]
2019-05-16uapi: Wire up the mount API syscalls on non-x86 arches [ver #2]David Howells1-0/+6
Wire up the mount API syscalls on non-x86 arches. Reported-by: Arnd Bergmann <[email protected]> Signed-off-by: David Howells <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Signed-off-by: Al Viro <[email protected]>
2019-04-15arch: add pidfd and io_uring syscalls everywhereArnd Bergmann1-0/+4
Add the io_uring and pidfd_send_signal system calls to all architectures. These system calls are designed to handle both native and compat tasks, so all entries are the same across architectures, only arm-compat and the generic tale still use an old format. Acked-by: Michael Ellerman <[email protected]> (powerpc) Acked-by: Heiko Carstens <[email protected]> (s390) Acked-by: Geert Uytterhoeven <[email protected]> Signed-off-by: Arnd Bergmann <[email protected]>
2019-02-07y2038: add 64-bit time_t syscalls to all 32-bit architecturesArnd Bergmann1-0/+1
This adds 21 new system calls on each ABI that has 32-bit time_t today. All of these have the exact same semantics as their existing counterparts, and the new ones all have macro names that end in 'time64' for clarification. This gets us to the point of being able to safely use a C library that has 64-bit time_t in user space. There are still a couple of loose ends to tie up in various areas of the code, but this is the big one, and should be entirely uncontroversial at this point. In particular, there are four system calls (getitimer, setitimer, waitid, and getrusage) that don't have a 64-bit counterpart yet, but these can all be safely implemented in the C library by wrapping around the existing system calls because the 32-bit time_t they pass only counts elapsed time, not time since the epoch. They will be dealt with later. Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Heiko Carstens <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]> Acked-by: Catalin Marinas <[email protected]>
2019-01-25arch: add pkey and rseq syscall numbers everywhereArnd Bergmann1-0/+4
Most architectures define system call numbers for the rseq and pkey system calls, even when they don't support the features, and perhaps never will. Only a few architectures are missing these, so just define them anyway for consistency. If we decide to add them later to one of these, the system call numbers won't get out of sync then. Signed-off-by: Arnd Bergmann <[email protected]> Acked-by: Heiko Carstens <[email protected]> Acked-by: Geert Uytterhoeven <[email protected]>
2019-01-25ia64: assign syscall numbers for perf and seccompArnd Bergmann1-0/+2
Most architectures have assigned numbers for both seccomp and perf_event_open, even when they do not implement either. ia64 is an exception here, so for consistency lets add numbers for both of them. Unless CONFIG_PERF_EVENTS and CONFIG_SECCOMP are implemented, the system calls just return -ENOSYS. Signed-off-by: Arnd Bergmann <[email protected]>
2019-01-25ia64: add statx and io_pgetevents syscallsArnd Bergmann1-0/+2
All architectures should implement these two, so assign numbers and hook them up on ia64. Signed-off-by: Arnd Bergmann <[email protected]>
2019-01-25ia64: add __NR_umount2 definitionArnd Bergmann1-1/+1
Other architectures commonly use __NR_umount2 for sys_umount, only ia64 and alpha use __NR_umount here. In order to synchronize the generated tables, use umount2 like everyone else, and add back the old name from asm/unistd.h for compatibility. The __IGNORE_* lines are now all obsolete and can be removed as a side-effect. Signed-off-by: Arnd Bergmann <[email protected]>
2018-11-13ia64: add system call table generation supportFiroz Khan4-0/+445
The system call tables are in different format in all architecture and it will be difficult to manually add, modify or delete the syscall table entries in the res- pective files. To make it easy by keeping a script and which will generate the uapi header and syscall table file. This change will also help to unify the implemen- tation across all architectures. The system call table generation script is added in kernel/syscalls directory which contain the scripts to generate both uapi header file and system call table files. The syscall.tbl will be input for the scripts. syscall.tbl contains the list of available system calls along with system call number and corresponding entry point. Add a new system call in this architecture will be possible by adding new entry in the syscall.tbl file. Adding a new table entry consisting of: - System call number. - ABI. - System call name. - Entry point name. syscallhdr.sh and syscalltbl.sh will generate uapi header unistd_64.h and syscall_table.h files respectively. Both .sh files will parse the content syscall.tbl to generate the header and table files. unistd_64.h will be included by uapi/asm/unistd.h and syscall_table.h is included by kernel/entry.S - the real system call table. ARM, s390 and x86 architecuture does have similar support. I leverage their implementation to come up with a generic solution. Signed-off-by: Firoz Khan <[email protected]> Signed-off-by: Tony Luck <[email protected]>