Age | Commit message (Collapse) | Author | Files | Lines |
|
Daniel Borkmann says:
====================
This set contains three more fixes for the bpf_msg_pull_data()
mainly for correcting scatterlist ring wrap-arounds as well as
fixing up data pointers. For details please see individual patches.
Thanks!
====================
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
When we perform the sg shift repair for the scatterlist ring, we
currently start out at i = first_sg + 1. However, this is not
correct since the first_sg could point to the sge sitting at slot
MAX_SKB_FRAGS - 1, and a subsequent i = MAX_SKB_FRAGS will access
the scatterlist ring (sg) out of bounds. Add the sk_msg_iter_var()
helper for iterating through the ring, and apply the same rule
for advancing to the next ring element as we do elsewhere. Later
work will use this helper also in other places.
Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
If first_sg and last_sg wraps around in the scatterlist ring, then we
need to account for that in the shift as well. E.g. crafting such msgs
where this is the case leads to a hang as shift becomes negative. E.g.
consider the following scenario:
first_sg := 14 |=> shift := -12 msg->sg_start := 10
last_sg := 3 | msg->sg_end := 5
round 1: i := 15, move_from := 3, sg[15] := sg[ 3]
round 2: i := 0, move_from := -12, sg[ 0] := sg[-12]
round 3: i := 1, move_from := -11, sg[ 1] := sg[-11]
round 4: i := 2, move_from := -10, sg[ 2] := sg[-10]
[...]
round 13: i := 11, move_from := -1, sg[ 2] := sg[ -1]
round 14: i := 12, move_from := 0, sg[ 2] := sg[ 0]
round 15: i := 13, move_from := 1, sg[ 2] := sg[ 1]
round 16: i := 14, move_from := 2, sg[ 2] := sg[ 2]
round 17: i := 15, move_from := 3, sg[ 2] := sg[ 3]
[...]
This means we will loop forever and never hit the msg->sg_end condition
to break out of the loop. When we see that the ring wraps around, then
the shift should be MAX_SKB_FRAGS - first_sg + last_sg - 1. Meaning,
the remainder slots from the tail of the ring and the head until last_sg
combined.
Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
In the current code, msg->data is set as sg_virt(&sg[i]) + start - offset
and msg->data_end relative to it as msg->data + bytes. Using iterator i
to point to the updated starting scatterlist element holds true for some
cases, however not for all where we'd end up pointing out of bounds. It
is /correct/ for these ones:
1) When first finding the starting scatterlist element (sge) where we
find that the page is already privately owned by the msg and where
the requested bytes and headroom fit into the sge's length.
However, it's /incorrect/ for the following ones:
2) After we made the requested area private and updated the newly allocated
page into first_sg slot of the scatterlist ring; when we find that no
shift repair of the ring is needed where we bail out updating msg->data
and msg->data_end. At that point i will point to last_sg, which in this
case is the next elem of first_sg in the ring. The sge at that point
might as well be invalid (e.g. i == msg->sg_end), which we use for
setting the range of sg_virt(&sg[i]). The correct one would have been
first_sg.
3) Similar as in 2) but when we find that a shift repair of the ring is
needed. In this case we fix up all sges and stop once we've reached the
end. In this case i will point to will point to the new msg->sg_end,
and the sge at that point will be invalid. Again here the requested
range sits in first_sg.
Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Pull NVMe fixes from Christoph.
* 'nvme-4.19' of git://git.infradead.org/nvme:
nvmet: free workqueue object if module init fails
nvme-fcloop: Fix dropped LS's to removed target port
nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event
|
|
Like d88b6d04: "cdrom: information leak in cdrom_ioctl_media_changed()"
There is another cast from unsigned long to int which causes
a bounds check to fail with specially crafted input. The value is
then used as an index in the slot array in cdrom_slot_status().
Signed-off-by: Scott Bauer <[email protected]>
Signed-off-by: Scott Bauer <[email protected]>
Cc: [email protected]
Signed-off-by: Jens Axboe <[email protected]>
|
|
Add of_get_compatible_child() helper that can be used to lookup
compatible child nodes.
Several drivers currently use of_find_compatible_node() to lookup child
nodes while failing to notice that the of_find_ functions search the
entire tree depth-first (from a given start node) and therefore can
match unrelated nodes. The fact that these functions also drop a
reference to the node they start searching from (e.g. the parent node)
is typically also overlooked, something which can lead to use-after-free
bugs.
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Rob Herring <[email protected]>
|
|
Using a private template is problematic:
1. We can't assign both a zone and a timeout policy
(zone assigns a conntrack template, so we hit problem 1)
2. Using a template needs to take care of ct refcount, else we'll
eventually free the private template due to ->use underflow.
This patch reworks template policy to instead work with existing conntrack.
As long as such conntrack has not yet been placed into the hash table
(unconfirmed) we can still add the timeout extension.
The only caveat is that we now need to update/correct ct->timeout to
reflect the initial/new state, otherwise the conntrack entry retains the
default 'new' timeout.
Side effect of this change is that setting the policy must
now occur from chains that are evaluated *after* the conntrack lookup
has taken place.
No released kernel contains the timeout policy feature yet, so this change
should be ok.
Changes since v2:
- don't handle 'ct is confirmed case'
- after previous patch, no need to special-case tcp/dccp/sctp timeout
anymore
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
tcp, sctp and dccp trackers re-use the userspace ctnetlink states
to index their timeout arrays, which means timeout[0] is never
used. Copy the 'new' state (syn-sent, dccp-request, ..) to 0 as well
so external users can simply read it off timeouts[0] without need to
differentiate dccp/sctp/tcp and udp/icmp/gre/generic.
The alternative is to map all array accesses to 'i - 1', but that
is a much more intrusive change.
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>
|
|
An unfortunate consequence of having a strong typing for the input
values to the SMC call is that it also affects the type of the
return values, limiting r0 to 32 bits and r{1,2,3} to whatever
was passed as an input.
Let's turn everything into "unsigned long", which satisfies the
requirements of both architectures, and allows for the full
range of return values.
Reported-by: Julien Grall <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
|
|
- Fix mismatch between SVE registers (Z) and FPSIMD register (V)
- Don't prefix the path for [3] with Linux to stay consistent with
[1] and [2].
Signed-off-by: Julien Grall <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
|
|
When building building AMSDU from non-linear SKB, we hit a
kernel panic when trying to push the padding to the tail.
Instead, put the padding at the head of the next subframe.
This also fixes the A-MSDU subframes to not have the padding
accounted in the length field and not have pad at all for
the last subframe, both required by the spec.
Fixes: 6e0456b54545 ("mac80211: add A-MSDU tx support")
Signed-off-by: Sara Sharon <[email protected]>
Reviewed-by: Lorenzo Bianconi <[email protected]>
Signed-off-by: Johannes Berg <[email protected]>
|
|
IEEE 802.11-2016 14.10.8.3 HWMP sequence numbering says:
If it is a target mesh STA, it shall update its own HWMP SN to
maximum (current HWMP SN, target HWMP SN in the PREQ element) + 1
immediately before it generates a PREP element in response to a
PREQ element.
Signed-off-by: Yuan-Chi Pang <[email protected]>
Signed-off-by: Johannes Berg <[email protected]>
|
|
While recently going over bpf_msg_pull_data(), I noticed three
issues which are fixed in here:
1) When we attempt to find the first scatterlist element (sge)
for the start offset, we add len to the offset before we check
for start < offset + len, whereas it should come after when
we iterate to the next sge to accumulate the offsets. For
example, given a start offset of 12 with a sge length of 8
for the first sge in the list would lead us to determine this
sge as the first sge thinking it covers first 16 bytes where
start is located, whereas start sits in subsequent sges so
we would end up pulling in the wrong data.
2) After figuring out the starting sge, we have a short-cut test
in !msg->sg_copy[i] && bytes <= len. This checks whether it's
not needed to make the page at the sge private where we can
just exit by updating msg->data and msg->data_end. However,
the length test is not fully correct. bytes <= len checks
whether the requested bytes (end - start offsets) fit into the
sge's length. The part that is missing is that start must not
be sge length aligned. Meaning, the start offset into the sge
needs to be accounted as well on top of the requested bytes
as otherwise we can access the sge out of bounds. For example
the sge could have length of 8, our requested bytes could have
length of 8, but at a start offset of 4, so we also would need
to pull in 4 bytes of the next sge, when we jump to the out
label we do set msg->data to sg_virt(&sg[i]) + start - offset
and msg->data_end to msg->data + bytes which would be oob.
3) The subsequent bytes < copy test for finding the last sge has
the same issue as in point 2) but also it tests for less than
rather than less or equal to. Meaning if the sge length is of
8 and requested bytes of 8 while having the start aligned with
the sge, we would unnecessarily go and pull in the next sge as
well to make it private.
Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
Pull thermal fixes from Eduardo Valentin:
"Minor fixes to OF thermal, qoriq, and rcar drivers"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal:
thermal: of-thermal: disable passive polling when thermal zone is disabled
thermal: rcar_gen3_thermal: convert to SPDX identifiers
thermal: rcar_thermal: convert to SPDX identifiers
thermal: qoriq: Switch to SPDX identifier
thermal: qoriq: Simplify the 'site' variable assignment
thermal: qoriq: Use devm_thermal_zone_of_sensor_register()
|
|
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:
struct foo {
int stuff;
void *entry[];
};
instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count,
GFP_KERNEL);
Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:
instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
Notice that, currently, there is a bug during the allocation:
sizeof(npcm7xx_clk_data) should be sizeof(*npcm7xx_clk_data)
Fix this bug by using struct_size() in kzalloc()
This issue was detected with the help of Coccinelle.
Cc: [email protected]
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Avi Fishman <[email protected]>
Signed-off-by: Stephen Boyd <[email protected]>
|
|
Variable save_pud is being assigned but is never used hence it is
redundant and can be removed.
Cleans up clang warning:
variable 'save_pud' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Signed-off-by: Boris Ostrovsky <[email protected]>
|
|
Export device state to sysfs to allow for easier get device state.
Signed-off-by: Joe Jin <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Signed-off-by: Boris Ostrovsky <[email protected]>
|
|
Thanks to Christoph Hellwig for pointing out a cleaner way to do this,
as my approach was quite ugly.
CC: Christoph Hellwig <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>
|
|
As of commit fd1102f0aade ("mm: mmu_notifier fix for tlb_end_vma"),
asm-generic/tlb.h now calls tlb_flush() from a static inline function,
so we need to make sure that it's declared before #including the
asm-generic header in the arch header.
Reported-by: Guenter Roeck <[email protected]>
Fixes: fd1102f0aade ("mm: mmu_notifier fix for tlb_end_vma")
Signed-off-by: Will Deacon <[email protected]>
[groeck: Use forward declaration instead of moving inline function]
Signed-off-by: Guenter Roeck <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>
|
|
I managed to miss one of Rob's code reviews on the mailing list
<http://lists.infradead.org/pipermail/linux-riscv/2018-August/001139.html>.
The patch has already been merged, so I'm submitting a fixup.
Sorry!
Fixes: b67bc7cb4088 ("dt-bindings: interrupt-controller: RISC-V local interrupt controller")
Cc: Rob Herring <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Karsten Merker <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>
|
|
We use kzalloc to allocate the write_buf that we use for
i2c transfer on hdcp write. But it seems that we are forgetting
to free the memory that is not needed after i2c transfer is
completed.
Reported-by: Brian J Wood <[email protected]>
Fixes: 2320175feb74 ("drm/i915: Implement HDCP for HDMI")
Cc: Ramalingam C <[email protected]>
Cc: Sean Paul <[email protected]>
Cc: Jani Nikula <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: <[email protected]> # v4.17+
Signed-off-by: Rodrigo Vivi <[email protected]>
Reviewed-by: Chris Wilson <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 62d3a8deaa10b8346d979d0dabde56c33b742afa)
Signed-off-by: Rodrigo Vivi <[email protected]>
|
|
The workaround was supposed to look at the plane destination
coordinates. Currently it's looking at some mixture of src
and dst coordinates that doesn't make sense. Fix it up.
Signed-off-by: Ville Syrjälä <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Fixes: 394676f05bee (drm/i915: Add WA for planes ending close to left screen edge)
Reviewed-by: Imre Deak <[email protected]>
(cherry picked from commit b1f1c2c11fc6c6cd3e361061e30f9b2839897b28)
Signed-off-by: Rodrigo Vivi <[email protected]>
|
|
Fix the VMC page fault when the running sequence is as below:
1.amdgpu_gem_create_ioctl
2.ttm_bo_swapout->amdgpu_vm_bo_invalidate, as not called
amdgpu_vm_bo_base_init, so won't called
list_add_tail(&base->bo_list, &bo->va). Even the bo was evicted,
it won't set the bo_base->moved.
3.drm_gem_open_ioctl->amdgpu_vm_bo_base_init, here only called
list_move_tail(&base->vm_status, &vm->evicted), but not set the
bo_base->moved.
4.amdgpu_vm_bo_map->amdgpu_vm_bo_insert_map, as the bo_base->moved is
not set true, the function amdgpu_vm_bo_insert_map will call
list_move(&bo_va->base.vm_status, &vm->moved)
5.amdgpu_cs_ioctl won't validate the swapout bo, as it is only in the
moved list, not in the evict list. So VMC page fault occurs.
Signed-off-by: Emily Deng <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
|
|
Otherwise we can get the following errors occasionally on some devices:
mmc1: tried to HW reset card, got error -110
mmcblk1: error -110 requesting status
mmcblk1: recovery failed!
print_req_error: I/O error, dev mmcblk1, sector 14329
...
I have one device that hits this error almost on every boot, and another
one that hits it only rarely with the other ones I've used behave without
problems. I'm not sure if the issue is related to a particular eMMC card
model, but in case it is, both of the machines with issues have:
# cat /sys/class/mmc_host/mmc1/mmc1:0001/manfid \
/sys/class/mmc_host/mmc1/mmc1:0001/oemid \
/sys/class/mmc_host/mmc1/mmc1:0001/name
0x000045
0x0100
SEM16G
and the working ones have:
0x000011
0x0100
016G92
Note that "ti,non-removable" is different as omap_hsmmc_reg_get() does not
call omap_hsmmc_disable_boot_regulators() if no_regulator_off_init is set.
And currently we set no_regulator_off_init only for "ti,non-removable" and
not for "non-removable". It seems that we should have "non-removable" with
some other mmc generic property behave in the same way instead of having to
use a non-generic property. But let's fix the issue first.
Fixes: 7e2f8c0ae670 ("ARM: dts: Add minimal support for motorola droid 4
xt894")
Cc: Marcel Partap <[email protected]>
Cc: Merlijn Wajer <[email protected]>
Cc: Michael Scott <[email protected]>
Cc: NeKit <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Sebastian Reichel <[email protected]>
Signed-off-by: Tony Lindgren <[email protected]>
|
|
|
|
Fix wrong mode for dts file added by commit bb3e3fbbac86
("ARM: dts: Add DT support for Octavo Systems OSD3358-SM-RED
based on TI AM335x").
Signed-off-by: Neeraj Dantu <[email protected]>
CC: Robert Nelson <[email protected]>
CC: Jason Kridner <[email protected]>
Signed-off-by: Tony Lindgren <[email protected]>
|
|
freq_reg_info expects to get the frequency in kHz. Instead we
accidently pass it in MHz. Thus, currently the function always
return ERR rule. Fix that.
Fixes: 50f32718e125 ("nl80211: Add wmm rule attribute to NL80211_CMD_GET_WIPHY dump command")
Signed-off-by: Haim Dreyfuss <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
[fix kHz/MHz in commit message]
Signed-off-by: Johannes Berg <[email protected]>
|
|
TXOP (also known as Channel Occupancy Time) is u16 and should be
added using nla_put_u16 instead of u8, fix that.
Fixes: 50f32718e125 ("nl80211: Add wmm rule attribute to NL80211_CMD_GET_WIPHY dump command")
Signed-off-by: Haim Dreyfuss <[email protected]>
Signed-off-by: Luca Coelho <[email protected]>
Signed-off-by: Johannes Berg <[email protected]>
|
|
User controls @idx which to be used as index of hwsim_world_regdom_custom.
So, It can be exploited via Spectre-like attack. (speculative execution)
This kind of attack leaks address of hwsim_world_regdom_custom,
It leads an attacker to bypass security mechanism such as KASLR.
So sanitize @idx before using it to prevent attack.
I leveraged strategy [1] to find and exploit this gadget.
[1] https://github.com/jinb-park/linux-exploit/tree/master/exploit-remaining-spectre-gadget/
Signed-off-by: Jinbum Park <[email protected]>
[johannes: unwrap URL]
Signed-off-by: Johannes Berg <[email protected]>
|
|
I changed the way mac80211 updates the PM state of the peer.
I forgot that we could also have multicast frames from the
peer and that those frame should of course not change the
PM state of the peer: A peer goes to power save when it
needs to scan, but it won't send the broadcast Probe Request
with the PM bit set.
This made us mark the peer as awake when it wasn't and then
Intel's firmware would fail to transmit because the peer is
asleep according to its database. The driver warned about
this and it looked like this:
WARNING: CPU: 0 PID: 184 at /usr/src/linux-4.16.14/drivers/net/wireless/intel/iwlwifi/mvm/tx.c:1369 iwl_mvm_rx_tx_cmd+0x53b/0x860
CPU: 0 PID: 184 Comm: irq/124-iwlwifi Not tainted 4.16.14 #1
RIP: 0010:iwl_mvm_rx_tx_cmd+0x53b/0x860
Call Trace:
iwl_pcie_rx_handle+0x220/0x880
iwl_pcie_irq_handler+0x6c9/0xa20
? irq_forced_thread_fn+0x60/0x60
? irq_thread_dtor+0x90/0x90
The relevant code that spits the WARNING is:
case TX_STATUS_FAIL_DEST_PS:
/* the FW should have stopped the queue and not
* return this status
*/
WARN_ON(1);
info->flags |= IEEE80211_TX_STAT_TX_FILTERED;
This fixes https://bugzilla.kernel.org/show_bug.cgi?id=199967.
Fixes: 9fef65443388 ("mac80211: always update the PM state of a peer on MGMT / DATA frames")
Cc: <[email protected]> #4.16+
Signed-off-by: Emmanuel Grumbach <[email protected]>
Signed-off-by: Johannes Berg <[email protected]>
|
|
Make wmm_rule be part of the reg_rule structure. This simplifies the
code a lot at the cost of having bigger memory usage. However in most
cases we have only few reg_rule's and when we do have many like in
iwlwifi we do not save memory as it allocates a separate wmm_rule for
each channel anyway.
This also fixes a bug reported in various places where somewhere the
pointers were corrupted and we ended up doing a null-dereference.
Fixes: 230ebaa189af ("cfg80211: read wmm rules from regulatory database")
Signed-off-by: Stanislaw Gruszka <[email protected]>
[rephrase commit message slightly]
Signed-off-by: Johannes Berg <[email protected]>
|
|
The mac80211_hwsim driver intends to say that it supports up to four
STBC receive streams, but instead it ends up saying something undefined.
The IEEE80211_VHT_CAP_RXSTBC_X macros aren't independent bits that can
be ORed together, but values. In this case, _4 is the appropriate one
to use.
Signed-off-by: Danek Duvall <[email protected]>
Signed-off-by: Johannes Berg <[email protected]>
|
|
The mod mask for VHT capabilities intends to say that you can override
the number of STBC receive streams, and it does, but only by accident.
The IEEE80211_VHT_CAP_RXSTBC_X aren't bits to be set, but values (albeit
left-shifted). ORing the bits together gets the right answer, but we
should use the _MASK macro here instead.
Signed-off-by: Danek Duvall <[email protected]>
Signed-off-by: Johannes Berg <[email protected]>
|
|
Currently, when a redirect occurs in sockmap and an error occurs in
the redirect call we unwind the scatterlist once in the error path
of bpf_tcp_sendmsg_do_redirect() and then again in sendmsg(). Then
in the error path of sendmsg we decrement the copied count by the
send size.
However, its possible we partially sent data before the error was
generated. This can happen if do_tcp_sendpages() partially sends the
scatterlist before encountering a memory pressure error. If this
happens we need to decrement the copied value (the value tracking
how many bytes were actually sent to TCP stack) by the number of
remaining bytes _not_ the entire send size. Otherwise we risk
confusing userspace.
Also we don't need two calls to free the scatterlist one is
good enough. So remove the one in bpf_tcp_sendmsg_do_redirect() and
then properly reduce copied by the number of remaining bytes which
may in fact be the entire send size if no bytes were sent.
To do this use bool to indicate if free_start_sg() should do mem
accounting or not.
Signed-off-by: John Fastabend <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
|
|
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
When a targetport is removed from the config, fcloop will avoid calling
the LS done() routine thinking the targetport is gone. This leaves the
initiator reset/reconnect hanging as it waits for a status on the
Create_Association LS for the reconnect.
Change the filter in the LS callback path. If tport null (set when
failed validation before "sending to remote port"), be sure to call
done. This was the main bug. But, continue the logic that only calls
done if tport was set but there is no remoteport (e.g. case where
remoteport has been removed, thus host doesn't expect a completion).
Signed-off-by: James Smart <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
In many architectures loads may be reordered with older stores to
different locations. In the nvme driver the following two operations
could be reordered:
- Write shadow doorbell (dbbuf_db) into memory.
- Read EventIdx (dbbuf_ei) from memory.
This can result in a potential race condition between driver and VM host
processing requests (if given virtual NVMe controller has a support for
shadow doorbell). If that occurs, then the NVMe controller may decide to
wait for MMIO doorbell from guest operating system, and guest driver may
decide not to issue MMIO doorbell on any of subsequent commands.
This issue is purely timing-dependent one, so there is no easy way to
reproduce it. Currently the easiest known approach is to run "Oracle IO
Numbers" (orion) that is shipped with Oracle DB:
orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
concat -write 40 -duration 120 -matrix row -testname nvme_test
Where nvme_test is a .lun file that contains a list of NVMe block
devices to run test against. Limiting number of vCPUs assigned to given
VM instance seems to increase chances for this bug to occur. On test
environment with VM that got 4 NVMe drives and 1 vCPU assigned the
virtual NVMe controller hang could be observed within 10-20 minutes.
That correspond to about 400-500k IO operations processed (or about
100GB of IO read/writes).
Orion tool was used as a validation and set to run in a loop for 36
hours (equivalent of pushing 550M IO operations). No issues were
observed. That suggest that the patch fixes the issue.
Fixes: f9f38e33389c ("nvme: improve performance for virtual NVMe devices")
Signed-off-by: Michal Wnukowski <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
[hch: updated changelog and comment a bit]
Signed-off-by: Christoph Hellwig <[email protected]>
|
|
Building the newly introduced BPF_PROG_TYPE_SK_REUSEPORT leads to
a compile time error when building with clang:
net/core/filter.o: In function `sk_reuseport_convert_ctx_access':
../net/core/filter.c:7284: undefined reference to `__compiletime_assert_7284'
It seems that clang has issues resolving hweight_long at compile
time. Since SK_FL_PROTO_MASK is a constant, we can use the interface
for known constant arguments which works fine with clang.
Fixes: 2dbb9b9e6df6 ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT")
Signed-off-by: Stefan Agner <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
In bpf_tcp_recvmsg() we first took a reference on the psock, however
once we find that there are skbs in the normal socket's receive queue
we return with processing them through tcp_recvmsg(). Problem is that
we leak the taken reference on the psock in that path. Given we don't
really do anything with the psock at this point, move the skb_queue_empty()
test before we fetch the psock to fix this case.
Fixes: 8934ce2fd081 ("bpf: sockmap redirect ingress support")
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
bpf_tcp_close() we pop the psock linkage to a map via psock_map_pop().
A parallel update on the sock hash map can happen between psock_map_pop()
and lookup_elem_raw() where we override the element under link->hash /
link->key. In bpf_tcp_close()'s lookup_elem_raw() we subsequently only
test whether an element is present, but we do not test whether the
element is infact the element we were looking for.
We lock the sock in bpf_tcp_close() during that time, so do we hold
the lock in sock_hash_update_elem(). However, the latter locks the
sock which is newly updated, not the one we're purging from the hash
table. This means that while one CPU is doing the lookup from bpf_tcp_close(),
another CPU is doing the map update in parallel, dropped our sock from
the hlist and released the psock.
Subsequently the first CPU will find the new sock and attempts to drop
and release the old sock yet another time. Fix is that we need to check
the elements for a match after lookup, similar as we do in the sock map.
Note that the hash tab elems are freed via RCU, so access to their
link->hash / link->key is fine since we're under RCU read side there.
Fixes: e9db4ef6bf4c ("bpf: sockhash fix omitted bucket lock in sock_close")
Signed-off-by: Daniel Borkmann <[email protected]>
Acked-by: John Fastabend <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
|
|
Currently, variable ref_count within the bsg_device struct is of
type atomic_t. For variables being used as reference counters,
the refcount API should be used instead of atomic. The newer
refcount API works to prevent counter overflows and use-after-free
bugs. So, move this varable from the atomic API to refcount,
potentially avoiding the issues mentioned.
Signed-off-by: John Pittman <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
kmem_cache_destroy() can handle NULL pointer correctly, so there is
no need to check e->icq_cache before calling kmem_cache_destroy().
Signed-off-by: Chengguang Xu <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
|
|
In IPv4, the newly introduced rdma_read_gids is used to read the SGID/DGID
for the connection which returns GID correctly for RoCE transport as well.
In IPv6, rdma_read_gids is also used. The following are why rdma_read_gids
is introduced.
rdma_addr_get_dgid() for RoCE for client side connections returns MAC
address, instead of DGID.
rdma_addr_get_sgid() for RoCE doesn't return correct SGID for IPv6 and
when more than one IP address is assigned to the netdevice.
So the transport agnostic rdma_read_gids() API is provided by rdma_cm
module.
Signed-off-by: Zhu Yanjun <[email protected]>
Acked-by: Santosh Shilimkar <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Commit 52638f71fcff ("dsa: Move gpio reset into switch driver")
moved the GPIO handling into the switch drivers but forgot
to remove the GPIO header includes.
Signed-off-by: Linus Walleij <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
In function tipc_dest_push, the 32bit variables 'node' and 'port'
are stored separately in uppper and lower part of 64bit 'value'.
Then this value is assigned to dst->value which is a union like:
union
{
struct {
u32 port;
u32 node;
};
u64 value;
}
This works on little-endian machines like x86 but fails on big-endian
machines.
The fix remove the 'value' stack parameter and even the 'value'
member of the union in tipc_dest, assign the 'node' and 'port' member
directly with the input parameter to avoid the endian issue.
Fixes: a80ae5306a73 ("tipc: improve destination linked list")
Signed-off-by: Zhenbo Gao <[email protected]>
Acked-by: Jon Maloy <[email protected]>
Signed-off-by: Haiqing Bai <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Jiri Pirko says:
====================
net: sched: couple of small fixes
Jiri Pirko (2):
net: sched: fix extack error message when chain is failed to be
created
net: sched: return -ENOENT when trying to remove filter from
non-existent chain
====================
Acked-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
When chain 0 was implicitly created, removal of non-existent filter from
chain 0 gave -ENOENT. Once chain 0 became non-implicit, the same call is
giving -EINVAL. Fix this by returning -ENOENT in that case.
Reported-by: Roman Mashak <[email protected]>
Fixes: f71e0ca4db18 ("net: sched: Avoid implicit chain 0 creation")
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
Instead "Cannot find" say "Cannot create".
Fixes: c35a4acc2985 ("net: sched: cls_api: handle generic cls errors")
Signed-off-by: Jiri Pirko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|
|
After erspan_ver is introudced, if erspan_ver is not set in iproute, its
value will be left 0 by default. Since Commit 02f99df1875c ("erspan: fix
invalid erspan version."), it has broken the traffic due to the version
check in erspan_xmit if users are not aware of 'erspan_ver' param, like
using an old version of iproute.
To fix this compatibility problem, it sets erspan_ver to 1 by default
when adding an erspan dev in erspan_setup. Note that we can't do it in
ipgre_netlink_parms, as this function is also used by ipgre_changelink.
Fixes: 02f99df1875c ("erspan: fix invalid erspan version.")
Reported-by: Jianlin Shi <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
|