The code assumes log_num < in_num everywhere, and that is true as long as
in_num is incremented by descriptor iov count, and log_num by 1. However
this breaks if there's a zero sized descriptor.
As a result, if a malicious guest creates a vring desc with desc.len = 0,
it may cause the host kernel to crash by overflowing the log array. This
bug can be triggered during the VM migration.
There's no need to log when desc.len = 0, so just don't increment log_num
in this case.
Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server") Cc: stable@vger.kernel.org Reviewed-by: Lidong Chen <lidongchen@tencent.com> Signed-off-by: ruippan <ruippan@tencent.com> Signed-off-by: yongduan <yongduan@tencent.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Trinity and other fuzzers can hit this WARN on far too easily,
resulting in a tainted kernel that hinders automated fuzzing.
Replace it with a rate-limited printk.
Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Clang warns after commit 8985167ecf57 ("clk: s2mps11: Fix matching when
built as module and DT node contains compatible"):
drivers/clk/clk-s2mps11.c:242:34: warning: variable 's2mps11_dt_match'
is not needed and will not be emitted [-Wunneeded-internal-declaration]
static const struct of_device_id s2mps11_dt_match[] = {
^
1 warning generated.
This warning happens when a variable is used in some construct that
doesn't require a reference to that variable to be emitted in the symbol
table; in this case, it's MODULE_DEVICE_TABLE, which only needs to hold
the data of the variable, not the variable itself.
$ nm -S drivers/clk/clk-s2mps11.o | rg s2mps11_dt_match 00000078000003d4 R __mod_of__s2mps11_dt_match_device_table
Normally, with device ID table variables, it means that the variable
just needs to be tied to the device declaration at the bottom of the
file, like s2mps11_clk_id:
$ nm -S drivers/clk/clk-s2mps11.o | rg s2mps11_clk_id 0000000000000078 R __mod_platform__s2mps11_clk_id_device_table 0000000000000078 r s2mps11_clk_id
However, because the comment above this deliberately doesn't want this
variable added to .of_match_table, we need to mark s2mps11_dt_match as
__used to silence this warning. This makes it clear to Clang that the
variable is used for something, even if a reference to it isn't being
emitted.
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Fixes: 8985167ecf57 ("clk: s2mps11: Fix matching when built as module and DT node contains compatible") Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The basepath may contain special characters, which would confuse the regex
matcher. ${var#prefix} does the right thing.
Link: http://lkml.kernel.org/r/20190518055946.181563-1-drinkcat@chromium.org Fixes: 67a28de47faa8358 ("scripts/decode_stacktrace: only strip base path when a prefix of the path") Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Since vhost_exceeds_weight() was introduced, callers need to specify
the packet weight and byte weight in vhost_dev_init(). Note that, the
packet weight isn't counted in this patch to keep the original behavior
unchanged.
Fixes: e82b9b0727ff ("vhost: introduce vhost_exceeds_weight()") Cc: stable@vger.kernel.org Signed-off-by: Tiwei Bie <tiwei.bie@intel.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In commit 6a53b7593233 ("xfrm: check id proto in validate_tmpl()")
I introduced a check for xfrm protocol, but according to Herbert
IPSEC_PROTO_ANY should only be used as a wildcard for lookup, so
it should be removed from validate_tmpl().
And, IPSEC_PROTO_ANY is expected to only match 3 IPSec-specific
protocols, this is why xfrm_state_flush() could still miss
IPPROTO_ROUTING, which leads that those entries are left in
net->xfrm.state_all before exit net. Fix this by replacing
IPSEC_PROTO_ANY with zero.
This patch also extracts the check from validate_tmpl() to
xfrm_id_proto_valid() and uses it in parse_ipsecrequest().
With this, no other protocols should be added into xfrm.
Fixes: 6a53b7593233 ("xfrm: check id proto in validate_tmpl()") Reported-by: syzbot+0bf0519d6e0de15914fe@syzkaller.appspotmail.com Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Zubin Mithra <zsm@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The recent change to shuffle the codec initialization procedure for
Realtek via commit 607ca3bd220f ("ALSA: hda/realtek - EAPD turn on
later") caused the silent output on some machines. This change was
supposed to be safe, but it isn't actually; some devices have quirk
setups to override the EAPD via COEF or BTL in the additional verb
table, which is applied at the beginning of snd_hda_gen_init(). And
this EAPD setup is again overridden in alc_auto_init_amp().
For recovering from the regression, tell snd_hda_gen_init() not to
apply the verbs there by a new flag, then apply the verbs in
alc_init().
Since the chained quirks via chained_before flag is applied before the
depth check, it may lead to the endless recursive calls, when the
chain were set up incorrectly. Fix it by moving the depth check at
the beginning of the loop.
Fixes: 1f57825077dc ("ALSA: hda - Add chained_before flag to the fixup entry") Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The devicetree binding lists the phy phy as optional. As such, the
driver should not bail out if it can't find a regulator. Instead it
should just skip the remaining regulator related code and continue
on normally.
Skip the remainder of phy_power_on() if a regulator supply isn't
available. This also gets rid of the bogus return code.
Fixes: 2e12f536635f ("net: stmmac: dwmac-rk: Use standard devicetree property for phy regulator") Signed-off-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After commit baeababb5b85d5c4e6c917efe2a1504179438d3b
("tun: return NET_XMIT_DROP for dropped packets"),
when tun_net_xmit drop packets, it will free skb and return NET_XMIT_DROP,
netpoll_send_skb_on_dev will run into following use after free cases:
1. retry netpoll_start_xmit with freed skb;
2. queue freed skb in npinfo->txq.
queue_process will also run into use after free case.
hit netpoll_send_skb_on_dev first case with following kernel log:
Signed-off-by: Feng Sun <loyou85@gmail.com> Signed-off-by: Xiaojun Zhao <xiaojunzhao141@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Chris Wilson reports that it breaks his CPU hotplug test scripts. In
particular, it breaks offlining and then re-onlining the boot CPU, which
we treat specially (and the BIOS does too).
The symptoms are that we can offline the CPU, but it then does not come
back online again:
smpboot: CPU 0 is now offline
smpboot: Booting Node 0 Processor 0 APIC 0x0
smpboot: do_boot_cpu failed(-1) to wakeup CPU#0
Thomas says he knows why it's broken (my personal suspicion: our magic
handling of the "cpu0_logical_apicid" thing), but for 5.3 the right fix
is to just revert it, since we've never touched the LDR bits before, and
it's not worth the risk to do anything else at this stage.
[ Hotpluging of the boot CPU is special anyway, and should be off by
default. See the "BOOTPARAM_HOTPLUG_CPU0" config option and the
cpu0_hotplug kernel parameter.
In general you should not do it, and it has various known limitations
(hibernate and suspend require the boot CPU, for example).
But it should work, even if the boot CPU is special and needs careful
treatment - Linus ]
To fix this use BCM2835_AUX_SPI_STAT_RX_LVL to determine when we may
read data from the fifo reliably without any corruption.
Surprisingly the only values ever empirically read in
BCM2835_AUX_SPI_STAT_RX_LVL are 0x00, 0x10, 0x20 and 0x30.
So whenever the mask is not 0 we can read from the fifo in a safe manner.
The patch has now been tested intensively and we are no longer
able to reproduce the "RX" issue any longer.
Fixes: 1ea29b39f4c812ec ("spi: bcm2835aux: add bcm2835 auxiliary spi device...") Reported-by: Hubert Denkmair <h.denkmair@intence.de> Signed-off-by: Martin Sperl <kernel@martin.sperl.org> Acked-by: Stefan Wahren <stefan.wahren@i2se.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This read of the fifo is a potential candidate for a race condition
as the spi transfer is not necessarily finished and so can lead to
an early read of the fifo that still misses data.
So it has been removed.
Fixes: 1ea29b39f4c812ec ("spi: bcm2835aux: add bcm2835 auxiliary spi device...") Suggested-by: Hubert Denkmair <h.denkmair@intence.de> Signed-off-by: Martin Sperl <kernel@martin.sperl.org> Acked-by: Stefan Wahren <stefan.wahren@i2se.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Sharing more code between polling and interrupt-driven mode.
Signed-off-by: Martin Sperl <kernel@martin.sperl.org> Acked-by: Stefan Wahren <stefan.wahren@i2se.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The BCM2835 AUX SPI has a shared interrupt line (with AUX UART).
Downstream fixes this with an AUX irqchip to demux the IRQ sources and a
DT change which breaks compatibility with older kernels. The AUX irqchip
was already rejected for upstream[1] and the DT change would break
working systems if the DTB is updated to a newer one. The latter issue
was brought to my attention by Alex Graf.
The root cause however is a bug in the shared handler. Shared handlers
must check that interrupts are actually enabled before servicing the
interrupt. Add a check that the TXEMPTY or IDLE interrupts are enabled.
Cc: Alexander Graf <agraf@suse.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Eric Anholt <eric@anholt.net> Cc: Stefan Wahren <stefan.wahren@i2se.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Ray Jui <rjui@broadcom.com> Cc: Scott Branden <sbranden@broadcom.com> Cc: bcm-kernel-feedback-list@broadcom.com Cc: linux-spi@vger.kernel.org Cc: linux-rpi-kernel@lists.infradead.org Cc: linux-arm-kernel@lists.infradead.org Signed-off-by: Rob Herring <robh@kernel.org> Reviewed-by: Eric Anholt <eric@anholt.net> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
If after an MMIO exit to userspace a VCPU is immediately run with an
immediate_exit request, such as when a signal is delivered or an MMIO
emulation completion is needed, then the VCPU completes the MMIO
emulation and immediately returns to userspace. As the exit_reason
does not get changed from KVM_EXIT_MMIO in these cases we have to
be careful not to complete the MMIO emulation again, when the VCPU is
eventually run again, because the emulation does an instruction skip
(and doing too many skips would be a waste of guest code :-) We need
to use additional VCPU state to track if the emulation is complete.
As luck would have it, we already have 'mmio_needed', which even
appears to be used in this way by other architectures already.
Fixes: 0d640732dbeb ("arm64: KVM: Skip MMIO insn after emulation") Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Calling ceph_buffer_put() in __ceph_setxattr() may end up freeing the
i_xattrs.prealloc_blob buffer while holding the i_ceph_lock. This can be
fixed by postponing the call until later, when the lock is released.
The following backtrace was triggered by fstests generic/117.
BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
in_atomic(): 1, irqs_disabled(): 0, pid: 650, name: fsstress
3 locks held by fsstress/650:
#0: 00000000870a0fe8 (sb_writers#8){.+.+}, at: mnt_want_write+0x20/0x50
#1: 00000000ba0c4c74 (&type->i_mutex_dir_key#6){++++}, at: vfs_setxattr+0x55/0xa0
#2: 000000008dfbb3f2 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: __ceph_setxattr+0x297/0x810
CPU: 1 PID: 650 Comm: fsstress Not tainted 5.2.0+ #437
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x67/0x90
___might_sleep.cold+0x9f/0xb1
vfree+0x4b/0x60
ceph_buffer_release+0x1b/0x60
__ceph_setxattr+0x2b4/0x810
__vfs_setxattr+0x66/0x80
__vfs_setxattr_noperm+0x59/0xf0
vfs_setxattr+0x81/0xa0
setxattr+0x115/0x230
? filename_lookup+0xc9/0x140
? rcu_read_lock_sched_held+0x74/0x80
? rcu_sync_lockdep_assert+0x2e/0x60
? __sb_start_write+0x142/0x1a0
? mnt_want_write+0x20/0x50
path_setxattr+0xba/0xd0
__x64_sys_lsetxattr+0x24/0x30
do_syscall_64+0x50/0x1c0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7ff23514359a
Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
In mlx4_ib_alloc_pv_bufs(), 'tun_qp->tx_ring' is allocated through
kcalloc(). However, it is not always deallocated in the following execution
if an error occurs, leading to memory leaks. To fix this issue, free
'tun_qp->tx_ring' whenever an error occurs.
hv_kvp_daemon.c: In function ‘kvp_get_ip_info.constprop’:
hv_kvp_daemon.c:812:30: warning: ‘ip_buffer’ may be used uninitialized in this function [-Wmaybe-uninitialized]
struct hv_kvp_ipaddr_value *ip_buffer;
this seems to be a false positive: we only use ip_buffer when
op == KVP_OP_GET_IP_INFO and it is only unset when op == KVP_OP_ENUMERATE.
Silence the warning by initializing ip_buffer to NULL.
When a Tx timestamp is requested, a pointer to the skb is stored in the
ravb_tstamp_skb struct. This was done without an skb_get. There exists
the possibility that the skb could be freed by ravb_tx_free (when
ravb_tx_free is called from ravb_start_xmit) before the timestamp was
processed, leading to a use-after-free bug.
Use skb_get when filling a ravb_tstamp_skb struct, and add appropriate
frees/consumes when a ravb_tstamp_skb struct is freed.
Fixes: c156633f1353 ("Renesas Ethernet AVB driver proper") Signed-off-by: Tho Vu <tho.vu.wh@rvc.renesas.com> Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com> Signed-off-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
In i2400m_barker_db_init(), 'options_orig' is allocated through kstrdup()
to hold the original command line options. Then, the options are parsed.
However, if an error occurs during the parsing process, 'options_orig' is
not deallocated, leading to a memory leak bug. To fix this issue, free
'options_orig' before returning the error.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
In kalmia_init_and_get_ethernet_addr(), 'usb_buf' is allocated through
kmalloc(). In the following execution, if the 'status' returned by
kalmia_send_init_packet() is not 0, 'usb_buf' is not deallocated, leading
to memory leaks. To fix this issue, add the 'out' label to free 'usb_buf'.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
In cx82310_bind(), 'dev->partial_data' is allocated through kmalloc().
Then, the execution waits for the firmware to become ready. If the firmware
is not ready in time, the execution is terminated. However, the allocated
'dev->partial_data' is not deallocated on this path, leading to a memory
leak bug. To fix this issue, free 'dev->partial_data' before returning the
error.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
In myri10ge_probe(), myri10ge_alloc_slices() is invoked to allocate slices
related structures. Later on, myri10ge_request_irq() is used to get an irq.
However, if this process fails, the allocated slices related structures are
not deallocated, leading to memory leaks. To fix this issue, revise the
target label of the goto statement to 'abort_with_slices'.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
In blocked_fl_write(), 't' is not deallocated if bitmap_parse_user() fails,
leading to a memory leak bug. To fix this issue, free t before returning
the error.
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
In file included from drivers/hwmon/pmbus/ucd9000.c:19:0:
./include/linux/gpio/driver.h:576:1: error: redefinition of gpiochip_add_pin_range
gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
^~~~~~~~~~~~~~~~~~~~~~
In file included from drivers/hwmon/pmbus/ucd9000.c:18:0:
./include/linux/gpio.h:245:1: note: previous definition of gpiochip_add_pin_range was here
gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
^~~~~~~~~~~~~~~~~~~~~~
The ibm,mac-address-filters property defines the maximum number of
addresses the hypervisor's multicast filter list can support. It is
encoded as a big-endian integer in the OF device tree, but the virtual
ethernet driver does not convert it for use by little-endian systems.
As a result, the driver is not behaving as it should on affected systems
when a large number of multicast addresses are assigned to the device.
Reported-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
It seems the controller needs a short time after downloading the
firmware before it is ready for the NVM. A delay as short as 1 ms
seems sufficient, make it 10 ms just in case. No event is received
during the delay, hence we don't just silently drop an extra event.
drivers/net/ethernet/toshiba/tc35815.c:1507:30: warning: use of logical
'&&' with constant operand [-Wconstant-logical-operand]
if (!HAVE_DMA_RXALIGN(lp) && NET_IP_ALIGN)
^ ~~~~~~~~~~~~
drivers/net/ethernet/toshiba/tc35815.c:1507:30: note: use '&' for a
bitwise operation
if (!HAVE_DMA_RXALIGN(lp) && NET_IP_ALIGN)
^~
&
drivers/net/ethernet/toshiba/tc35815.c:1507:30: note: remove constant to
silence this warning
if (!HAVE_DMA_RXALIGN(lp) && NET_IP_ALIGN)
~^~~~~~~~~~~~~~~
1 warning generated.
Explicitly check that NET_IP_ALIGN is not zero, which matches how this
is checked in other parts of the tree. Because NET_IP_ALIGN is a build
time constant, this check will be constant folded away during
optimization.
Fixes: 82a9928db560 ("tc35815: Enable StripCRC feature") Link: https://github.com/ClangBuiltLinux/linux/issues/608 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
As spin_unlock_irq will enable interrupts.
Function tsi108_stat_carry is called from interrupt handler tsi108_irq.
Interrupts are enabled in interrupt handler.
Use spin_lock_irqsave/spin_unlock_irqrestore instead of spin_(un)lock_irq
in IRQ context to avoid this.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
I incorrectly merged commit 31a2fbb390fe ("x86/ptrace: Fix possible
spectre-v1 in ptrace_get_debugreg()") when backporting it, as was
graciously pointed out at
https://grsecurity.net/teardown_of_a_failed_linux_lts_spectre_fix.php
Resolve the upstream difference with the stable kernel merge to properly
protect things.
If TDLS station addition is rejected, the sta memory is leaked.
Avoid this by moving the check before the allocation.
Cc: stable@vger.kernel.org Fixes: 7ed5285396c2 ("mac80211: don't initiate TDLS connection if station is not associated to AP") Link: https://lore.kernel.org/r/20190801073033.7892-1-johannes@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 96cce12ff6e0 ("cfg80211: fix processing world
regdomain when non modular").
Re-triggering a reg_process_hint with the last request on all events,
can make the regulatory domain fail in case of multiple WiFi modules. On
slower boards (espacially with mdev), enumeration of the WiFi modules
can end up in an intersected regulatory domain, and user cannot set it
with 'iw reg set' anymore.
This is happening, because:
- 1st module enumerates, queues up a regulatory request
- request gets processed by __reg_process_hint_driver():
- checks if previous was set by CORE -> yes
- checks if regulator domain changed -> yes, from '00' to e.g. 'US'
-> sends request to the 'crda'
- 2nd module enumerates, queues up a regulator request (which triggers
the reg_todo() work)
- reg_todo() -> reg_process_pending_hints() sees, that the last request
is not processed yet, so it tries to process it again.
__reg_process_hint driver() will run again, and:
- checks if the last request's initiator was the core -> no, it was
the driver (1st WiFi module)
- checks, if the previous initiator was the driver -> yes
- checks if the regulator domain changed -> yes, it was '00' (set by
core, and crda call did not return yet), and should be changed to 'US'
------> __reg_process_hint_driver calls an intersect
Besides, the reg_process_hint call with the last request is meaningless
since the crda call has a timeout work. If that timeout expires, the
first module's request will lost.
Cc: stable@vger.kernel.org Fixes: 96cce12ff6e0 ("cfg80211: fix processing world regdomain when non modular") Signed-off-by: Robert Hodaszi <robert.hodaszi@digi.com> Link: https://lore.kernel.org/r/20190614131600.GA13897@a1-hr Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Francois reported that VMware balloon gets stuck after a balloon reset,
when the VMCI doorbell is removed. A similar error can occur when the
balloon driver is removed with the following splat:
The cause for the bug is that when the "delayed" doorbell is invoked, it
takes a reference on the doorbell entry and schedules work that is
supposed to run the appropriate code and drop the doorbell entry
reference. The code ignores the fact that if the work is already queued,
it will not be scheduled to run one more time. As a result one of the
references would not be dropped. When the code waits for the reference
to get to zero, during balloon reset or module removal, it gets stuck.
Fix it. Drop the reference if schedule_work() indicates that the work is
already queued.
Note that this bug got more apparent (or apparent at all) due to
commit ce664331b248 ("vmw_balloon: VMCI_DOORBELL_SET does not check status").
In the error path of stm_source_register_device(), the kfree is
unnecessary, as the put_device() before it ends up calling
stm_source_device_release() to free stm_source_device, leading to
a double free at the outer kfree() call. Remove it.
The OCR register defines the supported range of VDD voltages for SD cards.
However, it has turned out that some SD cards reports an invalid voltage
range, for example having bit7 set.
When a host supports MMC_CAP2_FULL_PWR_CYCLE and some of the voltages from
the invalid VDD range, this triggers the core to run a power cycle of the
card to try to initialize it at the lowest common supported voltage.
Obviously this fails, since the card can't support it.
Let's fix this problem, by clearing invalid bits from the read OCR register
for SD cards, before proceeding with the VDD voltage negotiation.
Cc: stable@vger.kernel.org Reported-by: Philip Langdale <philipl@overt.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Philip Langdale <philipl@overt.org> Tested-by: Philip Langdale <philipl@overt.org> Tested-by: Manuel Presnitz <mail@mpy.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
HS200 is not implemented in the driver, but the controller claims it
through caps. Remove it via a quirk, to make sure the mmc core do not try
to enable HS200, as it causes the eMMC initialization to fail.
Signed-off-by: Eugen Hristev <eugen.hristev@microchip.com> Acked-by: Ludovic Desroches <ludovic.desroches@microchip.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Fixes: bb5f8ea4d514 ("mmc: sdhci-of-at91: introduce driver for the Atmel SDMMC") Cc: stable@vger.kernel.org # v4.4+ Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
32-bit processes running on a 64-bit kernel are not always detected
correctly, causing the process to crash when uretprobes are installed.
The reason for the crash is that in_ia32_syscall() is used to determine the
process's mode, which only works correctly when called from a syscall.
In the case of uretprobes, however, the function is called from a exception
and always returns 'false' on a 64-bit kernel. In consequence this leads to
corruption of the process's return address.
Fix this by using user_64bit_mode() instead of in_ia32_syscall(), which
is correct in any situation.
[ tglx: Add a comment and the following historical info ]
This should have been detected by the rename which happened in commit
abfb9498ee13 ("x86/entry: Rename is_{ia32,x32}_task() to in_{ia32,x32}_syscall()")
which states in the changelog:
The is_ia32_task()/is_x32_task() function names are a big misnomer: they
suggests that the compat-ness of a system call is a task property, which
is not true, the compatness of a system call purely depends on how it
was invoked through the system call layer.
.....
and then it went and blindly renamed every call site.
Sadly enough this was already mentioned here:
8faaed1b9f50 ("uprobes/x86: Introduce sizeof_long(), cleanup adjust_ret_addr() and
arch_uretprobe_hijack_return_addr()")
where the changelog says:
TODO: is_ia32_task() is not what we actually want, TS_COMPAT does
not necessarily mean 32bit. Fortunately syscall-like insns can't be
probed so it actually works, but it would be better to rename and
use is_ia32_frame().
and goes all the way back to:
0326f5a94dde ("uprobes/core: Handle breakpoint and singlestep exceptions")
Oh well. 7+ years until someone actually tried a uretprobe on a 32bit
process on a 64bit kernel....
In its current form, user_64bit_mode() can only be used when CONFIG_X86_64
is selected. This implies that code built with CONFIG_X86_64=n cannot use
it. If a piece of code needs to be built for both CONFIG_X86_64=y and
CONFIG_X86_64=n and wants to use this function, it needs to wrap it in
an #ifdef/#endif; potentially, in multiple places.
This can be easily avoided with a single #ifdef/#endif pair within
user_64bit_mode() itself.
Suggested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: ricardo.neri@intel.com Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Huang Rui <ray.huang@amd.com> Cc: Qiaowei Ren <qiaowei.ren@intel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Colin Ian King <colin.king@canonical.com> Cc: Chen Yucong <slaoub@gmail.com> Cc: Adam Buchbinder <adam.buchbinder@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Garnier <thgarnie@google.com> Link: https://lkml.kernel.org/r/1509135945-13762-4-git-send-email-ricardo.neri-calderon@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
The option named "auto_delink_en" is a bit misleading, as setting it to
false doesn't really disable auto-delink but let auto-delink be firmware
controlled.
Update the description to reflect the real usage of this parameter.
This patch fixes an issue that the following error is
possible to happen when ohci hardware causes an interruption
and the system is shutting down at the same time.
ohci_shutdown() disables all the interrupt and rh_state is set to
OHCI_RH_HALTED. In other hand, ohci_irq() is possible to enable
OHCI_INTR_SF and OHCI_INTR_MIE on ohci_irq(). Note that OHCI_INTR_SF
is possible to be set by start_ed_unlink() which is called:
ohci_irq()
-> process_done_list()
-> takeback_td()
-> start_ed_unlink()
So, ohci_irq() has the following condition, the issue happens by
&ohci->regs->intrenable = OHCI_INTR_MIE | OHCI_INTR_SF and
ohci->rh_state = OHCI_RH_HALTED:
/* interrupt for some other device? */
if (ints == 0 || unlikely(ohci->rh_state == OHCI_RH_HALTED))
return IRQ_NOTMINE;
To fix the issue, ohci_shutdown() holds the spin lock while disabling
the interruption and changing the rh_state flag to prevent reenable
the OHCI_INTR_MIE unexpectedly. Note that io_watchdog_func() also
calls the ohci_shutdown() and it already held the spin lock, so that
the patch makes a new function as _ohci_shutdown().
This patch is inspired by a Renesas R-Car Gen3 BSP patch
from Tho Vu.
In case of a disconnect an ongoing flush() has to be made fail.
Nevertheless we cannot be sure that any pending URB has already
finished, so although they will never succeed, they still must
not be touched.
The clean solution for this is to check for WDM_IN_USE
and WDM_DISCONNECTED in flush(). There is no point in ever
clearing WDM_IN_USE, as no further writes make sense.
Although APIC initialization will typically clear out the LDR before
setting it, the APIC cleanup code should reset the LDR.
This was discovered with a 32-bit KVM guest jumping into a kdump
kernel. The stale bits in the LDR triggered a bug in the KVM APIC
implementation which caused the destination mapping for VCPUs to be
corrupted.
Note that this isn't intended to paper over the KVM APIC bug. The kernel
has to clear the LDR when resetting the APIC registers except when X2APIC
is enabled.
This lacks a Fixes tag because missing to clear LDR goes way back into pre
git history.
[ tglx: Made x2apic_enabled a function call as required ]
Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190826101513.5080-3-bsd@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Legacy apic init uses bigsmp for smp systems with 8 and more CPUs. The
bigsmp APIC implementation uses physical destination mode, but it
nevertheless initializes LDR and DFR. The LDR even ends up incorrectly with
multiple bit being set.
This does not cause a functional problem because LDR and DFR are ignored
when physical destination mode is active, but it triggered a problem on a
32-bit KVM guest which jumps into a kdump kernel.
The multiple bits set unearthed a bug in the KVM APIC implementation. The
code which creates the logical destination map for VCPUs ignores the
disabled state of the APIC and ends up overwriting an existing valid entry
and as a result, APIC calibration hangs in the guest during kdump
initialization.
Remove the bogus LDR/DFR initialization.
This is not intended to work around the KVM APIC bug. The LDR/DFR
ininitalization is wrong on its own.
The issue goes back into the pre git history. The fixes tag is the commit
in the bitkeeper import which introduced bigsmp support in 2003.
Don't advance RIP or inject a single-step #DB if emulation signals a
fault. This logic applies to all state updates that are conditional on
clean retirement of the emulation instruction, e.g. updating RFLAGS was
previously handled by commit 38827dbd3fb85 ("KVM: x86: Do not update
EFLAGS on faulting emulation").
Not advancing RIP is likely a nop, i.e. ctxt->eip isn't updated with
ctxt->_eip until emulation "retires" anyways. Skipping #DB injection
fixes a bug reported by Andy Lutomirski where a #UD on SYSCALL due to
invalid state with EFLAGS.TF=1 would loop indefinitely due to emulation
overwriting the #UD with #DB and thus restarting the bad SYSCALL over
and over.
Cc: Nadav Amit <nadav.amit@gmail.com> Cc: stable@vger.kernel.org Reported-by: Andy Lutomirski <luto@kernel.org> Fixes: 663f4c61b803 ("KVM: x86: handle singlestep during emulation") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The input pool of a client might be deleted via the resize ioctl, the
the access to it should be covered by the proper locks. Currently the
only missing place is the call in snd_seq_ioctl_get_client_pool(), and
this patch papers over it.
As Jason Baron explained in commit 790ba4566c1a ("tcp: set SOCK_NOSPACE
under memory pressure"), it is crucial we properly set SOCK_NOSPACE
when needed.
However, Jason patch had a bug, because the 'nonblocking' status
as far as sk_stream_wait_memory() is concerned is governed
by MSG_DONTWAIT flag passed at sendmsg() time :
long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
So it is very possible that tcp sendmsg() calls sk_stream_wait_memory(),
and that sk_stream_wait_memory() returns -EAGAIN with SOCK_NOSPACE
cleared, if sk->sk_sndtimeo has been set to a small (but not zero)
value.
This patch removes the 'noblock' variable since we must always
set SOCK_NOSPACE if -EAGAIN is returned.
It also renames the do_nonblock label since we might reach this
code path even if we were in blocking mode.
Fixes: 790ba4566c1a ("tcp: set SOCK_NOSPACE under memory pressure") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jason Baron <jbaron@akamai.com> Reported-by: Vladimir Rutsky <rutsky@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Jason Baron <jbaron@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The `uac_mixer_unit_descriptor` shown as below is read from the
device side. In `parse_audio_mixer_unit`, `baSourceID` field is
accessed from index 0 to `bNrInPins` - 1, the current implementation
assumes that descriptor is always valid (the length of descriptor
is no shorter than 5 + `bNrInPins`). If a descriptor read from
the device side is invalid, it may trigger out-of-bound memory
access.
`check_input_term` recursively calls itself with input from
device side (e.g., uac_input_terminal_descriptor.bCSourceID)
as argument (id). In `check_input_term`, if `check_input_term`
is called with the same `id` argument as the caller, it triggers
endless recursive call, resulting kernel space stack overflow.
This patch fixes the bug by adding a bitmap to `struct mixer_build`
to keep track of the checked ids and stop the execution if some id
has been checked (similar to how parse_audio_unit handles unitid
argument).
tcp_rtx_queue_tail() (called by tcp_fragment()) can call
tcp_write_queue_prev() on the first packet in the queue, which will trigger
the BUG in tcp_write_queue_prev(), because there is no previous packet.
This happens when the retransmit queue is empty, for example in case of a
zero window.
Commit 8c3088f895a0 ("tcp: be more careful in tcp_fragment()") was not a
simple cherry-pick of the original one from master (b617158dc096)
because there is a specific TCP rtx queue only since v4.15. For more
details, please see the commit message of b617158dc096 ("tcp: be more
careful in tcp_fragment()").
The BUG() is hit due to the specific code added to versions older than
v4.15. The comment in skb_queue_prev() (include/linux/skbuff.h:1406),
just before the BUG_ON() somehow suggests to add a check before using
it, what Tim did.
In master, this code path causing the issue will not be taken because
the implementation of tcp_rtx_queue_tail() is different:
Fixes: 8c3088f895a0 ("tcp: be more careful in tcp_fragment()") Signed-off-by: Tim Froidcoeur <tim.froidcoeur@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Reviewed-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The commit 5e6acc3e678e ("bcm2835-pm: Move bcm2835-watchdog's DT probe
to an MFD.") broke module autoloading on Raspberry Pi. So add a
module alias this fix this.
clang-9 points out that there are two variables that depending on the
configuration may only be used in an ARRAY_SIZE() expression but not
referenced:
drivers/dma/ste_dma40.c:145:12: error: variable 'd40_backup_regs' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration]
static u32 d40_backup_regs[] = {
^
drivers/dma/ste_dma40.c:214:12: error: variable 'd40_backup_regs_chan' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration]
static u32 d40_backup_regs_chan[] = {
Voltage regulators are optional, so functions must check they exist
before dereferencing.
Note this issue is hidden if CONFIG_REGULATORS is not set, because the
offending code is optimised away.
Notes for stable:
The issue first appears in commit 57d104c153d3 ("ufs: add UFS power
management support") but is inadvertently fixed in commit 60f0187031c0
("scsi: ufs: disable vccq if it's not needed by UFS device") which in
turn was reverted by commit 730679817d83 ("Revert "scsi: ufs: disable vccq
if it's not needed by UFS device""). So fix applies v3.18 to v4.5 and
v5.1+
Fixes: 57d104c153d3 ("ufs: add UFS power management support") Fixes: 730679817d83 ("Revert "scsi: ufs: disable vccq if it's not needed by UFS device"") Cc: stable@vger.kernel.org Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
There have been reports of RDRAND issues after resuming from suspend on
some AMD family 15h and family 16h systems. This issue stems from a BIOS
not performing the proper steps during resume to ensure RDRAND continues
to function properly.
RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be
reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND
support using CPUID, including the kernel, will believe that RDRAND is
not supported.
Update the CPU initialization to clear the RDRAND CPUID bit for any family
15h and 16h processor that supports RDRAND. If it is known that the family
15h or family 16h system does not have an RDRAND resume issue or that the
system will not be placed in suspend, the "rdrand=force" kernel parameter
can be used to stop the clearing of the RDRAND CPUID bit.
Additionally, update the suspend and resume path to save and restore the
MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in
place after resuming from suspend.
Note, that clearing the RDRAND CPUID bit does not prevent a processor
that normally supports the RDRAND instruction from executing it. So any
code that determined the support based on family and model won't #UD.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chen Yu <yu.c.chen@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org> Cc: "linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: <stable@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "x86@kernel.org" <x86@kernel.org> Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.com
[sl: adjust context in docs] Signed-off-by: Sasha Levin <sashal@kernel.org>
A bug was reported that on certain Broadwell platforms, after
resuming from S3, the CPU is running at an anomalously low
speed.
It turns out that the BIOS has modified the value of the
THERM_CONTROL register during S3, and changed it from 0 to 0x10,
thus enabled clock modulation(bit4), but with undefined CPU Duty
Cycle(bit1:3) - which causes the problem.
Here is a simple scenario to reproduce the issue:
1. Boot up the system
2. Get MSR 0x19a, it should be 0
3. Put the system into sleep, then wake it up
4. Get MSR 0x19a, it shows 0x10, while it should be 0
Although some BIOSen want to change the CPU Duty Cycle during
S3, in our case we don't want the BIOS to do any modification.
Fix this issue by introducing a more generic x86 framework to
save/restore specified MSR registers(THERM_CONTROL in this case)
for suspend/resume. This allows us to fix similar bugs in a much
simpler way in the future.
When the kernel wants to protect certain MSRs during suspending,
we simply add a quirk entry in msr_save_dmi_table, and customize
the MSR registers inside the quirk callback, for example:
and the quirk mechanism ensures that, once resumed from suspend,
the MSRs indicated by these IDs will be restored to their
original, pre-suspend values.
Since both 64-bit and 32-bit kernels are affected, this patch
covers the common 64/32-bit suspend/resume code path. And
because the MSRs specified by the user might not be available or
readable in any situation, we use rdmsrl_safe() to safely save
these MSRs.
Reported-and-tested-by: Marcin Kaszewski <marcin.kaszewski@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@suse.de Cc: len.brown@intel.com Cc: linux@horizon.com Cc: luto@kernel.org Cc: rjw@rjwysocki.net Link: http://lkml.kernel.org/r/c9abdcbc173dd2f57e8990e304376f19287e92ba.1448382971.git.yu.c.chen@intel.com
[ More edits to the naming of data structures. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Change ct id hash calculation to only use invariants.
Currently the ct id hash calculation is based on some fields that can
change in the lifetime on a conntrack entry in some corner cases. The
current hash uses the whole tuple which contains an hlist pointer which
will change when the conntrack is placed on the dying list resulting in
a ct id change.
This patch also removes the reply-side tuple and extension pointer from
the hash calculation so that the ct id will will not change from
initialization until confirmation.
Fixes: 3c79107631db1f7 ("netfilter: ctnetlink: don't use conntrack/expect object addresses as id") Signed-off-by: Dirk Morris <dmorris@metaloft.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
else, we leak the addresses to userspace via ctnetlink events
and dumps.
Compute an ID on demand based on the immutable parts of nf_conn struct.
Another advantage compared to using an address is that there is no
immediate re-use of the same ID in case the conntrack entry is freed and
reallocated again immediately.
According to Amit Klein and Benny Pinkas, IP ID generation is too weak
and might be used by attackers.
Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
having 64bit key and Jenkins hash is risky.
It is time to switch to siphash and its 128bit keys.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 4.4: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
HalfSipHash, or hsiphash, is a shortened version of SipHash, which
generates 32-bit outputs using a weaker 64-bit key. It has *much* lower
security margins, and shouldn't be used for anything too sensitive, but
it could be used as a hashtable key function replacement, if the output
is never exposed, and if the security requirement is not too high.
The goal is to make this something that performance-critical jhash users
would be willing to use.
On 64-bit machines, HalfSipHash1-3 is slower than SipHash1-3, so we alias
SipHash1-3 to HalfSipHash1-3 on those systems.
hsiphash() is roughly 3 times slower than jhash(), but comes with a
considerable security improvement.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 4.4 to avoid regression for WireGuard with only half
the siphash API present] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
SipHash is a 64-bit keyed hash function that is actually a
cryptographically secure PRF, like HMAC. Except SipHash is super fast,
and is meant to be used as a hashtable keyed lookup function, or as a
general PRF for short input use cases, such as sequence numbers or RNG
chaining.
For the first usage:
There are a variety of attacks known as "hashtable poisoning" in which an
attacker forms some data such that the hash of that data will be the
same, and then preceeds to fill up all entries of a hashbucket. This is
a realistic and well-known denial-of-service vector. Currently
hashtables use jhash, which is fast but not secure, and some kind of
rotating key scheme (or none at all, which isn't good). SipHash is meant
as a replacement for jhash in these cases.
There are a modicum of places in the kernel that are vulnerable to
hashtable poisoning attacks, either via userspace vectors or network
vectors, and there's not a reliable mechanism inside the kernel at the
moment to fix it. The first step toward fixing these issues is actually
getting a secure primitive into the kernel for developers to use. Then
we can, bit by bit, port things over to it as deemed appropriate.
While SipHash is extremely fast for a cryptographically secure function,
it is likely a bit slower than the insecure jhash, and so replacements
will be evaluated on a case-by-case basis based on whether or not the
difference in speed is negligible and whether or not the current jhash usage
poses a real security risk.
For the second usage:
A few places in the kernel are using MD5 or SHA1 for creating secure
sequence numbers, syn cookies, port numbers, or fast random numbers.
SipHash is a faster and more fitting, and more secure replacement for MD5
in those situations. Replacing MD5 and SHA1 with SipHash for these uses is
obvious and straight-forward, and so is submitted along with this patch
series. There shouldn't be much of a debate over its efficacy.
Dozens of languages are already using this internally for their hash
tables and PRFs. Some of the BSDs already use this in their kernels.
SipHash is a widely known high-speed solution to a widely known set of
problems, and it's time we catch-up.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: David Laight <David.Laight@aculab.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 4.4 as dependency of commits df453700e8d8 "inet: switch
IP ID generator to siphash" and 3c79107631db "netfilter: ctnetlink: don't
use conntrack/expect object addresses as id":
- Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
This patch will check the weight and exit the loop if we exceeds the
weight. This is useful for preventing scsi kthread from hogging cpu
which is guest triggerable.
This addresses CVE-2019-3900.
Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Fixes: 057cbf49a1f0 ("tcm_vhost: Initial merge for vhost level target fabric driver") Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[bwh: Backported to 4.4:
- Drop changes in vhost_scsi_ctl_handle_vq()
- Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
When the rx buffer is too small for a packet, we will discard the vq
descriptor and retry it for the next packet:
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
&busyloop_intr))) {
...
/* On overrun, truncate and discard */
if (unlikely(headcount > UIO_MAXIOV)) {
iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
err = sock->ops->recvmsg(sock, &msg,
1, MSG_DONTWAIT | MSG_TRUNC);
pr_debug("Discarded rx packet: len %zd\n", sock_len);
continue;
}
...
}
This makes it possible to trigger a infinite while..continue loop
through the co-opreation of two VMs like:
1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the
vhost process as much as possible e.g using indirect descriptors or
other.
2) Malicious VM2 generate packets to VM1 as fast as possible
Fixing this by checking against weight at the end of RX and TX
loop. This also eliminate other similar cases when:
- userspace is consuming the packets in the meanwhile
- theoretical TOCTOU attack if guest moving avail index back and forth
to hit the continue after vhost find guest just add new buffers
This addresses CVE-2019-3900.
Fixes: d8316f3991d20 ("vhost: fix total length when packets are too short") Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server") Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
[bwh: Backported to 4.4:
- Both Tx modes are handled in one loop in handle_tx()
- Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
We used to have vhost_exceeds_weight() for vhost-net to:
- prevent vhost kthread from hogging the cpu
- balance the time spent between TX and RX
This function could be useful for vsock and scsi as well. So move it
to vhost.c. Device must specify a weight which counts the number of
requests, or it can also specific a byte_weight which counts the
number of bytes that has been processed.
Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
[bwh: Backported to 4.4:
- Drop changes to vhost_vsock
- In vhost_net, both Tx modes are handled in one loop in handle_tx()
- Adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 4.4: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
Similar to commit a2ac99905f1e ("vhost-net: set packet weight of
tx polling to 2 * vq size"), we need a packet-based limit for
handler_rx, too - elsewhere, under rx flood with small packets,
tx can be delayed for a very long time, even without busypolling.
The pkt limit applied to handle_rx must be the same applied by
handle_tx, or we will get unfair scheduling between rx and tx.
Tying such limit to the queue length makes it less effective for
large queue length values and can introduce large process
scheduler latencies, so a constant valued is used - likewise
the existing bytes limit.
The selected limit has been validated with PVP[1] performance
test with different queue sizes:
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
polling udp packets with small length(e.g. 1byte udp payload), because setting
VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
Ping-Latencies shown below were tested between two Virtual Machines using
netperf (UDP_STREAM, len=1), and then another machine pinged the client:
Seems pretty consistent, a small dip at 2 VQ sizes.
Ring size is a hint from device about a burst size it can tolerate. Based on
benchmarks, set the weight to 2 * vq size.
To evaluate this change, another tests were done using netperf(RR, TX) between
two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
tweaked through qemu. Results shown below does not show obvious changes.
Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com> Signed-off-by: Lidong Chen <lidongchen@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Alexander Kochetkov <al.kochet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
Before this patch, function read_rindex_entry would set a rgrp
glock's gl_object pointer to itself before inserting the rgrp into
the rgrp rbtree. The problem is: if another process was also reading
the rgrp in, and had already inserted its newly created rgrp, then
the second call to read_rindex_entry would overwrite that value,
then return a bad return code to the caller. Later, other functions
would reference the now-freed rgrp memory by way of gl_object.
In some cases, that could result in gfs2_rgrp_brelse being called
twice for the same rgrp: once for the failed attempt and once for
the "real" rgrp release. Eventually the kernel would panic.
There are also a number of other things that could go wrong when
a kernel module is accessing freed storage. For example, this could
result in rgrp corruption because the fake rgrp would point to a
fake bitmap in memory too, causing gfs2_inplace_reserve to search
some random memory for free blocks, and find some, since we were
never setting rgd->rd_bits to NULL before freeing it.
This patch fixes the problem by not setting gl_object until we
have successfully inserted the rgrp into the rbtree. Also, it sets
rd_bits to NULL as it frees them, which will ensure any accidental
access to the wrong rgrp will result in a kernel panic rather than
file system corruption, which is preferred.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
[bwh: Backported to 4.4: adjust context] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
this problem - and other problems of sharing a spinlock with an
interrupt.
Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: cgroups@vger.kernel.org Cc: stable@vger.kernel.org # 4.5+ Cc: linux-kernel@vger.kernel.org Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zubin Mithra <zsm@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the sector number is too high, dm_table_find_target() should return a
pointer to a zeroed dm_target structure (the caller should test it with
dm_target_is_valid).
However, for some table sizes, the code in dm_table_find_target() that
performs btree lookup will access out of bound memory structures.
Fix this bug by testing the sector number at the beginning of
dm_table_find_target(). Also, add an "inline" keyword to the function
dm_table_get_size() because this is a hot path.
In commit 6096d91af0b6 ("dm space map metadata: fix occasional leak
of a metadata block on resize"), we refactor the commit logic to a new
function 'apply_bops'. But when that logic was replaced in out() the
return value was not stored. This may lead out() returning a wrong
value to the caller.
Fixes: 6096d91af0b6 ("dm space map metadata: fix occasional leak of a metadata block on resize") Cc: stable@vger.kernel.org Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When btree_split_beneath() splits a node to two new children, it will
allocate two blocks: left and right. If right block's allocation
failed, the left block will be unlocked and marked dirty. If this
happened, the left block'ss content is zero, because it wasn't
initialized with the btree struct before the attempot to allocate the
right block. Upon return, when flushing the left block to disk, the
validator will fail when check this block. Then a BUG_ON is raised.
Fix this by completely initializing the left block before allocating and
initializing the right block.
Fixes: 4dcb8b57df359 ("dm btree: fix leak of bufio-backed block in btree_split_beneath error path") Cc: stable@vger.kernel.org Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit a90118c445cc ("x86/boot: Save fields explicitly, zero out everything
else") had two errors:
* It preserved boot_params.acpi_rsdp_addr, and
* It failed to preserve boot_params.hdr
Therefore, zero out acpi_rsdp_addr, and preserve hdr.
Fixes: a90118c445cc ("x86/boot: Save fields explicitly, zero out everything else") Reported-by: Neil MacLeod <neil@nmacleod.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Neil MacLeod <neil@nmacleod.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190821192513.20126-1-jhubbard@nvidia.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Recent gcc compilers (gcc 9.1) generate warnings about an out of bounds
memset, if the memset goes accross several fields of a struct. This
generated a couple of warnings on x86_64 builds in sanitize_boot_params().
Fix this by explicitly saving the fields in struct boot_params
that are intended to be preserved, and zeroing all the rest.
[ tglx: Tagged for stable as it breaks the warning free build there as well ]
Suggested-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190731054627.5627-2-jhubbard@nvidia.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some newer machines do not advertise legacy timers. The kernel can handle
that situation if the TSC and the CPU frequency are enumerated by CPUID or
MSRs and the CPU supports TSC deadline timer. If the CPU does not support
TSC deadline timer the local APIC timer frequency has to be known as well.
Some Ryzens machines do not advertize legacy timers, but there is no
reliable way to determine the bus frequency which feeds the local APIC
timer when the machine allows overclocking of that frequency.
As there is no legacy timer the local APIC timer calibration crashes due to
a NULL pointer dereference when accessing the not installed global clock
event device.
Switch the calibration loop to a non interrupt based one, which polls
either TSC (if frequency is known) or jiffies. The latter requires a global
clockevent. As the machines which do not have a global clockevent installed
have a known TSC frequency this is a non issue. For older machines where
TSC frequency is not known, there is no known case where the legacy timers
do not exist as that would have been reported long ago.
Use 'lea' instead of 'add' when adjusting %rsp in CALL_NOSPEC so as to
avoid clobbering flags.
KVM's emulator makes indirect calls into a jump table of sorts, where
the destination of the CALL_NOSPEC is a small blob of code that performs
fast emulation by executing the target instruction with fixed operands.
adcb_al_dl:
0x000339f8 <+0>: adc %dl,%al
0x000339fa <+2>: ret
A major motiviation for doing fast emulation is to leverage the CPU to
handle consumption and manipulation of arithmetic flags, i.e. RFLAGS is
both an input and output to the target of CALL_NOSPEC. Clobbering flags
results in all sorts of incorrect emulation, e.g. Jcc instructions often
take the wrong path. Sans the nops...
ctxt->eflags = (ctxt->eflags & ~EFLAGS_MASK) | (flags & EFLAGS_MASK);
0x000359a8 <+136>: mov -0x10(%ebp),%eax
0x000359ab <+139>: and $0x8d5,%edi
0x000359b4 <+148>: and $0xfffff72a,%eax
0x000359b9 <+153>: or %eax,%edi
0x000359bd <+157>: mov %edi,0x4(%ebx)
For the most part this has gone unnoticed as emulation of guest code
that can trigger fast emulation is effectively limited to MMIO when
running on modern hardware, and MMIO is rarely, if ever, accessed by
instructions that affect or consume flags.
Breakage is almost instantaneous when running with unrestricted guest
disabled, in which case KVM must emulate all instructions when the guest
has invalid state, e.g. when the guest is in Big Real Mode during early
BIOS.
Fixes: 776b043848fd2 ("x86/retpoline: Add initial retpoline support") Fixes: 1a29b5b7f347a ("KVM: x86: Make indirect calls in emulator speculation safe") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20190822211122.27579-1-sean.j.christopherson@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Note that the fix offered by commit bd293d071ffe doesn't really prevent
the deadlock from occuring - if we look at the stacktrace reported by
Junxiao Bi, we see that it hangs in bit_wait_io and not on the mutex -
i.e. it has already successfully taken the mutex. Changing the mutex
from mutex_lock to mutex_trylock won't help with deadlocks that happen
afterwards.
Abort processing of a command if we run out of mapped data in the
SG list. This should never happen, but a previous bug caused it to
be possible. Play it safe and attempt to abort nicely if we don't
have more SG segments left.
On the arm64 platform, executing "ifconfig eth0 up" will fail,
returning "ifconfig: SIOCSIFFLAGS: Input/output error."
ndev->dev is not initialized, dma_map_single->get_dma_ops->
dummy_dma_ops->__dummy_map_page will return DMA_ERROR_CODE
directly, so when we use dma_map_single, the first parameter
is to use the device of platform_device.
Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
TX_DESC_NUM is 256, in tx_count, the maximum value of
mod(TX_DESC_NUM - 1) is 254, the variable "count" in
the hip04_mac_start_xmit function is never equal to
(TX_DESC_NUM - 1), so hip04_mac_start_xmit never
return NETDEV_TX_BUSY.
tx_count is modified to mod(TX_DESC_NUM) so that
the maximum value of tx_count can reach
(TX_DESC_NUM - 1), then hip04_mac_start_xmit can reurn
NETDEV_TX_BUSY.
Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
If hip04_tx_reclaim is interrupted while it is running
and then __napi_schedule continues to execute
hip04_rx_poll->hip04_tx_reclaim, reentrancy occurs
and oops is generated. So you need to mask the interrupt
during the hip04_tx_reclaim run.
Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
A call to 'kfree_skb()' is missing in the error handling path of
'init_one()'.
This is already present in 'remove_one()' but is missing here.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
John Hubbard reports seeing the following stack trace:
nfs4_do_reclaim
rcu_read_lock /* we are now in_atomic() and must not sleep */
nfs4_purge_state_owners
nfs4_free_state_owner
nfs4_destroy_seqid_counter
rpc_destroy_wait_queue
cancel_delayed_work_sync
__cancel_work_timer
__flush_work
start_flush_work
might_sleep:
(kernel/workqueue.c:2975: BUG)
The solution is to separate out the freeing of the state owners
from nfs4_purge_state_owners(), and perform that outside the atomic
context.
Reported-by: John Hubbard <jhubbard@nvidia.com> Fixes: 0aaaf5c424c7f ("NFS: Cache state owners after files are closed") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
strncpy() does not ensure NULL-termination when the input string size
equals to the destination buffer size IFNAMSIZ. The output string is
passed to dev_info() which relies on the NULL-termination.
Use strlcpy() instead.
This issue is identified by a Coccinelle script.
Signed-off-by: Wang Xiayang <xywang.sjtu@sjtu.edu.cn> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
strncpy() does not ensure NULL-termination when the input string size
equals to the destination buffer size IFNAMSIZ. The output string
'name' is passed to dev_info which relies on NULL-termination.
Use strlcpy() instead.
This issue is identified by a Coccinelle script.
Signed-off-by: Wang Xiayang <xywang.sjtu@sjtu.edu.cn> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>