It has been reported to be causing problems, and is being reworked
upstream and has been dropped from the current 5.15.y stable queue until
it gets resolved.
It has been reported to be causing problems, and is being reworked
upstream and has been dropped from the current 5.15.y stable queue until
it gets resolved.
Reported-by: Chris Rankin <rankincj@gmail.com> Reported-by: Thorsten Leemhuis <linux@leemhuis.info> Link: https://lore.kernel.org/r/ed000478-2a60-0066-c337-a04bffc112b1@leemhuis.info Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It has been reported to be causing problems, and is being reworked
upstream and has been dropped from the current 5.15.y stable queue until
it gets resolved.
Commit a4b83deb3e76 ("media: videobuf2: rework vb2_mem_ops API")
added a new vb member to struct vb2_dma_sg_buf, but it only added
code setting this to the vb2_dma_sg_alloc() function and not to the
vb2_dma_sg_get_userptr() and vb2_dma_sg_attach_dmabuf() which also
create vb2_dma_sg_buf objects.
This is causing a crash due to a NULL pointer deref when using
libcamera on devices with an Intel IPU3 (qcam app).
Fix these crashes by assigning buf->vb in the other 2 functions too,
note libcamera tests the vb2_dma_sg_get_userptr() path, the change
to the vb2_dma_sg_attach_dmabuf() path is untested.
Fixes: a4b83deb3e76 ("media: videobuf2: rework vb2_mem_ops API") Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The size of the exception stacks was increased by the commit in Fixes,
resulting in stack sizes greater than a page in size. The #VC exception
handling was only mapping the first (bottom) page, resulting in an
SEV-ES guest failing to boot.
Make the #VC exception stacks part of the default exception stacks
storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map
them only when a SEV-ES guest has been detected.
Rip out the custom VC stacks mapping and storage code.
In preparation for other confidential computing technologies, introduce
a generic helper function, cc_platform_has(), that can be used to
check for specific active confidential computing attributes, like
memory encryption. This is intended to eliminate having to add multiple
technology-specific checks to the code (e.g. if (sev_active() ||
tdx_active() || ... ).
The new function validate_hash_algo() assumed that ima_get_hash_algo()
always return a valid 'enum hash_algo', but it returned the
user-supplied value present in the digital signature without
any bounds checks.
Update ima_get_hash_algo() to always return a valid hash algorithm,
defaulting on 'ima_hash_algo' when the user-supplied value inside
the xattr is invalid.
Signed-off-by: THOBY Simon <Simon.THOBY@viveris.fr> Reported-by: syzbot+e8bafe7b82c739eaf153@syzkaller.appspotmail.com Fixes: 50f742dd9147 ("IMA: block writes of the security.ima xattr with unsupported algorithms") Reviewed-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Previous fix aded bpf_clamp_umax() helper use to re-validate boundaries.
While that works correctly, it introduces more branches, which blows up
past 1 million instructions in no-alu32 variant of strobemeta selftests.
Switching len variable from u32 to u64 also fixes the issue and reduces
the number of validated instructions, so use that instead. Fix this
patch and bpf_clamp_umax() removed, both alu32 and no-alu32 selftests
pass.
Commit in Fixes changed the iopl emulation to not #GP on CLI and STI
because it would break some insane luserspace tools which would toggle
interrupts.
The corresponding selftest would rely on the fact that executing CLI/STI
would trigger a #GP and thus detect it this way but since that #GP is
not happening anymore, the detection is now wrong too.
Extend the test to actually look at the IF flag and whether executing
those insns had any effect on it. The STI detection needs to have the
fact that interrupts were previously disabled, passed in so do that from
the previous CLI test, i.e., STI test needs to follow a previous CLI one
for it to make sense.
There are several error return paths that dereference the null pointer
host because the pointer has not yet been set to a valid value.
Fix this by adding a new out_mmc label and exiting via this label
to avoid the host clean up and hence the null pointer dereference.
Addresses-Coverity: ("Explicit null dereference") Fixes: 8105c2abbf36 ("mmc: moxart: Fix reference count leaks in moxart_probe") Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20211013100052.125461-1-colin.king@canonical.com Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Using a kernel pointer in place of a dma_addr_t token can
lead to undefined behavior if that makes it into cache
management functions. The compiler caught one such attempt
in a cast:
drivers/net/wireless/ath/ath10k/mac.c: In function 'ath10k_add_interface':
drivers/net/wireless/ath/ath10k/mac.c:5586:47: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
5586 | arvif->beacon_paddr = (dma_addr_t)arvif->beacon_buf;
| ^
Looking through how this gets used down the way, I'm fairly
sure that beacon_paddr is never accessed again for ATH10K_DEV_TYPE_HL
devices, and if it was accessed, that would be a bug.
Change the assignment to use a known-invalid address token
instead, which avoids the warning and makes it easier to catch
bugs if it does end up getting used.
Fixes: e263bdab9c0e ("ath10k: high latency fixes for beacon buffer") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20211014075153.3655910-1-arnd@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 7be3248f3139 ("cifs: To match file servers, make sure the server hostname matches") Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Adding kfree(dvb) to vidtv_bridge_remove() will remove the memory
too soon: if an application still has an open filehandle to the device
when the driver is unloaded, then when that filehandle is closed, a
use-after-free access takes place to the freed memory.
Move the kfree(dvb) to vidtv_bridge_dev_release() instead.
commit 652de07addd2 ("drm/amd/display: Fully switch to dmub for all dcn21
asics") switched over to using dmub on Renoir to fix Gitlab 1735, but this
implied a new dependency on newer firmware which might not be met on older
kernel versions.
Since sw_init runs before hw_init, there is an opportunity to determine
whether or not the firmware version is new to adjust the behavior.
Cc: Roman.Li@amd.com BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1772 BugLink: https://gitlab.freedesktop.org/drm/amd/-/issues/1735 Fixes: 652de07addd2 ("drm/amd/display: Fully switch to dmub for all dcn21 asics") Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Roman Li <Roman.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The premise of commit 6f9f17287e78 ("SUNRPC: Mitigate cond_resched() in
xprt_transmit()") was that cond_resched() is expensive and unnecessary
when there has been just a single send.
The point of cond_resched() is to ensure that tasks that should pre-empt
this one get a chance to do so when it is safe to do so. The code prior
to commit 6f9f17287e78 failed to take into account that it was keeping a
rpc_task pinned for longer than it needed to, and so rather than doing a
full revert, let's just move the cond_resched.
Change PCIe Max Payload Size setting in PCIe Device Control register to 512
bytes to align with PCIe Link Initialization sequence as defined in Marvell
Armada 3700 Functional Specification. According to the specification,
maximal Max Payload Size supported by this device is 512 bytes.
Without this kernel prints suspicious line:
pci 0000:01:00.0: Upstream bridge's Max Payload Size set to 256 (was 16384, max 512)
With this change it changes to:
pci 0000:01:00.0: Upstream bridge's Max Payload Size set to 256 (was 512, max 512)
Macros SUN8I_CSC_CTRL() and SUN8I_CSC_COEFF() don't follow usual
recommendation of having arguments enclosed in parenthesis. While that
didn't change anything for quite sometime, it actually become important
after CSC code rework with commit ea067aee45a8 ("drm/sun4i: de2/de3:
Remove redundant CSC matrices").
Without this fix, colours are completely off for supported YVU formats
on SoCs with DE2 (A64, H3, R40, etc.).
Fix the issue by enclosing macro arguments in parenthesis.
When CONFIG_SMP=y, timebase synchronization is required when the second
kernel is started.
arch/powerpc/kernel/smp.c:
int __cpu_up(unsigned int cpu, struct task_struct *tidle)
{
...
if (smp_ops->give_timebase)
smp_ops->give_timebase();
...
}
void start_secondary(void *unused)
{
...
if (smp_ops->take_timebase)
smp_ops->take_timebase();
...
}
When CONFIG_HOTPLUG_CPU=n and CONFIG_KEXEC_CORE=n,
smp_85xx_ops.give_timebase is NULL,
smp_85xx_ops.take_timebase is NULL,
As a result, the timebase is not synchronized.
Timebase synchronization does not depend on CONFIG_HOTPLUG_CPU.
Fixes: 56f1ba280719 ("powerpc/mpc85xx: refactor the PM operations") Cc: stable@vger.kernel.org # v4.6+ Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210929033646.39630-3-nixiaoming@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On VMs with NX encryption, compression, and/or RNG offload, these
capabilities are described by nodes in the ibm,platform-facilities device
tree hierarchy:
The acceleration functions that these nodes describe are not disrupted by
live migration, not even temporarily.
But the post-migration ibm,update-nodes sequence firmware always sends
"delete" messages for this hierarchy, followed by an "add" directive to
reconstruct it via ibm,configure-connector (log with debugging statements
enabled in mobility.c):
Note we receive a single "add" message for the entire hierarchy, and what
we receive from the ibm,configure-connector sequence is the top-level
platform-facilities node along with its three children. The debug message
simply reports the parent node and not the whole subtree.
Also, significantly, the nodes added are almost completely equivalent to
the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in
the leaf nodes is the only property I've observed to differ, and Linux does
not use that. So in practice, the sum of update messages Linux receives for
this hierarchy is equivalent to minor property updates.
We succeed in removing the original hierarchy from the device tree. But the
vio bus code is ignorant of this, and does not unbind or relinquish its
references. The leaf nodes, still reachable through sysfs, of course still
refer to the now-freed ibm,platform-facilities parent node, which makes
use-after-free possible:
Moreover, the "new" replacement subtree is not correctly added to the
device tree, resulting in ibm,platform-facilities parent node without the
appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy:
$ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
/sys/firmware/devicetree/base/ibm,platform-facilities/
0 directories
$ cd /sys/devices/vio ; find . -xtype l -exec file {} +
./ibm,sym-encryption-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1
./ibm,random-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1
./ibm,compression-v1/of_node: broken symbolic link to
../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1
This is because add_dt_node() -> dlpar_attach_node() attaches only the
parent node returned from configure-connector, ignoring any children. This
should be corrected for the general case, but fixing that won't help with
the stale OF node references, which is the more urgent problem.
One way to address that would be to make the drivers respond to node
removal notifications, so that node references can be dropped
appropriately. But this would likely force the drivers to disrupt active
clients for no useful purpose: equivalent nodes are immediately re-added.
And recall that the acceleration capabilities described by the nodes remain
available throughout the whole process.
The solution I believe to be robust for this situation is to convert
remove+add of a node with an unchanged phandle to an update of the node's
properties in the Linux device tree structure. That would involve changing
and adding a fair amount of code, and may take several iterations to land.
Until that can be realized we have a confirmed use-after-free and the
possibility of memory corruption. So add a limited workaround that
discriminates on the node type, ignoring adds and removes. This should be
amenable to backporting in the meantime.
Fixes: 410bccf97881 ("powerpc/pseries: Partition migration in the kernel") Cc: stable@vger.kernel.org Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211020194703.2613093-1-nathanl@linux.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The check_return_regs_valid() can cause a false positive if the return
regs are marked as norestart and they are an HSRR type interrupt,
because the low bit in the bottom of regs->trap causes interrupt type
matching to fail.
This can occcur for example on bare metal with a HV privileged doorbell
interrupt that causes a signal, but do_signal returns early because
get_signal() fails, and takes the "No signal to deliver" path. In this
case no signal was delivered so the return location is not changed so
return SRRs are not invalidated, yet set_trap_norestart is called, which
messes up the match. Building go-1.16.6 is known to reproduce this.
Fix it by using the TRAP() accessor which masks out the low bit.
Fixes: 6eaaf9de3599 ("powerpc/64s/interrupt: Check and fix srr_valid without crashing") Cc: stable@vger.kernel.org # v5.14+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211026122531.3599918-1-npiggin@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The mitigation-patching.sh script in the powerpc selftests toggles
all mitigations on and off simultaneously, revealing that rfi_flush
and stf_barrier cannot safely operate at the same time due to races
in updating the static key.
On some systems, the static key code throws a warning and the kernel
remains functional. On others, the kernel will hang or crash.
Fix this by slapping on a mutex.
Fixes: 13799748b957 ("powerpc/64: use interrupt restart table to speed up return from interrupt") Cc: stable@vger.kernel.org # v5.14+ Signed-off-by: Russell Currey <ruscur@russell.cc> Acked-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211027072410.40950-1-ruscur@russell.cc Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 587164cd, introduced new opal message type (OPAL_MSG_PRD2) and
added opal notifier. But I missed to unregister the notifier during
module unload path. This results in below call trace if you try to
unload and load opal_prd module.
Also add new notifier_block for OPAL_MSG_PRD2 message.
Sample calltrace (modprobe -r opal_prd; modprobe opal_prd)
BUG: Unable to handle kernel data access on read at 0xc0080000192200e0
Faulting instruction address: 0xc00000000018d1cc
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
CPU: 66 PID: 7446 Comm: modprobe Kdump: loaded Tainted: G E 5.14.0prd #759
NIP: c00000000018d1cc LR: c00000000018d2a8 CTR: c0000000000cde10
REGS: c0000003c4c0f0a0 TRAP: 0300 Tainted: G E (5.14.0prd)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 24224824 XER: 20040000
CFAR: c00000000018d2a4 DAR: c0080000192200e0 DSISR: 40000000 IRQMASK: 1
...
NIP notifier_chain_register+0x2c/0xc0
LR atomic_notifier_chain_register+0x48/0x80
Call Trace:
0xc000000002090610 (unreliable)
atomic_notifier_chain_register+0x58/0x80
opal_message_notifier_register+0x7c/0x1e0
opal_prd_probe+0x84/0x150 [opal_prd]
platform_probe+0x78/0x130
really_probe+0x110/0x5d0
__driver_probe_device+0x17c/0x230
driver_probe_device+0x60/0x130
__driver_attach+0xfc/0x220
bus_for_each_dev+0xa8/0x130
driver_attach+0x34/0x50
bus_add_driver+0x1b0/0x300
driver_register+0x98/0x1a0
__platform_driver_register+0x38/0x50
opal_prd_driver_init+0x34/0x50 [opal_prd]
do_one_initcall+0x60/0x2d0
do_init_module+0x7c/0x320
load_module+0x3394/0x3650
__do_sys_finit_module+0xd4/0x160
system_call_exception+0x140/0x290
system_call_common+0xf4/0x258
A e5500 machine running a 32-bit kernel sometimes hangs at boot,
seemingly going into an infinite loop of instruction storage interrupts.
The ESR (Exception Syndrome Register) has a value of 0x800000 (store)
when this happens, which is likely set by a previous store. An
instruction TLB miss interrupt would then leave ESR unchanged, and if no
PTE exists it calls directly to the instruction storage interrupt
handler without changing ESR.
access_error() does not cause a segfault due to a store to a read-only
vma because is_exec is true. Most subsequent fault handling does not
check for a write fault on a read-only vma, and might do strange things
like create a writeable PTE or call page_mkwrite on a read only vma or
file. It's not clear what happens here to cause the infinite faulting in
this case, a fault handler failure or low level PTE or TLB handling.
In any case this can be fixed by having the instruction storage
interrupt zero regs->dsisr rather than storing the ESR value to it.
Fixes: a01a3f2ddbcd ("powerpc: remove arguments from fault handler functions") Cc: stable@vger.kernel.org # v5.12+ Reported-by: Jacques de Laval <jacques.delaval@protonmail.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Jacques de Laval <jacques.delaval@protonmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211028133043.4159501-1-npiggin@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Following the introduction of the generic ECC engine infrastructure, it
was necessary to reorganize the code and move the ECC configuration in
the ->attach_chip() hook. Failing to do that properly lead to a first
series of fixes supposed to stabilize the situation. Unfortunately, this
only fixed the use of software ECC engines, preventing any other kind of
engine to be used, including on-die ones.
It is now time to (finally) fix the situation by ensuring that we still
provide a default (eg. software ECC) but will still support different
ECC engines such as on-die ECC engines if properly described in the
device tree.
There are no changes needed on the core side in order to do this, but we
just need to leverage the logic there which allows:
1- a subsystem default (set to Host engines in the raw NAND world)
2- a driver specific default (here set to software ECC engines)
3- any type of engine requested by the user (ie. described in the DT)
As the raw NAND subsystem has not yet been fully converted to the ECC
engine infrastructure, in order to provide a default ECC engine for this
driver we need to set chip->ecc.engine_type *before* calling
nand_scan(). During the initialization step, the core will consider this
entry as the default engine for this driver. This value may of course
be overloaded by the user if the usual DT properties are provided.
Following the introduction of the generic ECC engine infrastructure, it
was necessary to reorganize the code and move the ECC configuration in
the ->attach_chip() hook. Failing to do that properly lead to a first
series of fixes supposed to stabilize the situation. Unfortunately, this
only fixed the use of software ECC engines, preventing any other kind of
engine to be used, including on-die ones.
It is now time to (finally) fix the situation by ensuring that we still
provide a default (eg. software ECC) but will still support different
ECC engines such as on-die ECC engines if properly described in the
device tree.
There are no changes needed on the core side in order to do this, but we
just need to leverage the logic there which allows:
1- a subsystem default (set to Host engines in the raw NAND world)
2- a driver specific default (here set to software ECC engines)
3- any type of engine requested by the user (ie. described in the DT)
As the raw NAND subsystem has not yet been fully converted to the ECC
engine infrastructure, in order to provide a default ECC engine for this
driver we need to set chip->ecc.engine_type *before* calling
nand_scan(). During the initialization step, the core will consider this
entry as the default engine for this driver. This value may of course
be overloaded by the user if the usual DT properties are provided.
Following the introduction of the generic ECC engine infrastructure, it
was necessary to reorganize the code and move the ECC configuration in
the ->attach_chip() hook. Failing to do that properly lead to a first
series of fixes supposed to stabilize the situation. Unfortunately, this
only fixed the use of software ECC engines, preventing any other kind of
engine to be used, including on-die ones.
It is now time to (finally) fix the situation by ensuring that we still
provide a default (eg. software ECC) but will still support different
ECC engines such as on-die ECC engines if properly described in the
device tree.
There are no changes needed on the core side in order to do this, but we
just need to leverage the logic there which allows:
1- a subsystem default (set to Host engines in the raw NAND world)
2- a driver specific default (here set to software ECC engines)
3- any type of engine requested by the user (ie. described in the DT)
As the raw NAND subsystem has not yet been fully converted to the ECC
engine infrastructure, in order to provide a default ECC engine for this
driver we need to set chip->ecc.engine_type *before* calling
nand_scan(). During the initialization step, the core will consider this
entry as the default engine for this driver. This value may of course
be overloaded by the user if the usual DT properties are provided.
Following the introduction of the generic ECC engine infrastructure, it
was necessary to reorganize the code and move the ECC configuration in
the ->attach_chip() hook. Failing to do that properly lead to a first
series of fixes supposed to stabilize the situation. Unfortunately, this
only fixed the use of software ECC engines, preventing any other kind of
engine to be used, including on-die ones.
It is now time to (finally) fix the situation by ensuring that we still
provide a default (eg. software ECC) but will still support different
ECC engines such as on-die ECC engines if properly described in the
device tree.
There are no changes needed on the core side in order to do this, but we
just need to leverage the logic there which allows:
1- a subsystem default (set to Host engines in the raw NAND world)
2- a driver specific default (here set to software ECC engines)
3- any type of engine requested by the user (ie. described in the DT)
As the raw NAND subsystem has not yet been fully converted to the ECC
engine infrastructure, in order to provide a default ECC engine for this
driver we need to set chip->ecc.engine_type *before* calling
nand_scan(). During the initialization step, the core will consider this
entry as the default engine for this driver. This value may of course
be overloaded by the user if the usual DT properties are provided.
Following the introduction of the generic ECC engine infrastructure, it
was necessary to reorganize the code and move the ECC configuration in
the ->attach_chip() hook. Failing to do that properly lead to a first
series of fixes supposed to stabilize the situation. Unfortunately, this
only fixed the use of software ECC engines, preventing any other kind of
engine to be used, including on-die ones.
It is now time to (finally) fix the situation by ensuring that we still
provide a default (eg. software ECC) but will still support different
ECC engines such as on-die ECC engines if properly described in the
device tree.
There are no changes needed on the core side in order to do this, but we
just need to leverage the logic there which allows:
1- a subsystem default (set to Host engines in the raw NAND world)
2- a driver specific default (here set to software ECC engines)
3- any type of engine requested by the user (ie. described in the DT)
As the raw NAND subsystem has not yet been fully converted to the ECC
engine infrastructure, in order to provide a default ECC engine for this
driver we need to set chip->ecc.engine_type *before* calling
nand_scan(). During the initialization step, the core will consider this
entry as the default engine for this driver. This value may of course
be overloaded by the user if the usual DT properties are provided.
Following the introduction of the generic ECC engine infrastructure, it
was necessary to reorganize the code and move the ECC configuration in
the ->attach_chip() hook. Failing to do that properly lead to a first
series of fixes supposed to stabilize the situation. Unfortunately, this
only fixed the use of software ECC engines, preventing any other kind of
engine to be used, including on-die ones.
It is now time to (finally) fix the situation by ensuring that we still
provide a default (eg. software ECC) but will still support different
ECC engines such as on-die ECC engines if properly described in the
device tree.
There are no changes needed on the core side in order to do this, but we
just need to leverage the logic there which allows:
1- a subsystem default (set to Host engines in the raw NAND world)
2- a driver specific default (here set to software ECC engines)
3- any type of engine requested by the user (ie. described in the DT)
As the raw NAND subsystem has not yet been fully converted to the ECC
engine infrastructure, in order to provide a default ECC engine for this
driver we need to set chip->ecc.engine_type *before* calling
nand_scan(). During the initialization step, the core will consider this
entry as the default engine for this driver. This value may of course
be overloaded by the user if the usual DT properties are provided.
Following the introduction of the generic ECC engine infrastructure, it
was necessary to reorganize the code and move the ECC configuration in
the ->attach_chip() hook. Failing to do that properly lead to a first
series of fixes supposed to stabilize the situation. Unfortunately, this
only fixed the use of software ECC engines, preventing any other kind of
engine to be used, including on-die ones.
It is now time to (finally) fix the situation by ensuring that we still
provide a default (eg. software ECC) but will still support different
ECC engines such as on-die ECC engines if properly described in the
device tree.
There are no changes needed on the core side in order to do this, but we
just need to leverage the logic there which allows:
1- a subsystem default (set to Host engines in the raw NAND world)
2- a driver specific default (here set to software ECC engines)
3- any type of engine requested by the user (ie. described in the DT)
As the raw NAND subsystem has not yet been fully converted to the ECC
engine infrastructure, in order to provide a default ECC engine for this
driver we need to set chip->ecc.engine_type *before* calling
nand_scan(). During the initialization step, the core will consider this
entry as the default engine for this driver. This value may of course
be overloaded by the user if the usual DT properties are provided.
Fixes: d525914b5bd8 ("mtd: rawnand: xway: Move the ECC initialization to ->attach_chip()") Cc: stable@vger.kernel.org Cc: Jan Hoffmann <jan@3e8.eu> Cc: Kestrel seventyfour <kestrelseventyfour@gmail.com> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Tested-by: Jan Hoffmann <jan@3e8.eu> Link: https://lore.kernel.org/linux-mtd/20210928222258.199726-10-miquel.raynal@bootlin.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Following the introduction of the generic ECC engine infrastructure, it
was necessary to reorganize the code and move the ECC configuration in
the ->attach_chip() hook. Failing to do that properly lead to a first
series of fixes supposed to stabilize the situation. Unfortunately, this
only fixed the use of software ECC engines, preventing any other kind of
engine to be used, including on-die ones.
It is now time to (finally) fix the situation by ensuring that we still
provide a default (eg. software ECC) but will still support different
ECC engines such as on-die ECC engines if properly described in the
device tree.
There are no changes needed on the core side in order to do this, but we
just need to leverage the logic there which allows:
1- a subsystem default (set to Host engines in the raw NAND world)
2- a driver specific default (here set to software ECC engines)
3- any type of engine requested by the user (ie. described in the DT)
As the raw NAND subsystem has not yet been fully converted to the ECC
engine infrastructure, in order to provide a default ECC engine for this
driver we need to set chip->ecc.engine_type *before* calling
nand_scan(). During the initialization step, the core will consider this
entry as the default engine for this driver. This value may of course
be overloaded by the user if the usual DT properties are provided.
The introduction of the generic ECC engine API lead to a number of
changes in various drivers which broke some of them. Here is a typical
example: I expected the SM_ORDER option to be handled by the Hamming ECC
engine internals. Problem: the fsmc driver does not instantiate (yet) a
real ECC engine object so we had to use a 'bare' ECC helper instead of
the shiny rawnand functions. However, when not intializing this engine
properly and using the bare helpers, we do not get the SM ORDER feature
handled automatically. It looks like this was lost in the process so
let's ensure we use the right SM ORDER now.
Since commit 48720ba56891 ("virtio/s390: use DMA memory for ccw I/O and
classic notifiers") we were supposed to make sure that
virtio_ccw_release_dev() completes before the ccw device and the
attached dma pool are torn down, but unfortunately we did not. Before
that commit it used to be OK to delay cleaning up the memory allocated
by virtio-ccw indefinitely (which isn't really intuitive for guys used
to destruction happens in reverse construction order), but now we
trigger a BUG_ON if the genpool is destroyed before all memory allocated
from it is deallocated. Which brings down the guest. We can observe this
problem, when unregister_virtio_device() does not give up the last
reference to the virtio_device (e.g. because a virtio-scsi attached scsi
disk got removed without previously unmounting its previously mounted
partition).
To make sure that the genpool is only destroyed after all the necessary
freeing is done let us take a reference on the ccw device on each
ccw_device_dma_zalloc() and give it up on each ccw_device_dma_free().
Actually there are multiple approaches to fixing the problem at hand
that can work. The upside of this one is that it is the safest one while
remaining simple. We don't crash the guest even if the driver does not
pair allocations and frees. The downside is the reference counting
overhead, that the reference counting for ccw devices becomes more
complex, in a sense that we need to pair the calls to the aforementioned
functions for it to be correct, and that if we happen to leak, we leak
more than necessary (the whole ccw device instead of just the genpool).
Some alternatives to this approach are taking a reference in
virtio_ccw_online() and giving it up in virtio_ccw_release_dev() or
making sure virtio_ccw_release_dev() completes its work before
virtio_ccw_remove() returns. The downside of these approaches is that
these are less safe against programming errors.
Cc: <stable@vger.kernel.org> # v5.3 Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Fixes: 48720ba56891 ("virtio/s390: use DMA memory for ccw I/O and classic notifiers") Reported-by: bfu@redhat.com Reviewed-by: Vineeth Vijayan <vneethv@linux.ibm.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a queue is switched to soft offline during heavy load and later
switched to soft online again and now used, it may be that the caller
is blocked forever in the ioctl call.
The failure occurs because there is a pending reply after the queue(s)
have been switched to offline. This orphaned reply is received when
the queue is switched to online and is accidentally counted for the
outstanding replies. So when there was a valid outstanding reply and
this orphaned reply is received it counts as the outstanding one thus
dropping the outstanding counter to 0. Voila, with this counter the
receive function is not called any more and the real outstanding reply
is never received (until another request comes in...) and the ioctl
blocks.
The fix is simple. However, instead of readjusting the counter when an
orphaned reply is detected, I check the queue status for not empty and
compare this to the outstanding counter. So if the queue is not empty
then the counter must not drop to 0 but at least have a value of 1.
commit 9c6c273aa424 ("timer: Remove init_timer_on_stack() in favor
of timer_setup_on_stack()") changed the timer setup from
init_timer_on_stack(() to timer_setup(), but missed to change the
mod_timer() call. And while at it, use msecs_to_jiffies() instead
of the open coded timeout calculation.
Cc: stable@vger.kernel.org Fixes: 9c6c273aa424 ("timer: Remove init_timer_on_stack() in favor of timer_setup_on_stack()") Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The CPU hotplug off took place after 3 seconds.
The issue is the read of the event count value after 4 seconds when
the CPU is not available and the read of the counter returns an
error. This is treated as a counter value of zero. This results
in a very large value (0 - previous_value).
Fix this by detecting the hotplugged off CPU and report 0 instead
of a very large number.
It is generally unsafe to call put_device() with dpm_list_mtx held,
because the given device's release routine may carry out an action
depending on that lock which then may deadlock, so modify the
system-wide suspend and resume of devices to always drop dpm_list_mtx
before calling put_device() (and adjust white space somewhat while
at it).
For instance, this prevents the following splat from showing up in
the kernel log after a system resume in certain configurations:
[ 3290.969514] ======================================================
[ 3290.969517] WARNING: possible circular locking dependency detected
[ 3290.969519] 5.15.0+ #2420 Tainted: G S
[ 3290.969523] ------------------------------------------------------
[ 3290.969525] systemd-sleep/4553 is trying to acquire lock:
[ 3290.969529] ffff888117ab1138 ((wq_completion)hci0#2){+.+.}-{0:0}, at: flush_workqueue+0x87/0x4a0
[ 3290.969554]
but task is already holding lock:
[ 3290.969556] ffffffff8280fca8 (dpm_list_mtx){+.+.}-{3:3}, at: dpm_resume+0x12e/0x3e0
[ 3290.969571]
which lock already depends on the new lock.
Cc: All applicable <stable@vger.kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The note in c2adda27d202f ("video: backlight: Add of_find_backlight helper
in backlight.c") says that gpio-backlight uses brightness as power state.
This has been fixed since in ec665b756e6f7 ("backlight: gpio-backlight:
Correct initial power state handling") and other backlight drivers do not
require this workaround. Drop the workaround.
This fixes the case where e.g. pwm-backlight can perfectly well be set to
brightness 0 on boot in DT, which without this patch leads to the display
brightness to be max instead of off.
Fixes: c2adda27d202f ("video: backlight: Add of_find_backlight helper in backlight.c") Cc: <stable@vger.kernel.org> # 5.4+ Cc: <stable@vger.kernel.org> # 4.19.x: ec665b756e6f7: backlight: gpio-backlight: Correct initial power state handling Signed-off-by: Marek Vasut <marex@denx.de> Acked-by: Noralf Trønnes <noralf@tronnes.org> Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch extends the DLN2 driver; adding cell for adc_dln2 module.
The original patch[1] fell through the cracks when the driver was added
so ADC has never actually been usable. That patch did not have ACPI
support which was added in v5.9, so the oldest supported version this
current patch can be backported to is 5.10.
Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory. This can happen for 2
different reasons. a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
normal allocation fails.
The latter is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations. Anyway, if the OOM killer has been already invoked there
is no reason to invoke it again from the #PF path. Especially when the
OOM condition might be gone by that time and we have no way to find out
other than allocate.
Moreover if the allocation failed and the OOM killer hasn't been invoked
then we are unlikely to do the right thing from the #PF context because
we have already lost the allocation context and restictions and
therefore might oom kill a task from a different NUMA domain.
This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.
[VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
This is a local problem related to memcg, however, it causes unnecessary
global OOM kills that are repeated over and over again and escalate into a
real disaster. This has been broken since kmem accounting has been
introduced for cgroup v1 (3.8). There was no kmem specific reclaim for
the separate limit so the only way to handle kmem hard limit was to return
with ENOMEM. In upstream the problem will be fixed by removing the
outdated kmem limit, however stable and LTS kernels cannot do it and are
still affected. This patch fixes the problem and should be backported
into stable/LTS.]
Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.
Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It can be misused and allowed to trigger global OOM from inside
a memcg-limited container. On the other hand if memcg fails allocation,
called from inside #PF handler it triggers global OOM from inside
pagefault_out_of_memory().
To prevent these problems this patchset:
(a) removes execution of out_of_memory() from
pagefault_out_of_memory(), becasue nobody can explain why it is
necessary.
(b) allow memcg to fail allocation of dying/killed tasks.
This patch (of 3):
Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory which in turn executes
out_out_memory() and can kill a random task.
An allocation might fail when the current task is the oom victim and
there are no memory reserves left. The OOM killer is already handled at
the page allocator level for the global OOM and at the charging level
for the memcg one. Both have much more information about the scope of
allocation/charge request. This means that either the OOM killer has
been invoked properly and didn't lead to the allocation success or it
has been skipped because it couldn't have been invoked. In both cases
triggering it from here is pointless and even harmful.
It makes much more sense to let the killed task die rather than to wake
up an eternally hungry oom-killer and send him to choose a fatter victim
for breakfast.
Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We need to ensure that we serialize the stalled and hash bits with the
wait_queue wait handler, or we could be racing with someone modifying
the hashed state after we find it busy, but before we then give up and
wait for it to be cleared. This can cause random delays or stalls when
handling buffered writes for many files, where some of these files cause
hash collisions between the worker threads.
Cc: stable@vger.kernel.org Reported-by: Daniel Black <daniel@mariadb.org> Fixes: e941894eae31 ("io-wq: make buffered file write hashed work map per-ctx") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We need to set the stalled bit early, before we drop the lock for adding
us to the stall hash queue. If not, then we can race with new work being
queued between adding us to the stall hash and io_worker_handle_work()
marking us stalled.
A previous commit removed the IRQ safety of the worker and wqe locks,
but that left one spot of the hash wait lock now being done without
already having IRQs disabled.
Ensure that we use the right locking variant for the hashed waitqueue
lock.
Fixes: a9a4aa9fbfc5 ("io-wq: wqe and worker locks no longer need to be IRQ safe") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.
The above assumption is overly optimistic though. E.g. vmalloc can
scale to really large requests and the heuristic would allow that. We
used to have an early break in the vmalloc allocator for killed tasks
but this has been reverted by commit b8c8a338f75e ("Revert "vmalloc:
back off when the current task is killed""). There are likely other
similar code paths which do not check for fatal signals in an
allocation&charge loop. Also there are some kernel objects charged to a
memcg which are not bound to a process life time.
It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.
One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This
is certainly possible but it is not really clear how much of an excess
is desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.
This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot
fail or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the failure
(ENOMEM) and tell the caller to back off.
In addition, this patch renames should_force_charge() helper to
task_is_dying() because now its use is not associated witch forced
charging.
This patch depends on pagefault_out_of_memory() to not trigger
out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
and cause a global OOM killer.
Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It is not safe to check page->index without holding the page lock. It
can be changed if the page is moved between the swap cache and the page
cache for a shmem file, for example. There is a VM_BUG_ON below which
checks page->index is correct after taking the page lock.
Link: https://lkml.kernel.org/r/20210818144932.940640-1-willy@infradead.org Fixes: 5c211ba29deb ("mm: add and use find_lock_entries") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: <syzbot+c87be4f669d920c76330@syzkaller.appspotmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If cgroup_sk_alloc() is called from interrupt context, then just assign the
root cgroup to skcd->cgroup. Prior to commit 8520e224f547 ("bpf, cgroups:
Fix cgroup v2 fallback on v1/v2 mixed mode") we would just return, and later
on in sock_cgroup_ptr(), we were NULL-testing the cgroup in fast-path, and
iff indeed NULL returning the root cgroup (v ?: &cgrp_dfl_root.cgrp). Rather
than re-adding the NULL-test to the fast-path we can just assign it once from
cgroup_sk_alloc() given v1/v2 handling has been simplified. The migration from
NULL test with returning &cgrp_dfl_root.cgrp to assigning &cgrp_dfl_root.cgrp
directly does /not/ change behavior for callers of sock_cgroup_ptr().
syzkaller was able to trigger a splat in the legacy netrom code base, where
the RX handler in nr_rx_frame() calls nr_make_new() which calls sk_alloc()
and therefore cgroup_sk_alloc() with in_interrupt() condition. Thus the NULL
skcd->cgroup, where it trips over on cgroup_sk_free() side given it expects
a non-NULL object. There are a few other candidates aside from netrom which
have similar pattern where in their accept-like implementation, they just call
to sk_alloc() and thus cgroup_sk_alloc() instead of sk_clone_lock() with the
corresponding cgroup_sk_clone() which then inherits the cgroup from the parent
socket. None of them are related to core protocols where BPF cgroup programs
are running from. However, in future, they should follow to implement a similar
inheritance mechanism.
Additionally, with a !CONFIG_CGROUP_NET_PRIO and !CONFIG_CGROUP_NET_CLASSID
configuration, the same issue was exposed also prior to 8520e224f547 due to
commit e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated
cgroup") which added the early in_interrupt() return back then.
Fixes: 8520e224f547 ("bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode") Fixes: e876ecc67db8 ("cgroup: memcg: net: do not associate sock with unrelated cgroup") Reported-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com Reported-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: syzbot+df709157a4ecaf192b03@syzkaller.appspotmail.com Tested-by: syzbot+533f389d4026d86a2a95@syzkaller.appspotmail.com Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/bpf/20210927123921.21535-1-daniel@iogearbox.net Signed-off-by: Sasha Levin <sashal@kernel.org>
Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
Back in the days, commit bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
embedded per-socket cgroup information into sock->sk_cgrp_data and in order
to save 8 bytes in struct sock made both mutually exclusive, that is, when
cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).
The assumption made was "there is no reason to mix the two and this is in line
with how legacy and v2 compatibility is handled" as stated in bd1060a1d671.
However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
this assumption no longer holds, and the possibility of the v1/v2 mixed mode
with the v2 root fallback being hit becomes a real security issue.
Many of the cgroup v2 BPF programs are also used for policy enforcement, just
to pick _one_ example, that is, to programmatically deny socket related system
calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
a policy bypass for the affected Pods.
In production environments, we have recently seen this case due to various
circumstances: i) a different 3rd party agent and/or ii) a container runtime
such as [0] in the user's environment configuring legacy cgroup v1 net_cls
tags, which triggered implicitly mentioned root fallback. Another case is
Kubernetes projects like kind [1] which create Kubernetes nodes in a container
and also add cgroup namespaces to the mix, meaning programs which are attached
to the cgroup v2 root of the cgroup namespace get attached to a non-root
cgroup v2 path from init namespace point of view. And the latter's root is
out of reach for agents on a kind Kubernetes node to configure. Meaning, any
entity on the node setting cgroup v1 net_cls tag will trigger the bypass
despite cgroup v2 BPF programs attached to the namespace root.
Generally, this mutual exclusiveness does not hold anymore in today's user
environments and makes cgroup v2 usage from BPF side fragile and unreliable.
This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
sock_cgroup_data in order to address these issues; this implicitly also fixes
the tradeoffs being made back then with regards to races and refcount leaks
as stated in bd1060a1d671, and removes the fallback, so that cgroup v2 BPF
programs always operate as expected.
Currently, it is not possible to migrate a neighbor entry between NUD_PERMANENT
state and NTF_USE flag with a dynamic NUD state from a user space control plane.
Similarly, it is not possible to add/remove NTF_EXT_LEARNED flag from an existing
neighbor entry in combination with NTF_USE flag.
This is due to the latter directly calling into neigh_event_send() without any
meta data updates as happening in __neigh_update(). Thus, to enable this use
case, extend the latter with a NEIGH_UPDATE_F_USE flag where we break the
NUD_PERMANENT state in particular so that a latter neigh_event_send() is able
to re-resolve a neighbor entry.
Before fix, NUD_PERMANENT -> NUD_* & NTF_USE:
# ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a
# ./ip/ip n
192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT
[...]
# ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
# ./ip/ip n
192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT
[...]
As can be seen, despite the admin-triggered replace, the entry remains in the
NUD_PERMANENT state.
After fix, NUD_PERMANENT -> NUD_* & NTF_USE:
# ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a
# ./ip/ip n
192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT
[...]
# ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
# ./ip/ip n
192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE
[...]
# ./ip/ip n
192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn STALE
[...]
# ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a
# ./ip/ip n
192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT
[...]
After the fix, the admin-triggered replace switches to a dynamic state from
the NTF_USE flag which triggered a new neighbor resolution. Likewise, we can
transition back from there, if needed, into NUD_PERMANENT.
Similar before/after behavior can be observed for below transitions:
Before fix, NTF_USE -> NTF_USE | NTF_EXT_LEARNED -> NTF_USE:
# ./ip/ip n replace 192.168.178.30 dev enp5s0 use
# ./ip/ip n
192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
[...]
# ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
# ./ip/ip n
192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
[...]
After fix, NTF_USE -> NTF_USE | NTF_EXT_LEARNED -> NTF_USE:
# ./ip/ip n replace 192.168.178.30 dev enp5s0 use
# ./ip/ip n
192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
[...]
# ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
# ./ip/ip n
192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE
[...]
# ./ip/ip n replace 192.168.178.30 dev enp5s0 use
# ./ip/ip n
192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
[..]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
memset() and memcpy() on an MMIO region like here results in a
lockup at startup on mpc5200 platform (since this first happens
during probing of the ATA and Ethernet drivers). Use memset_io()
and memcpy_toio() instead.
udma_get_*() checks if rchan/tchan/rflow is already allocated by checking
if it has a NON NULL value. For the error cases, rchan/tchan/rflow will
have error value and udma_get_*() considers this as already allocated
(PASS) since the error values are NON NULL. This results in NULL pointer
dereference error while de-referencing rchan/tchan/rflow.
Reset the value of rchan/tchan/rflow to NULL if a channel request fails.
CC: stable@vger.kernel.org Acked-by: Peter Ujfalusi <peter.ujfalusi@gmail.com> Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com> Link: https://lore.kernel.org/r/20211031032411.27235-3-kishon@ti.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
bcdma_get_*() checks if bchan is already allocated by checking if it
has a NON NULL value. For the error cases, bchan will have error value
and bcdma_get_*() considers this as already allocated (PASS) since the
error values are NON NULL. This results in NULL pointer dereference
error while de-referencing bchan.
Reset the value of bchan to NULL if a channel request fails.
CC: stable@vger.kernel.org Acked-by: Peter Ujfalusi <peter.ujfalusi@gmail.com> Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com> Link: https://lore.kernel.org/r/20211031032411.27235-2-kishon@ti.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There are pclusters in runtime marked with Z_EROFS_PCLUSTER_TAIL
before actual I/O submission. Thus, the decompression chain can be
extended if the following pcluster chain hooks such tail pcluster.
As the related comment mentioned, if some page is made of a hooked
pcluster and another followed pcluster, it can be reused for in-place
I/O (since I/O should be submitted anyway):
_______________________________________________________________
| tail (partial) page | head (partial) page |
|_____PRIMARY_HOOKED___|____________PRIMARY_FOLLOWED____________|
However, it's by no means safe to reuse as pagevec since if such
PRIMARY_HOOKED pclusters finally move into bypass chain without I/O
submission. It's somewhat hard to reproduce with LZ4 and I just found
it (general protection fault) by ro_fsstressing a LZMA image for long
time.
I'm going to actively clean up related code together with multi-page
folio adaption in the next few months. Let's address it directly for
easier backporting for now.
It invokes kthread_run to create a thread and run issue_discard_thread.
However, if f2fs_build_node_manager fails, the control flow goes to
free_nm and calls f2fs_destroy_node_manager. This function will free
sbi->nm_info. However, if issue_discard_thread accesses sbi->nm_info
after the deallocation, but before the f2fs_stop_discard_thread, it will
cause UAF(Use-after-free).
We use inline_dentry which requires to allocate dentry page when adding a link.
If we allow to reclaim memory from filesystem, we do down_read(&sbi->cp_rwsem)
twice by f2fs_lock_op(). I think this should be okay, but how about stopping
the lockdep complaint [1]?
When using "devm_request_threaded_irq(,,,,IRQF_ONESHOT,,)" in a driver,
only the first interrupt is handled, and following interrupts are never
delivered (initially reported in [1]).
That's because the RISC-V PLIC cannot EOI masked interrupts, as explained
in the description of Interrupt Completion in the PLIC spec [2]:
<quote>
The PLIC signals it has completed executing an interrupt handler by
writing the interrupt ID it received from the claim to the claim/complete
register. The PLIC does not check whether the completion ID is the same
as the last claim ID for that target. If the completion ID does not match
an interrupt source that *is currently enabled* for the target, the
completion is silently ignored.
</quote>
Re-enable the interrupt before completion if it has been masked during
the handling, and remask it afterwards.
copy_process currently copies task_struct.posix_cputimers_work as-is. If a
timer interrupt arrives while handling clone and before dup_task_struct
completes then the child task will have:
copy_process clears task_struct.task_works, so (2) will have no effect and
posix_cpu_timers_work will never run (not to mention it doesn't make sense
for two tasks to share a common linked list).
Since posix_cpu_timers_work never runs, posix_cputimers_work.scheduled is
never cleared. Since scheduled is set, future timer interrupts will skip
scheduling work, with the ultimate result that the task will never receive
timer expirations.
Together, the complete flow is:
1. Task 1 calls clone(), enters kernel.
2. Timer interrupt fires, schedules task work on Task 1.
2a. task_struct.posix_cputimers_work.scheduled = true
2b. task_struct.posix_cputimers_work.work added to
task_struct.task_works.
3. dup_task_struct() copies Task 1 to Task 2.
4. copy_process() clears task_struct.task_works for Task 2.
5. Future timer interrupts on Task 2 see
task_struct.posix_cputimers_work.scheduled = true and skip scheduling
work.
Fix this by explicitly clearing contents of task_struct.posix_cputimers_work
in copy_process(). This was never meant to be shared or inherited across
tasks in the first place.
Fixes: 1fb497dd0030 ("posix-cpu-timers: Provide mechanisms to defer timer handling to task_work") Reported-by: Rhys Hiltner <rhys@justin.tv> Signed-off-by: Michael Pratt <mpratt@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20211101210615.716522-1-mpratt@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When UBSAN is enabled, the code emitted for the call to guest_pv_has
includes a call to __ubsan_handle_load_invalid_value. objtool
complains that this call happens with UACCESS enabled; to avoid
the warning, pull the calls to user_access_begin into both arms
of the "if" statement, after the check for guest_pv_has.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Errata SKX37 is word-for-word identical to the other errata listed in
this workaround. I happened to notice this after investigating a CMCI
storm on a Skylake host. While I can't confirm this was the root cause,
spurious corrected errors does sound like a likely suspect.
Fixes: 2976908e4198 ("x86/mce: Do not log spurious corrected mce errors") Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20211029205759.GA7385@codemonkey.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
{standard input}: Assembler messages:
{standard input}:287: Error: opcode not supported on this processor: mips3 (mips3) `dins $10,$7,32,32'
{standard input}:680: Error: opcode not supported on this processor: mips3 (mips3) `dins $10,$7,32,32'
{standard input}:1274: Error: opcode not supported on this processor: mips3 (mips3) `dins $12,$9,32,32'
{standard input}:2175: Error: opcode not supported on this processor: mips3 (mips3) `dins $10,$7,32,32'
make[1]: *** [scripts/Makefile.build:277: mm/highmem.o] Error 1
with code produced from `__cmpxchg64' for MIPS64r2 CPU configurations
using CONFIG_32BIT and CONFIG_PHYS_ADDR_T_64BIT.
This is due to MIPS_ISA_ARCH_LEVEL downgrading the assembly architecture
to `r4000' i.e. MIPS III for MIPS64r2 configurations, while there is a
block of code containing a DINS MIPS64r2 instruction conditionalized on
MIPS_ISA_REV >= 2 within the scope of the downgrade.
The assembly architecture override code pattern has been put there for
LL/SC instructions, so that code compiles for configurations that select
a processor to build for that does not support these instructions while
still providing run-time support for processors that do, dynamically
switched by non-constant `cpu_has_llsc'. It went in with linux-mips.org
commit aac8aa7717a2 ("Enable a suitable ISA for the assembler around
ll/sc so that code builds even for processors that don't support the
instructions. Plus minor formatting fixes.") back in 2005.
Fix the problem by wrapping these instructions along with the adjacent
SYNC instructions only, following the practice established with commit cfd54de3b0e4 ("MIPS: Avoid move psuedo-instruction whilst using
MIPS_ISA_LEVEL") and commit 378ed6f0e3c5 ("MIPS: Avoid using .set mips0
to restore ISA"). Strictly speaking the SYNC instructions do not have
to be wrapped as they are only used as a Loongson3 erratum workaround,
so they will be enabled in the assembler by default, but do this so as
to keep code consistent with other places.
Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Fixes: c7e2d71dda7a ("MIPS: Fix set_pte() for Netlogic XLR using cmpxchg64()") Cc: stable@vger.kernel.org # v5.1+ Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For years, there have been random segmentation faults in userspace on
SMP PA-RISC machines. It occurred to me that this might be a problem in
set_pte_at(). MIPS and some other architectures do cache flushes when
installing PTEs with the present bit set.
Here I have adapted the code in update_mmu_cache() to flush the kernel
mapping when the kernel flush is deferred, or when the kernel mapping
may alias with the user mapping. This simplifies calls to
update_mmu_cache().
I also changed the barrier in set_pte() from a compiler barrier to a
full memory barrier. I know this change is not sufficient to fix the
problem. It might not be needed.
I have had a few days of operation with 5.14.16 to 5.15.1 and haven't
seen any random segmentation faults on rp3440 or c8000 so far.
Signed-off-by: John David Anglin <dave.anglin@bell.net> Signed-off-by: Helge Deller <deller@gmx.de> Cc: stable@kernel.org # 5.12+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I noticed that sometimes at kernel startup the backtraces did not
included the function names of init functions. Their address were not
resolved to function names and instead only the address was printed.
Debugging shows that the culprit is is_ksym_addr() which is called
by the backtrace functions to check if an address belongs to a function in
the kernel. The problem occurs only for CONFIG_KALLSYMS_ALL=y.
When looking at is_ksym_addr() one can see that for CONFIG_KALLSYMS_ALL=y
the function only tries to resolve the address via is_kernel() function,
which checks like this:
if (addr >= _stext && addr <= _end)
return 1;
On parisc the init functions are located before _stext, so this check fails.
Other platforms seem to have all functions (including init functions)
behind _stext.
The following patch moves the _stext symbol at the beginning of the
kernel and thus includes the init section. This fixes the check and does
not seem to have any negative side effects on where the kernel mapping
happens in the map_pages() function in arch/parisc/mm/init.c.
Naresh and Antonio ran into a build failure with latest Debian
armhf compilers, with lots of output like
tmp/ccY3nOAs.s:2215: Error: selected processor does not support `cpsid i' in ARM mode
As it turns out, $(cc-option) fails early here when the FPU is not
selected before CPU architecture is selected, as the compiler
option check runs before enabling -msoft-float, which causes
a problem when testing a target architecture level without an FPU:
cc1: error: '-mfloat-abi=hard': selected architecture lacks an FPU
Passing e.g. -march=armv6k+fp in place of -march=armv6k would avoid this
issue, but the fallback logic is already broken because all supported
compilers (gcc-5 and higher) are much more recent than these options,
and building with -march=armv5t as a fallback no longer works.
The best way forward that I see is to just remove all the checks, which
also has the nice side-effect of slightly improving the startup time for
'make'.
The -mtune=marvell-f option was apparently never supported by any mainline
compiler, and the custom Codesourcery gcc build that did support is
now too old to build kernels, so just use -mtune=xscale unconditionally
for those.
This should be safe to apply on all stable kernels, and will be required
in order to keep building them with gcc-11 and higher.
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=996419 Reported-by: Antonio Terceiro <antonio.terceiro@linaro.org> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Tested-by: Sebastian Reichel <sebastian.reichel@collabora.com> Tested-by: Klaus Kudielka <klaus.kudielka@gmail.com> Cc: Matthias Klose <doko@debian.org> Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently __set_fixmap() bails out with a warning when called in early boot
from early_iounmap(). Fix it, and while at it, make the comment a bit easier
to understand.
Cc: <stable@vger.kernel.org> Fixes: b089c31c519c ("ARM: 8667/3: Fix memory attribute inconsistencies when using fixmap") Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Linux allows doing a flush/fsync on a file open for read-only,
but the protocol does not allow that. If the file passed in
on the flush is read-only try to find a writeable handle for
the same inode, if that is not possible skip sending the
fsync call to the server to avoid breaking the apps.
Reported-by: Julian Sikorski <belegdol@gmail.com> Tested-by: Julian Sikorski <belegdol@gmail.com> Suggested-by: Jeremy Allison <jra@samba.org> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The below commit added optional support for passing a bind address.
It configures the sockaddr bind arguments before parsing options and
reconfigures on options -b and -4.
This broke support for passing port (-p) on its own.
Configure sockaddr after parsing all arguments.
Fixes: 3327a9c46352 ("selftests: add functionals test for UDP GRO") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Ensure diagnostics monitoring support is implemented for the SFF 8472
compliant port module and set the correct length for ethtool port
module eeprom read.
Fixes: f56ec6766dcf ("cxgb4: Add support for ethtool i2c dump") Signed-off-by: Manoj Malviya <manojmalviya@chelsio.com> Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
We got the following WARNING when running ab/nginx
test with RDMA link flapping (up-down-up).
The reason is when smc_sock fallback and at linkdown
happens simultaneously, we may got the following situation:
smc_sock was set to SMC_CLOSED and sock_put() been called
when terminate the link group. But later application call
close() on the socket, then we got:
__smc_release():
if (smc_sock->fallback)
smc_sock->sk_state = SMC_CLOSED
sock_put(smc_sock)
Again we set the smc_sock to CLOSED through it's already
in CLOSED state, and double put the refcnt, so the following
warning happens:
refcount_t: underflow; use-after-free.
WARNING: CPU: 5 PID: 860 at lib/refcount.c:28 refcount_warn_saturate+0x8d/0xf0
Modules linked in:
CPU: 5 PID: 860 Comm: nginx Not tainted 5.10.46+ #403
Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/01/2014
RIP: 0010:refcount_warn_saturate+0x8d/0xf0
Code: 05 5c 1e b5 01 01 e8 52 25 bc ff 0f 0b c3 80 3d 4f 1e b5 01 00 75 ad 48
This patch adds check in __smc_release() to make
sure we won't do an extra sock_put() and set the
socket to CLOSED when its already in CLOSED state.
Fixes: 51f1de79ad8e (net/smc: replace sock_put worker by socket refcounting) Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently vosck_connect() increments sock refcount for nonblocking
socket each time it's called, which can lead to memory leak if
it's called multiple times because connect timeout function decrements
sock refcount only once.
Fixes it by making vsock_connect() return -EALREADY immediately when
sock state is already SS_CONNECTING.
Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit bfe301ebbc94 ("net: mvpp2: convert to use
mac_prepare()/mac_finish()") introduced a bug wherein it leaves the MAC
RESET register asserted after mac_finish(), due to wrong order of
function calls.
Before it was:
.mac_config()
mvpp22_mode_reconfigure()
assert reset
mvpp2_xlg_config()
deassert reset
Now it is:
.mac_prepare()
.mac_config()
mvpp2_xlg_config()
deassert reset
.mac_finish()
mvpp2_xlg_config()
assert reset
Obviously this is wrong.
This bug is triggered when phylink tries to change the PHY interface
mode from a GMAC mode (sgmii, 1000base-x, 2500base-x) to XLG mode
(10gbase-r, xaui). The XLG mode does not work since reset is left
asserted. Only after
ifconfig down && ifconfig up
is called will the XLG mode work.
Move the call to mvpp22_mode_reconfigure() to .mac_prepare()
implementation. Since some of the subsequent functions need to know
whether the interface is being changed, we unfortunately also need to
pass around the new interface mode before setting port->phy_interface.
Fixes: bfe301ebbc94 ("net: mvpp2: convert to use mac_prepare()/mac_finish()") Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit fe28c53ed71d ("net: stmmac: fix taprio configuration when
base_time is in the past") allowed some base time values in the past,
but apparently not all, the base-time value of 0 (Jan 1st 1970) is still
explicitly denied by the driver.
Remove the bogus check.
Fixes: b60189e0392f ("net: stmmac: Integrate EST with TAPRIO scheduler API") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, driver only allow configuring ETS bandwidth of TCs according
to the max TC number queried from firmware. However, the hardware actually
supports 8 TCs and users may need to configure ETS bandwidth of all TCs,
so remove the restriction.
Fixes: 330baff5423b ("net: hns3: add ETS TC weight setting in SSU module") Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, driver will send command to firmware to query pfc packet number
when user uses dcb tool to get pfc parameters. However, the periodic
service task will also periodically query and record MAC statistics,
including pfc packet number.
As the hardware registers of statistics is cleared after reading, it will
cause pfc packet number of MAC statistics are not correct after using dcb
tool to get pfc parameters.
To fix this problem, when user uses dcb tool to get pfc parameters, driver
updates MAC statistics firstly and then get pfc packet number from MAC
statistics.
Fixes: 64fd2300fcc1 ("net: hns3: add support for querying pfc puase packets statistic") Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently, NIC init ROCE interrupt vector with MSIX interrupt. But ROCE use
pci_irq_vector() to get interrupt vector, which adds the relative interrupt
vector again and gets wrong interrupt vector.
So fixes it by assign relative interrupt vector to ROCE instead of MSIX
interrupt vector and delete the unused struct member base_msi_vector
declaration of hclgevf_dev.
Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
1) if q->tk_offset == TK_OFFS_MAX, then get_tcp_tstamp() calls
ktime_mono_to_any() with out-of-bound value.
2) if q->tk_offset is changed in taprio_parse_clockid(),
taprio_get_time() might also call ktime_mono_to_any()
with out-of-bound value as sysbot found:
UBSAN: array-index-out-of-bounds in kernel/time/timekeeping.c:908:27
index 3 is out of range for type 'ktime_t *[3]'
CPU: 1 PID: 25668 Comm: kworker/u4:0 Not tainted 5.15.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
ubsan_epilogue+0xb/0x5a lib/ubsan.c:151
__ubsan_handle_out_of_bounds.cold+0x62/0x6c lib/ubsan.c:291
ktime_mono_to_any+0x1d4/0x1e0 kernel/time/timekeeping.c:908
get_tcp_tstamp net/sched/sch_taprio.c:322 [inline]
get_packet_txtime net/sched/sch_taprio.c:353 [inline]
taprio_enqueue_one+0x5b0/0x1460 net/sched/sch_taprio.c:420
taprio_enqueue+0x3b1/0x730 net/sched/sch_taprio.c:485
dev_qdisc_enqueue+0x40/0x300 net/core/dev.c:3785
__dev_xmit_skb net/core/dev.c:3869 [inline]
__dev_queue_xmit+0x1f6e/0x3630 net/core/dev.c:4194
batadv_send_skb_packet+0x4a9/0x5f0 net/batman-adv/send.c:108
batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:393 [inline]
batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:421 [inline]
batadv_iv_send_outstanding_bat_ogm_packet+0x6d7/0x8e0 net/batman-adv/bat_iv_ogm.c:1701
process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
kthread+0x405/0x4f0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
Fixes: 7ede7b03484b ("taprio: make clock reference conversions easier") Fixes: 54002066100b ("taprio: Adjust timestamps for TCP packets") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vedang Patel <vedang.patel@intel.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Link: https://lore.kernel.org/r/20211108180815.1822479-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Model 88E6191X only supports >1G speeds on port 10. Port 0 and 9 are
only 1G.
Fixes: de776d0d316f ("net: dsa: mv88e6xxx: add support for mv88e6393x family") Signed-off-by: Marek Behún <kabel@kernel.org> Cc: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20211104171747.10509-1-kabel@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
DEFINE_PROC_SHOW_ATTRIBUTE() is supposed to be used to define a series
of functions and variables to register proc file easily. And the users
can use proc_create_data() to pass their own private data and get it
via seq->private in the callback. Unfortunately, the proc file system
use PDE_DATA() to get private data instead of inode->i_private. So fix
it. Fortunately, there only one user of it which does not pass any
private data, so this bug does not break any in-tree codes.
Link: https://lkml.kernel.org/r/20211029032638.84884-1-songmuchun@bytedance.com Fixes: 97a32539b956 ("proc: convert everything to "struct proc_ops"") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Florent Revest <revest@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
For NV12 FBs with odd main surface tile-row height the CCS surface
height was incorrectly calculated 1 less than the actual value. Fix this
by rounding up the result of divison. For consistency do the same for
the CCS surface width calculation.
The priv->ntfy_blocks[] has "priv->num_ntfy_blks" elements so this >
needs to be >= to prevent an off by one bug. The priv->ntfy_blocks[]
array is allocated in gve_alloc_notify_blocks().
Fixes: 87a7f321bb6a ("gve: Recover from queue stall due to missed IRQ") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Using the % operator on a 64-bit variable is expensive and can
cause a link failure:
arm-linux-gnueabi-ld: drivers/dma/stm32-dma.o: in function `stm32_dma_get_max_width':
stm32-dma.c:(.text+0x170): undefined reference to `__aeabi_uldivmod'
arm-linux-gnueabi-ld: drivers/dma/stm32-dma.o: in function `stm32_dma_set_xfer_param':
stm32-dma.c:(.text+0x1cd4): undefined reference to `__aeabi_uldivmod'
As we know that we just want to check the alignment in
stm32_dma_get_max_width(), there is no need for a full division, and
using a simple mask is a faster replacement.
Same in stm32_dma_set_xfer_param(), change this to only allow burst
transfers if the address is a multiple of the length.
stm32_dma_get_best_burst just after will take buf_len into account to fix
burst in case of misalignment.
Theorically, address pointers used by STM32 DMA must be chosen so as to
ensure that all transfers within a burst block are aligned on the address
boundary equal to the size of the transfer.
If this is always the case for peripheral addresses on STM32, it is not for
memory addresses if the user doesn't respect this alignment constraint.
To avoid a weird behavior of the DMA controller in this case (no error
triggered but data are not transferred as expected), force no burst.