In unprivileged Xen guests event handling can cause a deadlock with
Xen console handling. The evtchn_rwlock and the hvc_lock are taken in
opposite sequence in __hvc_poll() and in Xen console IRQ handling.
Normally this is no problem, as the evtchn_rwlock is taken as a reader
in both paths, but as soon as an event channel is being closed, the
lock will be taken as a writer, which will cause read_lock() to block:
read_lock(evtchn_rwlock)
spin_lock(hvc_lock)
write_lock(evtchn_rwlock)
[blocks]
spin_lock(hvc_lock)
[blocks]
read_lock(evtchn_rwlock)
[blocks due to writer waiting,
and not in_interrupt()]
This issue can be avoided by replacing evtchn_rwlock with RCU in
xen_free_irq(). Note that RCU is used only to delay freeing of the
irq_info memory. There is no RCU based dereferencing or replacement of
pointers involved.
In order to avoid potential races between removing the irq_info
reference and handling of interrupts, set the irq_info pointer to NULL
only when freeing its memory. The IRQ itself must be freed at that
time, too, as otherwise the same IRQ number could be allocated again
before handling of the old instance would have been finished.
This is XSA-441 / CVE-2023-34324.
Fixes: 54c9de89895e ("xen/events: add a new "late EOI" evtchn framework") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Negative ifindexes are illegal, but the kernel does not validate the
ifindex in the ancillary header of RTM_NEWLINK messages, resulting in
the kernel generating a warning [1] when such an ifindex is specified.
It was improperly backported to 4.19.y, and applied to the wrong
function, which obviously causes problems. A fixed version will be
applied as a separate commit later.
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/ZSQeA8fhUT++iZvz@ostr-mac Cc: Ido Schimmel <idosch@nvidia.com> Cc: Jiri Pirko <jiri@nvidia.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Back in 2005, Kyle McMartin removed the 16-byte alignment for
ldcw semaphores on PA 2.0 machines (CONFIG_PA20). This broke
spinlocks on pre PA8800 processors. The main symptom was random
faults in mmap'd memory (e.g., gcc compilations, etc).
Unfortunately, the errata for this ldcw change is lost.
The issue is the 16-byte alignment required for ldcw semaphore
instructions can only be reduced to natural alignment when the
ldcw operation can be handled coherently in cache. Only PA8800
and PA8900 processors actually support doing the operation in
cache.
Aligning the spinlock dynamically adds two integer instructions
to each spinlock.
The following compilation error is false alarm as RDMA devices don't
have such large amount of ports to actually cause to format truncation.
drivers/infiniband/core/cma_configfs.c: In function ‘make_cma_ports’:
drivers/infiniband/core/cma_configfs.c:223:57: error: ‘snprintf’ output may be truncated before the last format character [-Werror=format-truncation=]
223 | snprintf(port_str, sizeof(port_str), "%u", i + 1);
| ^
drivers/infiniband/core/cma_configfs.c:223:17: note: ‘snprintf’ output between 2 and 11 bytes into a destination of size 10
223 | snprintf(port_str, sizeof(port_str), "%u", i + 1);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[5]: *** [scripts/Makefile.build:243: drivers/infiniband/core/cma_configfs.o] Error 1
pinctrl_gpio_set_config() expects the GPIO number from the global GPIO
numberspace, not the controller-relative offset, which needs to be added
to the chip base.
In order to be sure that 'buff' is never truncated, its size should be
12, not 11.
When building with W=1, this fixes the following warnings:
drivers/infiniband/hw/mlx4/sysfs.c: In function ‘add_port_entries’:
drivers/infiniband/hw/mlx4/sysfs.c:268:34: error: ‘sprintf’ may write a terminating nul past the end of the destination [-Werror=format-overflow=]
268 | sprintf(buff, "%d", i);
| ^
drivers/infiniband/hw/mlx4/sysfs.c:268:17: note: ‘sprintf’ output between 2 and 12 bytes into a destination of size 11
268 | sprintf(buff, "%d", i);
| ^~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/hw/mlx4/sysfs.c:286:34: error: ‘sprintf’ may write a terminating nul past the end of the destination [-Werror=format-overflow=]
286 | sprintf(buff, "%d", i);
| ^
drivers/infiniband/hw/mlx4/sysfs.c:286:17: note: ‘sprintf’ output between 2 and 12 bytes into a destination of size 11
286 | sprintf(buff, "%d", i);
| ^~~~~~~~~~~~~~~~~~~~~~
Currently, when hb_interval is changed by users, it won't take effect
until the next expiry of hb timer. As the default value is 30s, users
have to wait up to 30s to wait its hb_interval update to work.
This becomes pretty bad in containers where a much smaller value is
usually set on hb_interval. This patch improves it by resetting the
hb timer immediately once the value of hb_interval is updated by users.
Note that we don't address the already existing 'problem' when sending
a heartbeat 'on demand' if one hb has just been sent(from the timer)
mentioned in:
During the 4-way handshake, the transport's state is set to ACTIVE in
sctp_process_init() when processing INIT_ACK chunk on client or
COOKIE_ECHO chunk on server.
when processing COOKIE_ECHO on 192.168.1.2, as it's in COOKIE_WAIT state,
sctp_sf_do_dupcook_b() is called by sctp_sf_do_5_2_4_dupcook() where it
creates a new association and sets its transport to ACTIVE then updates
to the old association in sctp_assoc_update().
However, in sctp_assoc_update(), it will skip the transport update if it
finds a transport with the same ipaddr already existing in the old asoc,
and this causes the old asoc's transport state not to move to ACTIVE
after the handshake.
This means if DATA retransmission happens at this moment, it won't be able
to enter PF state because of the check 'transport->state == SCTP_ACTIVE'
in sctp_do_8_2_transport_strike().
This patch fixes it by updating the transport in sctp_assoc_update() with
sctp_assoc_add_peer() where it updates the transport state if there is
already a transport with the same ipaddr exists in the old asoc.
This commit fixes poor delayed ACK behavior that can cause poor TCP
latency in a particular boundary condition: when an application makes
a TCP socket write that is an exact multiple of the MSS size.
The problem is that there is painful boundary discontinuity in the
current delayed ACK behavior. With the current delayed ACK behavior,
we have:
(1) If an app reads data when > 1*MSS is unacknowledged, then
tcp_cleanup_rbuf() ACKs immediately because of:
(2) If an app reads all received data, and the packets were < 1*MSS,
and either (a) the app is not ping-pong or (b) we received two
packets < 1*MSS, then tcp_cleanup_rbuf() ACKs immediately beecause
of:
(3) *However*: if an app reads exactly 1*MSS of data,
tcp_cleanup_rbuf() does not send an immediate ACK. This is true
even if the app is not ping-pong and the 1*MSS of data had the PSH
bit set, suggesting the sending application completed an
application write.
Thus if the app is not ping-pong, we have this painful case where
>1*MSS gets an immediate ACK, and <1*MSS gets an immediate ACK, but a
write whose last skb is an exact multiple of 1*MSS can get a 40ms
delayed ACK. This means that any app that transfers data in one
direction and takes care to align write size or packet size with MSS
can suffer this problem. With receive zero copy making 4KB MSS values
more common, it is becoming more common to have application writes
naturally align with MSS, and more applications are likely to
encounter this delayed ACK problem.
The fix in this commit is to refine the delayed ACK heuristics with a
simple check: immediately ACK a received 1*MSS skb with PSH bit set if
the app reads all data. Why? If an skb has a len of exactly 1*MSS and
has the PSH bit set then it is likely the end of an application
write. So more data may not be arriving soon, and yet the data sender
may be waiting for an ACK if cwnd-bound or using TX zero copy. Thus we
set ICSK_ACK_PUSHED in this case so that tcp_cleanup_rbuf() will send
an ACK immediately if the app reads all of the data and is not
ping-pong. Note that this logic is also executed for the case where
len > MSS, but in that case this logic does not matter (and does not
hurt) because tcp_cleanup_rbuf() will always ACK immediately if the
app reads data and there is more than an MSS of unACKed data.
This commit fixes quick-ack counting so that it only considers that a
quick-ack has been provided if we are sending an ACK that newly
acknowledges data.
The code was erroneously using the number of data segments in outgoing
skbs when deciding how many quick-ack credits to remove. This logic
does not make sense, and could cause poor performance in
request-response workloads, like RPC traffic, where requests or
responses can be multi-segment skbs.
When a TCP connection decides to send N quick-acks, that is to
accelerate the cwnd growth of the congestion control module
controlling the remote endpoint of the TCP connection. That quick-ack
decision is purely about the incoming data and outgoing ACKs. It has
nothing to do with the outgoing data or the size of outgoing data.
And in particular, an ACK only serves the intended purpose of allowing
the remote congestion control to grow the congestion window quickly if
the ACK is ACKing or SACKing new data.
The fix is simple: only count packets as serving the goal of the
quickack mechanism if they are ACKing/SACKing new data. We can tell
whether this is the case by checking inet_csk_ack_scheduled(), since
we schedule an ACK exactly when we are ACKing/SACKing new data.
Fixes: fc6415bcb0f5 ("[TCP]: Fix quick-ack decrementing with TSO.") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231001151239.1866845-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The STM32MP1 keeps clk_rx enabled during suspend, and therefore the
driver does not enable the clock in stm32_dwmac_init() if the device was
suspended. The problem is that this same code runs on STM32 MCUs, which
do disable clk_rx during suspend, causing the clock to never be
re-enabled on resume.
This patch adds a variant flag to indicate that clk_rx remains enabled
during suspend, and uses this to decide whether to enable the clock in
stm32_dwmac_init() if the device was suspended.
This approach fixes this specific bug with limited opportunity for
unintended side-effects, but I have a follow up patch that will refactor
the clock configuration and hopefully make it less error prone.
Fixes: 6528e02cc9ff ("net: ethernet: stmmac: add adaptation for stm32mp157c.") Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230927175749.1419774-1-ben.wolsieffer@hefring.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
This issue is caused because usbnet_read_cmd() reads less bytes than requested
(zero byte in the reproducer). In this case, 'buf' is not properly filled.
This patch fixes the issue by returning -ENODATA if usbnet_read_cmd() reads
less bytes than requested.
Fixes: d0cad871703b ("smsc75xx: SMSC LAN75xx USB gigabit ethernet adapter driver") Reported-and-tested-by: syzbot+6966546b78d050bb0b5d@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6966546b78d050bb0b5d Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230923173549.3284502-1-syoshida@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Including the transhdrlen in length is a problem when the packet is
partially filled (e.g. something like send(MSG_MORE) happened previously)
when appending to an IPv4 or IPv6 packet as we don't want to repeat the
transport header or account for it twice. This can happen under some
circumstances, such as splicing into an L2TP socket.
The symptom observed is a warning in __ip6_append_data():
WARNING: CPU: 1 PID: 5042 at net/ipv6/ip6_output.c:1800 __ip6_append_data.isra.0+0x1be8/0x47f0 net/ipv6/ip6_output.c:1800
that occurs when MSG_SPLICE_PAGES is used to append more data to an already
partially occupied skbuff. The warning occurs when 'copy' is larger than
the amount of data in the message iterator. This is because the requested
length includes the transport header length when it shouldn't. This can be
triggered by, for example:
Fix this by only adding transhdrlen into the length if the write queue is
empty in l2tp_ip6_sendmsg(), analogously to how UDP does things.
l2tp_ip_sendmsg() looks like it won't suffer from this problem as it builds
the UDP packet itself.
Fixes: a32e0eec7042 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6") Reported-by: syzbot+62cbf263225ae13ff153@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/0000000000001c12b30605378ce8@google.com/ Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: David Ahern <dsahern@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: netdev@vger.kernel.org
cc: bpf@vger.kernel.org
cc: syzkaller-bugs@googlegroups.com Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Without this 'else' statement, an "usb" name goes into two handlers:
the first/previous 'if' statement _AND_ the for-loop over 'devtable',
but the latter is useless as it has no 'usb' device_id entry anyway.
Tested with allmodconfig before/after patch; no changes to *.mod.c:
git checkout v6.6-rc3
make -j$(nproc) allmodconfig
make -j$(nproc) olddefconfig
make -j$(nproc)
find . -name '*.mod.c' | cpio -pd /tmp/before
# apply patch
make -j$(nproc)
find . -name '*.mod.c' | cpio -pd /tmp/after
diff -r /tmp/before/ /tmp/after/
# no difference
Fixes: acbef7b76629 ("modpost: fix module autoloading for OF devices with generic compatible property") Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The following call trace shows a deadlock issue due to recursive locking of
mutex "device_mutex". First lock acquire is in target_for_each_device() and
second in target_free_device().
When regcache_rbtree_write() creates a new rbtree_node it was passing the
wrong bit number to regcache_rbtree_set_register(). The bit number is the
offset __in number of registers__, but in the case of creating a new block
regcache_rbtree_write() was not dividing by the address stride to get the
number of registers.
Fix this by dividing by map->reg_stride.
Compare with regcache_rbtree_read() where the bit is checked.
This bug meant that the wrong register was marked as present. The register
that was written to the cache could not be read from the cache because it
was not marked as cached. But a nearby register could be marked as having
a cached value even if it was never written to the cache.
Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com> Fixes: 3f4ff561bc88 ("regmap: rbtree: Make cache_present bitmap per node") Link: https://lore.kernel.org/r/20230922153711.28103-1-rf@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Process the result of hdlc_open() and call uhdlc_close()
in case of an error. It is necessary to pass the error
code up the control flow, similar to a possible
error in request_irq().
Also add a hdlc_close() call to the uhdlc_close()
because the comment to hdlc_close() says it must be called
by the hardware driver when the HDLC device is being closed
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: c19b6d246a35 ("drivers/net: support hdlc function for QE-UCC") Signed-off-by: Alexandra Diupina <adiupina@astralinux.ru> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Only skip the code path trying to access the rfc1042 headers when the
buffer is too small, so the driver can still process packets without
rfc1042 headers.
Fixes: 119585281617 ("wifi: mwifiex: Fix OOB and integer underflow when rx packets") Signed-off-by: Pin-yen Lin <treapking@chromium.org> Acked-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Matthew Wang <matthewmwang@chromium.org> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://lore.kernel.org/r/20230908104308.1546501-1-treapking@chromium.org Signed-off-by: Sasha Levin <sashal@kernel.org>
There exists mtd devices with zero erasesize, which will trigger a
divide-by-zero exception while attaching ubi device.
Fix it by refusing attaching if mtd's erasesize is 0.
commit 0bdf399342c5 ("net: Avoid address overwrite in kernel_connect")
ensured that kernel_connect() will not overwrite the address parameter
in cases where BPF connect hooks perform an address rewrite. This change
replaces direct calls to sock->ops->connect() in net with kernel_connect()
to make these call safe.
Link: https://lore.kernel.org/netdev/20230912013332.2048422-1-jrife@google.com/ Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect") Cc: stable@vger.kernel.org Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jordan Rife <jrife@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In a TLV encoding scheme, the Length part represents the length after
the header containing the values for type and length. In this case,
`tlv_len` should be:
Notice that the `- 1` accounts for the one-element array `bitmap`, which
1-byte size is already included in `sizeof(*tlv_rxba)`.
So, if the above is correct, there is a double-counting of some members
in `struct mwifiex_ie_types_rxba_sync`, when `tlv_buf_left` and `tmp`
are calculated:
The above reflects the desired change: avoid counting 13 too many bytes;
which is the total size of the double-counted members in
`struct mwifiex_ie_types_rxba_sync`:
The flexible structure (a structure that contains a flexible-array member
at the end) `qed_ll2_tx_packet` is nested within the second layer of
`struct qed_ll2_info`:
The problem is that member `cbs` in `struct qed_ll2_info` is placed just
after an object of type `struct qed_ll2_tx_queue`, which is in itself
an implicit flexible structure, which by definition ends in a flexible
array member, in this case `bds_set`. This causes an undefined behavior
bug at run-time when dynamic memory is allocated for `bds_set`, which
could lead to a serious issue if `cbs` in `struct qed_ll2_info` is
overwritten by the contents of `bds_set`. Notice that the type of `cbs`
is a structure full of function pointers (and a cookie :) ):
Fix this by moving the declaration of `cbs` to the middle of its
containing structure `qed_ll2_info`, preventing it from being
overwritten by the contents of `bds_set` at run-time.
This bug was introduced in 2017, when `bds_set` was converted to a
one-element array, and started to be used as a Variable Length Object
(VLO) at run-time.
Fixes: f5823fe6897c ("qed: Add ll2 option to limit the number of bds per packet") Cc: stable@vger.kernel.org Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/ZQ+Nz8DfPg56pIzr@work Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When device_register() fails, zfcp_port_release() will be called after
put_device(). As a result, zfcp_ccw_adapter_put() will be called twice: one
in zfcp_port_release() and one in the error path after device_register().
So the reference on the adapter object is doubly put, which may lead to a
premature free. Fix this by adjusting the error tag after
device_register().
Fixes: f3450c7b9172 ("[SCSI] zfcp: Replace local reference counting with common kref") Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Link: https://lore.kernel.org/r/20230923103723.10320-1-dinghao.liu@zju.edu.cn Acked-by: Benjamin Block <bblock@linux.ibm.com> Cc: stable@vger.kernel.org # v2.6.33+ Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In commit f296b374b9c1 ("media: dvb: symbol fixup for dvb_attach()") in
the 4.19.y tree, a few symbols were missed due to files being renamed in
newer kernel versions. Fix this up by properly marking up the
sp8870_attach and xc2028_attach symbols.
Ben writes:
When I looked into the referenced security issue, it seemed to only be
exploitable through wakelock names, and in the upstream kernel only
after commit c8377adfa781 "PM / wakeup: Show wakeup sources stats in
sysfs" (first included in 5.4). So I would be interested to know if
and why a fix was needed for 4.19.
More importantly, this backported version uniformly converts to
sysfs_emit(), but there are 3 places sysfs_emit_at() must be used
instead:
In AHCI 1.3.1, the register description for CAP.SSC:
"When cleared to ‘0’, software must not allow the HBA to initiate
transitions to the Slumber state via agressive link power management nor
the PxCMD.ICC field in each port, and the PxSCTL.IPM field in each port
must be programmed to disallow device initiated Slumber requests."
In AHCI 1.3.1, the register description for CAP.PSC:
"When cleared to ‘0’, software must not allow the HBA to initiate
transitions to the Partial state via agressive link power management nor
the PxCMD.ICC field in each port, and the PxSCTL.IPM field in each port
must be programmed to disallow device initiated Partial requests."
Ensure that we always set the corresponding bits in PxSCTL.IPM, such that
a device is not allowed to initiate transitions to power states which are
unsupported by the HBA.
DevSleep is always initiated by the HBA, however, for completeness, set the
corresponding bit in PxSCTL.IPM such that agressive link power management
cannot transition to DevSleep if DevSleep is not supported.
sata_link_scr_lpm() is used by libahci, ata_piix and libata-pmp.
However, only libahci has the ability to read the CAP/CAP2 register to see
if these features are supported. Therefore, in order to not introduce any
regressions on ata_piix or libata-pmp, create flags that indicate that the
respective feature is NOT supported. This way, the behavior for ata_piix
and libata-pmp should remain unchanged.
This change is based on a patch originally submitted by Runa Guo-oc.
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Fixes: 1152b2617a6e ("libata: implement sata_link_scr_lpm() and make ata_dev_set_feature() global") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
With the configuration PAGE_SIZE 64k and filesystem blocksize 64k,
a problem occurred when more than 13 million files were directly created
under a directory:
EXT4-fs error (device xx): ext4_dx_csum_set:492: inode #xxxx: comm xxxxx: dir seems corrupt? Run e2fsck -D.
EXT4-fs error (device xx): ext4_dx_csum_verify:463: inode #xxxx: comm xxxxx: dir seems corrupt? Run e2fsck -D.
EXT4-fs error (device xx): dx_probe:856: inode #xxxx: block 8188: comm xxxxx: Directory index failed checksum
When enough files are created, the fake_dirent->reclen will be 0xffff.
it doesn't equal to the blocksize 65536, i.e. 0x10000.
But it is not the same condition when blocksize equals to 4k.
when enough files are created, the fake_dirent->reclen will be 0x1000.
it equals to the blocksize 4k, i.e. 0x1000.
The problem seems to be related to the limitation of the 16-bit field
when the blocksize is set to 64k.
To address this, helpers like ext4_rec_len_{from,to}_disk has already
been introduced to complete the conversion between the encoded and the
plain form of rec_len.
So fix this one by using the helper, and all the other in this file too.
Cc: stable@kernel.org Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes") Suggested-by: Andreas Dilger <adilger@dilger.ca> Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20230803060938.1929759-1-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The elf-fdpic loader hard sets the process personality to either
PER_LINUX_FDPIC for true elf-fdpic binaries or to PER_LINUX for normal ELF
binaries (in this case they would be constant displacement compiled with
-pie for example). The problem with that is that it will lose any other
bits that may be in the ELF header personality (such as the "bug
emulation" bits).
On the ARM architecture the ADDR_LIMIT_32BIT flag is used to signify a
normal 32bit binary - as opposed to a legacy 26bit address binary. This
matters since start_thread() will set the ARM CPSR register as required
based on this flag. If the elf-fdpic loader loses this bit the process
will be mis-configured and crash out pretty quickly.
Modify elf-fdpic loader personality setting so that it preserves the upper
three bytes by using the SET_PERSONALITY macro to set it. This macro in
the generic case sets PER_LINUX and preserves the upper bytes.
Architectures can override this for their specific use case, and ARM does
exactly this.
The problem shows up quite easily running under qemu using the ARM
architecture, but not necessarily on all types of real ARM hardware. If
the underlying ARM processor does not support the legacy 26-bit addressing
mode then everything will work as expected.
Link: https://lkml.kernel.org/r/20230907011808.2985083-1-gerg@kernel.org Fixes: 1bde925d23547 ("fs/binfmt_elf_fdpic.c: provide NOMMU loader for regular ELF binaries") Signed-off-by: Greg Ungerer <gerg@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Ungerer <gerg@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On certain SATA controllers, softreset fails after wakeup from S2RAM with
the message "softreset failed (1st FIS failed)", sometimes resulting in
drives not being detected again. With the increased timeout, this issue
is avoided. Instead, "softreset failed (device not ready)" is now
logged 1-2 times; this later failure seems to cause fewer problems
however, and the drives are detected reliably once they've spun up and
the probe is retried.
The issue was observed with the primary SATA controller of the QNAP
TS-453B, which is an "Intel Corporation Celeron/Pentium Silver Processor
SATA Controller [8086:31e3] (rev 06)" integrated in the Celeron J4125 CPU,
and the following drives:
The SATA controller seems to be more relevant to this issue than the
drives, as the same drives are always detected reliably on the secondary
SATA controller on the same board (an ASMedia 106x) without any "softreset
failed" errors even without the increased timeout.
Fixes: e7d3ef13d52a ("libata: change drive ready wait after hard reset to 5s") Cc: stable@vger.kernel.org Signed-off-by: Matthias Schiffer <mschiffer@universe-factory.net> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
libsas does its own domain based power management of ports. For such
ports, libata should not use a device type defining power management
operations as executing these operations for suspend/resume in addition
to libsas calls to ata_sas_port_suspend() and ata_sas_port_resume() is
not necessary (and likely dangerous to do, even though problems are not
seen currently).
Introduce the new ata_port_sas_type device_type for ports managed by
libsas. This new device type is used in ata_tport_add() and is defined
without power management operations.
Fixes: 2fcbdcb4c802 ("[SCSI] libata: export ata_port suspend/resume infrastructure for sas") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Whenever an ATA adapter driver is removed (e.g. rmmod),
ata_port_detach() is called repeatedly for all the adapter ports to
remove (unload) the devices attached to the port and delete the port
device itself. Removing of devices is done using libata EH with the
ATA_PFLAG_UNLOADING port flag set. This causes libata EH to execute
ata_eh_unload() which disables all devices attached to the port.
ata_port_detach() finishes by calling scsi_remove_host() to remove the
scsi host associated with the port. This function will trigger the
removal of all scsi devices attached to the host and in the case of
disks, calls to sd_shutdown() which will flush the device write cache
and stop the device. However, given that the devices were already
disabled by ata_eh_unload(), the synchronize write cache command and
start stop unit commands fail. E.g. running "rmmod ahci" with first
removing sd_mod results in error messages like:
The function ata_port_request_pm() checks the port flag
ATA_PFLAG_PM_PENDING and calls ata_port_wait_eh() if this flag is set to
ensure that power management operations for a port are not scheduled
simultaneously. However, this flag check is done without holding the
port lock.
Fix this by taking the port lock on entry to the function and checking
the flag under this lock. The lock is released and re-taken if
ata_port_wait_eh() needs to be called. The two WARN_ON() macros checking
that the ATA_PFLAG_PM_PENDING flag was cleared are removed as the first
call is racy and the second one done without holding the port lock.
Fixes: 5ef41082912b ("ata: add ata port system PM callbacks") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Alex reported that running ssh over IPv6 does not work with
Thunderbolt/USB4 networking driver. The reason for that is that driver
should call skb_is_gso() before calling skb_is_gso_v6(), and it should
not return false after calculates the checksum successfully. This probably
was a copy paste error from the original driver where it was done properly.
Reported-by: Alex Balcanquall <alex@alexbal.com> Fixes: e69b6c02b4c3 ("net: Add support for networking over Thunderbolt cable") Cc: stable@vger.kernel.org Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
A user reported some issues with smaller file systems that get very
full. While investigating this issue I noticed that df wasn't showing
100% full, despite having 0 chunk space and having < 1MiB of available
metadata space.
This turns out to be an overflow issue, we're doing:
to determine if there's not enough space to make metadata allocations,
which overflows if total_available_metadata_space is < 4M. Fix this by
checking to see if our available space is greater than the 4M threshold.
This makes df properly report 100% usage on the file system.
CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For REPORT SUPPORTED OPERATION CODES command, the service action field is
defined as bits 0-4 in the second byte in the CDB. Bits 5-7 in the second
byte are reserved.
Only look at the service action field in the second byte when determining
if the MAINTENANCE IN opcode is a REPORT SUPPORTED OPERATION CODES command.
This matches how we only look at the service action field in the second
byte when determining if the SERVICE ACTION IN(16) opcode is a READ
CAPACITY(16) command (reserved bits 5-7 in the second byte are ignored).
Fixes: 7b2030942859 ("libata: Add support for SCT Write Same") Cc: stable@vger.kernel.org Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In nilfs_gccache_submit_read_data(), brelse(bh) is called to drop the
reference count of bh when the call to nilfs_dat_translate() fails. If
the reference count hits 0 and its owner page gets unlocked, bh may be
freed. However, bh->b_page is dereferenced to put the page after that,
which may result in a use-after-free bug. This patch moves the release
operation after unlocking and putting the page.
NOTE: The function in question is only called in GC, and in combination
with current userland tools, address translation using DAT does not occur
in that function, so the code path that causes this issue will not be
executed. However, it is possible to run that code path by intentionally
modifying the userland GC library or by calling the GC ioctl directly.
[konishi.ryusuke@gmail.com: NOTE added to the commit log] Link: https://lkml.kernel.org/r/1543201709-53191-1-git-send-email-bianpan2016@163.com Link: https://lkml.kernel.org/r/20230921141731.10073-1-konishi.ryusuke@gmail.com Fixes: a3d93f709e89 ("nilfs2: block cache for garbage collection") Signed-off-by: Pan Bian <bianpan2016@163.com> Reported-by: Ferry Meng <mengferry@linux.alibaba.com> Closes: https://lkml.kernel.org/r/20230818092022.111054-1-mengferry@linux.alibaba.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In case the leaf driver wants to use IRQ polling (irq = 0) and
IIR register shows that an interrupt happened in the 8250 hardware
the IRQ data can be NULL. In such a case we need to skip the wake
event as we came to this path from the timer interrupt and quite
likely system is already awake.
Without this fix we have got an Oops:
serial8250: ttyS0 at I/O 0x3f8 (irq = 0, base_baud = 115200) is a 16550A
...
BUG: kernel NULL pointer dereference, address: 0000000000000010
RIP: 0010:serial8250_handle_irq+0x7c/0x240
Call Trace:
? serial8250_handle_irq+0x7c/0x240
? __pfx_serial8250_timeout+0x10/0x10
smack_dentry_create_files_as() determines whether transmuting should occur
based on the label of the parent directory the new inode will be added to,
and not the label of the directory where it is created.
This helps for example to do transmuting on overlayfs, since the latter
first creates the inode in the working directory, and then moves it to the
correct destination.
However, despite smack_dentry_create_files_as() provides the correct label,
smack_inode_init_security() does not know from passed information whether
or not transmuting occurred. Without this information,
smack_inode_init_security() cannot set SMK_INODE_CHANGED in smk_flags,
which will result in the SMACK64TRANSMUTE xattr not being set in
smack_d_instantiate().
Thus, add the smk_transmuted field to the task_smack structure, and set it
in smack_dentry_create_files_as() to smk_task if transmuting occurred. If
smk_task is equal to smk_transmuted in smack_inode_init_security(), act as
if transmuting was successful but without taking the label from the parent
directory (the inode label was already set correctly from the current
credentials in smack_inode_alloc_security()).
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[4.19: adjusted for the lack of helper functions] Fixes: d6d80cb57be4 ("Smack: Base support for overlayfs") Signed-off-by: Munehisa Kamata <kamatam@amazon.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently in "smack_inode_copy_up()" function, process label is
changed with the label on parent inode. Due to which,
process is assigned directory label and whatever file or directory
created by the process are also getting directory label
which is wrong label.
Changes has been done to use label of overlay inode instead
of parent inode.
Signed-off-by: Vishal Goel <vishal.goel@samsung.com> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[4.19: adjusted for the lack of helper functions] Fixes: d6d80cb57be4 ("Smack: Base support for overlayfs") Signed-off-by: Munehisa Kamata <kamatam@amazon.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Load balancing IO completions across all available MSI-X vectors should be
enabled for Invader and later generation controllers only. This needs to
be disabled for older controllers. Add an adapter type check before
setting msix_load_balance flag.
Fixes: 1d15d9098ad1 ("scsi: megaraid_sas: Load balance completions across all MSI-X") Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
When converting net_device_stats to rtnl_link_stats64 sign extension
is triggered on ILP32 machines as 6c1c509778 changed the previous
"ulong -> u64" conversion to "long -> u64" by accessing the
net_device_stats fields through a (signed) atomic_long_t.
This causes for example the received bytes counter to jump to 16EiB after
having received 2^31 bytes. Casting the atomic value to "unsigned long"
beforehand converting it into u64 avoids this.
Fixes: 6c1c5097781f ("net: add atomic_long_t to net_device_stats fields") Signed-off-by: Felix Riemann <felix.riemann@sma.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Daniel reported that the commit 1ae3e78c0820 ("watchdog: iTCO_wdt: No
need to stop the timer in probe") makes QEMU implementation of the iTCO
watchdog not to trigger reboot anymore when NO_REBOOT flag is initially
cleared using this option (in QEMU command line):
-global ICH9-LPC.noreboot=false
The problem with the commit is that it left the unconditional setting of
NO_REBOOT that is not cleared anymore when the kernel keeps pinging the
watchdog (as opposed to the previous code that called iTCO_wdt_stop()
that cleared it).
Fix this so that we only set NO_REBOOT if the watchdog was not initially
running.
Fixes: 1ae3e78c0820 ("watchdog: iTCO_wdt: No need to stop the timer in probe") Reported-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Tested-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20221028062750.45451-1-mika.westerberg@linux.intel.com Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The watchdog core can handle pinging of the watchdog before userspace
opens the device. For this reason instead of stopping the timer, just
mark it as running and let the watchdog core take care of it.
If a device has no NUMA node information associated with it, the driver
puts the device in node first_memory_node (say node 0). Not having a
NUMA node and being associated with node 0 are completely different
things and it makes little sense to mix the two.
Fix linker error if FB=m about missing fb_io_read and fb_io_write. The
linker's error message suggests that this config setting has already
been broken for other symbols.
All errors (new ones prefixed by >>):
sh4-linux-ld: drivers/video/fbdev/sh7760fb.o: in function `sh7760fb_probe':
sh7760fb.c:(.text+0x374): undefined reference to `framebuffer_alloc'
sh4-linux-ld: sh7760fb.c:(.text+0x394): undefined reference to `fb_videomode_to_var'
sh4-linux-ld: sh7760fb.c:(.text+0x39c): undefined reference to `fb_alloc_cmap'
sh4-linux-ld: sh7760fb.c:(.text+0x3a4): undefined reference to `register_framebuffer'
sh4-linux-ld: sh7760fb.c:(.text+0x3ac): undefined reference to `fb_dealloc_cmap'
sh4-linux-ld: sh7760fb.c:(.text+0x434): undefined reference to `framebuffer_release'
sh4-linux-ld: drivers/video/fbdev/sh7760fb.o: in function `sh7760fb_remove':
sh7760fb.c:(.text+0x800): undefined reference to `unregister_framebuffer'
sh4-linux-ld: sh7760fb.c:(.text+0x804): undefined reference to `fb_dealloc_cmap'
sh4-linux-ld: sh7760fb.c:(.text+0x814): undefined reference to `framebuffer_release'
>> sh4-linux-ld: drivers/video/fbdev/sh7760fb.o:(.rodata+0xc): undefined reference to `fb_io_read'
>> sh4-linux-ld: drivers/video/fbdev/sh7760fb.o:(.rodata+0x10): undefined reference to `fb_io_write'
sh4-linux-ld: drivers/video/fbdev/sh7760fb.o:(.rodata+0x2c): undefined reference to `cfb_fillrect'
sh4-linux-ld: drivers/video/fbdev/sh7760fb.o:(.rodata+0x30): undefined reference to `cfb_copyarea'
sh4-linux-ld: drivers/video/fbdev/sh7760fb.o:(.rodata+0x34): undefined reference to `cfb_imageblit'
Suggested-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202309130632.LS04CPWu-lkp@intel.com/ Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Acked-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Link: https://patchwork.freedesktop.org/patch/msgid/20230918090400.13264-1-tzimmermann@suse.de Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit 151e887d8ff9 ("veth: Fixing transmit return status for dropped
packets") exposed the fact that bpf_clone_redirect is capable of
returning raw NET_XMIT_XXX return codes.
This is in the conflict with its UAPI doc which says the following:
"0 on success, or a negative error in case of failure."
Update the UAPI to reflect the fact that bpf_clone_redirect can
return positive error numbers, but don't explicitly define
their meaning.
Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230911194731.286342-1-sdf@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
ata_scsi_port_error_handler() starts off by clearing ATA_PFLAG_EH_PENDING,
before calling ap->ops->error_handler() (without holding the ap->lock).
If an error IRQ is received while ap->ops->error_handler() is running,
the irq handler will set ATA_PFLAG_EH_PENDING.
Once ap->ops->error_handler() returns, ata_scsi_port_error_handler()
checks if ATA_PFLAG_EH_PENDING is set, and if it is, another iteration
of ATA EH is performed.
The problem is that ATA_PFLAG_EH_PENDING is not only cleared by
ata_scsi_port_error_handler(), it is also cleared by ata_eh_reset().
ata_eh_reset() is called by ap->ops->error_handler(). This additional
clearing done by ata_eh_reset() breaks the whole retry logic in
ata_scsi_port_error_handler(). Thus, if an error IRQ is received while
ap->ops->error_handler() is running, the port will currently remain
frozen and will never get re-enabled.
The additional clearing in ata_eh_reset() was introduced in commit 1e641060c4b5 ("libata: clear eh_info on reset completion").
Looking at the original error report:
https://marc.info/?l=linux-ide&m=124765325828495&w=2
We can see the following happening:
[ 1.074659] ata3: XXX port freeze
[ 1.074700] ata3: XXX hardresetting link, stopping engine
[ 1.074746] ata3: XXX flipping SControl
[ 1.420049] ata3: XXX starting engine
[ 1.420096] ata3: XXX rc=0, class=1
[ 1.420142] ata3: XXX clearing IRQs for thawing
[ 1.420188] ata3: XXX port thawed
[ 1.420234] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
We are not supposed to be able to receive an error IRQ while the port is
frozen (PxIE is set to 0, i.e. all IRQs for the port are disabled).
AHCI 1.3.1 section 10.7.1.1 First Tier (IS Register) states:
"Each bit location can be thought of as reporting a '1' if the virtual
"interrupt line" for that port is indicating it wishes to generate an
interrupt. That is, if a port has one or more interrupt status bit set,
and the enables for those status bits are set, then this bit shall be set."
Additionally, AHCI state P:ComInit clearly shows that the state machine
will only jump to P:ComInitSetIS (which sets IS.IPS(x) to '1'), if PxIE.PCE
is set to '1'. In our case, PxIE is set to 0, so IS.IPS(x) won't get set.
So IS.IPS(x) only gets set if PxIS and PxIE is set.
AHCI 1.3.1 section 10.7.1.1 First Tier (IS Register) also states:
"The bits in this register are read/write clear. It is set by the level of
the virtual interrupt line being a set, and cleared by a write of '1' from
the software."
So if IS.IPS(x) is set, you need to explicitly clear it by writing a 1 to
IS.IPS(x) for that port.
Since PxIE is cleared, the only way to get an interrupt while the port is
frozen, is if IS.IPS(x) is set, and the only way IS.IPS(x) can be set when
the port is frozen, is if it was set before the port was frozen.
However, since commit 737dd811a3db ("ata: libahci: clear pending interrupt
status"), we clear both PxIS and IS.IPS(x) after freezing the port, but
before the COMRESET, so the problem that commit 1e641060c4b5 ("libata:
clear eh_info on reset completion") fixed can no longer happen.
Thus, revert commit 1e641060c4b5 ("libata: clear eh_info on reset
completion"), so that the retry logic in ata_scsi_port_error_handler()
works once again. (The retry logic is still needed, since we can still
get an error IRQ _after_ the port has been thawed, but before
ata_scsi_port_error_handler() takes the ap->lock in order to check
if ATA_PFLAG_EH_PENDING is set.)
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
When user resize all trace ring buffer through file 'buffer_size_kb',
then in ring_buffer_resize(), kernel allocates buffer pages for each
cpu in a loop.
If the kernel preemption model is PREEMPT_NONE and there are many cpus
and there are many buffer pages to be allocated, it may not give up cpu
for a long time and finally cause a softlockup.
To avoid it, call cond_resched() after each cpu buffer allocation.
Function instance_set() expects to enable event 'sched_switch', so we
should set 1 to its 'enable' file.
Testcase passed after this patch:
# ./ftracetest test.d/instances/instance-event.tc
=== Ftrace unit tests ===
[1] Test creation and deletion of trace instances while setting an event
[PASS]
# of passed: 1
# of failed: 0
# of unresolved: 0
# of untested: 0
# of unsupported: 0
# of xfailed: 0
# of undefined(test bug): 0
The drivers uses a mutex and I2C bus access in its PMIC EIC chip
get implementation. This means these functions can sleep and the PMIC EIC
chip should set the can_sleep property to true.
This will ensure that a warning is printed when trying to get the
value from a context that potentially can't sleep.
Commit 0840242e8875 ("ARM: dts: Configure clock parent for pwm vibra")
attempted to fix the PWM settings but ended up causin an additional clock
reparenting error:
clk: failed to reparent abe-clkctrl:0060:24 to sys_clkin_ck: -22
Only timer9 is in the PER domain and can use the sys_clkin_ck clock source.
For timer8, the there is no sys_clkin_ck available as it's in the ABE
domain, instead it should use syc_clk_div_ck. However, for power
management, we want to use the always on sys_32k_ck instead.
Cc: Ivaylo Dimitrov <ivo.g.dimitrov.75@gmail.com> Cc: Carl Philipp Klemm <philipp@uvos.xyz> Cc: Merlijn Wajer <merlijn@wizzup.org> Cc: Pavel Machek <pavel@ucw.cz> Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com> Fixes: 0840242e8875 ("ARM: dts: Configure clock parent for pwm vibra")
Depends-on: 61978617e905 ("ARM: dts: Add minimal support for Droid Bionic xt875") Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
tegra-bpmp clocks driver makes implicit conversion of signed error
code to unsigned value in recalc_rate operation. The behavior for
recalc_rate, according to it's specification, should be that "If the
driver cannot figure out a rate for this clock, it must return 0."
Fixes: ca6f2796eef7 ("clk: tegra: Add BPMP clock driver") Signed-off-by: Timo Alho <talho@nvidia.com> Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com> Link: https://lore.kernel.org/r/20230912112951.2330497-1-cyndis@kapsi.fi Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
While commit d4a5c59a955b ("mmc: au1xmmc: force non-modular build and
remove symbol_get usage") to be built in, it can still build a kernel
without MMC support and thuse no mmc_detect_change symbol at all.
Add ifdefs to build the mmc support code in the alchemy arch code
conditional on mmc support.
Fixes: d4a5c59a955b ("mmc: au1xmmc: force non-modular build and remove symbol_get usage") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
Len Brown has reported that system suspend sometimes fail due to
inability to freeze a task working in ext4_trim_fs() for one minute.
Trimming a large filesystem on a disk that slowly processes discard
requests can indeed take a long time. Since discard is just an advisory
call, it is perfectly fine to interrupt it at any time and the return
number of discarded blocks until that moment. Do that when we detect the
task is being frozen.
Cc: stable@kernel.org Reported-by: Len Brown <lenb@kernel.org> Suggested-by: Dave Chinner <david@fromorbit.com>
References: https://bugzilla.kernel.org/show_bug.cgi?id=216322 Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230913150504.9054-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently we set the group's trimmed bit in ext4_trim_all_free() based
on return value of ext4_try_to_trim_range(). However when we will want
to abort trimming because of suspend attempt, we want to return success
from ext4_try_to_trim_range() but not set the trimmed bit. Instead
implementing awkward propagation of this information, just move setting
of trimmed bit into ext4_try_to_trim_range() when the whole group is
trimmed.
Otherwise nonaligned fstrim calls will works inconveniently for iterative
scanners, for example:
// trim [0,16MB] for group-1, but mark full group as trimmed
fstrim -o $((1024*1024*128)) -l $((1024*1024*16)) ./m
// handle [16MB,16MB] for group-1, do nothing because group already has the flag.
fstrim -o $((1024*1024*144)) -l $((1024*1024*16)) ./m
[ Update function documentation for ext4_trim_all_free -- TYT ]
There is no good reason for the s_last_trim_minblks to be atomic. There is
no data integrity needed and there is no real danger in setting and
reading it in a racy manner. Change it to be unsigned long, the same type
as s_clusters_per_group which is the maximum that's allowed.
Signed-off-by: Lukas Czerner <lczerner@redhat.com> Suggested-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20211103145122.17338-1-lczerner@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Stable-dep-of: 45e4ab320c9b ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()") Signed-off-by: Sasha Levin <sashal@kernel.org>
As commit 6920b3913235 ("ext4: add new helper interface
ext4_try_to_trim_range()") moves some code into the separate function
ext4_try_to_trim_range(), the use of the variable ret within that
function is more limited and can be adjusted as well.
Scope the use of the variable ret locally and drop dead assignments.
There is no functional change in this patch but just split the
codes, which serachs free block and does trim, into a new function
ext4_try_to_trim_range. This is preparing for the following async
backgroup discard.
Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210724074124.25731-3-jianchao.wan9@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Stable-dep-of: 45e4ab320c9b ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()") Signed-off-by: Sasha Levin <sashal@kernel.org>
Get rid of the 'group' parameter of ext4_trim_extent as we can get
it from the 'e4b'.
Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Wang Jianchao <wangjianchao@kuaishou.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210724074124.25731-2-jianchao.wan9@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Stable-dep-of: 45e4ab320c9b ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()") Signed-off-by: Sasha Levin <sashal@kernel.org>
The following processes run into a deadlock. CPU 41 was waiting for CPU 29
to handle a CSD request while holding spinlock "crashdump_lock", but CPU 29
was hung by that spinlock with IRQs disabled.
The lock is used to synchronize different sysfs operations, it doesn't
protect any resource that will be touched by an interrupt. Consequently
it's not required to disable IRQs. Replace the spinlock with a mutex to fix
the deadlock.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Link: https://lore.kernel.org/r/20230828221018.19471-1-junxiao.bi@oracle.com Reviewed-by: Mike Christie <michael.christie@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Driver will use "reply descriptor post queues" in round robin fashion when
the combined MSI-X mode is not enabled. With this IO completions are
distributed and load balanced across all the available reply descriptor
post queues equally.
This is enabled only if combined MSI-X mode is not enabled in firmware.
This improves performance and also fixes soft lockups.
When load balancing is enabled, IRQ affinity from driver needs to be
disabled.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Stable-dep-of: 0b0747d507bf ("scsi: megaraid_sas: Fix deadlock on firmware crashdump") Signed-off-by: Sasha Levin <sashal@kernel.org>
User accidently passed module parameter ql2xenabledif=1 which is
unsupported. However, driver still initialized which lead to guard tag
errors during device discovery.
Remove unsupported ql2xenabledif=1 option and validate the user input.
The touchpad of this device is both connected via PS/2 and i2c. This causes
strange behavior when both driver fight for control. The easy fix is to
prevent the PS/2 driver from accessing the mouse port as the full feature
set of the touchpad is only supported in the i2c interface anyway.
The strange behavior in this case is, that when an external screen is
connected and the notebook is closed, the pointer on the external screen is
moving to the lower right corner. When the notebook is opened again, this
movement stops, but the touchpad clicks are unresponsive afterwards until
reboot.
If an error occurs after a successful irq_domain_add_linear() call, it
should be undone by a corresponding irq_domain_remove(), as already done
in the remove function.
[1]
$ teamd -t team0 -d -c '{"runner": {"name": "loadbalance"}}'
$ ip link add name t-dummy type dummy
$ ip link add link t-dummy name t-dummy.100 type vlan id 100
$ ip link add name t-nlmon type nlmon
$ ip link set t-nlmon master team0
$ ip link set t-nlmon nomaster
$ ip link set t-dummy up
$ ip link set team0 up
$ ip link set t-dummy.100 down
$ ip link set t-dummy.100 master team0
When enslave a vlan device to team device and team device type is changed
from non-ether to ether, header_ops of team device is changed to
vlan_header_ops. That is incorrect and will trigger null-ptr-deref
for vlan->real_dev in vlan_dev_hard_header() because team device is not
a vlan device.
Cache eth_header_ops in team_setup(), then assign cached header_ops to
header_ops of team net device when its type is changed from non-ether
to ether to fix the bug.
Fixes: 1d76efe1577b ("team: add support for non-ethernet devices") Suggested-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230918123011.1884401-1-william.xuanziyang@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Long standing KCSAN issues are caused by data-race around
some dev->stats changes.
Most performance critical paths already use per-cpu
variables, or per-queue ones.
It is reasonable (and more correct) to use atomic operations
for the slow paths.
This patch adds an union for each field of net_device_stats,
so that we can convert paths that are not yet protected
by a spinlock or a mutex.
netdev_stats_to_stats64() no longer has an #if BITS_PER_LONG==64
Note that the memcpy() we were using on 64bit arches
had no provision to avoid load-tearing,
while atomic_long_read() is providing the needed protection
at no cost.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: 44bdb313da57 ("net: bridge: use DEV_STATS_INC()") Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently the reset process in hns3 and firmware watchdog init process is
asynchronous. we think firmware watchdog initialization is completed
before hns3 clear the firmware interrupt source. However, firmware
initialization may not complete early.
so we add delay before hns3 clear firmware interrupt source and 5 ms delay
is enough to avoid second firmware reset interrupt.
Fixes: c1a81619d73a ("net: hns3: Add mailbox interrupt handling to PF driver") Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Jijie Shao <shaojijie@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Valid domain value is in range 1 to HV_PERF_DOMAIN_MAX. Current code has
check for domain value greater than or equal to HV_PERF_DOMAIN_MAX. But
the check for domain value 0 is missing.
Fix this issue by adding check for domain value 0.
Before:
# ./perf stat -v -e hv_24x7/CPM_ADJUNCT_INST,domain=0,core=1/ sleep 1
Using CPUID 00800200
Control descriptor is not initialized
Error:
The sys_perf_event_open() syscall returned with 5 (Input/output error) for
event (hv_24x7/CPM_ADJUNCT_INST,domain=0,core=1/).
/bin/dmesg | grep -i perf may provide additional information.
Result from dmesg:
[ 37.819387] hv-24x7: hcall failed: [0 0x60040000 0x100 0] => ret
0xfffffffffffffffc (-4) detail=0x2000000 failing ix=0
After:
# ./perf stat -v -e hv_24x7/CPM_ADJUNCT_INST,domain=0,core=1/ sleep 1
Using CPUID 00800200
Control descriptor is not initialized
Warning:
hv_24x7/CPM_ADJUNCT_INST,domain=0,core=1/ event is not supported by the kernel.
failed to read counter hv_24x7/CPM_ADJUNCT_INST,domain=0,core=1/
Currently, we assume the skb is associated with a device before calling
__ip_options_compile, which is not always the case if it is re-routed by
ipvs.
When skb->dev is NULL, dev_net(skb->dev) will become null-dereference.
This patch adds a check for the edge case and switch to use the net_device
from the rtable when skb->dev is NULL.
Fixes: ed0de45a1008 ("ipv4: recompile ip options in ipv4_link_failure") Suggested-by: David Ahern <dsahern@kernel.org> Signed-off-by: Kyle Zeng <zengyhkyle@gmail.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Cc: Vadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
tls.sendmsg_large and tls.sendmsg_multiple are trying to send through
the self->cfd socket (only configured with TLS_RX) and to receive through
the self->fd socket (only configured with TLS_TX), so they're not using
kTLS at all. Swap the sockets.
Fixes: 7f657d5bf507 ("selftests: tls: add selftests for TLS sockets") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
Anonymous sets need to be populated once at creation and then they are
bound to rule since 938154b93be8 ("netfilter: nf_tables: reject unbound
anonymous set before commit phase"), otherwise transaction reports
EINVAL.
Userspace does not need to delete elements of anonymous sets that are
not yet bound, reject this with EOPNOTSUPP.
From flush command path, skip anonymous sets, they are expected to be
bound already. Otherwise, EINVAL is hit at the end of this transaction
for unbound sets.
When a CRC error occurs, the HBA asserts an interrupt to indicate an
interface fatal error (PxIS.IFS). The ISR clears PxIE and PxIS, then
does error recovery. If the adapter receives another SDB FIS
with an error (PxIS.TFES) from the device before the start of the EH
recovery process, the interrupt signaling the new SDB cannot be
serviced as PxIE was cleared already. This in turn results in the HBA
inability to issue any command during the error recovery process after
setting PxCMD.ST to 1 because PxIS.TFES is still set.
According to AHCI 1.3.1 specifications section 6.2.2, fatal errors
notified by setting PxIS.HBFS, PxIS.HBDS, PxIS.IFS or PxIS.TFES will
cause the HBA to enter the ERR:Fatal state. In this state, the HBA
shall not issue any new commands.
To avoid this situation, introduce the function
ahci_port_clear_pending_irq() to clear pending interrupts before
executing a COMRESET. This follows the AHCI 1.3.1 - section 6.2.2.2
specification.
Signed-off-by: Szuying Chen <Chloe_Chen@asmedia.com.tw> Fixes: e0bfd149973d ("[PATCH] ahci: stop engine during hard reset") Cc: stable@vger.kernel.org Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
With IPv6, connect() can occasionally return EINVAL if a route is
unavailable. If this happens during I/O to a data server, we want to
report it using LAYOUTERROR as an inability to connect.
Fixes: dd52128afdde ("NFSv4.1/pnfs Ensure flexfiles reports all connection related errors") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The rsvp classifier has served us well for about a quarter of a century but has
has not been getting much maintenance attention due to lack of known users.
When fw_change() is called on an existing filter, the whole
tcf_result struct is always copied into the new instance of the filter.
This causes a problem when updating a filter bound to a class,
as tcf_unbind_filter() is always called on the old instance in the
success path, decreasing filter_cnt of the still referenced class
and allowing it to be deleted, leading to a use-after-free.
Fix this by no longer copying the tcf_result struct from the old filter.
Fixes: e35a8ee5993b ("net: sched: fw use RCU") Reported-by: valis <sec@valis.email> Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg> Signed-off-by: valis <sec@valis.email> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg> Link: https://lore.kernel.org/r/20230729123202.72406-3-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[ Fixed small conflict as 'fnew->ifindex' assignment is not protected by
CONFIG_NET_CLS_IND on upstream since a51486266c3 ] Signed-off-by: Luiz Capitulino <luizcap@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
v7.2 controller has different ECC level field size and shift in the acc
control register than its predecessor and successor controller. It needs
to be set specifically.
When running delayed items we are holding a delayed node's mutex and then
we will attempt to modify a subvolume btree to insert/update/delete the
delayed items. However if have an error during the insertions for example,
btrfs_insert_delayed_items() may return with a path that has locked extent
buffers (a leaf at the very least), and then we attempt to release the
delayed node at __btrfs_run_delayed_items(), which requires taking the
delayed node's mutex, causing an ABBA type of deadlock. This was reported
by syzbot and the lockdep splat is the following:
WARNING: possible circular locking dependency detected 6.5.0-rc7-syzkaller-00024-g93f5de5f648d #0 Not tainted
------------------------------------------------------
syz-executor.2/13257 is trying to acquire lock: ffff88801835c0c0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256
but task is already holding lock: ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is: