When changing the cpu affinity of an event it can happen today that
(with some unlucky timing) the same event will be handled on the old
and the new cpu at the same time.
Avoid that by adding an "event active" flag to the per-event data and
call the handler only if this flag isn't set.
An event channel should be kept masked when an eoi is pending for it.
When being migrated to another cpu it might be unmasked, though.
In order to avoid this keep three different flags for each event channel
to be able to distinguish "normal" masking/unmasking from eoi related
masking/unmasking and temporary masking. The event channel should only
be able to generate an interrupt if all flags are cleared.
Cc: stable@vger.kernel.org Fixes: 54c9de89895e ("xen/events: add a new "late EOI" evtchn framework") Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Ross Lagerwall <ross.lagerwall@citrix.com> Link: https://lore.kernel.org/r/20210306161833.4552-3-jgross@suse.com
[boris -- corrected Fixed tag format]
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When creating a new event channel with 2-level events the affinity
needs to be reset initially in order to avoid using an old affinity
from earlier usage of the event channel port. So when tearing an event
channel down reset all affinity bits.
The same applies to the affinity when onlining a vcpu: all old
affinity settings for this vcpu must be reset. As percpu events get
initialized before the percpu event channel hook is called,
resetting of the affinities happens after offlining a vcpu (this is
working, as initial percpu memory is zeroed out).
When registering a memslot, we check the size and location of that
memslot against the IPA size to ensure that we can provide guest
access to the whole of the memory.
Unfortunately, this check rejects memslot that end-up at the exact
limit of the addressing capability for a given IPA size. For example,
it refuses the creation of a 2GB memslot at 0x8000000 with a 32bit
IPA space.
Fix it by relaxing the check to accept a memslot reaching the
limit of the IPA space.
Fixes: c3058d5da222 ("arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE") Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org # 4.4, 4.9 Reviewed-by: Andrew Jones <drjones@redhat.com> Link: https://lore.kernel.org/r/20210311100016.3830038-3-maz@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Here, hdpvr_register_videodev() is responsible for setup and
register a video device. Also defining and initializing a worker.
hdpvr_register_videodev() is calling by hdpvr_probe at last.
So no need to flush any work here.
Unregister v4l2, free buffers and memory. If hdpvr_probe() will fail.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
[krzk: backport to v4.4, still using single thread workqueue which
is drained/destroyed now in proper step so it cannot be NULL] Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
An xattr 'get' handler is expected to return the length of the value on
success, yet _nfs4_get_security_label() (and consequently also
nfs4_xattr_get_nfs4_label(), which is used as an xattr handler) returns
just 0 on success.
Fix this by returning label.len instead, which contains the length of
the result.
Fixes: aa9c2669626c ("NFS: Client implementation of Labeled-NFS") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The analog input subdevice supports Comedi asynchronous commands that
use Comedi's 16-bit sample format. However, the call to
`comedi_buf_write_samples()` is passing the address of a 32-bit integer
parameter. On bigendian machines, this will copy 2 bytes from the wrong
end of the 32-bit value. Fix it by changing the type of the parameter
holding the sample value to `unsigned short`.
[Note: the bug was introduced in commit edf4537bcbf5 ("staging: comedi:
pcl818: use comedi_buf_write_samples()") but the patch applies better to
commit d615416de615 ("staging: comedi: pcl818: introduce
pcl818_ai_write_sample()").]
The analog input subdevice supports Comedi asynchronous commands that
use Comedi's 16-bit sample format. However, the call to
`comedi_buf_write_samples()` is passing the address of a 32-bit integer
variable. On bigendian machines, this will copy 2 bytes from the wrong
end of the 32-bit value. Fix it by changing the type of the variable
holding the sample value to `unsigned short`.
The analog input subdevice supports Comedi asynchronous commands that
use Comedi's 16-bit sample format. However, the calls to
`comedi_buf_write_samples()` are passing the address of a 32-bit integer
variable. On bigendian machines, this will copy 2 bytes from the wrong
end of the 32-bit value. Fix it by changing the type of the variable
holding the sample value to `unsigned short`.
The analog input subdevice supports Comedi asynchronous commands that
use Comedi's 16-bit sample format. However, the call to
`comedi_buf_write_samples()` is passing the address of a 32-bit integer
variable. On bigendian machines, this will copy 2 bytes from the wrong
end of the 32-bit value. Fix it by changing the type of the variable
holding the sample value to `unsigned short`.
[Note: the bug was introduced in commit 1700529b24cc ("staging: comedi:
dmm32at: use comedi_buf_write_samples()") but the patch applies better
to the later (but in the same kernel release) commit 0c0eadadcbe6e
("staging: comedi: dmm32at: introduce dmm32_ai_get_sample()").]
The analog input subdevice supports Comedi asynchronous commands that
use Comedi's 16-bit sample format. However, the call to
`comedi_buf_write_samples()` is passing the address of a 32-bit integer
variable. On bigendian machines, this will copy 2 bytes from the wrong
end of the 32-bit value. Fix it by changing the type of the variable
holding the sample value to `unsigned short`.
The analog input subdevice supports Comedi asynchronous commands that
use Comedi's 16-bit sample format. However, the call to
`comedi_buf_write_samples()` is passing the address of a 32-bit integer
variable. On bigendian machines, this will copy 2 bytes from the wrong
end of the 32-bit value. Fix it by changing the type of the variable
holding the sample value to `unsigned short`.
Fixes: d1d24cb65ee3 ("staging: comedi: das6402: read analog input samples in interrupt handler") Cc: <stable@vger.kernel.org> # 3.19+ Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Link: https://lore.kernel.org/r/20210223143055.257402-5-abbotti@mev.co.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The analog input subdevice supports Comedi asynchronous commands that
use Comedi's 16-bit sample format. However, the calls to
`comedi_buf_write_samples()` are passing the address of a 32-bit integer
variable. On bigendian machines, this will copy 2 bytes from the wrong
end of the 32-bit value. Fix it by changing the type of the variables
holding the sample value to `unsigned short`. The type of the `val`
parameter of `pci1710_ai_read_sample()` is changed to `unsigned short *`
accordingly. The type of the `val` variable in `pci1710_ai_insn_read()`
is also changed to `unsigned short` since its address is passed to
`pci1710_ai_read_sample()`.
The digital input subdevice supports Comedi asynchronous commands that
read interrupt status information. This uses 16-bit Comedi samples (of
which only the bottom 8 bits contain status information). However, the
interrupt handler is calling `comedi_buf_write_samples()` with the
address of a 32-bit variable `unsigned int status`. On a bigendian
machine, this will copy 2 bytes from the wrong end of the variable. Fix
it by changing the type of the variable to `unsigned short`.
Fixes: a8c66b684efa ("staging: comedi: addi_apci_1500: rewrite the subdevice support functions") Cc: <stable@vger.kernel.org> #4.0+ Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Link: https://lore.kernel.org/r/20210223143055.257402-3-abbotti@mev.co.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Change-Of-State (COS) subdevice supports Comedi asynchronous
commands to read 16-bit change-of-state values. However, the interrupt
handler is calling `comedi_buf_write_samples()` with the address of a
32-bit integer `&s->state`. On bigendian architectures, it will copy 2
bytes from the wrong end of the 32-bit integer. Fix it by transferring
the value via a 16-bit integer.
Function _rtl92e_wx_set_scan calls memcpy without checking the length.
A user could control that length and trigger a buffer overflow.
Fix by checking the length is within the maximum allowed size.
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Lee Gibson <leegib@gmail.com> Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20210226145157.424065-1-leegib@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Function r8712_sitesurvey_cmd calls memcpy without checking the length.
A user could control that length and trigger a buffer overflow.
Fix by checking the length is within the maximum allowed size.
The "ie_len" is a value in the 1-255 range that comes from the user. We
have to cap it to ensure that it's not too large or it could lead to
memory corruption.
Fixes: 9a7fe54ddc3a ("staging: r8188eu: Add source files for new driver - part 1") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/YEHyQCrFZKTXyT7J@mwanda Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The memdup_user() function does not necessarily return a NUL terminated
string so this can lead to a read overflow. Switch from memdup_user()
to strndup_user() to fix this bug.
usbip_sockfd_store() is invoked when user requests attach (import)
detach (unimport) usb device from usbip host. vhci_hcd sends import
request and usbip_sockfd_store() exports the device if it is free
for export.
Export and unexport are governed by local state and shared state
- Shared state (usbip device status, sockfd) - sockfd and Device
status are used to determine if stub should be brought up or shut
down.
- Local state (tcp_socket, rx and tx thread task_struct ptrs)
A valid tcp_socket controls rx and tx thread operations while the
device is in exported state.
- While the device is exported, device status is marked used and socket,
sockfd, and thread pointers are valid.
Export sequence (stub-up) includes validating the socket and creating
receive (rx) and transmit (tx) threads to talk to the client to provide
access to the exported device. rx and tx threads depends on local and
shared state to be correct and in sync.
Unexport (stub-down) sequence shuts the socket down and stops the rx and
tx threads. Stub-down sequence relies on local and shared states to be
in sync.
There are races in updating the local and shared status in the current
stub-up sequence resulting in crashes. These stem from starting rx and
tx threads before local and global state is updated correctly to be in
sync.
1. Doesn't handle kthread_create() error and saves invalid ptr in local
state that drives rx and tx threads.
2. Updates tcp_socket and sockfd, starts stub_rx and stub_tx threads
before updating usbip_device status to SDEV_ST_USED. This opens up a
race condition between the threads and usbip_sockfd_store() stub up
and down handling.
Fix the above problems:
- Stop using kthread_get_run() macro to create/start threads.
- Create threads and get task struct reference.
- Add kthread_create() failure handling and bail out.
- Hold usbip_device lock to update local and shared states after
creating rx and tx threads.
- Update usbip_device status to SDEV_ST_USED.
- Update usbip_device tcp_socket, sockfd, tcp_rx, and tcp_tx
- Start threads after usbip_device (tcp_socket, sockfd, tcp_rx, tcp_tx,
and status) is complete.
Credit goes to syzbot and Tetsuo Handa for finding and root-causing the
kthread_get_run() improper error handling problem and others. This is a
hard problem to find and debug since the races aren't seen in a normal
case. Fuzzing forces the race window to be small enough for the
kthread_get_run() error path bug and starting threads before updating the
local and shared state bug in the stub-up sequence.
Tested with syzbot reproducer:
- https://syzkaller.appspot.com/text?tag=ReproC&x=14801034d00000
Fix attach_store() to validate the passed in file descriptor is a
stream socket. If the file descriptor passed was a SOCK_DGRAM socket,
sock_recvmsg() can't detect end of stream.
Fix usbip_sockfd_store() to validate the passed in file descriptor is
a stream socket. If the file descriptor passed was a SOCK_DGRAM socket,
sock_recvmsg() can't detect end of stream.
Add PID for CH340 that's found on cheap programmers.
The driver works flawlessly as soon as the new PID (0x9986) is added to it.
These look like ANU232MI but ship with a ch341 inside. They have no special
identifiers (mine only has the string "DB9D20130716" printed on the PCB and
nothing identifiable on the packaging. The merchant i bought it from
doesn't sell these anymore).
the lsusb -v output is:
Bus 001 Device 009: ID 9986:7523
Device Descriptor:
bLength 18
bDescriptorType 1
bcdUSB 1.10
bDeviceClass 255 Vendor Specific Class
bDeviceSubClass 0
bDeviceProtocol 0
bMaxPacketSize0 8
idVendor 0x9986
idProduct 0x7523
bcdDevice 2.54
iManufacturer 0
iProduct 0
iSerial 0
bNumConfigurations 1
Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength 0x0027
bNumInterfaces 1
bConfigurationValue 1
iConfiguration 0
bmAttributes 0x80
(Bus Powered)
MaxPower 96mA
Interface Descriptor:
bLength 9
bDescriptorType 4
bInterfaceNumber 0
bAlternateSetting 0
bNumEndpoints 3
bInterfaceClass 255 Vendor Specific Class
bInterfaceSubClass 1
bInterfaceProtocol 2
iInterface 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x82 EP 2 IN
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0020 1x 32 bytes
bInterval 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x02 EP 2 OUT
bmAttributes 2
Transfer Type Bulk
Synch Type None
Usage Type Data
wMaxPacketSize 0x0020 1x 32 bytes
bInterval 0
Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x81 EP 1 IN
bmAttributes 3
Transfer Type Interrupt
Synch Type None
Usage Type Data
wMaxPacketSize 0x0008 1x 8 bytes
bInterval 1
A xHC USB 3 port might miss the first wake signal from a USB 3 device
if the port LFPS reveiver isn't enabled fast enough after xHC resume.
xHC host will anyway be resumed by a PME# signal, but will go back to
suspend if no port activity is seen.
The device resends the U3 LFPS wake signal after a 100ms delay, but
by then host is already suspended, starting all over from the
beginning of this issue.
USB 3 specs say U3 wake LFPS signal is sent for max 10ms, then device
needs to delay 100ms before resending the wake.
Don't suspend immediately if port activity isn't detected in resume.
Instead add a retry. If there is no port activity then delay for 120ms,
and re-check for port activity.
According to the datasheet, this controller has a restriction
which "set an endpoint number so that combinations of the DIR bit and
the EPNUM bits do not overlap.". However, since the udc core driver is
possible to assign a bulk pipe as an interrupt endpoint, an endpoint
number may not match the pipe number. After that, when user rebinds
another gadget driver, this driver broke the restriction because
the driver didn't clear any configuration in usb_ep_disable().
Example:
# modprobe g_ncm
Then, EP3 = pipe 3, EP4 = pipe 4, EP5 = pipe 6
# rmmod g_ncm
# modprobe g_hid
Then, EP3 = pipe 6, EP4 = pipe 7.
So, pipe 3 and pipe 6 are set as EP3.
As per UAC2 Audio Data Formats spec (2.3.1.1 USB Packets),
if the sampling rate is a constant, the allowable variation
of number of audio slots per virtual frame is +/- 1 audio slot.
It means that endpoint should be able to accept/send +1 audio
slot.
Previous endpoint max_packet_size calculation code
was adding sometimes +1 audio slot due to DIV_ROUND_UP
behaviour which was rounding up to closest integer.
However this doesn't work if the numbers are divisible.
It had no any impact with Linux hosts which ignore
this issue, but in case of more strict Windows it
caused rejected enumeration
Thus always add +1 audio slot to endpoint's max packet size
Fixes: 913e4a90b6f9 ("usb: gadget: f_uac2: finalize wMaxPacketSize according to bandwidth") Cc: Peter Chen <peter.chen@freescale.com> Cc: <stable@vger.kernel.org> #v4.3+ Signed-off-by: Ruslan Bilovol <ruslan.bilovol@gmail.com> Link: https://lore.kernel.org/r/1614599375-8803-2-git-send-email-ruslan.bilovol@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The CDC ACM driver is false matching the Goodix Fingerprint device
against the USB_CDC_ACM_PROTO_AT_V25TER.
The Goodix Fingerprint device is a biometrics sensor that should be
handled in user-space. libfprint has some support for Goodix
fingerprint sensors, although not for this particular one. It is
possible that the vendor allocates a PID per OEM (Lenovo, Dell etc).
If this happens to be the case then more devices from the same vendor
could potentially match the ACM modem module table.
alloc_workqueue is not checked for errors and as a result,
a potential NULL dereference could occur.
Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
[krzk: backport applied to different path - without marvell subdir] Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When building with -ffunction-sections, the compiler will place each
function into its own ELF section, prefixed with ".text". For example,
a simple test module with functions test_module_do_work() and
test_module_wq_func():
Adjust the recordmcount scripts to look for ".text" as a section name
prefix. This will ensure that those functions will be included in the
__mcount_loc relocations:
% objdump --reloc --section __mcount_loc test_module.o
OFFSET TYPE VALUE 0000000000000000 R_X86_64_64 .text.test_module_do_work 0000000000000008 R_X86_64_64 .text.test_module_wq_func 0000000000000010 R_X86_64_64 .init.text
Link: http://lkml.kernel.org/r/1542745158-25392-2-git-send-email-joe.lawrence@redhat.com Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
[Manoj: Resolve conflict on 4.4.y/4.9.y because of missing 42c269c88dc1] Signed-off-by: Manoj Gupta <manojgupta@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In case of an unbind of the DASD device driver the function
dasd_generic_remove() is called which shuts down the device.
Among others this functions removes the int_handler from the cdev.
During shutdown the device cancels all outstanding IO requests and waits
for completion of the clear request.
Unfortunately the clear interrupt will never be received when there is no
interrupt handler connected.
Fix by moving the int_handler removal after the call to the state machine
where no request or interrupt is outstanding.
Cc: stable@vger.kernel.org Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Tested-by: Bjoern Walk <bwalk@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dell AE515 sound bar (413c:a506) spews the error messages when the
driver tries to read the current sample frequency, hence it needs to
be on the list in snd_usb_get_sample_rate_quirk().
When HD-audio bus receives unsolicited events during its system
suspend/resume (S3 and S4) phase, the controller driver may still try
to process events although the codec chips are already (or yet)
powered down. This might screw up the codec communication, resulting
in CORB/RIRB errors. Such events should be rather skipped, as the
codec chip status such as the jack status will be fully refreshed at
the system resume time.
Since we're tracking the system suspend/resume state in codec
power.power_state field, let's add the check in the common unsol event
handler entry point to filter out such events.
The per_pin->work might be still floating at the suspend, and this may
hit the access to the hardware at an unexpected timing. Cancel the
work properly at the suspend callback for avoiding the buggy access.
Note that the bug doesn't trigger easily in the recent kernels since
the work is queued only when the repoll count is set, and usually it's
only at the resume callback, but it's still possible to hit in
theory.
If iscsi_prep_scsi_cmd_pdu() fails we try to add it back to the cmdqueue,
but we leave it partially setup. We don't have functions that can undo the
pdu and init task setup. We only have cleanup_task which can clean up both
parts. So this has us just fail the cmd and go through the standard cleanup
routine and then have the SCSI midlayer retry it like is done when it fails
in the queuecommand path.
Avoid a potentially large stack frame and overflow by making
"cpumask_t avail" a static variable. There is no concurrent
access due to the existing locking.
Fix a race where a pending interrupt could be received and the handler
called before the handler's data has been setup, by converting to
irq_set_chained_handler_and_data().
See also 2cf5a03cb29d ("PCI/keystone: Fix race in installing chained IRQ
handler").
Based on the mail discussion, it seems ok to drop the error handling.
While sampling for marked events, currently we record the sample only
if the SIAR valid bit of Sampled Instruction Event Register (SIER) is
set. SIAR_VALID bit is used for fetching the instruction address from
Sampled Instruction Address Register(SIAR). But there are some
usecases, where the user is interested only in the PMU stats at each
counter overflow and the exact IP of the overflow event is not
required. Dropping SIAR invalid samples will fail to record some of
the counter overflows in such cases.
Example of such usecase is dumping the PMU stats (event counts) after
some regular amount of instructions/events from the userspace (ex: via
ptrace). Here counter overflow is indicated to userspace via signal
handler, and captured by monitoring and enabling I/O signaling on the
event file descriptor. In these cases, we expect to get
sample/overflow indication after each specified sample_period.
Perf event attribute will not have PERF_SAMPLE_IP set in the
sample_type if exact IP of the overflow event is not requested. So
while profiling if SAMPLE_IP is not set, just record the counter
overflow irrespective of SIAR_VALID check.
when get request SW timeout, if CMD/DAT xfer done irq coming right now,
then there is race between the msdc_request_timeout work and irq handler,
and the host->cmd and host->data may set to NULL in irq handler. also,
current flow ensure that only one path can go to msdc_request_done(), so
no need check the return value of cancel_delayed_work().
usbtv doesn't support power management, so on system suspend the
.disconnect callback of the driver is called. The teardown sequence
includes a call to snd_card_free. Its implementation waits until the
refcount of the sound card device drops to zero, however, if its file is
open, snd_card_file_add takes a reference, which can't be dropped during
the suspend, because the userspace processes are already frozen at this
point. snd_card_free waits for completion forever, leading to a hang on
suspend.
This commit fixes this deadlock condition by replacing snd_card_free
with snd_card_free_when_closed, that doesn't wait until all references
are released, allowing suspend to progress.
We must disable the regulator that was enabled in the probe function.
Fixes: 7994fe55a4a2 ("dm9000: Add regulator and reset support to dm9000") Signed-off-by: Paul Cercueil <paul@crapouillou.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the probe fails or requests to be defered, we must disable the
regulator that was previously enabled.
Fixes: 7994fe55a4a2 ("dm9000: Add regulator and reset support to dm9000") Signed-off-by: Paul Cercueil <paul@crapouillou.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For the devices in this driver, the default qdisc is "noqueue",
because their "tx_queue_len" is 0.
In function "__dev_queue_xmit" in "net/core/dev.c", devices with the
"noqueue" qdisc are specially handled. Packets are transmitted without
being queued after a "dev->flags & IFF_UP" check. However, it's possible
that even if this check succeeds, "ops->ndo_stop" may still have already
been called. This is because in "__dev_close_many", "ops->ndo_stop" is
called before clearing the "IFF_UP" flag.
If we call "netif_stop_queue" in "ops->ndo_stop", then it's possible in
"__dev_queue_xmit", it sees the "IFF_UP" flag is present, and then it
checks "netif_xmit_stopped" and finds that the queue is already stopped.
In this case, it will complain that:
"Virtual device ... asks to queue packet!"
To prevent "__dev_queue_xmit" from generating this complaint, we should
not call "netif_stop_queue" in "ops->ndo_stop".
We also don't need to call "netif_start_queue" in "ops->ndo_open",
because after a netdev is allocated and registered, the
"__QUEUE_STATE_DRV_XOFF" flag is initially not set, so there is no need
to call "netif_start_queue" to clear it.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Xie He <xie.he.0141@gmail.com> Acked-by: Martin Schiller <ms@dev.tdt.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes a bug that the moderation config will not be
applied when calling mlx4_en_reset_config. For example, when
turning on rx timestamping, mlx4_en_reset_config() will be called,
causing the NIC to forget previous moderation config.
This fix is in phase with a previous fix:
commit 79c54b6bbf06 ("net/mlx4_en: Fix TX moderation info loss
after set_ringparam is called")
Tested: Before this patch, on a host with NIC using mlx4, run
netserver and stream TCP to the host at full utilization.
$ sar -I SUM 1
INTR intr/s
14:03:56 sum 48758.00
After rx hwtstamp is enabled:
$ sar -I SUM 1
14:10:38 sum 317771.00
We see the moderation is not working properly and issued 7x more
interrupts.
After the patch, and turned on rx hwtstamp, the rate of interrupts
is as expected:
$ sar -I SUM 1
14:52:11 sum 49332.00
Fixes: 79c54b6bbf06 ("net/mlx4_en: Fix TX moderation info loss after set_ringparam is called") Signed-off-by: Kevin(Yudong) Yang <yyd@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> CC: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The handle_exit_race() function is defined in commit 9c3f39860367
("futex: Cure exit race"), which never returns -EBUSY. This results
in a small piece of dead code in the attach_to_pi_owner() function:
int ret = handle_exit_race(uaddr, uval, p); /* Never return -EBUSY */
...
if (ret == -EBUSY)
*exiting = p; /* dead code */
The return value -EBUSY is added to handle_exit_race() in upsteam
commit ac31c7ff8624409 ("futex: Provide distinct return value when
owner is exiting"). This commit was incorporated into v4.9.255, before
the function handle_exit_race() was introduced, whitout Modify
handle_exit_race().
To fix dead code, extract the change of handle_exit_race() from
commit ac31c7ff8624409 ("futex: Provide distinct return value when owner
is exiting"), re-incorporated.
Lee writes:
This commit takes the remaining functional snippet of:
ac31c7ff8624409 ("futex: Provide distinct return value when owner is exiting")
... and is the correct fix for this issue.
Fixes: 9c3f39860367 ("futex: Cure exit race") Cc: stable@vger.kernel.org # v4.9.258 Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Reviewed-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Stefan reported, that the glibc tst-robustpi4 test case fails
occasionally. That case creates the following race between
sys_exit() and sys_futex_lock_pi():
CPU0 CPU1
sys_exit() sys_futex()
do_exit() futex_lock_pi()
exit_signals(tsk) No waiters:
tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
mm_release(tsk) Set waiter bit
exit_robust_list(tsk) { *uaddr = 0x80000PID;
Set owner died attach_to_pi_owner() {
*uaddr = 0xC0000000; tsk = get_task(PID);
} if (!tsk->flags & PF_EXITING) {
... attach();
tsk->flags |= PF_EXITPIDONE; } else {
if (!(tsk->flags & PF_EXITPIDONE))
return -EAGAIN;
return -ESRCH; <--- FAIL
}
ESRCH is returned all the way to user space, which triggers the glibc test
case assert. Returning ESRCH unconditionally is wrong here because the user
space value has been changed by the exiting task to 0xC0000000, i.e. the
FUTEX_OWNER_DIED bit is set and the futex PID value has been cleared. This
is a valid state and the kernel has to handle it, i.e. taking the futex.
Cure it by rereading the user space value when PF_EXITING and PF_EXITPIDONE
is set in the task which 'owns' the futex. If the value has changed, let
the kernel retry the operation, which includes all regular sanity checks
and correctly handles the FUTEX_OWNER_DIED case.
If it hasn't changed, then return ESRCH as there is no way to distinguish
this case from malfunctioning user space. This happens when the exiting
task did not have a robust list, the robust list was corrupted or the user
space value in the futex was simply bogus.
Reported-by: Stefan Liebler <stli@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Sasha Levin <sashal@kernel.org> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467 Link: https://lkml.kernel.org/r/20181210152311.986181245@linutronix.de Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[Lee: Required to satisfy functional dependency from futex back-port.
Re-add the missing handle_exit_race() parts from: 3d4775df0a89 ("futex: Replace PF_EXITPIDONE with a state")] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently futex-pi relies on hb->lock to serialize everything. But hb->lock
creates another set of problems, especially priority inversions on RT where
hb->lock becomes a rt_mutex itself.
The rt_mutex::wait_lock is the most obvious protection for keeping the
futex user space value and the kernel internal pi_state in sync.
Rework and document the locking so rt_mutex::wait_lock is held accross all
operations which modify the user space value and the pi state.
This allows to invoke rt_mutex_unlock() (including deboost) without holding
hb->lock as a next step.
Nothing yet relies on the new locking rules.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[Lee: Back-ported in support of a previous futex back-port attempt] Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The kernel test robot reports a huge performance regression due to the
commit, and the reason seems fairly straightforward: when there is
contention on the page list (which is what causes acquire_slab() to
fail), we do _not_ want to just loop and try again, because that will
transfer the contention to the 'n->list_lock' spinlock we hold, and
just make things even worse.
This is admittedly likely a problem only on big machines - the kernel
test robot report comes from a 96-thread dual socket Intel Xeon Gold
6252 setup, but the regression there really is quite noticeable:
-47.9% regression of stress-ng.rawpkt.ops_per_sec
and the commit that was marked as being fixed (7ced37197196: "slub:
Acquire_slab() avoid loop") actually did the loop exit early very
intentionally (the hint being that "avoid loop" part of that commit
message), exactly to avoid this issue.
The correct thing to do may be to pick some kind of reasonable middle
ground: instead of breaking out of the loop on the very first sign of
contention, or trying over and over and over again, the right thing may
be to re-try _once_, and then give up on the second failure (or pick
your favorite value for "once"..).
Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/lkml/20210301080404.GF12822@xsang-OptiPlex-9020/ Cc: Jann Horn <jannh@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
floppy_revalidate() doesn't perform any error handling on lock_fdc()
result. lock_fdc() might actually be interrupted by a signal (it waits for
fdc becoming non-busy interruptibly). In such case, floppy_revalidate()
proceeds as if it had claimed the lock, but it fact it doesn't.
In case of multiple threads trying to open("/dev/fdX"), this leads to
serious corruptions all over the place, because all of a sudden there is
no critical section protection (that'd otherwise be guaranteed by locked
fd) whatsoever.
While at this, fix the fact that the 'interruptible' parameter to
lock_fdc() doesn't make any sense whatsoever, because we always wait
interruptibly anyway.
Most of the lock_fdc() callsites do properly handle error (and propagate
EINTR), but floppy_revalidate() and floppy_check_events() don't. Fix this.
In cifs_statfs(), if server->ops->queryfs is not NULL, then we should
use its return value rather than always returning 0. Instead, use rc
variable as it is properly set to 0 in case there is no
server->ops->queryfs.
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
nested target/match_revfn() calls work with xt[NFPROTO_UNSPEC] lists
without taking xt[NFPROTO_UNSPEC].mutex. This can race with module unload
and cause host to crash:
Assert HALT bit to enter freeze mode, there is a premise that FRZ bit is
asserted. This patch asserts FRZ bit in flexcan_chip_freeze, although
the reset value is 1b'1. This is a prepare patch, later patch will
invoke flexcan_chip_freeze() to enter freeze mode, which polling freeze
mode acknowledge.
Fixes: b1aa1c7a2165b ("can: flexcan: fix transition from and to freeze mode in chip_{,un}freeze") Link: https://lore.kernel.org/r/20210218110037.16591-2-qiangqing.zhang@nxp.com Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There are two ref count variables controlling the free()ing of a socket:
- struct sock::sk_refcnt - which is changed by sock_hold()/sock_put()
- struct sock::sk_wmem_alloc - which accounts the memory allocated by
the skbs in the send path.
In case there are still TX skbs on the fly and the socket() is closed,
the struct sock::sk_refcnt reaches 0. In the TX-path the CAN stack
clones an "echo" skb, calls sock_hold() on the original socket and
references it. This produces the following back trace:
We noticed a GRO issue for UDP-based encaps such as vxlan/geneve when the
csum for the UDP header itself is 0. In that case, GRO aggregation does
not take place on the phys dev, but instead is deferred to the vxlan/geneve
driver (see trace below).
The reason is essentially that GRO aggregation bails out in udp_gro_receive()
for such case when drivers marked the skb with CHECKSUM_UNNECESSARY (ice, i40e,
others) where for non-zero csums 2abb7cdc0dc8 ("udp: Add support for doing
checksum unnecessary conversion") promotes those skbs to CHECKSUM_COMPLETE
and napi context has csum_valid set. This is however not the case for zero
UDP csum (here: csum_cnt is still 0 and csum_valid continues to be false).
At the same time 57c67ff4bd92 ("udp: additional GRO support") added matches
on !uh->check ^ !uh2->check as part to determine candidates for aggregation,
so it certainly is expected to handle zero csums in udp_gro_receive(). The
purpose of the check added via 662880f44203 ("net: Allow GRO to use and set
levels of checksum unnecessary") seems to catch bad csum and stop aggregation
right away.
One way to fix aggregation in the zero case is to only perform the !csum_valid
check in udp_gro_receive() if uh->check is infact non-zero.
# netperf -H 10.55.10.4 -t TCP_STREAM -l 20
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.55.10.4 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 20.01 24576.53
Fixes: 57c67ff4bd92 ("udp: additional GRO support") Fixes: 662880f44203 ("net: Allow GRO to use and set levels of checksum unnecessary") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Tom Herbert <tom@herbertland.com> Acked-by: Willem de Bruijn <willemb@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20210226212248.8300-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When transmitting to a receiver in dynamic SMPS mode, all transmissions that
use multiple spatial streams need to be sent using CTS-to-self or RTS/CTS to
give the receiver's extra chains some time to wake up.
This fixes the tx rate getting stuck at <= MCS7 for some clients, especially
Intel ones, which make aggressive use of SMPS.
Cc: stable@vger.kernel.org Reported-by: Martin Kennedy <hurricos@gmail.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Link: https://lore.kernel.org/r/20210214184911.96702-1-nbd@nbd.name Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Apparently, <linux/netfilter/nfnetlink_cthelper.h> and
<linux/netfilter/nfnetlink_acct.h> could not be included into the same
compilation unit because of a cut-and-paste typo in the former header.
Add a new force_caps module parameter to allow overriding the drivers
builtin capability detection mechanism.
This can be used to for example:
-Disable rfkill functionality on devices where there is an AA OEM DMI
record advertising non functional rfkill switches
-Force loading of the driver on devices with a missing AA OEM DMI record
Note that force_caps is -1 when unset, this allows forcing the
capability field to 0, which results in acer-wmi only providing WMI
hotkey handling while disabling all other (led, rfkill, backlight)
functionality.
According to the definition of dm_iterate_devices_fn:
* This function must iterate through each section of device used by the
* target until it encounters a non-zero return code, which it then returns.
* Returns zero if no callout returned non-zero.
For some target type (e.g. dm-stripe), one call of iterate_devices() may
iterate multiple underlying devices internally, in which case a non-zero
return code returned by iterate_devices_callout_fn will stop the iteration
in advance. No iterate_devices_callout_fn should return non-zero unless
device iteration should stop.
Rename dm_table_requires_stable_pages() to dm_table_any_dev_attr() and
elevate it for reuse to stop iterating (and return non-zero) on the
first device that causes iterate_devices_callout_fn to return non-zero.
Use dm_table_any_dev_attr() to properly iterate through devices.
Rename device_is_nonrot() to device_is_rotational() and invert logic
accordingly to fix improper disposition.
[jeffle: backport notes]
No stable writes. Also convert the no_sg_merge capability check,
which is introduced by commit 200612ec33e5 ("dm table: propagate
QUEUE_FLAG_NO_SG_MERGE"), and removed since commit 2705c93742e9 ("block:
kill QUEUE_FLAG_NO_SG_MERGE") in v5.1.
Fixes: c3c4555edd10 ("dm table: clear add_random unless all devices have it set") Fixes: 4693c9668fdc ("dm table: propagate non rotational flag") Cc: stable@vger.kernel.org Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The copy_to_user() function returns the number of bytes remaining but
we want to return -EFAULT to the user if it can't complete the copy.
The "st" variable only holds zero on success or negative error codes on
failure so the type should be int.
Fixes: 36f988e978f8 ("rsxx: Adding in debugfs entries.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently the mask operation on variable conf is just 3 bits so
the switch statement case value of 8 is unreachable dead code.
The function daio_mgr_dao_init can be passed a 4 bit value,
function dao_rsc_init calls it with conf set to:
conf = (desc->msr & 0x7) | (desc->passthru << 3);
so clearly when desc->passthru is set to 1 then conf can be
at least 8.
So this patch is just an obvious quickfix for now.
Hint: the lock order is documented in 4.9.y and later. A similar
documenting is missing in 4.4.y. Please somebody either backport also,
or write a new description, if there would be some differences I cannot
easily see at the moment. Without reliable docs,
inspection of the locking correctness may become a pain.
Signed-off-by: Thomas Schoebel-Theuer <tst@1und1.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Lee Jones <lee.jones@linaro.org> Fixes: 394fc4981426 ("futex: Rework inconsistent rt_mutex/futex_q state") Fixes: 6510e4a2d04f ("futex,rt_mutex: Provide futex specific rt_mutex API") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
futex: fix irq self-deadlock and satisfy assertion
This patch and problem analysis is specific for 4.4 LTS, due to incomplete
backporting of other fixes. Later LTS series have different backports.
Since v4.4.257 when CONFIG_PROVE_LOCKING=y
the following triggers right after reboot of our pre-life systems
which equal our production setup:
Mar 03 11:27:33 icpu-test-bap10 kernel: =================================
Mar 03 11:27:33 icpu-test-bap10 kernel: [ INFO: inconsistent lock state ]
Mar 03 11:27:33 icpu-test-bap10 kernel: 4.4.259-rc1-grsec+ #730 Not tainted
Mar 03 11:27:33 icpu-test-bap10 kernel: ---------------------------------
Mar 03 11:27:33 icpu-test-bap10 kernel: inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
Mar 03 11:27:33 icpu-test-bap10 kernel: apache2-ssl/9310 [HC0[0]:SC0[0]:HE1:SE1] takes:
Mar 03 11:27:33 icpu-test-bap10 kernel: (&p->pi_lock){?.-.-.}, at: [<ffffffff810abb68>] pi_state_update_owner+0x51/0xd7
Mar 03 11:27:33 icpu-test-bap10 kernel: {IN-HARDIRQ-W} state was registered at:
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81088c4a>] __lock_acquire+0x3a7/0xe4a
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81089b01>] lock_acquire+0x18d/0x1bc
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff8170151c>] _raw_spin_lock_irqsave+0x3e/0x50
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810719a5>] try_to_wake_up+0x2c/0x210
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81071bf3>] default_wake_function+0xd/0xf
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81083588>] autoremove_wake_function+0x11/0x35
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810830b2>] __wake_up_common+0x48/0x7c
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff8108311a>] __wake_up+0x34/0x46
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff814c2a23>] megasas_complete_int_cmd+0x31/0x33
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff814c60a0>] megasas_complete_cmd+0x570/0x57b
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff814d05bc>] complete_cmd_fusion+0x23e/0x33d
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff814d0768>] megasas_isr_fusion+0x67/0x74
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81091ae5>] handle_irq_event_percpu+0x134/0x311
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81091cf5>] handle_irq_event+0x33/0x51
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810948b9>] handle_edge_irq+0xa3/0xc2
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81005f7b>] handle_irq+0xf9/0x101
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81005700>] do_IRQ+0x80/0xf5
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81702228>] ret_from_intr+0x0/0x20
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff8100cab0>] arch_cpu_idle+0xa/0xc
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81083a5a>] default_idle_call+0x1e/0x20
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81083b9d>] cpu_startup_entry+0x141/0x22f
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff816fb853>] rest_init+0x135/0x13b
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81d5ce99>] start_kernel+0x3fa/0x40a
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81d5c2af>] x86_64_start_reservations+0x2a/0x2c
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81d5c3d0>] x86_64_start_kernel+0x11f/0x12c
Mar 03 11:27:33 icpu-test-bap10 kernel: irq event stamp: 1457
Mar 03 11:27:33 icpu-test-bap10 kernel: hardirqs last enabled at (1457): [<ffffffff81042a69>] get_user_pages_fast+0xeb/0x14f
Mar 03 11:27:33 icpu-test-bap10 kernel: hardirqs last disabled at (1456): [<ffffffff810429dd>] get_user_pages_fast+0x5f/0x14f
Mar 03 11:27:33 icpu-test-bap10 kernel: softirqs last enabled at (1446): [<ffffffff815e127d>] release_sock+0x142/0x14d
Mar 03 11:27:33 icpu-test-bap10 kernel: softirqs last disabled at (1444): [<ffffffff815e116f>] release_sock+0x34/0x14d
Mar 03 11:27:33 icpu-test-bap10 kernel:
other info that might help us debug this:
Mar 03 11:27:33 icpu-test-bap10 kernel: Possible unsafe locking scenario:
Mar 03 11:27:33 icpu-test-bap10 kernel: CPU0
Mar 03 11:27:33 icpu-test-bap10 kernel: ----
Mar 03 11:27:33 icpu-test-bap10 kernel: lock(&p->pi_lock);
Mar 03 11:27:33 icpu-test-bap10 kernel: <Interrupt>
Mar 03 11:27:33 icpu-test-bap10 kernel: lock(&p->pi_lock);
Mar 03 11:27:33 icpu-test-bap10 kernel:
*** DEADLOCK ***
Mar 03 11:27:33 icpu-test-bap10 kernel: 2 locks held by apache2-ssl/9310:
Mar 03 11:27:33 icpu-test-bap10 kernel: #0: (&(&(__futex_data.queues)[i].lock)->rlock){+.+...}, at: [<ffffffff810ae4e6>] do
Mar 03 11:27:33 icpu-test-bap10 kernel: #1: (&lock->wait_lock){+.+...}, at: [<ffffffff810ae53a>] do_futex+0x639/0x809
Mar 03 11:27:33 icpu-test-bap10 kernel:
stack backtrace:
Mar 03 11:27:33 icpu-test-bap10 kernel: CPU: 13 PID: 9310 UID: 99 Comm: apache2-ssl Not tainted 4.4.259-rc1-grsec+ #730
Mar 03 11:27:33 icpu-test-bap10 kernel: Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.11.0 11/02/2019
Mar 03 11:27:33 icpu-test-bap10 kernel: 0000000000000000ffff883fb79bfc00ffffffff816f8fc2ffff883ffa66d300
Mar 03 11:27:33 icpu-test-bap10 kernel: ffffffff8eaa71f0ffff883fb79bfc50ffffffff810884840000000000000000
Mar 03 11:27:33 icpu-test-bap10 kernel: 000000000000000100000000000000010000000000000002ffff883ffa66db58
Mar 03 11:27:33 icpu-test-bap10 kernel: Call Trace:
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff816f8fc2>] dump_stack+0x94/0xca
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81088484>] print_usage_bug+0x1bc/0x1d1
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81087d76>] ? check_usage_forwards+0x98/0x98
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810885a5>] mark_lock+0x10c/0x203
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81088cb9>] __lock_acquire+0x416/0xe4a
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810abb68>] ? pi_state_update_owner+0x51/0xd7
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81089b01>] lock_acquire+0x18d/0x1bc
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81089b01>] ? lock_acquire+0x18d/0x1bc
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810abb68>] ? pi_state_update_owner+0x51/0xd7
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81700d12>] _raw_spin_lock+0x2a/0x39
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810abb68>] ? pi_state_update_owner+0x51/0xd7
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810abb68>] pi_state_update_owner+0x51/0xd7
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810ae5af>] do_futex+0x6ae/0x809
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff810ae83d>] SyS_futex+0x133/0x143
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff8100158a>] ? syscall_trace_enter_phase2+0x1a2/0x1bb
Mar 03 11:27:33 icpu-test-bap10 kernel: [<ffffffff81701848>] tracesys_phase2+0x90/0x95
Bisecting detects 47e452fcf2f
in the above specific scenario using apache-ssl,
but apparently the missing *_irq() was introduced in 34c8e1c2c02.
However, just reverting the old _irq() variants to a similar status
than before 34c8e1c2c02,
or using _irqsave() / _irqrestore() as some other backports are doing
in various places, would not really help.
The fundamental problem is the following violation of the assertion
lockdep_assert_held(&pi_state->pi_mutex.wait_lock) in pi_state_update_owner():
Mar 03 12:50:03 icpu-test-bap10 kernel: ------------[ cut here ]------------
Mar 03 12:50:03 icpu-test-bap10 kernel: WARNING: CPU: 37 PID: 8488 at kernel/futex.c:844 pi_state_update_owner+0x3d/0xd7()
Mar 03 12:50:03 icpu-test-bap10 kernel: Modules linked in: xt_time xt_connlimit xt_connmark xt_NFLOG xt_limit xt_hashlimit veth ip_set_bitmap_port xt_DSCP xt_multiport ip_set_hash_ip xt_owner xt_set ip_set_hash_net xt_state xt_conntrack nf_conntrack_ftp mars lz4_decompress lz4_compress ipmi_devintf x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul hed ipmi_si ipmi_msghandler processor crc32c_intel ehci_pci ehci_hcd usbcore i40e usb_common
Mar 03 12:50:03 icpu-test-bap10 kernel: CPU: 37 PID: 8488 UID: 99 Comm: apache2-ssl Not tainted 4.4.259-rc1-grsec+ #737
Mar 03 12:50:03 icpu-test-bap10 kernel: Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.11.0 11/02/2019
Mar 03 12:50:03 icpu-test-bap10 kernel: 0000000000000000ffff883f863f7c70ffffffff816f90020000000000000000
Mar 03 12:50:03 icpu-test-bap10 kernel: 0000000000000009ffff883f863f7ca8ffffffff8104cda2ffffffff810abac7
Mar 03 12:50:03 icpu-test-bap10 kernel: ffff883ffbfe5e800000000000000000ffff883f82ed4bc000007fc01c9bf000
Mar 03 12:50:03 icpu-test-bap10 kernel: Call Trace:
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff816f9002>] dump_stack+0x94/0xca
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff8104cda2>] warn_slowpath_common+0x94/0xad
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810abac7>] ? pi_state_update_owner+0x3d/0xd7
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff8104ce5f>] warn_slowpath_null+0x15/0x17
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810abac7>] pi_state_update_owner+0x3d/0xd7
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810abea8>] free_pi_state+0x2d/0x73
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810abf0b>] unqueue_me_pi+0x1d/0x31
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810ad735>] futex_lock_pi+0x27a/0x2e8
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff81088bca>] ? __lock_acquire+0x327/0xe4a
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810ae6a9>] do_futex+0x784/0x809
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810cfa9a>] ? seccomp_phase1+0xde/0x1e7
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810a4503>] ? current_kernel_time64+0xb/0x31
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810d23c3>] ? current_kernel_time+0xb/0xf
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff810ae861>] SyS_futex+0x133/0x143
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff8100158a>] ? syscall_trace_enter_phase2+0x1a2/0x1bb
Mar 03 12:50:03 icpu-test-bap10 kernel: [<ffffffff81701888>] tracesys_phase2+0x90/0x95
Mar 03 12:50:03 icpu-test-bap10 kernel: ---[ end trace 968f95a458dea951 ]---
In order to both (1) prevent the self-deadlock, and (2) to satisfy the assertion
at pi_state_update_owner(), some locking with irq disable is needed,
at least in the specific call stack.
Interestingly, there existed a suchalike locking just before f08a4af5ccb.
This is just a quick hotfix, resurrecting some previous
locks at the old places, but now using ->wait_lock in place
of the previous ->pi_lock (which was in place before f08a4af5ccb).
The ->pi_lock is now also taken, by the new code
which had been introduced in 34c8e1c2c02.
When this patch is applied, both the above splats are
no longer triggering at my prelife machines.
Without this patch, I cannot ensure stable production at
1&1 Ionos.
Hint for further work: I have not yet tested other call paths,
since I am under time pressure for security reasons.
Hint for further hardening of 4.4.y and probably some more LTS series:
Probably some more systematic testing with CONFIG_PROVE_LOCKING
(and probably some more options) should be invested
in order to make the 4.4 LTS series really "stable" again.
Signed-off-by: Thomas Schoebel-Theuer <tst@1und1.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Lee Jones <lee.jones@linaro.org> Fixes: f08a4af5ccb2 ("futex: Use pi_state_update_owner() in put_pi_state()") Fixes: 34c8e1c2c025 ("futex: Provide and use pi_state_update_owner()") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When an IOCTL with argument size larger than 128 that also used array
arguments were handled, two memory allocations were made but alas, only
the latter one of them was released. This happened because there was only
a single local variable to hold such a temporary allocation.
Fix this by adding separate variables to hold the pointers to the
temporary allocations.
We're not factoring in the start of the file for where to write and
read the swapfile, which leads to very unfortunate side effects of
writing where we should not be...
[This issue only affects swapfiles on filesystems on top of blockdevs
that implement rw_page ops (brd, zram, btt, pmem), and not on top of any
other block devices, in contrast to the upstream commit fix.]
There exists multiple path may do zram compaction concurrently.
1. auto-compaction triggered during memory reclaim
2. userspace utils write zram<id>/compaction node
So, multiple threads may call zs_shrinker_scan/zs_compact concurrently.
But pages_compacted is a per zsmalloc pool variable and modification
of the variable is not serialized(through under class->lock).
There are two issues here:
1. the pages_compacted may not equal to total number of pages
freed(due to concurrently add).
2. zs_shrinker_scan may not return the correct number of pages
freed(issued by current shrinker).
The fix is simple:
1. account the number of pages freed in zs_compact locally.
2. use actomic variable pages_compacted to accumulate total number.
Link: https://lkml.kernel.org/r/20210202122235.26885-1-wu-yan@tcl.com Fixes: 860c707dca155a56 ("zsmalloc: account the number of compacted pages") Signed-off-by: Rokudo Yan <wu-yan@tcl.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 3194a1746e8a ("xen-netback: don't "handle" error by BUG()")
dropped respective a BUG_ON() without noticing that with this the
variable's value wouldn't be consumed anymore. With gnttab_set_map_op()
setting all status fields to a non-zero value, in case of an error no
slot should have a status of GNTST_okay (zero).
Bailing immediately from set_foreign_p2m_mapping() upon a p2m updating
error leaves the full batch in an ambiguous state as far as the caller
is concerned. Instead flags respective slots as bad, unmapping what
was mapped there right away.
HYPERVISOR_grant_table_op()'s return value and the individual unmap
slots' status fields get used only for a one-time - there's not much we
can do in case of a failure.
Note that there's no GNTST_enomem or alike, so GNTST_general_error gets
used.
The map ops' handle fields get overwritten just to be on the safe side.
Open-iSCSI sends passthrough PDUs over netlink, but the kernel should be
verifying that the provided PDU header and data lengths fall within the
netlink message to prevent accessing beyond that in memory.
Cc: stable@vger.kernel.org Reported-by: Adam Nichols <adam@grimm-co.com> Reviewed-by: Lee Duncan <lduncan@suse.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Chris Leech <cleech@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As the iSCSI parameters are exported back through sysfs, it should be
enforcing that they never are more than PAGE_SIZE (which should be more
than enough) before accepting updates through netlink.
Change all iSCSI sysfs attributes to use sysfs_emit().
Cc: stable@vger.kernel.org Reported-by: Adam Nichols <adam@grimm-co.com> Reviewed-by: Lee Duncan <lduncan@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Chris Leech <cleech@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Output defects can exist in sysfs content using sprintf and snprintf.
sprintf does not know the PAGE_SIZE maximum of the temporary buffer
used for outputting sysfs content and it's possible to overrun the
PAGE_SIZE buffer length.
Add a generic sysfs_emit function that knows that the size of the
temporary buffer and ensures that no overrun is done.
Add a generic sysfs_emit_at function that can be used in multiple
call situations that also ensures that no overrun is done.
Validate the output buffer argument to be page aligned.
Validate the offset len argument to be within the PAGE_SIZE buf.
Protect the iSCSI transport handle, available in sysfs, by requiring
CAP_SYS_ADMIN to read it. Also protect the netlink socket by restricting
reception of messages to ones sent with CAP_SYS_ADMIN. This disables
normal users from being able to end arbitrary iSCSI sessions.
Cc: stable@vger.kernel.org Reported-by: Adam Nichols <adam@grimm-co.com> Reviewed-by: Chris Leech <cleech@redhat.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Lee Duncan <lduncan@suse.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
b21ebf2fb4cd ("x86: Treat R_X86_64_PLT32 as R_X86_64_PC32")
but for i386. As far as the kernel is concerned, R_386_PLT32 can be
treated the same as R_386_PC32.
R_386_PLT32/R_X86_64_PLT32 are PC-relative relocation types which
can only be used by branches. If the referenced symbol is defined
externally, a PLT will be used.
R_386_PC32/R_X86_64_PC32 are PC-relative relocation types which can be
used by address taking operations and branches. If the referenced symbol
is defined externally, a copy relocation/canonical PLT entry will be
created in the executable.
On x86-64, there is no PIC vs non-PIC PLT distinction and an
R_X86_64_PLT32 relocation is produced for both `call/jmp foo` and
`call/jmp foo@PLT` with newer (2018) GNU as/LLVM integrated assembler.
This avoids canonical PLT entries (st_shndx=0, st_value!=0).
On i386, there are 2 types of PLTs, PIC and non-PIC. Currently,
the GCC/GNU as convention is to use R_386_PC32 for non-PIC PLT and
R_386_PLT32 for PIC PLT. Copy relocations/canonical PLT entries
are possible ABI issues but GCC/GNU as will likely keep the status
quo because (1) the ABI is legacy (2) the change will drop a GNU
ld diagnostic for non-default visibility ifunc in shared objects.
clang-12 -fno-pic (since [1]) can emit R_386_PLT32 for compiler
generated function declarations, because preventing canonical PLT
entries is weighed over the rare ifunc diagnostic.
[84977.840894] ath10k_snoc a000000.wifi: wmi mgmt tx queue is full
[84977.840913] ath10k_snoc a000000.wifi: failed to transmit packet, dropping: -28
[84977.840924] ath10k_snoc a000000.wifi: failed to submit frame: -28
[84977.840932] ath10k_snoc a000000.wifi: failed to transmit frame: -28
This issue is caused by race condition between skb_dequeue and
__skb_queue_tail. The queue of ‘wmi_mgmt_tx_queue’ is protected by a
different lock: ar->data_lock vs list->lock, the result is no protection.
So when ath10k_mgmt_over_wmi_tx_work() and ath10k_mac_tx_wmi_mgmt()
running concurrently on different CPUs, there appear to be a rare corner
cases when the queue length is 1,
If the instruction ‘next = skb->next’ is executed before
‘WRITE_ONCE(prev->next, newsk)’, newsk will be lost, as CPUx get the
old ‘next’ pointer, but the length is still added by one. The final
result is the length of the queue will reach the maximum value but
the queue is empty.
So remove ar->data_lock, and use 'skb_queue_tail' instead of
'__skb_queue_tail' to prevent the potential race condition. Also switch
to use skb_queue_len_lockless, in case we queue a few SKBs simultaneously.
pktgen create threads for all online cpus and bond these threads to
relevant cpu repecivtily. when this thread firstly be woken up, it
will compare cpu currently running with the cpu specified at the time
of creation and if the two cpus are not equal, BUG_ON() will take effect
causing panic on the system.
Notice that these threads could be migrated to other cpus before start
running because of the cpu hotplug after these threads have created. so the
BUG_ON() used here seems unreasonable and we can replace it with WARN_ON()
to just printf a warning other than panic the system.
We can currently get a "command execute failure 19" error on beacon loss
if the signal is weak:
wlcore: Beacon loss detected. roles:0xff
wlcore: Connection loss work (role_id: 0).
...
wlcore: ERROR command execute failure 19
...
WARNING: CPU: 0 PID: 1552 at drivers/net/wireless/ti/wlcore/main.c:803
...
(wl12xx_queue_recovery_work.part.0 [wlcore])
(wl12xx_cmd_role_start_sta [wlcore])
(wl1271_op_bss_info_changed [wlcore])
(ieee80211_prep_connection [mac80211])
Error 19 is defined as CMD_STATUS_WRONG_NESTING from the wlcore firmware,
and seems to mean that the firmware no longer wants to see the quirk
handling for WLCORE_QUIRK_START_STA_FAILS done.
This quirk got added with commit 18eab430700d ("wlcore: workaround
start_sta problem in wl12xx fw"), and it seems that this already got fixed
in the firmware long time ago back in 2012 as wl18xx never had this quirk
in place to start with.
As we no longer even support firmware that early, to me it seems that it's
safe to just drop WLCORE_QUIRK_START_STA_FAILS to fix the error. Looks
like earlier firmware got disabled back in 2013 with commit 0e284c074ef9
("wl12xx: increase minimum singlerole firmware version required").
If it turns out we still need WLCORE_QUIRK_START_STA_FAILS with any
firmware that the driver works with, we can simply revert this patch and
add extra checks for firmware version used.
With this fix wlcore reconnects properly after a beacon loss.