Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start
is normally set at ACK processing time, not at send time.
Doing a proper fix would need to add an additional state variable,
and does not seem worth the trouble, given CUBIC bug has been there
forever before Jana noticed it.
Let's simply not set epoch_start in the future, otherwise
bictcp_update() could overflow and CUBIC would again
grow cwnd too fast.
This was detected thanks to a packetdrill test Neal wrote that was flaky
before applying this fix.
Fixes: 30927520dbae ("tcp_cubic: better follow cubic curve after idle period") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Jana Iyengar <jri@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
When we call btrfs_commit_transaction(), we splice the list "ordered"
of our transaction handle into the transaction's "pending_ordered"
list, but we don't re-initialize the "ordered" list of our transaction
handle, this means it still points to the same elements it used to
before the splice. Then we check if the current transaction's state is
>= TRANS_STATE_COMMIT_START and if it is we end up calling
btrfs_end_transaction() which simply splices again the "ordered" list
of our handle into the transaction's "pending_ordered" list, leaving
multiple pointers to the same ordered extents which results in list
corruption when we are iterating, removing and freeing ordered extents
at btrfs_wait_pending_ordered(), resulting in access to dangling
pointers / use-after-free issues.
Similarly, btrfs_end_transaction() can end up in some cases calling
btrfs_commit_transaction(), and both did a list splice of the transaction
handle's "ordered" list into the transaction's "pending_ordered" without
re-initializing the handle's "ordered" list, resulting in exactly the
same problem.
This produces the following warning on a kernel with linked list
debugging enabled:
On a non-debug kernel this leads to invalid memory accesses, causing a
crash. Fix this by using list_splice_init() instead of list_splice() in
btrfs_commit_transaction() and btrfs_end_transaction().
Cc: stable@vger.kernel.org Fixes: 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3" Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Mike Galbraith [Sat, 23 Apr 2016 00:38:23 +0000 (20:38 -0400)]
Correct backport of fa3c776 ("Thermal: Ignore invalid trip points")
Backport of 81ad4276b505e987dd8ebbdf63605f92cd172b52 failed to adjust
for intervening ->get_trip_temp() argument type change, thus causing
stack protector to panic.
drivers/thermal/thermal_core.c: In function ‘thermal_zone_device_register’:
drivers/thermal/thermal_core.c:1569:41: warning: passing argument 3 of
‘tz->ops->get_trip_temp’ from incompatible pointer type [-Wincompatible-pointer-types]
if (tz->ops->get_trip_temp(tz, count, &trip_temp))
^
drivers/thermal/thermal_core.c:1569:41: note: expected ‘long unsigned int *’
but argument is of type ‘int *’
CC: <stable@vger.kernel.org> #3.18,#4.1 Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Jana Iyengar found an interesting issue on CUBIC :
The epoch is only updated/reset initially and when experiencing losses.
The delta "t" of now - epoch_start can be arbitrary large after app idle
as well as the bic_target. Consequentially the slope (inverse of
ca->cnt) would be really large, and eventually ca->cnt would be
lower-bounded in the end to 2 to have delayed-ACK slow-start behavior.
This particularly shows up when slow_start_after_idle is disabled
as a dangerous cwnd inflation (1.5 x RTT) after few seconds of idle
time.
Jana initial fix was to reset epoch_start if app limited,
but Neal pointed out it would ask the CUBIC algorithm to recalculate the
curve so that we again start growing steeply upward from where cwnd is
now (as CUBIC does just after a loss). Ideally we'd want the cwnd growth
curve to be the same shape, just shifted later in time by the amount of
the idle period.
Reported-by: Jana Iyengar <jri@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Sangtae Ha <sangtae.ha@gmail.com> Cc: Lawrence Brakmo <lawrence@brakmo.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
On BXT platform Host Controller and Device Controller figure as
same PCI device but with different device function. HCD should
not pass data to Device Controller but only to Host Controllers.
Checking if companion device is Host Controller, otherwise skip.
Cc: <stable@vger.kernel.org> Signed-off-by: Robert Dobrowolski <robert.dobrowolski@linux.intel.com> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
PCI hotpluggable xhci controllers such as some Alpine Ridge solutions will
remove the xhci controller from the PCI bus when the last USB device is
disconnected.
Add a flag to indicate that the host is being removed to avoid queueing
configure_endpoint commands for the dropped endpoints.
For PCI hotplugged controllers this will prevent 5 second command timeouts
For static xhci controllers the configure_endpoint command is not needed
in the removal case as everything will be returned, freed, and the
controller is reset.
For now the flag is only set for PCI connected host controllers.
The problem seems to be that if a new device is detected
while we have already removed the shared HCD, then many of the
xhci operations (e.g. xhci_alloc_dev(), xhci_setup_device())
hang as command never completes.
I don't think XHCI can operate without the shared HCD as we've
already called xhci_halt() in xhci_only_stop_hcd() when shared HCD
goes away. We need to prevent new commands from being queued
not only when HCD is dying but also when HCD is halted.
The following lockup was detected while testing the otg state
machine.
[ 178.199951] xhci-hcd xhci-hcd.0.auto: xHCI Host Controller
[ 178.205799] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned bus number 1
[ 178.214458] xhci-hcd xhci-hcd.0.auto: hcc params 0x0220f04c hci version 0x100 quirks 0x00010010
[ 178.223619] xhci-hcd xhci-hcd.0.auto: irq 400, io mem 0x48890000
[ 178.230677] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[ 178.237796] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 178.245358] usb usb1: Product: xHCI Host Controller
[ 178.250483] usb usb1: Manufacturer: Linux 4.0.0-rc1-00024-g6111320 xhci-hcd
[ 178.257783] usb usb1: SerialNumber: xhci-hcd.0.auto
[ 178.267014] hub 1-0:1.0: USB hub found
[ 178.272108] hub 1-0:1.0: 1 port detected
[ 178.278371] xhci-hcd xhci-hcd.0.auto: xHCI Host Controller
[ 178.284171] xhci-hcd xhci-hcd.0.auto: new USB bus registered, assigned bus number 2
[ 178.294038] usb usb2: New USB device found, idVendor=1d6b, idProduct=0003
[ 178.301183] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[ 178.308776] usb usb2: Product: xHCI Host Controller
[ 178.313902] usb usb2: Manufacturer: Linux 4.0.0-rc1-00024-g6111320 xhci-hcd
[ 178.321222] usb usb2: SerialNumber: xhci-hcd.0.auto
[ 178.329061] hub 2-0:1.0: USB hub found
[ 178.333126] hub 2-0:1.0: 1 port detected
[ 178.567585] dwc3 48890000.usb: usb_otg_start_host 0
[ 178.572707] xhci-hcd xhci-hcd.0.auto: remove, state 4
[ 178.578064] usb usb2: USB disconnect, device number 1
[ 178.586565] xhci-hcd xhci-hcd.0.auto: USB bus 2 deregistered
[ 178.592585] xhci-hcd xhci-hcd.0.auto: remove, state 1
[ 178.597924] usb usb1: USB disconnect, device number 1
[ 178.603248] usb 1-1: new high-speed USB device number 2 using xhci-hcd
[ 190.597337] INFO: task kworker/u4:0:6 blocked for more than 10 seconds.
[ 190.604273] Not tainted 4.0.0-rc1-00024-g6111320 #1058
[ 190.610228] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 190.618443] kworker/u4:0 D c05c0ac0 0 6 2 0x00000000
[ 190.625120] Workqueue: usb_otg usb_otg_work
[ 190.629533] [<c05c0ac0>] (__schedule) from [<c05c10ac>] (schedule+0x34/0x98)
[ 190.636915] [<c05c10ac>] (schedule) from [<c05c1318>] (schedule_preempt_disabled+0xc/0x10)
[ 190.645591] [<c05c1318>] (schedule_preempt_disabled) from [<c05c23d0>] (mutex_lock_nested+0x1ac/0x3fc)
[ 190.655353] [<c05c23d0>] (mutex_lock_nested) from [<c046cf8c>] (usb_disconnect+0x3c/0x208)
[ 190.664043] [<c046cf8c>] (usb_disconnect) from [<c0470cf0>] (_usb_remove_hcd+0x98/0x1d8)
[ 190.672535] [<c0470cf0>] (_usb_remove_hcd) from [<c0485da8>] (usb_otg_start_host+0x50/0xf4)
[ 190.681299] [<c0485da8>] (usb_otg_start_host) from [<c04849a4>] (otg_set_protocol+0x5c/0xd0)
[ 190.690153] [<c04849a4>] (otg_set_protocol) from [<c0484b88>] (otg_set_state+0x170/0xbfc)
[ 190.698735] [<c0484b88>] (otg_set_state) from [<c0485740>] (otg_statemachine+0x12c/0x470)
[ 190.707326] [<c0485740>] (otg_statemachine) from [<c0053c84>] (process_one_work+0x1b4/0x4a0)
[ 190.716162] [<c0053c84>] (process_one_work) from [<c00540f8>] (worker_thread+0x154/0x44c)
[ 190.724742] [<c00540f8>] (worker_thread) from [<c0058f88>] (kthread+0xd4/0xf0)
[ 190.732328] [<c0058f88>] (kthread) from [<c000e810>] (ret_from_fork+0x14/0x24)
[ 190.739898] 5 locks held by kworker/u4:0/6:
[ 190.744274] #0: ("%s""usb_otg"){.+.+.+}, at: [<c0053bf4>] process_one_work+0x124/0x4a0
[ 190.752799] #1: ((&otgd->work)){+.+.+.}, at: [<c0053bf4>] process_one_work+0x124/0x4a0
[ 190.761326] #2: (&otgd->fsm.lock){+.+.+.}, at: [<c048562c>] otg_statemachine+0x18/0x470
[ 190.769934] #3: (usb_bus_list_lock){+.+.+.}, at: [<c0470ce8>] _usb_remove_hcd+0x90/0x1d8
[ 190.778635] #4: (&dev->mutex){......}, at: [<c046cf8c>] usb_disconnect+0x3c/0x208
[ 190.786700] INFO: task kworker/1:0:14 blocked for more than 10 seconds.
[ 190.793633] Not tainted 4.0.0-rc1-00024-g6111320 #1058
[ 190.799567] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 190.807783] kworker/1:0 D c05c0ac0 0 14 2 0x00000000
[ 190.814457] Workqueue: usb_hub_wq hub_event
[ 190.818866] [<c05c0ac0>] (__schedule) from [<c05c10ac>] (schedule+0x34/0x98)
[ 190.826252] [<c05c10ac>] (schedule) from [<c05c4e40>] (schedule_timeout+0x13c/0x1ec)
[ 190.834377] [<c05c4e40>] (schedule_timeout) from [<c05c19f0>] (wait_for_common+0xbc/0x150)
[ 190.843062] [<c05c19f0>] (wait_for_common) from [<bf068a3c>] (xhci_setup_device+0x164/0x5cc [xhci_hcd])
[ 190.852986] [<bf068a3c>] (xhci_setup_device [xhci_hcd]) from [<c046b7f4>] (hub_port_init+0x3f4/0xb10)
[ 190.862667] [<c046b7f4>] (hub_port_init) from [<c046eb64>] (hub_event+0x704/0x1018)
[ 190.870704] [<c046eb64>] (hub_event) from [<c0053c84>] (process_one_work+0x1b4/0x4a0)
[ 190.878919] [<c0053c84>] (process_one_work) from [<c00540f8>] (worker_thread+0x154/0x44c)
[ 190.887503] [<c00540f8>] (worker_thread) from [<c0058f88>] (kthread+0xd4/0xf0)
[ 190.895076] [<c0058f88>] (kthread) from [<c000e810>] (ret_from_fork+0x14/0x24)
[ 190.902650] 5 locks held by kworker/1:0/14:
[ 190.907023] #0: ("usb_hub_wq"){.+.+.+}, at: [<c0053bf4>] process_one_work+0x124/0x4a0
[ 190.915454] #1: ((&hub->events)){+.+.+.}, at: [<c0053bf4>] process_one_work+0x124/0x4a0
[ 190.924070] #2: (&dev->mutex){......}, at: [<c046e490>] hub_event+0x30/0x1018
[ 190.931768] #3: (&port_dev->status_lock){+.+.+.}, at: [<c046eb50>] hub_event+0x6f0/0x1018
[ 190.940558] #4: (&bus->usb_address0_mutex){+.+.+.}, at: [<c046b458>] hub_port_init+0x58/0xb10
On some xHCI controllers (e.g. R-Car SoCs), the AC64 bit (bit 0) of
HCCPARAMS1 is set to 1. However, the xHCs don't support 64-bit
address memory pointers actually. So, in this case, this driver should
call dma_set_coherent_mask(dev, DMA_BIT_MASK(32)) in xhci_gen_setup().
Otherwise, the xHCI controller will be died after a usb device is
connected if it runs on above 4GB physical memory environment.
So, this patch adds a new quirk XHCI_NO_64BIT_SUPPORT to resolve
such an issue.
Give USB3 devices a better chance to enumerate at USB 3 speeds if
they are connected to a suspended host.
Solves an issue with NEC uPD720200 host hanging when partially
enumerating a USB3 device as USB2 after host controller runtime resume.
Based on Sergey's test patch [1], this fixes zram with lz4 compression
on big endian cpus.
Note that the 64-bit preprocessor test is not a cleanup, it's part of
the fix, since those identifiers are bogus (for example, __ppc64__
isn't defined anywhere else in the kernel, which means we'd fall into
the 32-bit definitions on ppc64).
The commit 895005202987 ("dmaengine: dw: apply both HS interfaces and remove
slave_id usage") cleaned up the code to avoid usage of depricated slave_id
member of generic slave configuration.
Meanwhile it broke the master selection by removing important call to
dwc_set_masters() in ->device_alloc_chan_resources() which copied masters from
custom slave configuration to the internal channel structure.
Everything works until now since there is no customized connection of
DesignWare DMA IP to the bus, i.e. one bus and one or more masters are in use.
The configurations where 2 masters are connected to the different masters are
not working anymore. We are expecting one user of such configuration and need
to select masters properly. Besides that it is obviously a performance
regression since only one master is in use in multi-master configuration.
Select masters in accordance with what user asked for. Keep this patch in a form
more suitable for back porting.
We are safe to take necessary data in ->device_alloc_chan_resources() because
we don't support generic slave configuration embedded into custom one, and thus
the only way to provide such is to use the parameter to a filter function which
is called exactly before channel resource allocation.
While here, replase BUG_ON to less noisy dev_warn() and prevent channel
allocation in case of error.
Fixes: 895005202987 ("dmaengine: dw: apply both HS interfaces and remove slave_id usage") Cc: stable@vger.kernel.org Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Vinod Koul <vinod.koul@intel.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
This is Dell usb dock audio workaround.
It was fixed the master volume keep lower.
[Some background: the patch essentially skips the controls of a couple
of FU volumes. Although the firmware exposes the dB and the value
information via the usb descriptor, changing the values (we set the
min volume as default) screws up the device. Although this has been
fixed in the newer firmware, the devices are shipped with the old
firmware, thus we need the workaround in the driver side. -- tiwai]
An interrupt handler that uses the fpu can kill a KVM VM, if it runs
under the following conditions:
- the guest's xcr0 register is loaded on the cpu
- the guest's fpu context is not loaded
- the host is using eagerfpu
Note that the guest's xcr0 register and fpu context are not loaded as
part of the atomic world switch into "guest mode". They are loaded by
KVM while the cpu is still in "host mode".
Usage of the fpu in interrupt context is gated by irq_fpu_usable(). The
interrupt handler will look something like this:
if (irq_fpu_usable()) {
kernel_fpu_begin();
[... code that uses the fpu ...]
kernel_fpu_end();
}
As long as the guest's fpu is not loaded and the host is using eager
fpu, irq_fpu_usable() returns true (interrupted_kernel_fpu_idle()
returns true). The interrupt handler proceeds to use the fpu with
the guest's xcr0 live.
kernel_fpu_begin() saves the current fpu context. If this uses
XSAVE[OPT], it may leave the xsave area in an undesirable state.
According to the SDM, during XSAVE bit i of XSTATE_BV is not modified
if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and
xcr0[i] == 0 following an XSAVE.
kernel_fpu_end() restores the fpu context. Now if any bit i in
XSTATE_BV == 1 while xcr0[i] == 0, XRSTOR generates a #GP. The
fault is trapped and SIGSEGV is delivered to the current process.
Only pre-4.2 kernels appear to be vulnerable to this sequence of
events. Commit 653f52c ("kvm,x86: load guest FPU context more eagerly")
from 4.2 forces the guest's fpu to always be loaded on eagerfpu hosts.
This patch fixes the bug by keeping the host's xcr0 loaded outside
of the interrupts-disabled region where KVM switches into guest mode.
Cc: stable@vger.kernel.org Suggested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: David Matlack <dmatlack@google.com>
[Move load after goto cancel_injection. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Handling exceptions from modules never worked on parisc.
It was just masked by the fact that exceptions from modules
don't happen during normal use.
When a module triggers an exception in get_user() we need to load the
main kernel dp value before accessing the exception_data structure, and
afterwards restore the original dp value of the module on exit.
The kernel module testcase (lib/test_user_copy.c) exhibited a kernel
crash on parisc if the parameters for copy_from_user were reversed
("illegal reversed copy_to_user" testcase).
Fix this potential crash by checking the fault handler if the faulting
address is in the exception table.
We want to avoid the kernel module loader to create function pointers
for the kernel fixup routines of get_user() and put_user(). Changing
the external reference from function type to int type fixes this.
This unbreaks exception handling for get_user() and put_user() when
called from a kernel module.
The current implementation only uses the first byte in val,
the second byte is always 0. Change it to use cpu_to_le16
to write the two bytes into the register
Cc: stable@vger.kernel.org Signed-off-by: Yong Li <sdliyong@gmail.com> Reviewed-by: Phil Reid <preid@electromag.com.au> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
A Fedora user reports that the ftdi_sio driver works properly for the
ICP DAS I-7561U device. Further, the user manual for these devices
instructs users to load the driver and add the ids using the sysfs
interface.
Add support for these in the driver directly so that the devices work
out of the box instead of needing manual configuration.
If we rename an inode A (be it a file or a directory), create a new
inode B with the old name of inode A and under the same parent directory,
fsync inode B and then power fail, at log tree replay time we end up
removing inode A completely. If inode A is a directory then all its files
are gone too.
Example scenarios where this happens:
This is reproducible with the following steps, taken from a couple of
test cases written for fstests which are going to be submitted upstream
soon:
The next time the fs is mounted, log tree replay happens and
the directory "y" does not exist nor do the files "foo" and
"bar" exist anywhere (neither in "y" nor in "x", nor the root
nor anywhere).
The next time the fs is mounted, log tree replay happens and the
file "bar" does not exists anymore. A file with the name "foo"
exists and it matches the second file we created.
Another related problem that does not involve file/data loss is when a
new inode is created with the name of a deleted snapshot and we fsync it:
The next time the fs is mounted the log replay procedure fails because
it attempts to delete the snapshot entry (which has dir item key type
of BTRFS_ROOT_ITEM_KEY) as if it were a regular (non-root) entry,
resulting in the following error that causes mount to fail:
Fix this by forcing a transaction commit when such cases happen.
This means we check in the commit root of the subvolume tree if there
was any other inode with the same reference when the inode we are
fsync'ing is a new inode (created in the current transaction).
Test cases for fstests, covering all the scenarios given above, were
submitted upstream for fstests:
* fstests: generic test for fsync after renaming directory
https://patchwork.kernel.org/patch/8694281/
* fstests: generic test for fsync after renaming file
https://patchwork.kernel.org/patch/8694301/
* fstests: add btrfs test for fsync after snapshot deletion
https://patchwork.kernel.org/patch/8670671/
Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
When we have the no_holes feature enabled, if a we truncate a file to a
smaller size, truncate it again but to a size greater than or equals to
its original size and fsync it, the log tree will not have any information
about the hole covering the range [truncate_1_offset, new_file_size[.
Which means if the fsync log is replayed, the file will remain with the
state it had before both truncate operations.
Without the no_holes feature this does not happen, since when the inode
is logged (full sync flag is set) it will find in the fs/subvol tree a
leaf with a generation matching the current transaction id that has an
explicit extent item representing the hole.
Fix this by adding an explicit extent item representing a hole between
the last extent and the inode's i_size if we are doing a full sync.
The issue is easy to reproduce with the following test case for fstests:
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_dm_flakey
# This test was motivated by an issue found in btrfs when the btrfs
# no-holes feature is enabled (introduced in kernel 3.14). So enable
# the feature if the fs being tested is btrfs.
if [ $FSTYP == "btrfs" ]; then
_require_btrfs_fs_feature "no_holes"
_require_btrfs_mkfs_feature "no-holes"
MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes"
fi
# Now truncate our file foo to a smaller size (64Kb) and then truncate
# it to the size it had before the shrinking truncate (125Kb). Then
# fsync our file. If a power failure happens after the fsync, we expect
# our file to have a size of 125Kb, with the first 64Kb of data having
# the value 0xaa and the second 61Kb of data having the value 0x00.
$XFS_IO_PROG -c "truncate 64K" \
-c "truncate 125K" \
-c "fsync" \
$SCRATCH_MNT/foo
# Do something similar to our file bar, but the first truncation sets
# the file size to 0 and the second truncation expands the size to the
# double of what it was initially.
$XFS_IO_PROG -c "truncate 0" \
-c "truncate 253K" \
-c "fsync" \
$SCRATCH_MNT/bar
# Allow writes again, mount to trigger log replay and validate file
# contents.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# We expect foo to have a size of 125Kb, the first 64Kb of data all
# having the value 0xaa and the remaining 61Kb to be a hole (all bytes
# with value 0x00).
echo "File foo content after log replay:"
od -t x1 $SCRATCH_MNT/foo
# We expect bar to have a size of 253Kb and no extents (any byte read
# from bar has the value 0x00).
echo "File bar content after log replay:"
od -t x1 $SCRATCH_MNT/bar
status=0
exit
The expected file contents in the golden output are:
File foo content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
* 0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
* 0372000
File bar content after log replay: 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
* 0772000
Without this fix, their contents are:
File foo content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
* 0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
* 0372000
File bar content after log replay: 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
* 0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
* 0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
* 0772000
A test case submission for fstests follows soon.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
After commit 4f764e515361 ("Btrfs: remove deleted xattrs on fsync log
replay"), we can end up in a situation where during log replay we end up
deleting xattrs that were never deleted when their file was last fsynced.
This happens in the fast fsync path (flag BTRFS_INODE_NEEDS_FULL_SYNC is
not set in the inode) if the inode has the flag BTRFS_INODE_COPY_EVERYTHING
set, the xattr was added in a past transaction and the leaf where the
xattr is located was not updated (COWed or created) in the current
transaction. In this scenario the xattr item never ends up in the log
tree and therefore at log replay time, which makes the replay code delete
the xattr from the fs/subvol tree as it thinks that xattr was deleted
prior to the last fsync.
Fix this by always logging all xattrs, which is the simplest and most
reliable way to detect deleted xattrs and replay the deletes at log replay
time.
This issue is reproducible with the following test case for fstests:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
here=`pwd`
tmp=/tmp/$$
status=1 # failure is the default!
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
. ./common/attr
# real QA test starts here
# We create a lot of xattrs for a single file. Only btrfs and xfs are currently
# able to store such a large mount of xattrs per file, other filesystems such
# as ext3/4 and f2fs for example, fail with ENOSPC even if we attempt to add
# less than 1000 xattrs with very small values.
_supported_fs btrfs xfs
_supported_os Linux
_need_to_be_root
_require_scratch
_require_dm_flakey
_require_attrs
_require_metadata_journaling $SCRATCH_DEV
# Create the test file with some initial data and make sure everything is
# durably persisted.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" $SCRATCH_MNT/foo | _filter_xfs_io
sync
# Add many small xattrs to our file.
# We create such a large amount because it's needed to trigger the issue found
# in btrfs - we need to have an amount that causes the fs to have at least 3
# btree leafs with xattrs stored in them, and it must work on any leaf size
# (maximum leaf/node size is 64Kb).
num_xattrs=2000
for ((i = 1; i <= $num_xattrs; i++)); do
name="user.attr_$(printf "%04d" $i)"
$SETFATTR_PROG -n $name -v "val_$(printf "%04d" $i)" $SCRATCH_MNT/foo
done
# Sync the filesystem to force a commit of the current btrfs transaction, this
# is a necessary condition to trigger the bug on btrfs.
sync
# Now update our file's data and fsync the file.
# After a successful fsync, if the fsync log/journal is replayed we expect to
# see all the xattrs we added before with the same values (and the updated file
# data of course). Btrfs used to delete some of these xattrs when it replayed
# its fsync log/journal.
$XFS_IO_PROG -c "pwrite -S 0xbb 8K 16K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount. This makes the fs replay its fsync log.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
echo "File content after crash and log replay:"
od -t x1 $SCRATCH_MNT/foo
echo "File xattrs after crash and log replay:"
for ((i = 1; i <= $num_xattrs; i++)); do
name="user.attr_$(printf "%04d" $i)"
echo -n "$name="
$GETFATTR_PROG --absolute-names -n $name --only-values $SCRATCH_MNT/foo
echo
done
status=0
exit
The golden output expects all xattrs to be available, and with the correct
values, after the fsync log is replayed.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
If we do an append write to a file (which increases its inode's i_size)
that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
and the previous transaction added a new hard link to the file, which sets
the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
the file, the inode's new i_size isn't logged. This has the consequence
that after the fsync log is replayed, the file size remains what it was
before the append write operation, which means users/applications will
not be able to read the data that was successsfully fsync'ed before.
This happens because neither the inode item nor the delayed inode get
their i_size updated when the append write is made - doing so would
require starting a transaction in the buffered write path, something that
we do not do intentionally for performance reasons.
Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
set the inode is logged with its current i_size (log the in-memory inode
into the log tree).
This issue is not a recent regression and is easy to reproduce with the
following test case for fstests:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
here=`pwd`
tmp=/tmp/$$
status=1 # failure is the default!
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_supported_fs generic
_supported_os Linux
_need_to_be_root
_require_scratch
_require_dm_flakey
_require_metadata_journaling $SCRATCH_DEV
_crash_and_mount()
{
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount. This makes the fs replay its fsync log.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
}
# Create the test file with some initial data and then fsync it.
# The fsync here is only needed to trigger the issue in btrfs, as it causes the
# the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 32k" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
sync
# Add a hard link to our file.
# On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
# which is a necessary condition to trigger the issue.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar
# Sync the filesystem to force a commit of the current btrfs transaction, this
# is a necessary condition to trigger the bug on btrfs.
sync
# Now append more data to our file, increasing its size, and fsync the file.
# In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
# write path did not update the inode item in the btree nor the delayed inode
# item (in memory struture) in the current transaction (created by the fsync
# handler), the fsync did not record the inode's new i_size in the fsync
# log/journal. This made the data unavailable after the fsync log/journal is
# replayed.
$XFS_IO_PROG -c "pwrite -S 0xbb 32K 32K" \
-c "fsync" \
$SCRATCH_MNT/foo | _filter_xfs_io
echo "File content after fsync and before crash:"
od -t x1 $SCRATCH_MNT/foo
_crash_and_mount
echo "File content after crash and log replay:"
od -t x1 $SCRATCH_MNT/foo
status=0
exit
The expected file output before and after the crash/power failure expects the
appended data to be available, which is:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
* 0100000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
* 0200000
Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Changes since V1: fixed the description and added KASan warning.
In assoc_array_insert_into_terminal_node(), we call the
compare_object() method on all non-empty slots, even when they're
not leaves, passing a pointer to an unexpected structure to
compare_object(). Currently it causes an out-of-bound read access
in keyring_compare_object detected by KASan (see below). The issue
is easily reproduced with keyutils testsuite.
Only call compare_object() when the slot is a leave.
KASan warning:
==================================================================
BUG: KASAN: slab-out-of-bounds in keyring_compare_object+0x213/0x240 at addr ffff880060a6f838
Read of size 8 by task keyctl/1655
=============================================================================
BUG kmalloc-192 (Not tainted): kasan: bad access detected
-----------------------------------------------------------------------------
Plantronics BT300 does not support reading the sample rate which leads
to many lines of "cannot get freq at ep 0x1". This patch adds the USB
ID of the BT300 to quirks.c and avoids those error messages.
As of 5a60e87603c4c533492c515b7f62578189b03c9c, RBD object request
allocations are made via rbd_obj_request_create() with GFP_NOIO.
However, subsequent OSD request allocations in rbd_osd_req_create*()
use GFP_ATOMIC.
With heavy page cache usage (e.g. OSDs running on same host as krbd
client), rbd_osd_req_create() order-1 GFP_ATOMIC allocations have been
observed to fail, where direct reclaim would have allowed GFP_NOIO
allocations to succeed.
Cc: stable@vger.kernel.org # 3.18+ Suggested-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Neil Brown <neilb@suse.com> Signed-off-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Marek <mmarek@suse.cz> Cc: stable@vger.kernel.org Cc: kvm@vger.kernel.org Reported-by: Linda Walsh <lkml@tlinx.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
This patch fixes an issue that usbhsg_queue_done() may cause kernel
panic when dma callback is running and usb_ep_disable() is called
by interrupt handler. (Especially, we can reproduce this issue using
g_audio with usb-dmac driver.)
For example of a flow:
usbhsf_dma_complete (on tasklet)
--> usbhsf_pkt_handler (on tasklet)
--> usbhsg_queue_done (on tasklet)
*** interrupt happened and usb_ep_disable() is called ***
--> usbhsg_queue_pop (on tasklet)
Then, oops happened.
According to the gadget.h, a "complete" function will always be called
with interrupts disabled. However, sometimes usbhsg_queue_pop() function
is called with interrupts enabled. So, this function should be held by
usbhs_lock() to disable interruption. Also, this driver has to call
spin_unlock() to avoid spinlock recursion by this driver before calling
usb_gadget_giveback_request().
Otherwise, there is possible to cause a spinlock suspected in a gadget
complete function.
If at this moment target processor is handling an unrelated event in
evtchn_2l_handle_events()'s loop it may pick up our event since target's
cpu_evtchn_mask claims that this event belongs to it *and* the event is
unmasked and still pending. At the same time, source CPU will continue
executing its own handle_edge_irq().
With FIFO interrupt the scenario is similar: irq_move_irq() may result
in a EVTCHNOP_unmask hypercall which, in turn, may make the event
pending on the target CPU.
We can avoid this situation by moving and clearing the event while
keeping event masked.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Phoenix Audio TMX320 gives the similar error when the sample rate is
asked:
usb 2-1.3: 2:1: cannot get freq at ep 0x85
usb 2-1.3: 1:1: cannot get freq at ep 0x2
....
Add the corresponding USB-device ID (1de7:0014) to
snd_usb_get_sample_rate_quirk() list.
message every time playback starts. While USB DAC in the RR2150
supports reading the sample rate, it never returns a sample rate
other than zero in my observation with common sample rates.
Signed-off-by: Eric Wong <normalperson@yhbt.net> Cc: Joe Turner <joe@oampo.co.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Previously, ext4 would fail the mount if the file system had the quota
feature enabled and quota mount options (used for the older quota
setups) were present. This broke xfstests, since xfs silently ignores
the usrquote and grpquota mount options if they are specified. This
commit changes things so that we are consistent with xfs; having the
mount options specified is harmless, so no sense break users by
forbidding them.
Non maskable interrupts (NMI) are preferred to interrupts in current
implementation. If a NMI is pending and NMI is blocked by the result
of nmi_allowed(), pending interrupt is not injected and
enable_irq_window() is not executed, even if interrupts injection is
allowed.
In old kernel (e.g. 2.6.32), schedule() is often called in NMI context.
In this case, interrupts are needed to execute iret that intends end
of NMI. The flag of blocking new NMI is not cleared until the guest
execute the iret, and interrupts are blocked by pending NMI. Due to
this, iret can't be invoked in the guest, and the guest is starved
until block is cleared by some events (e.g. canceling injection).
This patch injects pending interrupts, when it's allowed, even if NMI
is blocked. And, If an interrupts is pending after executing
inject_pending_event(), enable_irq_window() is executed regardless of
NMI pending counter.
Cc: stable@vger.kernel.org Signed-off-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
With the internal Quota feature, mke2fs creates empty quota inodes and
quota usage tracking is enabled as soon as the file system is mounted.
Since quotacheck is no longer preallocating all of the blocks in the
quota inode that are likely needed to be written to, we are now seeing
a lockdep false positive caused by needing to allocate a quota block
from inside ext4_map_blocks(), while holding i_data_sem for a data
inode. This results in this complaint:
During revalidate we check whether device capacity has changed before we
decide whether to output disk information or not.
The check for old capacity failed to take into account that we scaled
sdkp->capacity based on the reported logical block size. And therefore
the capacity test would always fail for devices with sectors bigger than
512 bytes and we would print several copies of the same discovery
information.
Avoid scaling sdkp->capacity and instead adjust the value on the fly
when setting the block device capacity and generating fake C/H/S
geometry.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Cc: <stable@vger.kernel.org> Reported-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Hannes Reinicke <hare@suse.de> Reviewed-by: Ewan Milne <emilne@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
The SPICE protocol considers the position of a cursor to be the location
of its active pixel on the display, so the cursor is drawn with its
top-left corner at "(x - hot_spot_x, y - hot_spot_y)" but the DRM cursor
position gives the location where the top-left corner should be drawn,
with the hotspot being a hint for drivers that need it.
This fixes the location of the window resize cursors when using Fluxbox
with the QXL DRM driver and both the QXL and modesetting X drivers.
This patch adds a code to surely disable TX IRQ of the pipe before
starting TX DMAC transfer. Otherwise, a lot of unnecessary TX IRQs
may happen in rare cases when DMAC is used.
When unexpected situation happened (e.g. tx/rx irq happened while
DMAC is used), the usbhsf_pkt_handler() was possible to cause NULL
pointer dereference like the followings:
Commit 127500ccb766f ("ARM: OMAP2+: Only write the sysconfig on idle
when necessary") talks about verification of sysconfig cache value before
updating it, only during idle path. But the patch is adding the
verification in the enable path. So, adding the check in a proper place
as per the commit description.
Not keeping this check during enable path as there is a chance of losing
context and it is safe to do on idle as the context of the register will
never be lost while the device is active.
Signed-off-by: Lokesh Vutla <lokeshvutla@ti.com> Acked-by: Tero Kristo <t-kristo@ti.com> Cc: Jon Hunter <jonathanh@nvidia.com> Cc: <stable@vger.kernel.org> # 3.12+ Fixes: commit 127500ccb766 "ARM: OMAP2+: Only write the sysconfig on idle when necessary"
[paul@pwsan.com: appears to have been caused by my own mismerge of the
originally posted patch] Signed-off-by: Paul Walmsley <paul@pwsan.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
The usbhid driver has inconsistently duplicated code in its post-reset,
resume, and reset-resume pathways.
reset-resume doesn't check HID_STARTED before trying to
restart the I/O queues.
resume fails to clear the HID_SUSPENDED flag if HID_STARTED
isn't set.
resume calls usbhid_restart_queues() with usbhid->lock held
and the others call it without holding the lock.
The first item in particular causes a problem following a reset-resume
if the driver hasn't started up its I/O. URB submission fails because
usbhid->urbin is NULL, and this triggers an unending reset-retry loop.
This patch fixes the problem by creating a new subroutine,
hid_restart_io(), to carry out all the common activities. It also
adds some checks that were missing in the original code:
After a reset, there's no need to clear any halted endpoints.
After a resume, if a reset is pending there's no need to
restart any I/O until the reset is finished.
After a resume, if the interrupt-IN endpoint is halted there's
no need to submit the input URB until the halt has been
cleared.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Daniel Fraga <fragabr@gmail.com> Tested-by: Daniel Fraga <fragabr@gmail.com> CC: <stable@vger.kernel.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Some cipher implementations will crash if you try to use them
without calling setkey first. This patch adds a check so that
the accept(2) call will fail with -ENOKEY if setkey hasn't been
done on the socket yet.
Cc: stable@vger.kernel.org Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Dmitry Vyukov <dvyukov@google.com>
[backported to 3.18 by Milan Broz <gmazyland@gmail.com>]
The commit [bd48128539ab: ALSA: hda - Fix forgotten HDMI
monitor_present update] covered the missing update of monitor_present
flag, but this caused a regression for devices without the i915 eld
notifier. Since the old code supposed that pin_eld->monitor_present
was updated by the caller side, the hdmi_present_sense_via_verbs()
doesn't update the temporary eld->monitor_present but only
pin_eld->monitor_present, which is now overridden in update_eld().
The fix is to update pin_eld->monitor_present as well before calling
update_eld().
Note that this may still leave monitor_present flag in an inconsistent
state when the driver repolls, but this is at least the old behavior.
More proper fix will follow in the later patch.
GCC6 (and Linaro's 2015.12 snapshot of GCC5) has a new default that uses
adrp/ldr or adrp/add to address literal pools. When CONFIG_ARM64_ERRATUM_843419
is enabled, modules built with this toolchain fail to load:
module libahci: unsupported RELA relocation: 275
This patch fixes the problem by passing '-mpc-relative-literal-loads'
to the compiler.
Hanjun Guo has reported that a CMA stress test causes broken accounting of
CMA and free pages:
> Before the test, I got:
> -bash-4.3# cat /proc/meminfo | grep Cma
> CmaTotal: 204800 kB
> CmaFree: 195044 kB
>
>
> After running the test:
> -bash-4.3# cat /proc/meminfo | grep Cma
> CmaTotal: 204800 kB
> CmaFree: 6602584 kB
>
> So the freed CMA memory is more than total..
>
> Also the the MemFree is more than mem total:
>
> -bash-4.3# cat /proc/meminfo
> MemTotal: 16342016 kB
> MemFree: 22367268 kB
> MemAvailable: 22370528 kB
Laura Abbott has confirmed the issue and suspected the freepage accounting
rewrite around 3.18/4.0 by Joonsoo Kim. Joonsoo had a theory that this is
caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA
pageblocks:
> CMA isolates MAX_ORDER aligned blocks, but, during the process,
> partialy isolated block exists. If MAX_ORDER is 11 and
> pageblock_order is 9, two pageblocks make up MAX_ORDER
> aligned block and I can think following scenario because pageblock
> (un)isolation would be done one by one.
>
> (each character means one pageblock. 'C', 'I' means MIGRATE_CMA,
> MIGRATE_ISOLATE, respectively.
>
> CC -> IC -> II (Isolation)
> II -> CI -> CC (Un-isolation)
>
> If some pages are freed at this intermediate state such as IC or CI,
> that page could be merged to the other page that is resident on
> different type of pageblock and it will cause wrong freepage count.
This was supposed to be prevented by CMA operating on MAX_ORDER blocks,
but since it doesn't hold the zone->lock between pageblocks, a race
window does exist.
It's also likely that unexpected merging can occur between
MIGRATE_ISOLATE and non-CMA pageblocks. This should be prevented in
__free_one_page() since commit 3c605096d315 ("mm/page_alloc: restrict
max order of merging on isolated pageblock"). However, we only check
the migratetype of the pageblock where buddy merging has been initiated,
not the migratetype of the buddy pageblock (or group of pageblocks)
which can be MIGRATE_ISOLATE.
Joonsoo has suggested checking for buddy migratetype as part of
page_is_buddy(), but that would add extra checks in allocator hotpath
and bloat-o-meter has shown significant code bloat (the function is
inline).
This patch reduces the bloat at some expense of more complicated code.
The buddy-merging while-loop in __free_one_page() is initially bounded
to pageblock_border and without any migratetype checks. The checks are
placed outside, bumping the max_order if merging is allowed, and
returning to the while-loop with a statement which can't be possibly
considered harmful.
This fixes the accounting bug and also removes the arguably weird state
in the original commit 3c605096d315 where buddies could be left
unmerged.
Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock") Link: https://lkml.org/lkml/2016/3/2/280 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Hanjun Guo <guohanjun@huawei.com> Tested-by: Hanjun Guo <guohanjun@huawei.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Debugged-by: Laura Abbott <labbott@redhat.com> Debugged-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> [3.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Let's try to be consistent about data type of page order.
[sfr@canb.auug.org.au: fix build (type of pageblock_order)]
[hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
__free_pages_bootmem prepares a page for release to the buddy allocator
and assumes that the struct page is initialised. Parallel initialisation
of struct pages defers initialisation and __free_pages_bootmem can be
called for struct pages that cannot yet map struct page to PFN. This
patch passes PFN to __free_pages_bootmem with no other functional change.
Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Nate Zimmer <nzimmer@sgi.com> Tested-by: Waiman Long <waiman.long@hp.com> Tested-by: Daniel J Blueman <daniel@numascale.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Robin Holt <robinmholt@gmail.com> Cc: Nate Zimmer <nzimmer@sgi.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Waiman Long <waiman.long@hp.com> Cc: Scott Norton <scott.norton@hp.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
When master handles convert request, it queues ast first and then
returns status. This may happen that the ast is sent before the request
status because the above two messages are sent by two threads. And
right after the ast is sent, if master down, it may trigger BUG in
dlm_move_lockres_to_recovery_list in the requested node because ast
handler moves it to grant list without clear lock->convert_pending. So
remove BUG_ON statement and check if the ast is processed in
dlmconvert_remote.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
There is a race window between dlmconvert_remote and
dlm_move_lockres_to_recovery_list, which will cause a lock with
OCFS2_LOCK_BUSY in grant list, thus system hangs.
status = dlm_send_remote_convert_request();
>>>>>> race window, master has queued ast and return DLM_NORMAL,
and then down before sending ast.
this node detects master down and calls
dlm_move_lockres_to_recovery_list, which will revert the
lock to grant list.
Then OCFS2_LOCK_BUSY won't be cleared as new master won't
send ast any more because it thinks already be authorized.
In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set
(res is still in recovering) or res master changed (new master has
finished recovery), reset the status to DLM_RECOVERING, then it will
retry convert.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
The ati_remote2 driver expects at least two interfaces with one
endpoint each. If given malicious descriptor that specify one
interface or no endpoints, it will crash in the probe function.
Ensure there is at least two interfaces and one endpoint for each
interface before using it.
The full disclosure: http://seclists.org/bugtraq/2016/Mar/90
Some Lenovo ideapad models lack a physical rfkill switch.
On Lenovo models ideapad Y700 Touch-15ISK and ideapad Y700-15ISK,
ideapad-laptop would wrongly report all radios as blocked by
hardware which caused wireless network connections to fail.
Add these models without an rfkill switch to the no_hw_rfkill list.
Fix deadlocking during concurrent receive and transmit operations on SMP
platforms caused by the use of incorrect lock: on transmit 'tx_lock'
spinlock should be used instead of 'lock' which is used for receive
operation.
This fix is applicable to kernel versions starting from v2.15.
Signed-off-by: Aurelien Jacquiot <a-jacquiot@ti.com> Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
This commit fixes the following security hole affecting systems where
all of the following conditions are fulfilled:
- The fs.suid_dumpable sysctl is set to 2.
- The kernel.core_pattern sysctl's value starts with "/". (Systems
where kernel.core_pattern starts with "|/" are not affected.)
- Unprivileged user namespace creation is permitted. (This is
true on Linux >=3.8, but some distributions disallow it by
default using a distro patch.)
Under these conditions, if a program executes under secure exec rules,
causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
namespace, changes its root directory and crashes, the coredump will be
written using fsuid=0 and a path derived from kernel.core_pattern - but
this path is interpreted relative to the root directory of the process,
allowing the attacker to control where a coredump will be written with
root privileges.
To fix the security issue, always interpret core_pattern for dumps that
are written under SUID_DUMP_ROOT relative to the root directory of init.
Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
The trace_printk() code will allocate extra buffers if the compile detects
that a trace_printk() is used. To do this, the format of the trace_printk()
is saved to the __trace_printk_fmt section, and if that section is bigger
than zero, the buffers are allocated (along with a message that this has
happened).
If trace_printk() uses a format that is not a constant, and thus something
not guaranteed to be around when the print happens, the compiler optimizes
the fmt out, as it is not used, and the __trace_printk_fmt section is not
filled. This means the kernel will not allocate the special buffers needed
for the trace_printk() and the trace_printk() will not write anything to the
tracing buffer.
Adding a "__used" to the variable in the __trace_printk_fmt section will
keep it around, even though it is set to NULL. This will keep the string
from being printed in the debugfs/tracing/printk_formats section as it is
not needed.
Moving the initialization earlier is needed in 4.6 because
kvm_arch_init_vm is now using mmu_lock, causing lockdep to
complain:
[ 284.440294] INFO: trying to register non-static key.
[ 284.445259] the code is fine but needs lockdep annotation.
[ 284.450736] turning off the locking correctness validator.
...
[ 284.528318] [<ffffffff810aecc3>] lock_acquire+0xd3/0x240
[ 284.533733] [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
[ 284.541467] [<ffffffff81715581>] _raw_spin_lock+0x41/0x80
[ 284.546960] [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
[ 284.554707] [<ffffffffa0305aa0>] kvm_page_track_register_notifier+0x20/0x60 [kvm]
[ 284.562281] [<ffffffffa02ece70>] kvm_mmu_init_vm+0x20/0x30 [kvm]
[ 284.568381] [<ffffffffa02dbf7a>] kvm_arch_init_vm+0x1ea/0x200 [kvm]
[ 284.574740] [<ffffffffa02bff3f>] kvm_dev_ioctl+0xbf/0x4d0 [kvm]
However, it also helps fixing a preexisting problem, which is why this
patch is also good for stable kernels: kvm_create_vm was incrementing
current->mm->mm_count but not decrementing it at the out_err label (in
case kvm_init_mmu_notifier failed). The new initialization order makes
it possible to add the required mmdrop without adding a new error label.
This patch fixes an active I/O shutdown bug for fabric
drivers using target_wait_for_sess_cmds(), where se_cmd
descriptor shutdown would result in hung tasks waiting
indefinitely for se_cmd->cmd_wait_comp to complete().
To address this bug, drop the incorrect list_del_init()
usage in target_wait_for_sess_cmds() and always complete()
during se_cmd target_release_cmd_kref() put, in order to
let caller invoke the final fabric release callback
into se_cmd->se_tfo->release_cmd() code.
__clear_bit_unlock() is a special little snowflake. While it carries the
non-atomic '__' prefix, it is specifically documented to pair with
test_and_set_bit() and therefore should be 'somewhat' atomic.
Therefore the generic implementation of __clear_bit_unlock() cannot use
the fully non-atomic __clear_bit() as a default.
If an arch is able to do better; is must provide an implementation of
__clear_bit_unlock() itself.
Specifically, this came up as a result of hackbench livelock'ing in
slab_lock() on ARC with SMP + SLUB + !LLSC.
The non serializing __clear_bit() was getting "lost"
80543b8e: ld_s r2,[r13,0] <--- (A) Finds PG_locked is set 80543b90: or r3,r2,1 <--- (B) other core unlocks right here 80543b94: st_s r3,[r13,0] <--- (C) sets PG_locked (overwrites unlock)
The Microsoft HD-5001 webcam microphone does not support sample rate
reading as the HD-5000 one.
This results in dmesg errors and sound hanging with pulseaudio.
(busybox's cat uses sendfile(2), unlike the coreutils version)
This is because tracing_splice_read_pipe() can call splice_to_pipe()
with spd->nr_pages == 0. spd_pages underflows in splice_to_pipe() and
we fill the page pointers and the other fields of the pipe_buffers with
garbage.
All other callers of splice_to_pipe() avoid calling it when nr_pages ==
0, and we could make tracing_splice_read_pipe() do that too, but it
seems reasonable to have splice_to_page() handle this condition
gracefully.
Cc: stable@vger.kernel.org Signed-off-by: Rabin Vincent <rabin@rab.in> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
If tracing contains data and the trace_pipe file is read with sendfile(),
then it can trigger a NULL pointer dereference and various BUG_ON within the
VM code.
There's a patch to fix this in the splice_to_pipe() code, but it's also a
good idea to not let that happen from trace_pipe either.
The uas driver can never queue more then MAX_CMNDS (- 1) tags and tags
are shared between luns, so there is no need to claim that we can_queue
some random large number.
Not claiming that we can_queue 65536 commands, fixes the uas driver
failing to initialize while allocating the tag map with a "Page allocation
failure (order 7)" error on systems which have been running for a while
and thus have fragmented memory.
Cc: stable@vger.kernel.org Reported-and-tested-by: Yves-Alexis Perez <corsac@corsac.net> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
An attack has become available which pretends to be a quirky
device circumventing normal sanity checks and crashes the kernel
by an insufficient number of interfaces. This patch adds a check
to the code path for quirky devices.
Attacks that trick drivers into passing a NULL pointer
to usb_driver_claim_interface() using forged descriptors are
known. This thwarts them by sanity checking.
The iowarrior driver expects at least one valid endpoint. If given
malicious descriptors that specify 0 for the number of endpoints,
it will crash in the probe function. Ensure there is at least
one endpoint on the interface before using it.
The full report of this issue can be found here:
http://seclists.org/bugtraq/2016/Mar/87
RCU used illegally from idle CPU!
rcu_scheduler_active = 1, debug_locks = 1
RCU used illegally from extended quiescent state!
no locks held by swapper/3/0.
Move the entering_irq() call before ack_APIC_irq(), because entering_irq()
tells the RCU susbstems to end the extended quiescent state, so that the
following trace call in ack_APIC_irq() works correctly.
Suggested-by: Andi Kleen <ak@linux.intel.com> Fixes: 4787c368a9bc "x86/tracing: Add irq_enter/exit() in smp_trace_reschedule_interrupt()" Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
In some cases, platform thermal driver may report invalid trip points,
thermal core should not take any action for these trip points.
This fixed a regression that bogus trip point starts to screw up thermal
control on some Lenovo laptops, after
commit bb431ba26c5cd0a17c941ca6c3a195a3a6d5d461
Author: Zhang Rui <rui.zhang@intel.com>
Date: Fri Oct 30 16:31:47 2015 +0800
Thermal: initialize thermal zone device correctly
After thermal zone device registered, as we have not read any
temperature before, thus tz->temperature should not be 0,
which actually means 0C, and thermal trend is not available.
In this case, we need specially handling for the first
thermal_zone_device_update().
Both thermal core framework and step_wise governor is
enhanced to handle this. And since the step_wise governor
is the only one that uses trends, so it's the only thermal
governor that needs to be updated.
Normally the timeout clock frequency is read from the capabilities
register. It is also possible to set the value prior to calling
sdhci_add_host() in which case that value will override the
capabilities register value. However that was being done after
calculating max_busy_timeout so that max_busy_timeout was being
calculated using the wrong value of timeout_clk.
Fix that by moving the override before max_busy_timeout is
calculated.
The result is that the max_busy_timeout and max_discard
increase for BSW devices so that, for example, the time for
mkfs.ext4 on a 64GB eMMC drops from about 1 minute 40 seconds
to about 20 seconds.
Note, in the future, the capabilities setting will be tidied up
and this override won't be used anymore. However this fix is
needed for stable.
The touchscreen controller in the A13 and later has a different temperature
curve than the one in the original A10, change the compatible for the A13 and
later so that the kernel will use the correct curve.
Reported-by: Tong Zhang <lovewilliam@gmail.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Maxime Ripard <maxime.ripard@free-electrons.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
nfsd_lookup_dentry exits with the parent filehandle locked. fh_put also
unlocks if necessary (nfsd filehandle locking is probably too lenient),
so it gets unlocked eventually, but if the following op in the compound
needs to lock it again, we can deadlock.
A fuzzer ran into this; normal clients don't send a secinfo followed by
a readdir in the same compound.
Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
create_fixed_stream_quirk() may cause a NULL-pointer dereference by
accessing the non-existing endpoint when a USB device with a malformed
USB descriptor is used.
This patch avoids it simply by adding a sanity check of bNumEndpoints
before the accesses.
# dmesg | grep mmc
mmc_spi spi32766.0: SD/MMC host mmc0, no WP, no poweroff, cd polling
mmc0: host does not support reading read-only switch, assuming write-enable
mmc0: new SDHC card on SPI
mmcblk0: mmc0:0000 SU04G 3.69 GiB
mmcblk0: p1
With this patch applied the "cd polling" portion above disappears.
Cirrus HD-audio driver may adjust GPIO pins for EAPD dynamically
depending on the jack plug state. This works fine for the auto-mute
mode where the speaker gets muted upon the HP jack plug. OTOH, when
the auto-mute mode is off, this turns off the EAPD unexpectedly
depending on the jack state, which results in the silent speaker
output.
This patch fixes the silent speaker output issue by setting GPIO bits
constantly when the auto-mute mode is off.
Even though hid_hw_* checks that passed in data_len is less than
HID_MAX_BUFFER_SIZE it is not enough, as i2c-hid does not necessarily
allocate buffers of HID_MAX_BUFFER_SIZE but rather checks all device
reports and select largest size. In-kernel users normally just send as much
data as report needs, so there is no problem, but hidraw users can do
whatever they please:
Function eth_prepare_mac_addr_change() is called as part of MAC
address change. This function check if interface is running.
To enable change MAC address when interface is running:
IFF_LIVE_ADDR_CHANGE flag must be set to dev->priv_flags field
Fixes: c5aff18204da ("net: mvneta: driver for Marvell Armada 370/XP
network unit") Cc: stable@vger.kernel.org Signed-off-by: Dmitri Epshtein <dima@marvell.com> Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Inside multipath_make_request(), multipath maps the incoming
bio into low level device's bio, but it is totally wrong to
copy the bio into mapped bio via '*mapped_bio = *bio'. For
example, .__bi_remaining is kept in the copy, especially if
the incoming bio is chained to via bio splitting, so .bi_end_io
can't be called for the mapped bio at all in the completing path
in this kind of situation.
This patch fixes the issue by using clone style.
Cc: stable@vger.kernel.org (v3.14+) Reported-and-tested-by: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
The powermate driver expects at least one valid USB endpoint in its
probe function. If given malicious descriptors that specify 0 for
the number of endpoints, it will crash. Validate the number of
endpoints on the interface before using them.
The full report for this issue can be found here:
http://seclists.org/bugtraq/2016/Mar/85