This patch is based on an earlier one submitted
by Jon Maxwell with the following commit message:
"We recently encountered a bug where a few customers using ibmveth on the
same LPAR hit an issue where a TCP session hung when large receive was
enabled. Closer analysis revealed that the session was stuck because the
one side was advertising a zero window repeatedly.
We narrowed this down to the fact the ibmveth driver did not set gso_size
which is translated by TCP into the MSS later up the stack. The MSS is
used to calculate the TCP window size and as that was abnormally large,
it was calculating a zero window, even although the sockets receive buffer
was completely empty."
We rely on the Virtual I/O Server partition in a pseries
environment to provide the MSS through the TCP header checksum
field. The stipulation is that users should not disable checksum
offloading if rx packet aggregation is enabled through VIOS.
Some firmware offerings provide the MSS in the RX buffer.
This is signalled by a bit in the RX queue descriptor.
Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Reviewed-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Jonathan Maxwell <jmaxwell37@gmail.com> Reviewed-by: David Dai <zdai@us.ibm.com> Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Save the qp context flags byte containing the flag disabling vlan stripping
in the RESET to INIT qp transition, rather than in the INIT to RTR
transition. Per the firmware spec, the flags in this byte are active
in the RESET to INIT transition.
As a result of saving the flags in the incorrect qp transition, when
switching dynamically from VGT to VST and back to VGT, the vlan
remained stripped (as is required for VST) and did not return to
not-stripped (as is required for VGT).
Fixes: f0f829bf42cd ("net/mlx4_core: Add immediate activate for VGT->VST->VGT") Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In function mlx4_cq_completion() and mlx4_cq_event(), the
radix_tree_lookup requires a rcu_read_lock.
This is mandatory: if another core frees the CQ, it could
run the radix_tree_node_rcu_free() call_rcu() callback while
its being used by the radix tree lookup function.
Additionally, in function mlx4_cq_event(), since we are adding
the rcu lock around the radix-tree lookup, we no longer need to take
the spinlock. Also, the synchronize_irq() call for the async event
eliminates the need for incrementing the cq reference count in
mlx4_cq_event().
Other changes:
1. In function mlx4_cq_free(), replace spin_lock_irq with spin_lock:
we no longer take this spinlock in the interrupt context.
The spinlock here, therefore, simply protects against different
threads simultaneously invoking mlx4_cq_free() for different cq's.
2. In function mlx4_cq_free(), we move the radix tree delete to before
the synchronize_irq() calls. This guarantees that we will not
access this cq during any subsequent interrupts, and therefore can
safely free the CQ after the synchronize_irq calls. The rcu_read_lock
in the interrupt handlers only needs to protect against corrupting the
radix tree; the interrupt handlers may access the cq outside the
rcu_read_lock due to the synchronize_irq calls which protect against
premature freeing of the cq.
3. In function mlx4_cq_event(), we change the mlx_warn message to mlx4_dbg.
4. We leave the cq reference count mechanism in place, because it is
still needed for the cq completion tasklet mechanism.
Fixes: 6d90aa5cf17b ("net/mlx4_core: Make sure there are no pending async events when freeing CQ") Fixes: 225c7b1feef1 ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters") Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On a system with a defective USB device connected to an USB hub,
an endless sequence of port connect events was observed. The sequence
of events as observed is as follows:
- Port reports connected event (port status=USB_PORT_STAT_CONNECTION).
- Event handler debounces port and resets it by calling hub_port_reset().
- hub_port_reset() calls hub_port_wait_reset() to wait for the reset
to complete.
- The reset completes, but USB_PORT_STAT_CONNECTION is not immediately
set in the port status register.
- hub_port_wait_reset() returns -ENOTCONN.
- Port initialization sequence is aborted.
- A few milliseconds later, the port again reports a connected event,
and the sequence repeats.
This continues either forever or, randomly, stops if the connection
is already re-established when the port status is read. It results in
a high rate of udev events. This in turn destabilizes userspace since
the above sequence holds the device mutex pretty much continuously
and prevents userspace from actually reading the device status.
To prevent the problem from happening, let's wait for the connection
to be re-established after a port reset. If the device was actually
disconnected, the code will still return an error, but it will do so
only after the long reset timeout.
Cc: Douglas Anderson <dianders@chromium.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While stressing memory and IO at the same time we changed SMT settings,
we were able to consistently trigger deadlocks in the mm system, which
froze the entire machine.
I think that under memory stress conditions, the large allocations
performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
waiting on the block layer remmaping completion, thus deadlocking the
system. The trace below was collected after the machine stalled,
waiting for the hotplug event completion.
The simplest fix for this is to make allocations in this path
non-reclaimable, with GFP_NOIO. With this patch, We couldn't hit the
issue anymore.
This should apply on top of Jens's for-next branch cleanly.
Changes since v1:
- Use GFP_NOIO instead of GFP_NOWAIT.
Subtracting tp_sizeof_priv from tp_block_size and casting to int
to check whether one is less then the other doesn't always work
(both of them are unsigned ints).
Compare them as is instead.
Also cast tp_sizeof_priv to u64 before using BLK_PLUS_PRIV, as
it can overflow inside BLK_PLUS_PRIV otherwise.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
RNG instantiation was previously fixed by
commit 62743a4145bb9 ("crypto: caam - fix RNG init descriptor ret. code checking")
while deinstantiation was not addressed.
Since the descriptors used are similar, in the sense that they both end
with a JUMP HALT command, checking for errors should be similar too,
i.e. status code 7000_0000h should be considered successful.
Fixes: 1005bccd7a4a6 ("crypto: caam - enable instantiation of all RNG4 state handles") Signed-off-by: Horia Geantă <horia.geanta@nxp.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit dda45f701c9d ("MIPS: Switch to the irq_stack in interrupts")
changed both the normal and vectored interrupt handlers. Unfortunately
the vectored version, "except_vec_vi_handler", was incorrectly modified
to unconditionally jal to plat_irq_dispatch, rather than doing a jalr to
the vectored handler that has been set up. This is ok for many platforms
which set the vectored handler to plat_irq_dispatch anyway, but will
cause problems with platforms that use other handlers.
Fixes: dda45f701c9d ("MIPS: Switch to the irq_stack in interrupts") Signed-off-by: Matt Redfearn <matt.redfearn@imgtec.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paul.burton@imgtec.com> Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/15110/ Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Since do_IRQ is now invoked on a separate IRQ stack, we select
HAVE_IRQ_EXIT_ON_IRQ_STACK so that softirq's may be invoked directly
from irq_exit(), rather than requiring do_softirq_own_stack.
Signed-off-by: Matt Redfearn <matt.redfearn@imgtec.com> Acked-by: Jason A. Donenfeld <jason@zx2c4.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/14744/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When enterring interrupt context via handle_int or except_vec_vi, switch
to the irq_stack of the current CPU if it is not already in use.
The current stack pointer is masked with the thread size and compared to
the base or the irq stack. If it does not match then the stack pointer
is set to the top of that stack, otherwise this is a nested irq being
handled on the irq stack so the stack pointer should be left as it was.
The in-use stack pointer is placed in the callee saved register s1. It
will be saved to the stack when plat_irq_dispatch is invoked and can be
restored once control returns here.
Signed-off-by: Matt Redfearn <matt.redfearn@imgtec.com> Acked-by: Jason A. Donenfeld <jason@zx2c4.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: James Hogan <james.hogan@imgtec.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/14743/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The SAVE_SOME macro is used to save the execution context on all
exceptions.
If an exception occurs while executing user code, the stack is switched
to the kernel's stack for the current task, and register $28 is switched
to point to the current_thread_info, which is at the bottom of the stack
region.
If the exception occurs while executing kernel code, the stack is left,
and this change ensures that register $28 is not updated. This is the
correct behaviour when the kernel can be executing on the separate irq
stack, because the thread_info will not be at the base of it.
With this change, register $28 is only switched to it's kernel
conventional usage of the currrent thread info pointer at the point at
which execution enters kernel space. Doing it on every exception was
redundant, but OK without an IRQ stack, but will be erroneous once that
is introduced.
Signed-off-by: Matt Redfearn <matt.redfearn@imgtec.com> Acked-by: Jason A. Donenfeld <jason@zx2c4.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: James Hogan <james.hogan@imgtec.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/14742/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Within unwind stack, check if the stack pointer being unwound is within
the CPU's irq_stack and if so use that page rather than the task's stack
page.
Signed-off-by: Matt Redfearn <matt.redfearn@imgtec.com> Acked-by: Jason A. Donenfeld <jason@zx2c4.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Adam Buchbinder <adam.buchbinder@gmail.com> Cc: Maciej W. Rozycki <macro@imgtec.com> Cc: Marcin Nowakowski <marcin.nowakowski@imgtec.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/14741/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After parsing TRX we should skip to the first block placed behind it.
Our code was working only with TRX with length not aligned to the
blocksize. In other cases (length aligned) it was missing the block
places right after TRX.
This fixes calculation and simplifies the comment.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On Baytrail, we manually calculate busyness over the evaluation interval
to avoid issues with miscaluations with RC6 enabled. However, it turns
out that the DOWN_EI interrupt generator is completely bust - it
operates in two modes, continuous or never. Neither of which are
conducive to good behaviour. Stop unmask the DOWN_EI interrupt and just
compute everything from the UP_EI which does seem to correspond to the
desired interval.
v2: Fixup gen6_rps_pm_mask() as well
v3: Inline vlv_c0_above() to combine the now identical elapsed
calculation for up/down and simplify the threshold testing
Certain Baytrails, namely the 4 cpu core variants, have been
plaqued by spurious system hangs, mostly occurring with light loads.
Multiple bisects by various people point to a commit which changes the
reclocking strategy for Baytrail to follow its bigger brethen:
commit 8fb55197e64d ("drm/i915: Agressive downclocking on Baytrail")
There is also a review comment attached to this commit from Deepak S
on avoiding punit access on Cherryview and thus it was excluded on
common reclocking path. By taking the same approach and omitting
the punit access by not tweaking the thresholds when the hardware
has been asked to move into different frequency, considerable gains
in stability have been observed.
With J1900 box, light render/video load would end up in system hang
in usually less than 12 hours. With this patch applied, the cumulative
uptime has now been 34 days without issues. To provoke system hang,
light loads on both render and bsd engines in parallel have been used:
glxgears >/dev/null 2>/dev/null &
mpv --vo=vaapi --hwdec=vaapi --loop=inf vid.mp4
So far, author has not witnessed system hang with above load
and this patch applied. Reports from the tenacious people at
kernel bugzilla are also promising.
Considering that the punit access frequency with this patch is
considerably less, there is a possibility that this will push
the, still unknown, root cause past the triggering point on most loads.
But as we now can reliably reproduce the hang independently,
we can reduce the pain that users are having and use a
static thresholds until a root cause is found.
v3: don't break debugfs and simplification (Chris Wilson)
References: https://bugzilla.kernel.org/show_bug.cgi?id=109051 Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Len Brown <len.brown@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Jani Nikula <jani.nikula@intel.com> Cc: fritsch@xbmc.org Cc: miku@iki.fi Cc: Ezequiel Garcia <ezequiel@vanguardiasur.com.ar> CC: Michal Feix <michal@feix.cz> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Deepak S <deepak.s@linux.intel.com> Cc: Jarkko Nikula <jarkko.nikula@linux.intel.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Acked-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1487166779-26945-1-git-send-email-mika.kuoppala@intel.com
(cherry picked from commit 6067a27d1f0184596d51decbac1c1fdc4acb012f) Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In the case that compat_get_bitmap fails we do not want to copy the
bitmap to the user as it will contain uninitialized stack data and leak
sensitive data.
On VTLB+FTLB platforms (such as Loongson-3A R2), FTLB's pagesize is
usually configured the same as PAGE_SIZE. In such a case, Huge page
entry is not suitable to write in FTLB.
Unfortunately, when a huge page is created, its page table entries
haven't created immediately. Then the TLB refill handler will fetch an
invalid page table entry which has no "HUGE" bit, and this entry may be
written to FTLB. Since it is invalid, TLB load/store handler will then
use tlbwi to write the valid entry at the same place. However, the
valid entry is a huge page entry which isn't suitable for FTLB.
Our solution is to modify build_huge_handler_tail. Flush the invalid
old entry (whether it is in FTLB or VTLB, this is in order to reduce
branches) and use tlbwr to write the valid new entry.
Commit 08b3c894e565 ("MIPS: lantiq: Disable xbar fpi burst mode")
accidentally requested the resources from the pmu address region
instead of the xbar registers region, but the check for the return
value of request_mem_region() was wrong. Commit 98ea51cb0c8c ("MIPS:
Lantiq: Fix another request_mem_region() return code check") fixed the
check of the return value of request_mem_region() which made the kernel
panics.
This patch now makes use of the correct memory region for the cross bar.
When building for microMIPS we need to ensure that the assembler always
knows that there is code at the target of a branch or jump. Recent
toolchains will fail to link a microMIPS kernel when this isn't the case
due to what it thinks is a branch to non-microMIPS code.
mips-mti-linux-gnu-ld kernel/built-in.o: .spinlock.text+0x2fc: Unsupported branch between ISA modes.
mips-mti-linux-gnu-ld final link failed: Bad value
This is due to inline assembly labels in spinlock.h not being followed
by an instruction mnemonic, either due to a .subsection pseudo-op or the
end of the inline asm block.
Fix this with a .insn direction after such labels.
Signed-off-by: Paul Burton <paul.burton@imgtec.com> Signed-off-by: James Hogan <james.hogan@imgtec.com> Reviewed-by: Maciej W. Rozycki <macro@imgtec.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/15325/ Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a 32-bit kernel is configured to support MIPS64r6 (CPU_MIPS64_R6),
MIPS_O32_FP64_SUPPORT won't be selected as it should be because
MIPS32_O32 is disabled (o32 is already the default ABI available on
32-bit kernels).
This results in userland FP breakage as CP0_Status.FR is read-only 1
since r6 (when an FPU is present) so __enable_fpu() will fail to clear
FR. This causes the FPU emulator to get used which will incorrectly
emulate 32-bit FPU registers.
Force o32 fp64 support in this case by also selecting
MIPS_O32_FP64_SUPPORT from CPU_MIPS64_R6 if 32BIT.
Fixes: 4e9d324d4288 ("MIPS: Require O32 FP64 support for MIPS64 with O32 compat") Signed-off-by: James Hogan <james.hogan@imgtec.com> Reviewed-by: Paul Burton <paul.burton@imgtec.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/15310/ Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit fd2d2b191fe7 ("s390: get_user() should zero on failure")
intended to fix s390's get_user() implementation which did not zero
the target operand if the read from user space faulted. Unfortunately
the patch has no effect: the corresponding inline assembly specifies
that the operand is only written to ("=") and the previous value is
discarded.
Therefore the compiler is free to and actually does omit the zero
initialization.
To fix this simply change the contraint modifier to "+", so the
compiler cannot omit the initialization anymore.
Fixes: c9ca78415ac1 ("s390/uaccess: provide inline variants of get_user/put_user") Fixes: fd2d2b191fe7 ("s390: get_user() should zero on failure") Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reorder the operations in decompress_kernel() to ensure initrd is moved
to a safe location before the bss section is zeroed.
During decompression bss can overlap with the initrd and this can
corrupt the initrd contents depending on the size of the compressed
kernel (which affects where the initrd is placed by the bootloader) and
the size of the bss section of the decompressor.
Also use the correct initrd size when checking for overlaps with
parmblock.
Make sure to reserve the boot memory for the flattened device tree.
Otherwise it might get overwritten, e.g. when initial_boot_params is
copied, leading to a corrupted FDT and a boot hang/crash:
bootconsole [early0] enabled
Early console on uart16650 initialized at 0xf8001600
OF: fdt: Error -11 processing FDT
Kernel panic - not syncing: setup_cpuinfo: No CPU found in devicetree!
---[ end Kernel panic - not syncing: setup_cpuinfo: No CPU found in devicetree!
Guenter Roeck says:
> I think I found the problem. In unflatten_and_copy_device_tree(), with added
> debug information:
>
> OF: fdt: initial_boot_params=c861e400, dt=c861f000 size=28874 (0x70ca)
>
> ... and then initial_boot_params is copied to dt, which results in corrupted
> fdt since the memory overlaps. Looks like the initial_boot_params memory
> is not reserved and (re-)allocated by early_init_dt_alloc_memory_arch().
In the past, there was only one load-with-reservation instruction,
lwarx, and if a program attempted a lwarx on a misaligned address, it
would take an alignment interrupt and the kernel handler would emulate
it as though it was lwzx, which was not really correct, but benign since
it is loading the right amount of data, and the lwarx should be paired
with a stwcx. to the same address, which would also cause an alignment
interrupt which would result in a SIGBUS being delivered to the process.
We now have 5 different sizes of load-with-reservation instruction. Of
those, lharx and ldarx cause an immediate SIGBUS by luck since their
entries in aligninfo[] overlap instructions which were not fixed up, but
lqarx overlaps with lhz and will be emulated as such. lbarx can never
generate an alignment interrupt since it only operates on 1 byte.
To straighten this out and fix the lqarx case, this adds code to detect
the l[hwdq]arx instructions and return without fixing them up, resulting
in a SIGBUS being delivered to the process.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 4c6d9acce1f4 ("powerpc/mm: Add hooks for cxl") converted local
TLB invalidates to global if the cxl driver is active. This is necessary
because the CAPP snoops invalidations to forward them to the PSL on the
cxl adapter. However one path was forgotten. native_flush_hash_range()
still does local TLB invalidates, as found out the hard way recently.
This patch fixes it by following the same logic as previously: if the
cxl driver is active, the local TLB invalidates are 'upgraded' to
global.
The rapf copy loops in the Meta usercopy code is missing some extable
entries for HTP cores with unaligned access checking enabled, where
faults occur on the instruction immediately after the faulting access.
Add the fixup labels and extable entries for these cases so that corner
case user copy failures don't cause kernel crashes.
The fixup code to rewind the source pointer in
__asm_copy_from_user_{32,64}bit_rapf_loop() always rewound the source by
a single unit (4 or 8 bytes), however this is insufficient if the fault
didn't occur on the first load in the loop, as the source pointer will
have been incremented but nothing will have been stored until all 4
register [pairs] are loaded.
Read the LSM_STEP field of TXSTATUS (which is already loaded into a
register), a bit like the copy_to_user versions, to determine how many
iterations of MGET[DL] have taken place, all of which need rewinding.
The fixup code for the copy_to_user rapf loops reads TXStatus.LSM_STEP
to decide how far to rewind the source pointer. There is a special case
for the last execution of an MGETL/MGETD, since it leaves LSM_STEP=0
even though the number of MGETLs/MGETDs attempted was 4. This uses ADDZ
which is conditional upon the Z condition flag, but the AND instruction
which masked the TXStatus.LSM_STEP field didn't set the condition flags
based on the result.
Fix that now by using ANDS which does set the flags, and also marking
the condition codes as clobbered by the inline assembly.
Currently we try to zero the destination for a failed read from userland
in fixup code in the usercopy.c macros. The rest of the destination
buffer is then zeroed from __copy_user_zeroing(), which is used for both
copy_from_user() and __copy_from_user().
Unfortunately we fail to zero in the fixup code as D1Ar1 is set to 0
before the fixup code entry labels, and __copy_from_user() shouldn't even
be zeroing the rest of the buffer.
Move the zeroing out into copy_from_user() and rename
__copy_user_zeroing() to raw_copy_from_user() since it no longer does
any zeroing. This also conveniently matches the name needed for
RAW_COPY_USER support in a later patch.
Fixes: 373cd784d0fc ("metag: Memory handling") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: linux-metag@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When copying to userland on Meta, if any faults are encountered
immediately abort the copy instead of continuing on and repeatedly
faulting, and worse potentially copying further bytes successfully to
subsequent valid pages.
Fixes: 373cd784d0fc ("metag: Memory handling") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: linux-metag@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix the error checking of the alignment adjustment code in
raw_copy_from_user(), which mistakenly considers it safe to skip the
error check when aligning the source buffer on a 2 or 4 byte boundary.
If the destination buffer was unaligned it may have started to copy
using byte or word accesses, which could well be at the start of a new
(valid) source page. This would result in it appearing to have copied 1
or 2 bytes at the end of the first (invalid) page rather than none at
all.
Metag's lib/usercopy.c has a bunch of copy_from_user macros for larger
copies between 5 and 16 bytes which are completely unused. Before fixing
zeroing lets drop these macros so there is less to fix.
Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-metag@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In case of error, the function kthread_run() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check
should be replaced with IS_ERR().
In PT_SEIZED + LISTEN mode STOP/CONT signals cause a wakeup against
__TASK_TRACED. If this races with the ptrace_unfreeze_traced at the end
of a PTRACE_LISTEN, this can wake the task /after/ the check against
__TASK_TRACED, but before the reset of state to TASK_TRACED. This
causes it to instead clobber TASK_WAKING, allowing a subsequent wakeup
against TRACED while the task is still on the rq wake_list, corrupting
it.
Oleg said:
"The kernel can crash or this can lead to other hard-to-debug problems.
In short, "task->state = TASK_TRACED" in ptrace_unfreeze_traced()
assumes that nobody else can wake it up, but PTRACE_LISTEN breaks the
contract. Obviusly it is very wrong to manipulate task->state if this
task is already running, or WAKING, or it sleeps again"
[akpm@linux-foundation.org: coding-style fixes] Fixes: 9899d11f ("ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILL") Link: http://lkml.kernel.org/r/xm26y3vfhmkp.fsf_-_@bsegall-linux.mtv.corp.google.com Signed-off-by: Ben Segall <bsegall@google.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
vfs_llseek will check whether the file mode has
FMODE_LSEEK, no return failure. But ashmem can be
lseek, so add FMODE_LSEEK to ashmem file.
Comment From Greg Hackmann:
ashmem_llseek() passes the llseek() call through to the backing
shmem file. 91360b02ab48 ("ashmem: use vfs_llseek()") changed
this from directly calling the file's llseek() op into a VFS
layer call. This also adds a check for the FMODE_LSEEK bit, so
without that bit ashmem_llseek() now always fails with -ESPIPE.
ops->show() can return a negative error code.
Commit 65da3484d9be ("sysfs: correctly handle short reads on PREALLOC attrs.")
(in v4.4) caused this to be stored in an unsigned 'size_t' variable, so errors
would look like large numbers.
As a result, if an error is returned, sysfs_kf_read() will return the
value of 'count', typically 4096.
Commit 17d0774f8068 ("sysfs: correctly handle read offset on PREALLOC attrs")
(in v4.8) extended this error to use the unsigned large 'len' as a size for
memmove().
Consequently, if ->show returns an error, then the first read() on the
sysfs file will return 4096 and could return uninitialized memory to
user-space.
If the application performs a subsequent read, this will trigger a memmove()
with extremely large count, and is likely to crash the machine is bizarre ways.
This bug can currently only be triggered by reading from an md
sysfs attribute declared with __ATTR_PREALLOC() during the
brief period between when mddev_put() deletes an mddev from
the ->all_mddevs list, and when mddev_delayed_delete() - which is
scheduled on a workqueue - completes.
Before this, an error won't be returned by the ->show()
After this, the ->show() won't be called.
I can reproduce it reliably only by putting delay like
usleep_range(500000,700000);
early in mddev_delayed_delete(). Then after creating an
md device md0 run
echo clear > /sys/block/md0/md/array_state; cat /sys/block/md0/md/array_state
The bug can be triggered without the usleep.
Fixes: 65da3484d9be ("sysfs: correctly handle short reads on PREALLOC attrs.") Fixes: 17d0774f8068 ("sysfs: correctly handle read offset on PREALLOC attrs") Signed-off-by: NeilBrown <neilb@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In vmw_surface_define_ioctl(), the 'num_sizes' is the sum of the
'req->mip_levels' array. This array can be assigned any value from
the user space. As both the 'num_sizes' and the array is uint32_t,
it is easy to make 'num_sizes' overflow. The later 'mip_levels' is
used as the loop count. This can lead an oob write. Add the check of
'req->mip_levels' to avoid this.
Signed-off-by: Li Qiang <liqiang6-s@360.cn> Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Previously, when a surface was opened using a legacy (non prime) handle,
it was verified to have been created by a client in the same master realm.
Relax this so that opening is also allowed recursively if the client
already has the surface open.
This works around a regression in svga mesa where opening of a shared
surface is used recursively to obtain surface information.
In vmw_get_cap_3d_ioctl(), a user can supply 0 for a size that is
used in vzalloc(). This eventually calls dump_stack() (in warn_alloc()),
which can leak useful addresses to dmesg.
This therefore leaves only place where such inconsistency could occur:
if padata->list is added at the same time on two different threads.
This pdata pointer comes from the function call to
padata_get_next(pd), which has in it the following block:
I strongly suspect that the problem here is that two threads can race
on reorder list. Even though the deletion is locked, call to
list_entry is not locked, which means it's feasible that two threads
pick up the same padata object and subsequently call list_add_tail on
them at the same time. The fix is thus be hoist that lock outside of
that block.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
changed current->bio_list so that it did not contain *all* of the
queued bios, but only those submitted by the currently running
make_request_fn.
There are two places which walk the list and requeue selected bios,
and others that check if the list is empty. These are no longer
correct.
So redefine current->bio_list to point to an array of two lists, which
contain all queued bios, and adjust various code to test or walk both
lists.
Signed-off-by: NeilBrown <neilb@suse.com> Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()") Signed-off-by: Jens Axboe <axboe@fb.com>
[jwang: backport to 4.4] Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Restore changes in device-mapper from upstream version] Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
To avoid recursion on the kernel stack when stacked block devices
are in use, generic_make_request() will, when called recursively,
queue new requests for later handling. They will be handled when the
make_request_fn for the current bio completes.
If any bios are submitted by a make_request_fn, these will ultimately
be handled seqeuntially. If the handling of one of those generates
further requests, they will be added to the end of the queue.
This strict first-in-first-out behaviour can lead to deadlocks in
various ways, normally because a request might need to wait for a
previous request to the same device to complete. This can happen when
they share a mempool, and can happen due to interdependencies
particular to the device. Both md and dm have examples where this happens.
These deadlocks can be erradicated by more selective ordering of bios.
Specifically by handling them in depth-first order. That is: when the
handling of one bio generates one or more further bios, they are
handled immediately after the parent, before any siblings of the
parent. That way, when generic_make_request() calls make_request_fn
for some particular device, we can be certain that all previously
submited requests for that device have been completely handled and are
not waiting for anything in the queue of requests maintained in
generic_make_request().
An easy way to achieve this would be to use a last-in-first-out stack
instead of a queue. However this will change the order of consecutive
bios submitted by a make_request_fn, which could have unexpected consequences.
Instead we take a slightly more complex approach.
A fresh queue is created for each call to a make_request_fn. After it completes,
any bios for a different device are placed on the front of the main queue, followed
by any bios for the same device, followed by all bios that were already on
the queue before the make_request_fn was called.
This provides the depth-first approach without reordering bios on the same level.
This, by itself, it not enough to remove all deadlocks. It just makes
it possible for drivers to take the extra step required themselves.
To avoid deadlocks, drivers must never risk waiting for a request
after submitting one to generic_make_request. This includes never
allocing from a mempool twice in the one call to a make_request_fn.
A common pattern in drivers is to call bio_split() in a loop, handling
the first part and then looping around to possibly split the next part.
Instead, a driver that finds it needs to split a bio should queue
(with generic_make_request) the second part, handle the first part,
and then return. The new code in generic_make_request will ensure the
requests to underlying bios are processed first, then the second bio
that was split off. If it splits again, the same process happens. In
each case one bio will be completely handled before the next one is attempted.
With this is place, it should be possible to disable the
punt_bios_to_recover() recovery thread for many block devices, and
eventually it may be possible to remove it completely.
Ref: http://www.spinics.net/lists/raid/msg54680.html Tested-by: Jinpu Wang <jinpu.wang@profitbricks.com> Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
[jwang: backport to 4.4] Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
No caller currently checks the return value of
kvm_io_bus_unregister_dev(). This is evil, as all callers silently go on
freeing their device. A stale reference will remain in the io_bus,
getting at least used again, when the iobus gets teared down on
kvm_destroy_vm() - leading to use after free errors.
There is nothing the callers could do, except retrying over and over
again.
So let's simply remove the bus altogether, print an error and make
sure no one can access this broken bus again (returning -ENOMEM on any
attempt to access it).
Fixes: e93f8a0f821e ("KVM: convert io_bus to SRCU") Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On some QNAP NAS devices the rtc can wake the machine. Several people
noticed that once the machine was woken this way it fails to shut down.
That's because the driver fails to acknowledge the interrupt and so it
keeps active and restarts the machine immediatly after shutdown. See
https://bugs.debian.org/794266 for a bug report.
Doing this correctly requires to interpret the INT2 flag of the first read
of the STATUS1 register because this bit is cleared by read.
Note this is not maximally robust though because a pending irq isn't
detected when the STATUS1 register was already read (and so INT2 is not
set) but the irq was not disabled. But that is a hardware imposed problem
that cannot easily be fixed by software.
There were two deviations from the reference manual: you have to wait
half a second when POC is active and you might have to repeat
initialization when POC or BLD are still set after the sequence.
Note however that as POC and BLD are cleared by read the driver might
not be able to detect that a reset is necessary. I don't have a good
idea how to fix this.
Additionally report the value read from STATUS1 to the caller. This
prepares the next patch.
rtc: s35390a: make sure all members in the output are set
The rtc core calls the .read_alarm with all fields initialized to 0. As
the s35390a driver doesn't touch some fields the returned date is
interpreted as a date in January 1900. So make sure all fields are set
to -1; some of them are then overwritten with the right data depending
on the hardware state.
In mainline this is done by commit d68778b80dd7 ("rtc: initialize output
parameter for read alarm to "uninitialized"") in the core. This is
considered to dangerous for stable as it might have side effects for
other rtc drivers that might for example rely on alarm->time.tm_sec
being initialized to 0.
- When alarm isn't enabled, set .enabled to zero instead of returning
-EINVAL.
- Ignore how IRQ1 is configured when determining if IRQ2 is on.
- The three alarm registers have an enable flag which must be
evaluated.
- The chip always triggers when the seconds register gets 0.
Note that the rtc framework however doesn't handle the result correctly
because it doesn't check wday being initialized and so interprets an
alarm being set for 10:00 AM in three days as 10:00 AM tomorrow (or
today if that's not over yet).
With the IRQ stack changes integrated, the XRX200 devices started
emitting a constant stream of kernel messages like this:
[ 565.415310] Spurious IRQ: CAUSE=0x1100c300
This is caused by IP0 getting handled by plat_irq_dispatch() rather than
its vectored interrupt handler, which is fixed by commit de856416e714
("MIPS: IRQ Stack: Fix erroneous jal to plat_irq_dispatch").
Fix plat_irq_dispatch() to handle non-vectored IPI interrupts correctly
by setting up IP2-6 as proper chained IRQ handlers and calling do_IRQ
for all MIPS CPU interrupts.
Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: John Crispin <john@phrozen.org> Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/15077/
[james.hogan@imgtec.com: tweaked commit message] Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This bug is triggered when pmd_present() returns true for non-present
hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
Using pmd_present() to determine present/non-present for hugetlb is not
correct, because pmd_present() checks multiple bits (not only
_PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()") Link: http://lkml.kernel.org/r/1490149898-20231-1-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We were accidentally only overriding the first VRAM placement. For BOs
with the RADEON_GEM_NO_CPU_ACCESS flag set,
radeon_ttm_placement_from_domain creates a second VRAM placment with
fpfn == 0. If VRAM is almost full, the first VRAM placement with
fpfn > 0 may not work, but the second one with fpfn == 0 always will
(the BO's current location trivially satisfies it). Because "moving"
the BO to its current location puts it back on the LRU list, this
results in an infinite loop.
Fixes: 2a85aedd117c ("drm/radeon: Try evicting from CPU accessible to
inaccessible VRAM first") Reported-by: Zachary Michaels <zmichaels@oblong.com> Reported-and-Tested-by: Julien Isorce <jisorce@oblong.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When releasing the bus, let's clear the bus pointers to mark it out. If
any further device unregister happens on this bus, we know that we're
done if we found the bus being released already.
Using KASAN, Dmitry found a bug in the rh_call_control() routine: If
buffer allocation fails, the routine returns immediately without
unlinking its URB from the control endpoint, eventually leading to
linked-list corruption.
This patch fixes the problem by jumping to the end of the routine
(where the URB is unlinked) when an allocation failure occurs.
A side effect of 89d8232411a8 ("tty/serial: atmel_serial: BUG: stop DMA
from transmitting in stop_tx") is that the console can be called with
TX path disabled. Then the system would hang trying to push charecters
out in atmel_console_putchar().
Signed-off-by: Nicolas Ferre <nicolas.ferre@microchip.com> Fixes: 89d8232411a8 ("tty/serial: atmel_serial: BUG: stop DMA from transmitting in stop_tx") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If uart_flush_buffer() is called between atmel_tx_dma() and
atmel_complete_tx_dma(), the circular buffer has been cleared, but not
atmel_port->tx_len.
That leads to a circular buffer overflow (dumping (UART_XMIT_SIZE -
atmel_port->tx_len) bytes).
Tested-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: Richard Genoud <richard.genoud@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
WARNING: CPU: 0 PID: 774 at /build/linux-ROBWaj/linux-4.9.13/kernel/trace/trace_functions_graph.c:233 ftrace_return_to_handler+0x1aa/0x1e0
Bad frame pointer: expected f6919d98, received f6919db0
from func acpi_pm_device_sleep_wake return to c43b6f9d
The warning means that function graph tracing is broken for the
acpi_pm_device_sleep_wake() function. That's because the ACPI Makefile
unconditionally sets the '-Os' gcc flag to optimize for size. That's an
issue because mcount-based function graph tracing is incompatible with
'-Os' on x86, thanks to the following gcc bug:
I have another patch pending which will ensure that mcount-based
function graph tracing is never used with CONFIG_CC_OPTIMIZE_FOR_SIZE on
x86.
But this patch is needed in addition to that one because the ACPI
Makefile overrides that config option for no apparent reason. It has
had this flag since the beginning of git history, and there's no related
comment, so I don't know why it's there. As far as I can tell, there's
no reason for it to be there. The appropriate behavior is for it to
honor CONFIG_CC_OPTIMIZE_FOR_{SIZE,PERFORMANCE} like the rest of the
kernel.
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On this Dell AIO machine, the lineout jack does not work.
We found the pin 0x1a is assigned to lineout on this machine, and in
the past, we applied ALC298_FIXUP_DELL1_MIC_NO_PRESENCE to fix the
heaset-set mic problem for this machine, this fixup will redefine
the pin 0x1a to headphone-mic, as a result the lineout doesn't
work anymore.
After consulting with Dell, they told us this machine doesn't support
microphone via headset jack, so we add a new fixup which only defines
the pin 0x18 as the headset-mic.
[rearranged the fixup insertion position by tiwai in order to make the
merge with other branches easier -- tiwai]
Fixes: 59ec4b57bcae ("ALSA: hda - Fix headset mic detection problem for two dell machines") Signed-off-by: Hui Wang <hui.wang@canonical.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When a new event is queued while processing to resize the FIFO in
snd_seq_fifo_clear(), it may lead to a use-after-free, as the old pool
that is being queued gets removed. For avoiding this race, we need to
close the pool to be deleted and sync its usage before actually
deleting it.
The total ata xfer length may not be calculated properly, in that we do
not use the proper method to get an sg element dma length.
According to the code comment, sg_dma_len() should be used after
dma_map_sg() is called.
This issue was found by turning on the SMMUv3 in front of the hisi_sas
controller in hip07. Multiple sg elements were being combined into a
single element, but the original first element length was being use as
the total xfer length.
Fixes: ff2aeb1eb64c8a4770a6 ("libata: convert to chained sg") Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The user can control the size of the next command passed along, but the
value passed to the ioctl isn't checked against the usable max command
size.
Signed-off-by: Peter Chang <dpf@google.com> Acked-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mpt3sas has a firmware failure where it can only handle one pass through
ATA command at a time. If another comes in, contrary to the SAT
standard, it will hang until the first one completes (causing long
commands like secure erase to timeout). The original fix was to block
the device when an ATA command came in, but this caused a regression
with
scsi: srp_transport: Move queuecommand() wait code to SCSI core
So fix the original fix of the secure erase timeout by properly
returning SAM_STAT_BUSY like the SAT recommends. The original patch
also had a concurrency problem since scsih_qcmd is lockless at that
point (this is fixed by using atomic bitops to set and test the flag).
[mkp: addressed feedback wrt. test_bit and fixed whitespace]
Fixes: 18f6084a989ba1b (mpt3sas: Fix secure erase premature termination) Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Acked-by: Sreekanth Reddy <Sreekanth.Reddy@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Cc: Joe Korty <joe.korty@ccur.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When relocating the p2m, take special care not to relocate it so
that is overlaps with the current location of the p2m/initrd. This is
needed since the full extent of the current location is not marked as a
reserved region in the e820.
This was seen to happen to a dom0 with a large initial p2m and a small
reserved region in the middle of the initial p2m.
... we don't reschedule a task under certain circumstances:
Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on
CPU0) and holds a PI lock. This task is removed from the CPU because it
used up its time slice and another SCHED_OTHER task is running. Task-B on
CPU1 runs at RT priority and asks for the lock owned by task-A. This
results in a priority boost for task-A. Task-B goes to sleep until the
lock has been made available. Task-A is already runnable (but not active),
so it receives no wake up.
The reality now is that task-A gets on the CPU once the scheduler decides
to remove the current task despite the fact that a high priority task is
enqueued and waiting. This may take a long time.
The desired behaviour is that CPU0 immediately reschedules after the
priority boost which made task-A the task with the lowest priority.
Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: fd7a4bed1835 ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks") Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Filesystem encryption ostensibly supported revoking a keyring key that
had been used to "unlock" encrypted files, causing those files to become
"locked" again. This was, however, buggy for several reasons, the most
severe of which was that when key revocation happened to be detected for
an inode, its fscrypt_info was immediately freed, even while other
threads could be using it for encryption or decryption concurrently.
This could be exploited to crash the kernel or worse.
This patch fixes the use-after-free by removing the code which detects
the keyring key having been revoked, invalidated, or expired. Instead,
an encrypted inode that is "unlocked" now simply remains unlocked until
it is evicted from memory. Note that this is no worse than the case for
block device-level encryption, e.g. dm-crypt, and it still remains
possible for a privileged user to evict unused pages, inodes, and
dentries by running 'sync; echo 3 > /proc/sys/vm/drop_caches', or by
simply unmounting the filesystem. In fact, one of those actions was
already needed anyway for key revocation to work even somewhat sanely.
This change is not expected to break any applications.
In the future I'd like to implement a real API for fscrypt key
revocation that interacts sanely with ongoing filesystem operations ---
waiting for existing operations to complete and blocking new operations,
and invalidating and sanitizing key material and plaintext from the VFS
caches. But this is a hard problem, and for now this bug must be fixed.
This bug affected almost all versions of ext4, f2fs, and ubifs
encryption, and it was potentially reachable in any kernel configured
with encryption support (CONFIG_EXT4_ENCRYPTION=y,
CONFIG_EXT4_FS_ENCRYPTION=y, CONFIG_F2FS_FS_ENCRYPTION=y, or
CONFIG_UBIFS_FS_ENCRYPTION=y). Note that older kernels did not use the
shared fs/crypto/ code, but due to the potential security implications
of this bug, it may still be worthwhile to backport this fix to them.
Fixes: b7236e21d55f ("ext4 crypto: reorganize how we store keys in the inode") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Acked-by: Michael Halcrow <mhalcrow@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET
to fill TXSTATUS, a well-defined default value is used, based on the
task's current value.
Suggested-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Ensure that if userspace supplies insufficient data to PTRACE_SETREGSET
to fill all the registers, the thread's old registers are preserved.
Signed-off-by: Dave Martin <Dave.Martin@arm.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
regs_set() and regs_get() are vulnerable to an off-by-1 buffer overrun
if CONFIG_CPU_H8S is set, since this adds an extra entry to
register_offset[] but not to user_regs_struct.
So, iterate over user_regs_struct based on its actual size, not based on
the length of register_offset[].
Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
gpr_set won't work correctly and can never have been tested, and the
correct behaviour is not clear due to the endianness-dependent task
layout.
So, just remove it. The core code will now return -EOPNOTSUPPORT when
trying to set NT_PRSTATUS on this architecture until/unless a correct
implementation is supplied.
Signed-off-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Clearing the status bit on irq_unmask will discard any pending interrupt
that did arrive after the irq_ack, i.e. while the IRQ handler function
was executing.
Fixes: f365be092572 ("pinctrl: Add Qualcomm TLMM driver") Cc: Stephen Boyd <sboyd@codeaurora.org> Reported-by: Timur Tabi <timur@codeaurora.org> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When init_vqs runs, virtio_balloon.stats is either uninitialized or
contains stale values. The host updates its state with garbage data
because it has no way of knowing that this is just a marker buffer
used for signaling.
This patch updates the stats before pushing the initial buffer.
Alternative fixes:
* Push an empty buffer in init_vqs. Not easily done with the current
virtio implementation and violates the spec "Driver MUST supply the
same subset of statistics in all buffers submitted to the statsq".
* Push a buffer with invalid tags in init_vqs. Violates the same
spec clause, plus "invalid tag" is not really defined.
Note: the spec says:
When using the legacy interface, the device SHOULD ignore all values in
the first buffer in the statsq supplied by the driver after device
initialization. Note: Historically, drivers supplied an uninitialized
buffer in the first buffer.
Unfortunately QEMU does not seem to implement the recommendation
even for the legacy interface.
Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kees Cook has pointed out that xfrm_replay_state_esn_len() is subject to
wrapping issues. To ensure we are correctly ensuring that the two ESN
structures are the same size compare both the overall size as reported
by xfrm_replay_state_esn_len() and the internal length are the same.
When a new xfrm state is created during an XFRM_MSG_NEWSA call we
validate the user supplied replay_esn to ensure that the size is valid
and to ensure that the replay_window size is within the allocated
buffer. However later it is possible to update this replay_esn via a
XFRM_MSG_NEWAE call. There we again validate the size of the supplied
buffer matches the existing state and if so inject the contents. We do
not at this point check that the replay_window is within the allocated
memory. This leads to out-of-bounds reads and writes triggered by
netlink packets. This leads to memory corruption and the potential for
priviledge escalation.
We already attempt to validate the incoming replay information in
xfrm_new_ae() via xfrm_replay_verify_len(). This confirms that the user
is not trying to change the size of the replay state buffer which
includes the replay_esn. It however does not check the replay_window
remains within that buffer. Add validation of the contained
replay_window.
Dmitry reports following splat:
INFO: trying to register non-static key.
the code is fine but needs lockdep annotation.
turning off the locking correctness validator.
CPU: 0 PID: 13059 Comm: syz-executor1 Not tainted 4.10.0-rc7-next-20170207 #1
[..]
spin_lock_bh include/linux/spinlock.h:304 [inline]
xfrm_policy_flush+0x32/0x470 net/xfrm/xfrm_policy.c:963
xfrm_policy_fini+0xbf/0x560 net/xfrm/xfrm_policy.c:3041
xfrm_net_init+0x79f/0x9e0 net/xfrm/xfrm_policy.c:3091
ops_init+0x10a/0x530 net/core/net_namespace.c:115
setup_net+0x2ed/0x690 net/core/net_namespace.c:291
copy_net_ns+0x26c/0x530 net/core/net_namespace.c:396
create_new_namespaces+0x409/0x860 kernel/nsproxy.c:106
unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:205
SYSC_unshare kernel/fork.c:2281 [inline]
Problem is that when we get error during xfrm_net_init we will call
xfrm_policy_fini which will acquire xfrm_policy_lock before it was
initialized. Just move it around so locks get set up first.
one can immediatelly see an UBSAN warning:
UBSAN: Undefined behaviour in crypto/algif_hash.c:187:7
variable length array bound value 0 <= 0
CPU: 0 PID: 15949 Comm: syz-executor Tainted: G E 4.4.30-0-default #1
...
Call Trace:
...
[<ffffffff81d598fd>] ? __ubsan_handle_vla_bound_not_positive+0x13d/0x188
[<ffffffff81d597c0>] ? __ubsan_handle_out_of_bounds+0x1bc/0x1bc
[<ffffffffa0e2204d>] ? hash_accept+0x5bd/0x7d0 [algif_hash]
[<ffffffffa0e2293f>] ? hash_accept_nokey+0x3f/0x51 [algif_hash]
[<ffffffffa0e206b0>] ? hash_accept_parent_nokey+0x4a0/0x4a0 [algif_hash]
[<ffffffff8235c42b>] ? SyS_accept+0x2b/0x40
It is a correct warning, as hash state is propagated to accept as zero,
but creating a zero-length variable array is not allowed in C.
Fix this as proposed by Herbert -- do "?: 1" on that site. No sizeof or
similar happens in the code there, so we just allocate one byte even
though we do not use the array.
fbcon can deal with vc_hi_font_mask (the upper 256 chars) and adjust
the vc attrs dynamically when vc_hi_font_mask is changed at
fbcon_init(). When the vc_hi_font_mask is set, it remaps the attrs in
the existing console buffer with one bit shift up (for 9 bits), while
it remaps with one bit shift down (for 8 bits) when the value is
cleared. It works fine as long as the font gets updated after fbcon
was initialized.
However, we hit a bizarre problem when the console is switched to
another fb driver (typically from vesafb or efifb to drmfb). At
switching to the new fb driver, we temporarily rebind the console to
the dummy console, then rebind to the new driver. During the
switching, we leave the modified attrs as is. Thus, the new fbcon
takes over the old buffer as if it were to contain 8 bits chars
(although the attrs are still shifted for 9 bits), and effectively
this results in the yellow color texts instead of the original white
color, as found in the bugzilla entry below.
An easy fix for this is to re-adjust the attrs before leaving the
fbcon at con_deinit callback. Since the code to adjust the attrs is
already present in the current fbcon code, in this patch, we simply
factor out the relevant code, and call it from fbcon_deinit().
During a PCI error recovery, like the ones provoked by EEH in the ppc64
platform, all IO to the device must be blocked while the recovery is
completed. Current 8250_pci implementation only suspends the port
instead of detaching it, which doesn't prevent incoming accesses like
TIOCMGET and TIOCMSET calls from reaching the device. Those end up
racing with the EEH recovery, crashing it. Similar races were also
observed when opening the device and when shutting it down during
recovery.
This patch implements a more robust IO blockage for the 8250_pci
recovery by unregistering the port at the beginning of the procedure and
re-adding it afterwards. Since the port is detached from the uart
layer, we can be sure that no request will make through to the device
during recovery. This is similar to the solution used by the JSM serial
driver.
I thank Peter Hurley <peter@hurleysoftware.com> for valuable input on
this one over one year ago.