Will Deacon [Mon, 24 Aug 2020 11:29:40 +0000 (12:29 +0100)]
KVM: arm/arm64: Don't reschedule in unmap_stage2_range()
Upstream commits fdfe7cbd5880 ("KVM: Pass MMU notifier range flags to
kvm_unmap_hva_range()") and b5331379bc62 ("KVM: arm64: Only reschedule
if MMU_NOTIFIER_RANGE_BLOCKABLE is not set") fix a "sleeping from invalid
context" BUG caused by unmap_stage2_range() attempting to reschedule when
called on the OOM path.
Unfortunately, these patches rely on the MMU notifier callback being
passed knowledge about whether or not blocking is permitted, which was
introduced in 4.19. Rather than backport this considerable amount of
infrastructure just for KVM on arm, instead just remove the conditional
reschedule.
Cc: <stable@vger.kernel.org> # v4.9 only Cc: Marc Zyngier <maz@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Acked-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Juergen Gross [Thu, 20 Aug 2020 06:59:08 +0000 (08:59 +0200)]
xen: don't reschedule in preemption off sections
For support of long running hypercalls xen_maybe_preempt_hcall() is
calling cond_resched() in case a hypercall marked as preemptible has
been interrupted.
Normally this is no problem, as only hypercalls done via some ioctl()s
are marked to be preemptible. In rare cases when during such a
preemptible hypercall an interrupt occurs and any softirq action is
started from irq_exit(), a further hypercall issued by the softirq
handler will be regarded to be preemptible, too. This might lead to
rescheduling in spite of the softirq handler potentially having set
preempt_disable(), leading to splats like:
Firstly, the worst case scenario should assume the whole range was covered
by pmd sharing. The old algorithm might not work as expected for ranges
like (1g-2m, 1g+2m), where the adjusted range should be (0, 1g+2m) but the
expected range should be (0, 2g).
Since at it, remove the loop since it should not be required. With that,
the new code should be faster too when the invalidating range is huge.
Mike said:
: With range (1g-2m, 1g+2m) within a vma (0, 2g) the existing code will only
: adjust to (0, 1g+2m) which is incorrect.
:
: We should cc stable. The original reason for adjusting the range was to
: prevent data corruption (getting wrong page). Since the range is not
: always adjusted correctly, the potential for corruption still exists.
:
: However, I am fairly confident that adjust_range_if_pmd_sharing_possible
: is only gong to be called in two cases:
:
: 1) for a single page
: 2) for range == entire vma
:
: In those cases, the current code should produce the correct results.
:
: To be safe, let's just cc stable.
Fixes: 017b1660df89 ("mm: migration: fix migration of huge PMD shared pages") Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200730201636.74778-1-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When adding a new fd to an epoll, and that this new fd is an
epoll fd itself, we recursively scan the fds attached to it
to detect cycles, and add non-epool files to a "check list"
that gets subsequently parsed.
However, this check list isn't completely safe when deletions
can happen concurrently. To sidestep the issue, make sure that
a struct file placed on the check list sees its f_count increased,
ensuring that a concurrent deletion won't result in the file
disapearing from under our feet.
Cc: stable@vger.kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We have powerpc specific logic in our page fault handling to decide if
an access to an unmapped address below the stack pointer should expand
the stack VMA.
The code was originally added in 2004 "ported from 2.4". The rough
logic is that the stack is allowed to grow to 1MB with no extra
checking. Over 1MB the access must be within 2048 bytes of the stack
pointer, or be from a user instruction that updates the stack pointer.
The 2048 byte allowance below the stack pointer is there to cover the
288 byte "red zone" as well as the "about 1.5kB" needed by the signal
delivery code.
Unfortunately since then the signal frame has expanded, and is now
4224 bytes on 64-bit kernels with transactional memory enabled. This
means if a process has consumed more than 1MB of stack, and its stack
pointer lies less than 4224 bytes from the next page boundary, signal
delivery will fault when trying to expand the stack and the process
will see a SEGV.
The total size of the signal frame is the size of struct rt_sigframe
(which includes the red zone) plus __SIGNAL_FRAMESIZE (128 bytes on
64-bit).
The 2048 byte allowance was correct until 2008 as the signal frame
was:
At this point we should have been exposed to the bug, though as far as
I know it was never reported. I no longer have a system old enough to
easily test on.
Then in 2010 commit 320b2b8de126 ("mm: keep a guard page below a
grow-down stack segment") caused our stack expansion code to never
trigger, as there was always a VMA found for a write up to PAGE_SIZE
below r1.
That meant the bug was hidden as we continued to expand the signal
frame in commit 2b0a576d15e0 ("powerpc: Add new transactional memory
state to the signal context") (Feb 2013):
Then finally in 2017, commit 1be7107fbe18 ("mm: larger stack guard
gap, between vmas") exposed us to the existing bug, because it changed
the stack VMA to be the correct/real size, meaning our stack expansion
code is now triggered.
Fix it by increasing the allowance to 4224 bytes.
Hard-coding 4224 is obviously unsafe against future expansions of the
signal frame in the same way as the existing code. We can't easily use
sizeof() because the signal frame structure is not in a header. We
will either fix that, or rip out all the custom stack expansion
checking logic entirely.
Fixes: ce48b2100785 ("powerpc: Add VSX context save/restore, ptrace and signal support") Cc: stable@vger.kernel.org # v2.6.27+ Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Tested-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200724092528.1578671-2-mpe@ellerman.id.au Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
As per PAPR we have to look for both EPOW sensor value and event
modifier to identify the type of event and take appropriate action.
In LoPAPR v1.1 section 10.2.2 includes table 136 "EPOW Action Codes":
SYSTEM_SHUTDOWN 3
The system must be shut down. An EPOW-aware OS logs the EPOW error
log information, then schedules the system to be shut down to begin
after an OS defined delay internal (default is 10 minutes.)
Then in section 10.3.2.2.8 there is table 146 "Platform Event Log
Format, Version 6, EPOW Section", which includes the "EPOW Event
Modifier":
For EPOW sensor value = 3
0x01 = Normal system shutdown with no additional delay
0x02 = Loss of utility power, system is running on UPS/Battery
0x03 = Loss of system critical functions, system should be shutdown
0x04 = Ambient temperature too high
All other values = reserved
We have a user space tool (rtas_errd) on LPAR to monitor for
EPOW_SHUTDOWN_ON_UPS. Once it gets an event it initiates shutdown
after predefined time. It also starts monitoring for any new EPOW
events. If it receives "Power restored" event before predefined time
it will cancel the shutdown. Otherwise after predefined time it will
shutdown the system.
Commit 79872e35469b ("powerpc/pseries: All events of
EPOW_SYSTEM_SHUTDOWN must initiate shutdown") changed our handling of
the "on UPS/Battery" case, to immediately shutdown the system. This
breaks existing setups that rely on the userspace tool to delay
shutdown and let the system run on the UPS.
Fixes: 79872e35469b ("powerpc/pseries: All events of EPOW_SYSTEM_SHUTDOWN must initiate shutdown") Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[mpe: Massage change log and add PAPR references] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200820061844.306460-1-hegdevasant@linux.vnet.ibm.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
b53_common.c:1583:13: warning: The left expression of the compound
assignment is an uninitialized value. The computed value will
also be garbage
ent.port &= ~BIT(port);
~~~~~~~~ ^
ent is set by a successful call to b53_arl_read(). Unsuccessful
calls are caught by an switch statement handling specific returns.
b32_arl_read() calls b53_arl_op_wait() which fails with the
unhandled -ETIMEDOUT.
So add -ETIMEDOUT to the switch statement. Because
b53_arl_op_wait() already prints out a message, do not add another
one.
Fixes: 1da6df85c6fb ("net: dsa: b53: Implement ARL add/del/dump operations") Signed-off-by: Tom Rix <trix@redhat.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
When power_up_sst() fails, stream needs to be freed
just like when try_module_get() fails. However, current
code is returning directly and ends up leaking memory.
Fixes: 0121327c1a68b ("ASoC: Intel: mfld-pcm: add control for powering up/down dsp") Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Acked-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Link: https://lore.kernel.org/r/20200813084112.26205-1-dinghao.liu@zju.edu.cn Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Trusted VF with unicast promiscuous mode set, could listen to TX
traffic of other VFs.
Set unicast promiscuous mode to RX traffic, if VSI has port VLAN
configured. Rename misleading I40E_AQC_SET_VSI_PROMISC_TX bit to
I40E_AQC_SET_VSI_PROMISC_RX_ONLY. Aligned unicast promiscuous with
VLAN to the one without VLAN.
Fixes: 6c41a7606967 ("i40e: Add promiscuous on VLAN support") Fixes: 3b1200891b7f ("i40e: When in promisc mode apply promisc mode to Tx Traffic as well") Signed-off-by: Przemyslaw Patynowski <przemyslawx.patynowski@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
If for any reason a directory passed to do_split() does not have enough
active entries to exceed half the size of the block, we can end up
iterating over all "count" entries without finding a split point.
In this case, count == move, and split will be zero, and we will
attempt a negative index into map[].
Guard against this by detecting this case, and falling back to
split-to-half-of-count instead; in this case we will still have
plenty of space (> half blocksize) in each split block.
Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks") Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/f53e246b-647c-64bb-16ec-135383c70ad7@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
These accessors must be used to read/write a big-endian bus. The value
returned or written is native-endian.
However, these accessors are defined using be{16,32}_to_cpu() or
cpu_to_be{16,32}() to make the endian conversion but these expect a
__be{16,32} when none is present. Keeping them would need a force cast
that would solve nothing at all.
So, do the conversion using swab{16,32}, like done in asm-generic for
similar situations.
Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Link: http://lkml.kernel.org/r/20200622114232.80039-1-luc.vanoostenryck@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
If xfs_sysfs_init is called with parent_kobj == NULL, UBSAN
shows the following warning:
UBSAN: null-ptr-deref in ./fs/xfs/xfs_sysfs.h:37:23
member access within null pointer of type 'struct xfs_kobj'
Call Trace:
dump_stack+0x10e/0x195
ubsan_type_mismatch_common+0x241/0x280
__ubsan_handle_type_mismatch_v1+0x32/0x40
init_xfs_fs+0x12b/0x28f
do_one_initcall+0xdd/0x1d0
do_initcall_level+0x151/0x1b6
do_initcalls+0x50/0x8f
do_basic_setup+0x29/0x2b
kernel_init_freeable+0x19f/0x20b
kernel_init+0x11/0x1e0
ret_from_fork+0x22/0x30
Fix it by checking parent_kobj before the code accesses its member.
Signed-off-by: Eiichi Tsukata <devel@etsukata.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: minor whitespace edits] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
The loop may exist if vq->broken is true,
virtqueue_get_buf_ctx_packed or virtqueue_get_buf_ctx_split
will return NULL, so virtnet_poll will reschedule napi to
receive packet, it will lead cpu usage(si) to 100%.
call trace as below:
virtnet_poll
virtnet_receive
virtqueue_get_buf_ctx
virtqueue_get_buf_ctx_packed
virtqueue_get_buf_ctx_split
virtqueue_napi_complete
virtqueue_poll //return true
virtqueue_napi_schedule //it will reschedule napi
to fix this, return false if vq is broken in virtqueue_poll.
The log of UAF problem is listed below.
BUG: KASAN: use-after-free in jffs2_rmdir+0xa4/0x1cc [jffs2] at addr c1f165fc
Read of size 4 by task rm/8283
=============================================================================
BUG kmalloc-32 (Tainted: P B O ): kasan: bad access detected
-----------------------------------------------------------------------------
The root cause is that we don't get "jffs2_inode_info.sem" before
we scan list "jffs2_inode_info.dents" in function jffs2_rmdir.
This patch add codes to get "jffs2_inode_info.sem" before we scan
"jffs2_inode_info.dents" to slove the UAF problem.
Signed-off-by: Zhe Li <lizhe67@huawei.com> Reviewed-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sasha Levin <sashal@kernel.org>
xfs_trans_dqresv is the function that we use to make reservations
against resource quotas. Each resource contains two counters: the
q_core counter, which tracks resources allocated on disk; and the dquot
reservation counter, which tracks how much of that resource has either
been allocated or reserved by threads that are working on metadata
updates.
For disk blocks, we compare the proposed reservation counter against the
hard and soft limits to decide if we're going to fail the operation.
However, for inodes we inexplicably compare against the q_core counter,
not the incore reservation count.
Since the q_core counter is always lower than the reservation count and
we unlock the dquot between reservation and transaction commit, this
means that multiple threads can reserve the last inode count before we
hit the hard limit, and when they commit, we'll be well over the hard
limit.
Fix this by checking against the incore inode reservation counter, since
we would appear to maintain that correctly (and that's what we report in
GETQUOTA).
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
The Cache Control Register (CACR) of the ColdFire V3 has bits that
control high level caching functions, and also enable/disable the use
of the alternate stack pointer register (the EUSP bit) to provide
separate supervisor and user stack pointer registers. The code as
it is today will blindly clear the EUSP bit on cache actions like
invalidation. So it is broken for this case - and that will result
in failed booting (interrupt entry and exit processing will be
completely hosed).
This only affects ColdFire V3 parts that support the alternate stack
register (like the 5329 for example) - generally speaking new parts do,
older parts don't. It has no impact on ColdFire V3 parts with the single
stack pointer, like the 5307 for example.
Fix the cache bit defines used, so they maintain the EUSP bit when
carrying out cache actions through the CACR register.
If platform_driver_register() fails within vpss_init() resources are not
cleaned up. The patch fixes this issue by introducing the corresponding
error handling.
Found by Linux Driver Verification project (linuxtesting.org).
It is confirmed that Micron device needs DELAY_BEFORE_LPM quirk to have a
delay before VCC is powered off. Sdd Micron vendor ID and this quirk for
Micron devices.
Link: https://lore.kernel.org/r/20200612012625.6615-2-stanley.chu@mediatek.com Reviewed-by: Bean Huo <beanhuo@micron.com> Reviewed-by: Alim Akhtar <alim.akhtar@samsung.com> Signed-off-by: Stanley Chu <stanley.chu@mediatek.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
ext4_search_dir() and ext4_generic_delete_entry() can be called both for
standard director blocks and for inline directories stored inside inode
or inline xattr space. For the second case we didn't call
ext4_check_dir_entry() with proper constraints that could result in
accepting corrupted directory entry as well as false positive filesystem
errors like:
EXT4-fs error (device dm-0): ext4_search_dir:1395: inode #28320400:
block 113246792: comm dockerd: bad entry in directory: directory entry too
close to block end - offset=0, inode=28320403, rec_len=32, name_len=8,
size=4096
Fix the arguments passed to ext4_check_dir_entry().
Fixes: 109ba779d6cc ("ext4: check for directory entries too close to block end") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200731162135.8080-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
When ext4 encryption was originally merged, we were encrypting the
user-specified filename in ext4_match(), introducing a lot of additional
complexity into ext4_match() and its callers. This has since been
changed to encrypt the filename earlier, so we can remove the gunk
that's no longer needed. This more or less reverts ext4_search_dir()
and ext4_find_dest_de() to the way they were in the v4.0 kernel.
The following race is observed with the repeated online, offline and a
delay between two successive online of memory blocks of movable zone.
P1 P2
Online the first memory block in
the movable zone. The pcp struct
values are initialized to default
values,i.e., pcp->high = 0 &
pcp->batch = 1.
Allocate the pages from the
movable zone.
Try to Online the second memory
block in the movable zone thus it
entered the online_pages() but yet
to call zone_pcp_update().
This process is entered into
the exit path thus it tries
to release the order-0 pages
to pcp lists through
free_unref_page_commit().
As pcp->high = 0, pcp->count = 1
proceed to call the function
free_pcppages_bulk().
Update the pcp values thus the
new pcp values are like, say,
pcp->high = 378, pcp->batch = 63.
Read the pcp's batch value using
READ_ONCE() and pass the same to
free_pcppages_bulk(), pcp values
passed here are, batch = 63,
count = 1.
Since num of pages in the pcp
lists are less than ->batch,
then it will stuck in
while(list_empty(list)) loop
with interrupts disabled thus
a core hung.
Avoid this by ensuring free_pcppages_bulk() is called with proper count of
pcp list pages.
The mentioned race is some what easily reproducible without [1] because
pcp's are not updated for the first memory block online and thus there is
a enough race window for P2 between alloc+free and pcp struct values
update through onlining of second memory block.
With [1], the race still exists but it is very narrow as we update the pcp
struct values for the first memory block online itself.
This is not limited to the movable zone, it could also happen in cases
with the normal zone (e.g., hotplug to a node that only has DMA memory, or
no other memory yet).
The lowmem_reserve arrays provide a means of applying pressure against
allocations from lower zones that were targeted at higher zones. Its
values are a function of the number of pages managed by higher zones and
are assigned by a call to the setup_per_zone_lowmem_reserve() function.
The function is initially called at boot time by the function
init_per_zone_wmark_min() and may be called later by accesses of the
/proc/sys/vm/lowmem_reserve_ratio sysctl file.
The function init_per_zone_wmark_min() was moved up from a module_init to
a core_initcall to resolve a sequencing issue with khugepaged.
Unfortunately this created a sequencing issue with CMA page accounting.
The CMA pages are added to the managed page count of a zone when
cma_init_reserved_areas() is called at boot also as a core_initcall. This
makes it uncertain whether the CMA pages will be added to the managed page
counts of their zones before or after the call to
init_per_zone_wmark_min() as it becomes dependent on link order. With the
current link order the pages are added to the managed count after the
lowmem_reserve arrays are initialized at boot.
This means the lowmem_reserve values at boot may be lower than the values
used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
ratio values are unchanged.
In many cases the difference is not significant, but for example
an ARM platform with 1GB of memory and the following memory layout
cma: Reserved 256 MiB at 0x0000000030000000
Zone ranges:
DMA [mem 0x0000000000000000-0x000000002fffffff]
Normal empty
HighMem [mem 0x0000000030000000-0x000000003fffffff]
would result in 0 lowmem_reserve for the DMA zone. This would allow
userspace to deplete the DMA zone easily.
Funnily enough
$ cat /proc/sys/vm/lowmem_reserve_ratio
would fix up the situation because as a side effect it forces
setup_per_zone_lowmem_reserve.
This commit breaks the link order dependency by invoking
init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
have the chance to be properly accounted in their zone(s) and allowing
the lowmem_reserve arrays to receive consistent values.
Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization") Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jason Baron <jbaron@akamai.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
'chan->buf' is malloced in relay_open() by alloc_percpu() but not free
while destroy the relay channel. Fix it by adding free_percpu() before
return from relay_destroy_channel().
Fixes: 017c59c042d0 ("relay: Use per CPU constructs for the relay channel buffer pointers") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Daniel Axtens <dja@axtens.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Akash Goel <akash.goel@intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200817122826.48518-1-weiyongjun1@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
romfs has a superblock field that limits the size of the filesystem; data
beyond that limit is never accessed.
romfs_dev_read() fetches a caller-supplied number of bytes from the
backing device. It returns 0 on success or an error code on failure;
therefore, its API can't represent short reads, it's all-or-nothing.
However, when romfs_dev_read() detects that the requested operation would
cross the filesystem size limit, it currently silently truncates the
requested number of bytes. This e.g. means that when the content of a
file with size 0x1000 starts one byte before the filesystem size limit,
->readpage() will only fill a single byte of the supplied page while
leaving the rest uninitialized, leaking that uninitialized memory to
userspace.
Fix it by returning an error code instead of truncating the read when the
requested read operation would go beyond the end of the filesystem.
Fixes: da4458bda237 ("NOMMU: Make it possible for RomFS to use MTD devices directly") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: David Howells <dhowells@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200818013202.2246365-1-jannh@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Chris Murphy reported a problem where rpm ostree will bind mount a bunch
of things for whatever voodoo it's doing. But when it does this
/proc/mounts shows something like
Despite subvolid=256 being subvol=/foo. This is because we're just
spitting out the dentry of the mount point, which in the case of bind
mounts is the source path for the mountpoint. Instead we should spit
out the path to the actual subvol. Fix this by looking up the name for
the subvolid we have mounted. With this fix the same test looks like
this
The functions will be used outside of export.c and super.c to allow
resolving subvolume name from a given id, eg. for subvolume deletion by
id ioctl.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ split from the next patch ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
syzbot crashes on the VM_BUG_ON_MM(khugepaged_test_exit(mm), mm) in
__khugepaged_enter(): yes, when one thread is about to dump core, has set
core_state, and is waiting for others, another might do something calling
__khugepaged_enter(), which now crashes because I lumped the core_state
test (known as "mmget_still_valid") into khugepaged_test_exit(). I still
think it's best to lump them together, so just in this exceptional case,
check mm->mm_users directly instead of khugepaged_test_exit().
Fixes: bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Yang Shi <shy828301@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Eric Dumazet <edumazet@google.com> Cc: <stable@vger.kernel.org> [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008141503370.18085@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Move collapse_huge_page()'s mmget_still_valid() check into
khugepaged_test_exit() itself. collapse_huge_page() is used for anon THP
only, and earned its mmget_still_valid() check because it inserts a huge
pmd entry in place of the page table's pmd entry; whereas
collapse_file()'s retract_page_tables() or collapse_pte_mapped_thp()
merely clears the page table's pmd entry. But core dumping without mmap
lock must have been as open to mistaking a racily cleared pmd entry for a
page table at physical page 0, as exit_mmap() was. And we certainly have
no interest in mapping as a THP once dumping core.
Fixes: 59ea6d06cfa9 ("coredump: fix race condition between collapse_huge_page() and core dumping") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Song Liu <songliubraving@fb.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021217020.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
In calculation of the cpu mask for the hwlat kernel thread, the wrong
cpu mask is used instead of the tracing_cpumask, this causes the
tracing/tracing_cpumask useless for hwlat tracer. Fixes it.
Link: https://lkml.kernel.org/r/20200730082318.42584-2-haokexin@gmail.com Cc: Ingo Molnar <mingo@redhat.com> Cc: stable@vger.kernel.org Fixes: 0330f7aa8ee6 ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs") Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Instead of initializing the affinity of the hwlat kthread in the thread
itself, simply set up the initial affinity at thread creation. This
simplifies the code.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Fix the memory leakage in debuginfo__find_trace_events() when the probe
point is not found in the debuginfo. If there is no probe point found in
the debuginfo, debuginfo__find_probes() will NOT return -ENOENT, but 0.
Thus the caller of debuginfo__find_probes() must check the tf.ntevs and
release the allocated memory for the array of struct probe_trace_event.
The current code releases the memory only if the debuginfo__find_probes()
hits an error but not checks tf.ntevs. In the result, the memory allocated
on *tevs are not released if tf.ntevs == 0.
This fixes the memory leakage by checking tf.ntevs == 0 in addition to
ret < 0.
Both of the two LVDS channels should be disabled for split mode
in the encoder's ->disable() callback, because they are enabled
in the encoder's ->enable() callback.
Fixes: 6556f7f82b9c ("drm: imx: Move imx-drm driver out of staging") Cc: Philipp Zabel <p.zabel@pengutronix.de> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Pengutronix Kernel Team <kernel@pengutronix.de> Cc: NXP Linux Team <linux-imx@nxp.com> Cc: <stable@vger.kernel.org> Signed-off-by: Liu Ying <victor.liu@nxp.com> Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
Omitting suffixes from instructions in AT&T mode is bad practice when
operand size cannot be determined by the assembler from register
operands, and is likely going to be warned about by upstream gas in the
future (mine does already). Add the missing suffixes here. Note that for
64-bit this means some operations change from being 32-bit to 64-bit.
Oscar Salvador [Tue, 18 Aug 2020 11:00:46 +0000 (13:00 +0200)]
mm: Avoid calling build_all_zonelists_init under hotplug context
Recently a customer of ours experienced a crash when booting the
system while enabling memory-hotplug.
The problem is that Normal zones on different nodes don't get their private
zone->pageset allocated, and keep sharing the initial boot_pageset.
The sharing between zones is normally safe as explained by the comment for
boot_pageset - it's a percpu structure, and manipulations are done with
disabled interrupts, and boot_pageset is set up in a way that any page placed
on its pcplist is immediately flushed to shared zone's freelist, because
pcp->high == 1.
However, the hotplug operation updates pcp->high to a higher value as it
expects to be operating on a private pageset.
The problem is in build_all_zonelists(), which is called when the first range
of pages is onlined for the Normal zone of node X or Y:
if (system_state == SYSTEM_BOOTING) {
build_all_zonelists_init();
} else {
#ifdef CONFIG_MEMORY_HOTPLUG
if (zone)
setup_zone_pageset(zone);
#endif
/* we have to stop all cpus to guarantee there is no user
of zonelist */
stop_machine(__build_all_zonelists, pgdat, NULL);
/* cpuset refresh routine should be here */
}
When called during hotplug, it should execute the setup_zone_pageset(zone)
which allocates the private pageset.
However, with memhp_default_state=online, this happens early while
system_state == SYSTEM_BOOTING is still true, hence this step is skipped.
(and build_all_zonelists_init() is probably unsafe anyway at this point).
Another hotplug operation on the same zone then leads to zone_pcp_update(zone)
called from online_pages(), which updates the pcp->high for the shared
boot_pageset to a value higher than 1.
At that point, pages freed from Node X and Y Normal zones can end up on the same
pcplist and from there they can be freed to the wrong zone's freelist,
leading to the corruption and crashes.
Please, note that upstream has fixed that differently (and unintentionally) by
adding another boot state (SYSTEM_SCHEDULING), which is set before smp_init().
That should happen before memory hotplug events even with memhp_default_state=online.
Backporting that would be too intrusive.
Signed-off-by: Oscar Salvador <osalvador@suse.de> Debugged-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> # for stable trees Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Song Liu <songliubraving@fb.com> Cc: <stable@vger.kernel.org> [4.8+] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Landisk setup code maps the CF IDE area using ioremap_prot(), and
passes the resulting virtual addresses to the pata_platform driver,
disguising them as I/O port addresses. Hence the pata_platform driver
translates them again using ioport_map().
As CONFIG_GENERIC_IOMAP=n, and CONFIG_HAS_IOPORT_MAP=y, the
SuperH-specific mapping code in arch/sh/kernel/ioport.c translates
I/O port addresses to virtual addresses by adding sh_io_port_base, which
defaults to -1, thus breaking the assumption of an identity mapping.
Recently xHCI driver switched to tasklets in the commit 36dc01657b49
("usb: host: xhci: Support running urb giveback in tasklet context").
The handle_irq_event_* functions are expected to be called with interrupts
disabled and they rightfully complain here because we run in tasklet context
with interrupts enabled.
Use a event spinlock to protect event handler from being interrupted.
Note, that there are only two users of this GPIO and ADC drivers and both of
them are using generic_handle_irq() which makes above happen.
Fixes: 338a12814297 ("mfd: Add support for Diolan DLN-2 devices") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The 64 bit ino is being compared to the product of two u32 values,
however, the multiplication is being performed using a 32 bit multiply so
there is a potential of an overflow. To be fully safe, cast uspi->s_ncg
to a u64 to ensure a 64 bit multiplication occurs to avoid any chance of
overflow.
Fixes: f3e2a520f5fb ("ufs: NFS support") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Alexey Dobriyan <adobriyan@gmail.com> Link: http://lkml.kernel.org/r/20200715170355.1081713-1-colin.king@canonical.com
Addresses-Coverity: ("Unintentional integer overflow") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Fix the missing clk_disable_unprepare() before return
from emac_clks_phase1_init() in the error handling case.
Fixes: b9b17debc69d ("net: emac: emac gigabit ethernet controller driver") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Acked-by: Timur Tabi <timur@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
These if statements are supposed to be true if we ended the
list_for_each_entry() loops without hitting a break statement but they
don't work.
In the first loop, we increment "i" after the "if (i == unit)" condition
so we don't necessarily know that "i" is not equal to unit at the end of
the loop.
In the second loop we exit when mode is not pointing to a valid
drm_display_mode struct so it doesn't make sense to check "mode->type".
Fixes: a278724aa23c ("drm/vmwgfx: Implement fbdev on kms v2") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Roland Scheidegger <sroland@vmware.com> Signed-off-by: Roland Scheidegger <sroland@vmware.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Currently when the call to fsp_reg_write fails -EIO is not being returned
because the count is being returned instead of the return value in retval.
Fix this by returning the value in retval instead of count.
Addresses-Coverity: ("Unused value") Fixes: fc69f4a6af49 ("Input: add new driver for Sentelic Finger Sensing Pad") Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20200603141218.131663-1-colin.king@canonical.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
In case of error, the function clk_register() returns ERR_PTR()
and never returns NULL. The NULL test in the return value check
should be replaced with IS_ERR().
Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Link: https://lore.kernel.org/r/20200713032143.21362-1-vulab@iscas.ac.cn Acked-by: Barry Song <baohua@kernel.org> Fixes: 7bf21bc81f28 ("clk: sirf: re-arch to make the codes support both prima2 and atlas6") Signed-off-by: Stephen Boyd <sboyd@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
When the SSR interrupt is activated, it will detect every STOP condition
on the bus, not only the ones after we have been addressed. So, enable
this interrupt only after we have been addressed, and disable it
otherwise.
Fixes: de20d1857dd6 ("i2c: rcar: add slave support") Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Wolfram Sang <wsa@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
Set proper masks to avoid invalid input spillover to reserved bits.
Signed-off-by: Liu Yi L <yi.l.liu@intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20200724014925.15523-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
It is possible for the call to omap_iommu_dump_ctx to return
a negative error number, so check for the failure and return
the error number rather than pass the negative value to
simple_read_from_buffer.
Fixes: 14e0e6796a0d ("OMAP: iommu: add initial debugfs support") Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20200714192211.744776-1-colin.king@canonical.com
Addresses-Coverity: ("Improper use of negative value") Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
Whilst it doesn't matter if the internal 32k clock register settings
are cleaned up on exit, as the part will be turned off losing any
settings, hence the driver hasn't historially bothered. The external
clock should however be cleaned up, as it could cause clocks to be
left on, and will at best generate a warning on unbind.
Add clean up on both the probe error path and unbind for the 32k
clock.
Fixes: cdd8da8cc66b ("mfd: arizona: Add gating of external MCLKn clocks") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
The flag indicating a watchdog timeout having occurred normally persists
till Power-On Reset of the Fintek Super I/O chip. The user can clear it
by writing a `1' to the bit.
The driver doesn't offer a restart method, so regular system reboot
might not reset the Super I/O and if the watchdog isn't enabled, we
won't touch the register containing the bit on the next boot.
In this case all subsequent regular reboots will be wrongly flagged
by the driver as being caused by the watchdog.
Fix this by having the flag cleared after read. This is also done by
other drivers like those for the i6300esb and mpc8xxx_wdt.
The flags that should be or-ed into the watchdog_info.options by drivers
all start with WDIOF_, e.g. WDIOF_SETTIMEOUT, which indicates that the
driver's watchdog_ops has a usable set_timeout.
WDIOC_SETTIMEOUT was used instead, which expands to 0xc0045706, which
equals:
These were so far indicated to userspace on WDIOC_GETSUPPORT.
As the driver has not yet been migrated to the new watchdog kernel API,
the constant can just be dropped without substitute.
Fixes: 96cb4eb019ce ("watchdog: f71808e_wdt: new watchdog driver for Fintek F71808E and F71882FG") Cc: stable@vger.kernel.org Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20200611191750.28096-4-a.fatoum@pengutronix.de Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The driver supports populating bootstatus with WDIOF_CARDRESET, but so
far userspace couldn't portably determine whether absence of this flag
meant no watchdog reset or no driver support. Or-in the bit to fix this.
The tcpa_statistic_send is the function being kprobed. After analysis,
the root cause is that the fourth parameter regs of kprobe_ftrace_handler
is NULL. Why regs is NULL? We use the crash tool to analyze the kdump.
The tcpa_statistic_send calls ftrace_caller instead of ftrace_regs_caller.
So it is reasonable that the fourth parameter regs of kprobe_ftrace_handler
is NULL. In theory, we should call the ftrace_regs_caller instead of the
ftrace_caller. After in-depth analysis, we found a reproducible path.
Writing a simple kernel module which starts a periodic timer. The
timer's handler is named 'kprobe_test_timer_handler'. The module
name is kprobe_test.ko.
We mark the kprobe as GONE but not disarm the kprobe in the step 4).
The step 5) also do not disarm the kprobe when unregister kprobe. So
we do not remove the ip from the filter. In this case, when the module
loads again in the step 6), we will replace the code to ftrace_caller
via the ftrace_module_enable(). When we register kprobe again, we will
not replace ftrace_caller to ftrace_regs_caller because the ftrace is
disabled in the step 3). So the step 7) will trigger kernel panic. Fix
this problem by disarming the kprobe when the module is going away.
When module loaded and enabled, we will use __ftrace_replace_code
for module if any ftrace_ops referenced it found. But we will get
wrong ftrace_addr for module rec in ftrace_get_addr_new, because
rec->flags has not been setup correctly. It can cause the callback
function of a ftrace_ops has FTRACE_OPS_FL_SAVE_REGS to be called
with pt_regs set to NULL.
So setup correct FTRACE_FL_REGS flags for rec when we call
referenced_filters to find ftrace_ops references it.
Link: https://lkml.kernel.org/r/20200728180554.65203-1-zhouchengming@bytedance.com Cc: stable@vger.kernel.org Fixes: 8c4f3c3fa9681 ("ftrace: Check module functions being traced on reload") Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Dan Carpenter reported the following static checker warning.
fs/ocfs2/super.c:1269 ocfs2_parse_options() warn: '(-1)' 65535 can't fit into 32767 'mopt->slot'
fs/ocfs2/suballoc.c:859 ocfs2_init_inode_steal_slot() warn: '(-1)' 65535 can't fit into 32767 'osb->s_inode_steal_slot'
fs/ocfs2/suballoc.c:867 ocfs2_init_meta_steal_slot() warn: '(-1)' 65535 can't fit into 32767 'osb->s_meta_steal_slot'
That's because OCFS2_INVALID_SLOT is (u16)-1. Slot number in ocfs2 can be
never negative, so change s16 to u16.
Fixes: 9277f8334ffc ("ocfs2: fix value of OCFS2_INVALID_SLOT") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200627001259.19757-1-junxiao.bi@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
sbi->s_freeinodes_counter is only decreased by the ext2 code, it is never
increased. This patch fixes it.
Note that sbi->s_freeinodes_counter is only used in the algorithm that
tries to find the group for new allocations, so this bug is not easily
visible (the only visibility is that the group finding algorithm selects
inoptinal result).
There are some meta data of bcache are allocated by multiple pages,
and they are used as bio bv_page for I/Os to the cache device. for
example cache_set->uuids, cache->disk_buckets, journal_write->data,
bset_tree->data.
For such meta data memory, all the allocated pages should be treated
as a single memory block. Then the memory management and underlying I/O
code can treat them more clearly.
This patch adds __GFP_COMP flag to all the location allocating >0 order
pages for the above mentioned meta data. Then their pages are treated
as compound pages now.
In degraded raid5, we need to read parity to do reconstruct-write when
data disks fail. However, we can not read parity from
handle_stripe_dirtying() in force reconstruct-write mode.
Reproducible Steps:
1. Create degraded raid5
mdadm -C /dev/md2 --assume-clean -l5 -n3 /dev/sda2 /dev/sdb2 missing
2. Set rmw_level to 0
echo 0 > /sys/block/md2/md/rmw_level
3. IO to raid5
Now some io may be stuck in raid5. We can use handle_stripe_fill() to read
the parity in this situation.
Cc: <stable@vger.kernel.org> # v4.4+ Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: BingJing Chang <bingjingc@synology.com> Reviewed-by: Danny Shih <dannyshih@synology.com> Signed-off-by: ChangSyun Peng <allenpeng@synology.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add missed sock updates to compat path via a new helper, which will be
used more in coming patches. (The net/core/scm.c code is left as-is here
to assist with -stable backports for the compat path.)
Cc: Christoph Hellwig <hch@lst.de> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jakub Kicinski <kuba@kernel.org> Cc: stable@vger.kernel.org Fixes: 48a87cc26c13 ("net: netprio: fd passed in SCM_RIGHTS datagram not set correctly") Fixes: d84295067fc7 ("net: net_cls: fd passed in SCM_RIGHTS datagram not set correctly") Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If we don't have a hardware multicast filter available then instead of
silently failing to listen for the requested ethernet broadcast
addresses fall back to receiving all multicast packets, in a similar
fashion to other drivers with no multicast filter.
Cc: stable@vger.kernel.org Signed-off-by: Jonathan McDowell <noodles@earth.li> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The IPQ806x does not appear to have a functional multicast ethernet
address filter. This was observed as a failure to correctly receive IPv6
packets on a LAN to the all stations address. Checking the vendor driver
shows that it does not attempt to enable the multicast filter and
instead falls back to receiving all multicast packets, internally
setting ALLMULTI.
Use the new fallback support in the dwmac1000 driver to correctly
achieve the same with the mainline IPQ806x driver. Confirmed to fix IPv6
functionality on an RB3011 router.
Cc: stable@vger.kernel.org Signed-off-by: Jonathan McDowell <noodles@earth.li> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Recently random.h started including percpu.h (see commit f227e3ec3b5c ("random32: update the net random state on interrupt and
activity")), which broke corenet64_smp_defconfig:
In file included from /linux/arch/powerpc/include/asm/paca.h:18,
from /linux/arch/powerpc/include/asm/percpu.h:13,
from /linux/include/linux/random.h:14,
from /linux/lib/uuid.c:14:
/linux/arch/powerpc/include/asm/mmu.h:139:22: error: unknown type name 'next_tlbcam_idx'
139 | DECLARE_PER_CPU(int, next_tlbcam_idx);
This is due to a circular header dependency:
asm/mmu.h includes asm/percpu.h, which includes asm/paca.h, which
includes asm/mmu.h
Which means DECLARE_PER_CPU() isn't defined when mmu.h needs it.
We can fix it by moving the include of paca.h below the include of
asm-generic/percpu.h.
This moves the include of paca.h out of the #ifdef __powerpc64__, but
that is OK because paca.h is almost entirely inside #ifdef
CONFIG_PPC64 anyway.
It also moves the include of paca.h out of the #ifdef CONFIG_SMP,
which could possibly break something, but seems to have no ill
effects.
Fixes: f227e3ec3b5c ("random32: update the net random state on interrupt and activity") Cc: stable@vger.kernel.org # v5.8 Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200804130558.292328-1-mpe@ellerman.id.au Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There are 2 exit paths where the lock isn't held, but try to unlock the
mutex when exiting. In these places we should just return from the
function.
A neater approach would be to cleanup the ad5592r_read_raw(), but that
would make this patch more difficult to backport to stable versions.
Fixes 56ca9db862bf3: ("iio: dac: Add support for the AD5592R/AD5593R ADCs/DACs") Reported-by: Charles Stanhope <charles.stanhope@gmail.com> Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com> Cc: <Stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
While logging an inode, at copy_items(), if we fail to lookup the checksums
for an extent we release the destination path, free the ins_data array and
then return immediately. However a previous iteration of the for loop may
have added checksums to the ordered_sums list, in which case we leak the
memory used by them.
So fix this by making sure we iterate the ordered_sums list and free all
its checksums before returning.
Fixes: 3650860b90cc2a ("Btrfs: remove almost all of the BUG()'s from tree-log.c") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In try_to_merge_free_space we attempt to find entries to the left and
right of the entry we are adding to see if they can be merged. We
search for an entry past our current info (saved into right_info), and
then if right_info exists and it has a rb_prev() we save the rb_prev()
into left_info.
However there's a slight problem in the case that we have a right_info,
but no entry previous to that entry. At that point we will search for
an entry just before the info we're attempting to insert. This will
simply find right_info again, and assign it to left_info, making them
both the same pointer.
Now if right_info _can_ be merged with the range we're inserting, we'll
add it to the info and free right_info. However further down we'll
access left_info, which was right_info, and thus get a use-after-free.
Fix this by only searching for the left entry if we don't find a right
entry at all.
The CVE referenced had a specially crafted file system that could
trigger this use-after-free. However with the tree checker improvements
we no longer trigger the conditions for the UAF. But the original
conditions still apply, hence this fix.
Reference: CVE-2019-19448 Fixes: 963030817060 ("Btrfs: use hybrid extents+bitmap rb tree for free space") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[CAUSE]
The error is EMFILE (Too many files open) and comes from the anonymous
block device allocation. The ids are in a shared pool of size 1<<20.
The ids are assigned to live subvolumes, ie. the root structure exists
in memory (eg. after creation or after the root appears in some path).
The pool could be exhausted if the numbers are not reclaimed fast
enough, after subvolume deletion or if other system component uses the
anon block devices.
[WORKAROUND]
Since it's not possible to completely solve the problem, we can only
minimize the time the id is allocated to a subvolume root.
Firstly, we can reduce the use of anon_dev by trees that are not
subvolume roots, like data reloc tree.
This patch will do extra check on root objectid, to skip roots that
don't need anon_dev. Currently it's only data reloc tree and orphan
roots.
Reported-by: Greed Rong <greedrong@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+UqX+NTrZ6boGnWHhSeZmEY5J76CTqmYjO2S+=tHJX7nb9DPw@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If context is not NULL in acpiphp_grab_context(), but the
is_going_away flag is set for the device's parent, the reference
counter of the context needs to be decremented before returning
NULL or the context will never be freed, so make that happen.
Fixes: edf5bf34d408 ("ACPI / dock: Use callback pointers from devices' ACPI hotplug contexts") Reported-by: Vasily Averin <vvs@virtuozzo.com> Cc: 3.15+ <stable@vger.kernel.org> # 3.15+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When mounting with Kerberos, users have been confused about the
default error returned in scenarios in which either keyutils is
not installed or the user did not properly acquire a krb5 ticket.
Log a warning message in the case that "ENOKEY" is returned
from the get_spnego_key upcall so that users can better understand
why mount failed in those two cases.
CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
target_unpopulated is incremented with nr_pages at the start of the
function, but the call to free_xenballooned_pages will only subtract
pgno number of pages, and thus the rest need to be subtracted before
returning or else accounting will be skewed.
Since clang does not push pc and sp in function prologues, the current
implementation of unwind_frame does not work. By using the previous
frame's lr/fp instead of saved pc/sp we get valid unwinds on clang-built
kernels.
The bounds check on next frame pointer must be changed as well since
there are 8 less bytes between frames.
This fixes /proc/<pid>/stack.
Link: https://github.com/ClangBuiltLinux/linux/issues/912 Reported-by: Miles Chen <miles.chen@mediatek.com> Tested-by: Miles Chen <miles.chen@mediatek.com> Cc: stable@vger.kernel.org Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When using kexec the SBA IOMMU IBASE might still have the RE
bit set. This triggers a WARN_ON when trying to write back the
IBASE register later, and it also makes some mask calculations fail.
Further investigation of the L-R swap problem on the MS2109 reveals that
the problem isn't that the channels are swapped, but rather that they
are swapped and also out of phase by one sample. In other words, the
issue is actually that the very first frame that comes from the hardware
is a half-frame containing only the right channel, and after that
everything becomes offset.
So introduce a new quirk field to drop the very first 2 bytes that come
in after the format is configured and a capture stream starts. This puts
the channels in phase and in the correct order.
If the minix filesystem tries to map a very large logical block number to
its on-disk location, block_to_path() can return offsets that are too
large, causing out-of-bounds memory accesses when accessing indirect index
blocks. This should be prevented by the check against the maximum file
size, but this doesn't work because the maximum file size is read directly
from the on-disk superblock and isn't validated itself.
Fix this by validating the maximum file size at mount time.
If an inode has no links, we need to mark it bad rather than allowing it
to be accessed. This avoids WARNINGs in inc_nlink() and drop_nlink() when
doing directory operations on a fuzzed filesystem.
Patch series "fs/minix: fix syzbot bugs and set s_maxbytes".
This series fixes all syzbot bugs in the minix filesystem:
KASAN: null-ptr-deref Write in get_block
KASAN: use-after-free Write in get_block
KASAN: use-after-free Read in get_block
WARNING in inc_nlink
KMSAN: uninit-value in get_block
WARNING in drop_nlink
It also fixes the minix filesystem to set s_maxbytes correctly, so that
userspace sees the correct behavior when exceeding the max file size.
Running the crypto manager self tests with
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS may result in several types of errors
when using the ccp-crypto driver:
alg: skcipher: cbc-des3-ccp encryption failed on test vector 0; expected_error=0, actual_error=-5 ...
alg: skcipher: ctr-aes-ccp decryption overran dst buffer on test vector 0 ...
alg: ahash: sha224-ccp test failed (wrong result) on test vector ...
These errors are the result of improper processing of scatterlists mapped
for DMA.
Given a scatterlist in which entries are merged as part of mapping the
scatterlist for DMA, the DMA length of a merged entry will reflect the
combined length of the entries that were merged. The subsequent
scatterlist entry will contain DMA information for the scatterlist entry
after the last merged entry, but the non-DMA information will be that of
the first merged entry.
The ccp driver does not take this scatterlist merging into account. To
address this, add a second scatterlist pointer to track the current
position in the DMA mapped representation of the scatterlist. Both the DMA
representation and the original representation of the scatterlist must be
tracked as while most of the driver can use just the DMA representation,
scatterlist_map_and_copy() must use the original representation and
expects the scatterlist pointer to be accurate to the original
representation.
In order to properly walk the original scatterlist, the scatterlist must
be walked until the combined lengths of the entries seen is equal to the
DMA length of the current entry being processed in the DMA mapped
representation.
Fixes: 63b945091a070 ("crypto: ccp - CCP device driver and interface support") Signed-off-by: John Allen <john.allen@amd.com> Cc: stable@vger.kernel.org Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
qat_uclo.c:297:3: warning: Attempt to free released memory
[unix.Malloc]
kfree(*init_tab_base);
^~~~~~~~~~~~~~~~~~~~~
When input *init_tab_base is null, the function allocates memory for
the head of the list. When there is problem allocating other list
elements the list is unwound and freed. Then a check is made if the
list head was allocated and is also freed.
Keeping track of the what may need to be freed is the variable 'tail_old'.
The unwinding/freeing block is
There is another problem.
When the input *init_tab_base is non null the tail_old is calculated by
traveling down the list to first non null entry.
tail_old = init_header;
while (tail_old->next)
tail_old = tail_old->next;
When the unwinding free happens, the last entry of the input list will
be freed.
So the freeing needs a general changed.
If locally allocated the first element of tail_old is freed, else it
is skipped. As a bit of cleanup, reset *init_tab_base if it came in
as null.
Fixes: b4b7e67c917f ("crypto: qat - Intel(R) QAT ucode part of fw loader") Cc: <stable@vger.kernel.org> Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Adds an entry for Creative USB X-Fi to the rc_config array in
mixer_quirks.c to allow use of volume knob on the device.
Adds support for newer X-Fi Pro card, known as "Model No. SB1095"
with USB ID "041e:3263"
Assign the .throttle and .unthrottle functions to be generic function
in the driver structure to prevent data loss that can otherwise occur
if the host does not enable USB throttling.
CP210x hardware disables auto-RTS but leaves auto-CTS when in hardware
flow control mode and UART on cp210x hardware is disabled. When
re-opening the port, if auto-CTS is enabled on the cp210x, then auto-RTS
must be re-enabled in the driver.
Signed-off-by: Brant Merryman <brant.merryman@silabs.com> Co-developed-by: Phu Luu <phu.luu@silabs.com> Signed-off-by: Phu Luu <phu.luu@silabs.com> Link: https://lore.kernel.org/r/ECCF8E73-91F3-4080-BE17-1714BC8818FB@silabs.com
[ johan: fix up tags and problem description ] Fixes: 39a66b8d22a3 ("[PATCH] USB: CP2101 Add support for flow control") Cc: stable <stable@vger.kernel.org> # 2.6.12 Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We should fput() file iff FDPUT_FPUT is set. So we should set fput_needed
accordingly.
Fixes: 00e188ef6a7e ("sockfd_lookup_light(): switch to fdget^W^Waway from fget_light") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>