When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.
Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.
Here's the full report:
https://lkml.org/lkml/2014/9/19/230
To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.
v3:
- Kees pointed out that no_new_privs should never be cleared, so we
shouldn't define task_clear_no_new_privs(). we define 3 macros instead
of a single one.
v2:
- updated scripts/tags.sh, suggested by Peter
Cc: Ingo Molnar <mingo@kernel.org> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
[lizf: Backported to 3.4:
- adjust context
- remove no_new_priv code
- add atomic_flags to struct task_struct]
[bwh: Backported to 3.2:
- Drop changes in scripts/tags.sh
- Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
While trying out [1][2], I noticed that tc monitor doesn't show the
correct handle on delete:
$ tc monitor
qdisc clsact ffff: dev eno1 parent ffff:fff1
filter dev eno1 ingress protocol all pref 49152 bpf handle 0x2a [...]
deleted filter dev eno1 ingress protocol all pref 49152 bpf handle 0xf3be0c80
some context to explain the above:
The user identity of any tc filter is represented by a 32-bit
identifier encoded in tcm->tcm_handle. Example 0x2a in the bpf filter
above. A user wishing to delete, get or even modify a specific filter
uses this handle to reference it.
Every classifier is free to provide its own semantics for the 32 bit handle.
Example: classifiers like u32 use schemes like 800:1:801 to describe
the semantics of their filters represented as hash table, bucket and
node ids etc.
Classifiers also have internal per-filter representation which is different
from this externally visible identity. Most classifiers set this
internal representation to be a pointer address (which allows fast retrieval
of said filters in their implementations). This internal representation
is referenced with the "fh" variable in the kernel control code.
When a user successfuly deletes a specific filter, by specifying the correct
tcm->tcm_handle, an event is generated to user space which indicates
which specific filter was deleted.
Before this patch, the "fh" value was sent to user space as the identity.
As an example what is shown in the sample bpf filter delete event above
is 0xf3be0c80. This is infact a 32-bit truncation of 0xffff8807f3be0c80
which happens to be a 64-bit memory address of the internal filter
representation (address of the corresponding filter's struct cls_bpf_prog);
After this patch the appropriate user identifiable handle as encoded
in the originating request tcm->tcm_handle is generated in the event.
One of the cardinal rules of netlink rules is to be able to take an
event (such as a delete in this case) and reflect it back to the
kernel and successfully delete the filter. This patch achieves that.
Note, this issue has existed since the original TC action
infrastructure code patch back in 2004 as found in:
https://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/
Fixes: 4e54c4816bfe ("[NET]: Add tc extensions infrastructure.") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Remove usage of the '__attribute__((alias("...")))' hack that aliased
to integer arrays containing micro-assembled instructions. This hack
breaks when building a microMIPS kernel. It also makes the code much
easier to understand.
[ralf@linux-mips.org: Added back export of the clear_page and copy_page
symbols so certain modules will work again. Also fixed build with
CONFIG_SIBYTE_DMA_PAGEOPS enabled.]
Signed-off-by: Steven J. Hill <sjhill@mips.com> Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/3866/ Acked-by: David Daney <david.daney@cavium.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
[bwh: Backported to 3.2: adjust context] Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
net->ct.nf_conntrack_cachep is created with SLAB_DESTROY_BY_RCU.
The hash is protected by rcu, so readers look up conntracks without
locks.
A conntrack is removed from the hash, but in this moment a few readers
still can use the conntrack. Then this conntrack is released and another
thread creates conntrack with the same address and the equal tuple.
After this a reader starts to validate the conntrack:
* It's not dying, because a new conntrack was created
* nf_ct_tuple_equal() returns true.
But this conntrack is not initialized yet, so it can not be used by two
threads concurrently. In this case BUG_ON may be triggered from
nf_nat_setup_info().
Florian Westphal suggested to check the confirm bit too. I think it's
right.
task 1 task 2 task 3
nf_conntrack_find_get
____nf_conntrack_find
destroy_conntrack
hlist_nulls_del_rcu
nf_conntrack_free
kmem_cache_free
__nf_conntrack_alloc
kmem_cache_alloc
memset(&ct->tuplehash[IP_CT_DIR_MAX],
if (nf_ct_is_dying(ct))
if (!nf_ct_tuple_equal()
I'm not sure, that I have ever seen this race condition in a real life.
Currently we are investigating a bug, which is reproduced on a few nodes.
In our case one conntrack is initialized from a few tasks concurrently,
we don't have any other explanation for this.
l2tp_ip_backlog_recv may not return -1 if the packet gets dropped.
The return value is passed up to ip_local_deliver_finish, which treats
negative values as an IP protocol number for resubmission.
Signed-off-by: Paul Hüber <phueber@kernsp.in> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Validate the output buffer length for L2CAP config requests and responses
to avoid overflowing the stack buffer used for building the option blocks.
Signed-off-by: Ben Seri <ben@armis.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2:
- Drop changes to handling of L2CAP_CONF_EFS, L2CAP_CONF_EWS
- Drop changes to l2cap_do_create(), l2cap_security_cfm(), and L2CAP_CONF_PENDING
case in l2cap_config_rsp()
- In l2cap_config_rsp(), s/buf/req/
- Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
It's caused by skb_shared_info at the end of sk_buff was overwritten by
ISCSI_KEVENT_IF_ERROR when parsing nlmsg info from skb in iscsi_if_rx.
During the loop if skb->len == nlh->nlmsg_len and both are sizeof(*nlh),
ev = nlmsg_data(nlh) will acutally get skb_shinfo(SKB) instead and set a
new value to skb_shinfo(SKB)->nr_frags by ev->type.
This patch is to fix it by checking nlh->nlmsg_len properly there to
avoid over accessing sk_buff.
Reported-by: ChunYu Wang <chunwang@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Chris Leech <cleech@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
If using a kernel with CONFIG_XFS_RT=y and we set the RHINHERIT flag on
a directory in a filesystem that does not have a realtime device and
create a new file in that directory, it gets marked as a real time file.
When data is written and a fsync is issued, the filesystem attempts to
flush a non-existent rt device during the fsync process.
This results in a crash dereferencing a null buftarg pointer in
xfs_blkdev_issue_flush():
Setting RT inode flags does not require special privileges so any
unprivileged user can cause this oops to occur. To reproduce, confirm
kernel is compiled with CONFIG_XFS_RT=y and run:
'clk' is copied to a userland with padding byte(s) after 'vclk_post_div'
field unitialized, leaking data from the stack. Fix this ensuring all of
'clk' is initialized to zero.
If L1 does not specify the "use TPR shadow" VM-execution control in
vmcs12, then L0 must specify the "CR8-load exiting" and "CR8-store
exiting" VM-execution controls in vmcs02. Failure to do so will give
the L2 VM unrestricted read/write access to the hardware CR8.
This fixes CVE-2017-12154.
Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
nl80211_set_rekey_data() does not check if the required attributes
NL80211_REKEY_DATA_{REPLAY_CTR,KEK,KCK} are present when processing
NL80211_CMD_SET_REKEY_OFFLOAD request. This request can be issued by
users with CAP_NET_ADMIN privilege and may result in NULL dereference
and a system crash. Add a check for the required attributes presence.
This patch is based on the patch by bo Zhang.
This fixes CVE-2017-12153.
References: https://bugzilla.redhat.com/show_bug.cgi?id=1491046 Fixes: e5497d766ad ("cfg80211/nl80211: support GTK rekey offload") Reported-by: bo Zhang <zhangbo5891001@gmail.com> Signed-off-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
drivers/media/pci/saa7164/saa7164-core.c:97:18: warning: cast removes address space of expression
drivers/media/pci/saa7164/saa7164-core.c:122:31: warning: cast removes address space of expression
drivers/media/pci/saa7164/saa7164-core.c:122:31: warning: incorrect type in initializer (different address spaces)
drivers/media/pci/saa7164/saa7164-core.c:122:31: expected unsigned char [noderef] [usertype] <asn:2>*bufcpu
drivers/media/pci/saa7164/saa7164-core.c:122:31: got unsigned char [usertype] *<noident>
drivers/media/pci/saa7164/saa7164-core.c:282:44: warning: cast removes address space of expression
drivers/media/pci/saa7164/saa7164-core.c:286:38: warning: cast removes address space of expression
drivers/media/pci/saa7164/saa7164-core.c:286:35: warning: incorrect type in assignment (different address spaces)
drivers/media/pci/saa7164/saa7164-core.c:286:35: expected unsigned char [noderef] [usertype] <asn:2>*p
drivers/media/pci/saa7164/saa7164-core.c:286:35: got unsigned char [usertype] *<noident>
drivers/media/pci/saa7164/saa7164-core.c:352:44: warning: cast removes address space of expression
drivers/media/pci/saa7164/saa7164-core.c:527:53: warning: cast removes address space of expression
drivers/media/pci/saa7164/saa7164-core.c:129:30: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-core.c:133:38: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-core.c:133:72: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-core.c:134:35: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-core.c:287:61: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-core.c:288:65: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-core.c:289:65: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-core.c:290:65: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-core.c:291:65: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-core.c:292:65: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-core.c:293:65: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-core.c:294:65: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-fw.c:548:52: warning: incorrect type in argument 5 (different address spaces)
drivers/media/pci/saa7164/saa7164-fw.c:548:52: expected unsigned char [usertype] *dst
drivers/media/pci/saa7164/saa7164-fw.c:548:52: got unsigned char [noderef] [usertype] <asn:2>*
drivers/media/pci/saa7164/saa7164-fw.c:579:44: warning: incorrect type in argument 5 (different address spaces)
drivers/media/pci/saa7164/saa7164-fw.c:579:44: expected unsigned char [usertype] *dst
drivers/media/pci/saa7164/saa7164-fw.c:579:44: got unsigned char [noderef] [usertype] <asn:2>*
drivers/media/pci/saa7164/saa7164-fw.c:597:44: warning: incorrect type in argument 5 (different address spaces)
drivers/media/pci/saa7164/saa7164-fw.c:597:44: expected unsigned char [usertype] *dst
drivers/media/pci/saa7164/saa7164-fw.c:597:44: got unsigned char [noderef] [usertype] <asn:2>*
drivers/media/pci/saa7164/saa7164-bus.c:36:36: warning: cast removes address space of expression
drivers/media/pci/saa7164/saa7164-bus.c:41:36: warning: cast removes address space of expression
drivers/media/pci/saa7164/saa7164-bus.c:151:19: warning: incorrect type in assignment (different base types)
drivers/media/pci/saa7164/saa7164-bus.c:151:19: expected unsigned short [unsigned] [usertype] size
drivers/media/pci/saa7164/saa7164-bus.c:151:19: got restricted __le16 [usertype] <noident>
drivers/media/pci/saa7164/saa7164-bus.c:152:22: warning: incorrect type in assignment (different base types)
drivers/media/pci/saa7164/saa7164-bus.c:152:22: expected unsigned int [unsigned] [usertype] command
drivers/media/pci/saa7164/saa7164-bus.c:152:22: got restricted __le32 [usertype] <noident>
drivers/media/pci/saa7164/saa7164-bus.c:153:30: warning: incorrect type in assignment (different base types)
drivers/media/pci/saa7164/saa7164-bus.c:153:30: expected unsigned short [unsigned] [usertype] controlselector
drivers/media/pci/saa7164/saa7164-bus.c:153:30: got restricted __le16 [usertype] <noident>
drivers/media/pci/saa7164/saa7164-bus.c:172:20: warning: cast to restricted __le32
drivers/media/pci/saa7164/saa7164-bus.c:173:20: warning: cast to restricted __le32
drivers/media/pci/saa7164/saa7164-bus.c:206:28: warning: cast to restricted __le32
drivers/media/pci/saa7164/saa7164-bus.c:287:9: warning: incorrect type in argument 1 (different base types)
drivers/media/pci/saa7164/saa7164-bus.c:287:9: expected unsigned int [unsigned] val
drivers/media/pci/saa7164/saa7164-bus.c:287:9: got restricted __le32 [usertype] <noident>
drivers/media/pci/saa7164/saa7164-bus.c:339:20: warning: cast to restricted __le32
drivers/media/pci/saa7164/saa7164-bus.c:340:20: warning: cast to restricted __le32
drivers/media/pci/saa7164/saa7164-bus.c:463:9: warning: incorrect type in argument 1 (different base types)
drivers/media/pci/saa7164/saa7164-bus.c:463:9: expected unsigned int [unsigned] val
drivers/media/pci/saa7164/saa7164-bus.c:463:9: got restricted __le32 [usertype] <noident>
drivers/media/pci/saa7164/saa7164-bus.c:466:21: warning: cast to restricted __le16
drivers/media/pci/saa7164/saa7164-bus.c:467:24: warning: cast to restricted __le32
drivers/media/pci/saa7164/saa7164-bus.c:468:32: warning: cast to restricted __le16
drivers/media/pci/saa7164/saa7164-buffer.c:122:18: warning: incorrect type in assignment (different address spaces)
drivers/media/pci/saa7164/saa7164-buffer.c:122:18: expected unsigned long long [noderef] [usertype] <asn:2>*cpu
drivers/media/pci/saa7164/saa7164-buffer.c:122:18: got void *
drivers/media/pci/saa7164/saa7164-buffer.c:127:21: warning: incorrect type in assignment (different address spaces)
drivers/media/pci/saa7164/saa7164-buffer.c:127:21: expected unsigned long long [noderef] [usertype] <asn:2>*pt_cpu
drivers/media/pci/saa7164/saa7164-buffer.c:127:21: got void *
drivers/media/pci/saa7164/saa7164-buffer.c:134:20: warning: cast removes address space of expression
drivers/media/pci/saa7164/saa7164-buffer.c:156:63: warning: incorrect type in argument 3 (different address spaces)
drivers/media/pci/saa7164/saa7164-buffer.c:156:63: expected void *vaddr
drivers/media/pci/saa7164/saa7164-buffer.c:156:63: got unsigned long long [noderef] [usertype] <asn:2>*cpu
drivers/media/pci/saa7164/saa7164-buffer.c:179:57: warning: incorrect type in argument 3 (different address spaces)
drivers/media/pci/saa7164/saa7164-buffer.c:179:57: expected void *vaddr
drivers/media/pci/saa7164/saa7164-buffer.c:179:57: got unsigned long long [noderef] [usertype] <asn:2>*cpu
drivers/media/pci/saa7164/saa7164-buffer.c:180:56: warning: incorrect type in argument 3 (different address spaces)
drivers/media/pci/saa7164/saa7164-buffer.c:180:56: expected void *vaddr
drivers/media/pci/saa7164/saa7164-buffer.c:180:56: got unsigned long long [noderef] [usertype] <asn:2>*pt_cpu
drivers/media/pci/saa7164/saa7164-buffer.c:84:17: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-buffer.c:147:31: warning: dereference of noderef expression
drivers/media/pci/saa7164/saa7164-buffer.c:148:17: warning: dereference of noderef expression
Most are caused by pointers marked as __iomem when they aren't or not marked as
__iomem when they should.
Also note that readl/writel already do endian conversion, so there is no need to
do it again.
saa7164_bus_set/get were a bit tricky: you have to make sure the msg endian
conversion is done at the right time, and that the code isn't using fields that
are still little endian instead of cpu-endianness.
The approach chosen is to convert just before writing to the ring buffer
and to convert it back right after reading from the ring buffer.
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Cc: Steven Toth <stoth@kernellabs.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
[bwh: Backported to 3.2: adjust filenames, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The msg->command field is 32 bits, and we should fill it with a call
to cpu_to_le32(). The current code is broke on big endian systems.
On little endian systems it truncates the 32 bit value to 16 bits
which probably still works fine.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Steven Toth <stoth@kernellabs.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When changing a file's acl mask, btrfs_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.
If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.
Prevent this by restoring the original mode bits if __btrfs_set_acl
fails.
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.
Fix the problem by moving posix_acl_update_mode() out of
__ext4_set_acl() into ext4_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.
Fixes: 073931017b49d9458aa351605b43a7e34598caef Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
[bwh: Backported to 3.2: the __ext4_set_acl() function didn't exist,
so added it] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When changing a file's acl mask, __ext4_set_acl() will first set the group
bits of i_mode to the value of the mask, and only then set the actual
extended attribute representing the new acl.
If the second part fails (due to lack of space, for example) and the file
had no acl attribute to begin with, the system will from now on assume
that the mask permission bits are actual group permission bits, potentially
granting access to the wrong users.
Prevent this by only changing the inode mode after the acl has been set.
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When changing a file's acl mask, reiserfs_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.
If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.
Prevent this by only changing the inode mode after the acl has been set.
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.
Fix the problem by moving posix_acl_update_mode() out of
__reiserfs_set_acl() into reiserfs_set_acl(). That way the function will
not be called when inheriting ACLs which is what we want as it prevents
SGID bit clearing and the mode has been properly set by
posix_acl_create() anyway.
Fixes: 073931017b49d9458aa351605b43a7e34598caef CC: reiserfs-devel@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
[bwh: Backported to 3.2: the __reiserfs_set_acl() function didn't exist,
so added it] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Ben Hutchings [Sun, 8 Oct 2017 13:48:44 +0000 (14:48 +0100)]
ext3: preserve i_mode if ext2_set_acl() fails
Based on Ernesto A. Fernández's fix for ext2 (commit fe26569eb919), from
which the following description is taken:
> When changing a file's acl mask, ext2_set_acl() will first set the group
> bits of i_mode to the value of the mask, and only then set the actual
> extended attribute representing the new acl.
>
> If the second part fails (due to lack of space, for example) and the file
> had no acl attribute to begin with, the system will from now on assume
> that the mask permission bits are actual group permission bits, potentially
> granting access to the wrong users.
>
> Prevent this by only changing the inode mode after the acl has been set.
Cc: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Ben Hutchings [Fri, 6 Oct 2017 02:18:40 +0000 (03:18 +0100)]
ext3: Don't clear SGID when inheriting ACLs
Based on Jan Kara's fix for ext2 (commit a992f2d38e4c), from which the
following description is taken:
> When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
> set, DIR1 is expected to have SGID bit set (and owning group equal to
> the owning group of 'DIR0'). However when 'DIR0' also has some default
> ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
> 'DIR1' to get cleared if user is not member of the owning group.
>
> Fix the problem by creating __ext2_set_acl() function that does not call
> posix_acl_update_mode() and use it when inheriting ACLs. That prevents
> SGID bit clearing and the mode has been properly set by
> posix_acl_create() anyway.
Fixes: 073931017b49 ("posix_acl: Clear SGID bit when setting file permissions") Cc: linux-ext4@vger.kernel.org Cc: Jan Kara <jack@suse.cz> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When changing a file's acl mask, ext2_set_acl() will first set the group
bits of i_mode to the value of the mask, and only then set the actual
extended attribute representing the new acl.
If the second part fails (due to lack of space, for example) and the file
had no acl attribute to begin with, the system will from now on assume
that the mask permission bits are actual group permission bits, potentially
granting access to the wrong users.
Prevent this by only changing the inode mode after the acl has been set.
[JK: Rebased on top of "ext2: Don't clear SGID when inheriting ACLs"] Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.
Fix the problem by creating __ext2_set_acl() function that does not call
posix_acl_update_mode() and use it when inheriting ACLs. That prevents
SGID bit clearing and the mode has been properly set by
posix_acl_create() anyway.
Fixes: 073931017b49d9458aa351605b43a7e34598caef CC: linux-ext4@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
[bwh: Backported to 3.2:
- Keep using CURRENT_TIME_SEC
- Change parameter order of ext2_set_acl() to match upstream
- Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Actually, at the moment this is not an issue, as every in-tree arch does
the same manual checks for VERIFY_READ vs VERIFY_WRITE, relying on the MMU
to tell them apart, but this wasn't the case in the past and may happen
again on some odd arch in the future.
If anyone cares about 3.7 and earlier, this is a security hole (untested)
on real 80386 CPUs.
Signed-off-by: Adam Borowski <kilobyte@angband.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Jörn Engel noticed that the expand_upwards() function might not return
-ENOMEM in case the requested address is (unsigned long)-PAGE_SIZE and
if the architecture didn't defined TASK_SIZE as multiple of PAGE_SIZE.
Affected architectures are arm, frv, m68k, blackfin, h8300 and xtensa
which all define TASK_SIZE as 0xffffffff, but since none of those have
an upwards-growing stack we currently have no actual issue.
Nevertheless let's fix this just in case any of the architectures with
an upward-growing stack (currently parisc, metag and partly ia64) define
TASK_SIZE similar.
Link: http://lkml.kernel.org/r/20170702192452.GA11868@p100.box Fixes: bd726c90b6b8 ("Allow stack to grow up to address space limit") Signed-off-by: Helge Deller <deller@gmx.de> Reported-by: Jörn Engel <joern@purestorage.com> Cc: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When UBIFS prepares data structures which will be written to the MTD it
ensues that their lengths are multiple of 8. Since it uses kmalloc() the
padded bytes are left uninitialized and we leak a few bytes of kernel
memory to the MTD.
To make sure that all bytes are initialized, let's switch to kzalloc().
Kzalloc() is fine in this case because the buffers are not huge and in
the IO path the performance bottleneck is anyway the MTD.
Fixes: 1e51764a3c2a ("UBIFS: add new flash file system") Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: Boris Brezillon <boris.brezillon@free-electrons.com> Signed-off-by: Richard Weinberger <richard@nod.at>
[bwh: Backported to 3.2:
- Drop change in ubifs_jnl_xrename()
- Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
UBIFS handles extended attributes just like files, as consequence of
that, they also have inodes.
Therefore UBIFS does all the inode machinery also for xattrs. Since new
inodes have i_nlink of 1, a file or xattr inode will be evicted
if i_nlink goes down to 0 after an unlink. UBIFS assumes this model also
for xattrs, which is not correct.
One can create a file "foo" with xattr "user.test". By reading
"user.test" an inode will be created, and by deleting "user.test" it
will get evicted later. The assumption breaks if the file "foo", which
hosts the xattrs, will be removed. VFS nor UBIFS does not remove each
xattr via ubifs_xattr_remove(), it just removes the host inode from
the TNC and all underlying xattr nodes too and the inode will remain
in the cache and wastes memory.
To solve this problem, remove xattr inodes from the VFS inode cache in
ubifs_xattr_remove() to make sure that they get evicted.
Fixes: 1e51764a3c2ac05a ("UBIFS: add new flash file system") Signed-off-by: Richard Weinberger <richard@nod.at>
[bwh: Backported to 3.2:
- xattr support is optional, so add an #ifdef around the call
- Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The driver checks port->exists twice in i8042_interrupt(), first when
trying to assign temporary "serio" variable, and second time when deciding
whether it should call serio_interrupt(). The value of port->exists may
change between the 2 checks, and we may end up calling serio_interrupt()
with a NULL pointer:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
IP: [<ffffffff8150feaf>] _spin_lock_irqsave+0x1f/0x40
PGD 0
Oops: 0002 [#1] SMP
last sysfs file:
CPU 0
Modules linked in:
To avoid the issue let's change the second check to test whether serio is
NULL or not.
Also, let's take i8042_lock in i8042_start() and i8042_stop() instead of
trying to be overly smart and using memory barriers.
Signed-off-by: Chen Hong <chenhong3@huawei.com>
[dtor: take lock in i8042_start()/i8042_stop()] Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
From POWER4 onwards, mfocrf() only places the specified CR field into
the destination GPR, and the rest of it is set to 0. The PowerPC AS
from version 3.0 now requires this behaviour.
The emulation code currently puts the entire CR into the destination GPR.
Fix it.
Fixes: 6888199f7fe5 ("[POWERPC] Emulate more instructions in software") Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The workaround for the CELL timebase bug does not correctly mark cr0 as
being clobbered. This means GCC doesn't know that the asm block changes cr0 and
might leave the result of an unrelated comparison in cr0 across the block, which
we then trash, leading to basically random behaviour.
Fixes: 859deea949c3 ("[POWERPC] Cell timebase bug workaround") Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[mpe: Tweak change log and flag for stable] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
__list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer
duration if there are more number of items in the lru list. As per the
current code, it can hold the spin lock for upto maximum UINT_MAX
entries at a time. So if there are more number of items in the lru
list, then "BUG: spinlock lockup suspected" is observed in the below
path:
Fix this lockup by reducing the number of entries to be shrinked from
the lru list to 1024 at once. Also, add cond_resched() before
processing the lru list again.
Link: http://marc.info/?t=149722864900001&r=1&w=2 Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Alexander Polakov <apolyakov@beget.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: we don't hold the spin-lock for long, but adding
cond_resched() looks like a good idea] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") has
introduced a regression in some rust and Java environments which are
trying to implement their own stack guard page. They are punching a new
MAP_FIXED mapping inside the existing stack Vma.
This will confuse expand_{downwards,upwards} into thinking that the
stack expansion would in fact get us too close to an existing non-stack
vma which is a correct behavior wrt safety. It is a real regression on
the other hand.
Let's work around the problem by considering PROT_NONE mapping as a part
of the stack. This is a gros hack but overflowing to such a mapping
would trap anyway an we only can hope that usespace knows what it is
doing and handle it propely.
Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas") Link: http://lkml.kernel.org/r/20170705182849.GA18027@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Willy Tarreau <w@1wt.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
validate_scan_freqs() retrieves frequencies from attributes
nested in the attribute NL80211_ATTR_SCAN_FREQUENCIES with
nla_get_u32(), which reads 4 bytes from each attribute
without validating the size of data received. Attributes
nested in NL80211_ATTR_SCAN_FREQUENCIES don't have an nla policy.
Validate size of each attribute before parsing to avoid potential buffer
overread.
Fixes: 2a519311926 ("cfg80211/nl80211: scanning (and mac80211 update to use it)") Signed-off-by: Srinivas Dasari <dasaris@qti.qualcomm.com> Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
nla policy checks for only maximum length of the attribute data
when the attribute type is NLA_BINARY. If userspace sends less
data than specified, the wireless drivers may access illegal
memory. When type is NLA_UNSPEC, nla policy check ensures that
userspace sends minimum specified length number of bytes.
Remove type assignment to NLA_BINARY from nla_policy of
NL80211_ATTR_PMKID to make this NLA_UNSPEC and to make sure minimum
WLAN_PMKID_LEN bytes are received from userspace with
NL80211_ATTR_PMKID.
While cleaning up sysfs callback that prints EK we discovered a kernel
memory leak. This commit fixes the issue by zeroing the buffer used for
TPM command/response.
The leak happen when we use either tpm_vtpm_proxy, tpm_ibmvtpm or
xen-tpmfront.
Fixes: 0883743825e3 ("TPM: sysfs functions consolidation") Reported-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Tested-by: Stefan Berger <stefanb@linux.vnet.ibm.com> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
[bwh: Backported to 3.2: adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The crash happens in syscall_get_arguments function for
syscalls with zero arguments, that will try to access
first argument (args[0]) in event entry, but it's not
allocated.
Bail out of there are no arguments.
Reported-by: Zorro Lang <zlang@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The ib_uverbs_create_ah() ind ib_uverbs_modify_qp() calls receive
the port number from user input as part of its attributes and assumes
it is valid. Down on the stack, that parameter is used to access kernel
data structures. If the value is invalid, the kernel accesses memory
it should not. To prevent this, verify the port number before using it.
BUG: KASAN: use-after-free in ib_uverbs_create_ah+0x6d5/0x7b0
Read of size 4 at addr ffff880018d67ab8 by task syz-executor/313
BUG: KASAN: slab-out-of-bounds in modify_qp.isra.4+0x19d0/0x1ef0
Read of size 4 at addr ffff88006c40ec58 by task syz-executor/819
Fixes: 67cdb40ca444 ("[IB] uverbs: Implement more commands") Fixes: 189aba99e70 ("IB/uverbs: Extend modify_qp and support packet pacing") Cc: <security@kernel.org> Cc: Yevgeny Kliteynik <kliteyn@mellanox.com> Cc: Tziporet Koren <tziporet@mellanox.com> Cc: Alex Polak <alexpo@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
[bwh: Backported to 3.2:
- In modify_qp(), command structure is cmd not cmd->base
- In modify_qp(), add release_qp label
- In ib_uverbs_create_ah(), add definition of ib_dev
- Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Previously start_port and end_port were defined in 2 places, cache.c and
device.c and this prevented their use in other modules.
Make these common functions, change the name to reflect the rdma
name space, and update existing users.
Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
[bwh: Backported to 3.2: drop one inapplicable change] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
We have pretty clear evidence that MSIs are getting lost on g4x and
somehow the interrupt logic doesn't seem to recover from that state
even if we try hard to clear the IIR.
Disabling IER around the normal IIR clearing in the irq handler isn't
sufficient to avoid this, so the problem really seems to be further
up the interrupt chain. This should guarantee that there's always
an edge if any IIR bits are set after the interrupt handler is done,
which should normally guarantee that the CPU interrupt is generated.
That approach seems to work perfectly on VLV/CHV, but apparently
not on g4x.
MSI is documented to be broken on 965gm at least. The chipset spec
says MSI is defeatured because interrupts can be delayed or lost,
which fits well with what we're seeing on g4x. Previously we've
already disabled GMBUS interrupts on g4x because somehow GMBUS
manages to raise legacy interrupts even when MSI is enabled.
Since there's such widespread MSI breakahge all over in the pre-gen5
land let's just give up on MSI on these platforms.
Seqno reporting might be negatively affected by this since the legcy
interrupts aren't guaranteed to be ordered with the seqno writes,
whereas MSI interrupts may be? But an occasioanlly missed seqno
seems like a small price to pay for generally working interrupts.
Cc: Diego Viola <diego.viola@gmail.com> Tested-by: Diego Viola <diego.viola@gmail.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101261 Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170626203051.28480-1-ville.syrjala@linux.intel.com Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
(cherry picked from commit e38c2da01f76cca82b59ca612529b81df82a7cc7) Signed-off-by: Jani Nikula <jani.nikula@intel.com>
[bwh: Backported to 3.2:
- Open-code INTEL_GEN()
- Adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Currently, when the link for $DEV is down, this command succeeds but the
address is removed immediately by DAD (1):
ip addr add 1111::12/64 dev $DEV valid_lft 3600 preferred_lft 1800
In the same situation, this will succeed and not remove the address (2):
ip addr add 1111::12/64 dev $DEV
ip addr change 1111::12/64 dev $DEV valid_lft 3600 preferred_lft 1800
The comment in addrconf_dad_begin() when !IF_READY makes it look like
this is the intended behavior, but doesn't explain why:
* If the device is not ready:
* - keep it tentative if it is a permanent address.
* - otherwise, kill it.
We clearly cannot prevent userspace from doing (2), but we can make (1)
work consistently with (2).
addrconf_dad_stop() is only called in two cases: if DAD failed, or to
skip DAD when the link is down. In that second case, the fix is to avoid
deleting the address, like we already do for permanent addresses.
Fixes: 3c21edbd1137 ("[IPV6]: Defer IPv6 device initialization until the link becomes ready.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The userspace needs to know why is the address being removed so that it can
perhaps obtain a new address.
Without the DADFAILED flag it's impossible to distinguish removal of a
temporary and tentative address due to DAD failure from other reasons (device
removed, manual address removal).
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The enclosure_add_device() function should fail if it can't create the
relevant sysfs links.
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Tested-by: Douglas Miller <dougmill@linux.vnet.ibm.com> Acked-by: James Bottomley <jejb@linux.vnet.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Currently we saw a lot of "No irq handler" errors during hibernation, which
caused the system hang finally:
ata4.00: qc timeout (cmd 0xec)
ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata4.00: revalidation failed (errno=-5)
ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
do_IRQ: 31.151 No irq handler for vector
According to above logs, there is an interrupt triggered and it is
dispatched to CPU31 with a vector number 151, but there is no handler for
it, thus this IRQ will not get acked and will cause an IRQ flood which
kills the system. To be more specific, the 31.151 is an interrupt from the
AHCI host controller.
After some investigation, the reason why this issue is triggered is because
the thaw_noirq() function does not restore the MSI/MSI-X settings across
hibernation.
The scenario is illustrated below:
1. Before hibernation, IRQ 34 is the handler for the AHCI device, which
is bound to CPU31.
2. Hibernation starts, the AHCI device is put into low power state.
3. All the nonboot CPUs are put offline, so IRQ 34 has to be migrated to
the last alive one - CPU0.
4. After the snapshot has been created, all the nonboot CPUs are brought
up again; IRQ 34 remains bound to CPU0.
5. AHCI devices are put into D0.
6. The snapshot is written to the disk.
The issue is triggered in step 6. The AHCI interrupt should be delivered
to CPU0, however it is delivered to the original CPU31 instead, which
causes the "No irq handler" issue.
Ying Huang has provided a clue that, in step 3 it is possible that writing
to the register might not take effect as the PCI devices have been
suspended.
In step 3, the IRQ 34 affinity should be modified from CPU31 to CPU0, but
in fact it is not. In __pci_write_msi_msg(), if the device is already in
low power state, the low level MSI message entry will not be updated but
cached. During the device restore process after a normal suspend/resume,
pci_restore_msi_state() writes the cached MSI back to the hardware.
But this is not the case for hibernation. pci_restore_msi_state() is not
currently called in pci_pm_thaw_noirq(), although pci_save_state() has
saved the necessary PCI cached information in pci_pm_freeze_noirq().
Restore the PCI status for the device during hibernation. Otherwise the
status might be lost across hibernation (for example, settings for MSI,
MSI-X, ATS, ACS, IOV, etc.), which might cause problems during hibernation.
Suggested-by: Ying Huang <ying.huang@intel.com> Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
[bhelgaas: changelog] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Len Brown <len.brown@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Ying Huang <ying.huang@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.
Fix the problem by moving posix_acl_update_mode() out of
__btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.
Fixes: 073931017b49d9458aa351605b43a7e34598caef CC: linux-btrfs@vger.kernel.org CC: David Sterba <dsterba@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: David Sterba <dsterba@suse.com>
[bwh: Backported to 3.2: move the call to posix_acl_update_mode() into
btrfs_xattr_acl_set()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The USB core and sysfs will attempt to enumerate certain parameters
which are unsupported by the au0828 - causing inconsistent behavior
and sometimes causing the chip to reset. Avoid making these calls.
This problem manifested as intermittent cases where the au8522 would
be reset on analog video startup, in particular when starting up ALSA
audio streaming in parallel - the sysfs entries created by
snd-usb-audio on streaming startup would result in unsupported control
messages being sent during tuning which would put the chip into an
unknown state.
Fix commit e50c0a8fa60d ("Support the MIPS32 / MIPS64 DSP ASE.") and
send SIGILL rather than SIGBUS whenever an unimplemented BPOSGE32 DSP
ASE instruction has been encountered in `__compute_return_epc_for_insn'
as our Reserved Instruction exception handler would in response to an
attempt to actually execute the instruction. Sending SIGBUS only makes
sense for the unaligned PC case, since moved to `__compute_return_epc'.
Adjust function documentation accordingly, correct formatting and use
`pr_info' rather than `printk' as the other exit path already does.
Fixes: e50c0a8fa60d ("Support the MIPS32 / MIPS64 DSP ASE.") Signed-off-by: Maciej W. Rozycki <macro@imgtec.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16396/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
[bwh: Backported to 3.2:
- Drop the comment change
- Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
pm_genpd_remove_subdomain() iterates over domain's master_links list and
removes matching element thus it has to use safe version of list
iteration.
Fixes: f721889ff65a ("PM / Domains: Support for generic I/O PM domains (v8)") Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Neither soft poweroff (transition to ACPI power state S5) nor
suspend-to-RAM (transition to state S3) works on the Macbook Pro 11,4 and
11,5.
The problem is related to the [mem 0x7fa00000-0x7fbfffff] space. When we
use that space, e.g., by assigning it to the 00:1c.0 Root Port, the ACPI
Power Management 1 Control Register (PM1_CNT) at [io 0x1804] doesn't work
anymore.
Linux does a soft poweroff (transition to S5) by writing to PM1_CNT. The
theory about why this doesn't work is:
- The write to PM1_CNT causes an SMI
- The BIOS SMI handler depends on something in
[mem 0x7fa00000-0x7fbfffff]
- When Linux assigns [mem 0x7fa00000-0x7fbfffff] to the 00:1c.0 Port, it
covers up whatever the SMI handler uses, so the SMI handler no longer
works correctly
Reserve the [mem 0x7fa00000-0x7fbfffff] space so we don't assign it to
anything.
This is voodoo programming, since we don't know what the real conflict is,
but we've failed to find the root cause.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=103211 Tested-by: thejoe@gmail.com Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Lukas Wunner <lukas@wunner.de> Cc: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The Haswell Power Control Unit has a non-PCI register (CONFIG_TDP_NOMINAL)
where BAR 0 is supposed to be. This is erratum HSE43 in the spec update
referenced below:
The PCIe* Base Specification indicates that Configuration Space Headers
have a base address register at offset 0x10. Due to this erratum, the
Power Control Unit's CONFIG_TDP_NOMINAL CSR (Bus 1; Device 30; Function
3; Offset 0x10) is located where a base register is expected.
Mark the PCU as having non-compliant BARs so we don't try to probe any of
them. There are no other BARs on this device.
The inline asm retry check in the MIPS_ATOMIC_SET operation of the
sysmips system call has been backwards since commit f1e39a4a616c ("MIPS:
Rewrite sysmips(MIPS_ATOMIC_SET, ...) in C with inline assembler")
merged in v2.6.32, resulting in the non R10000_LLSC_WAR case retrying
until the operation was inatomic, before returning the new value that
was probably just written multiple times instead of the old value.
Invert the branch condition to fix that particular issue.
Fixes: f1e39a4a616c ("MIPS: Rewrite sysmips(MIPS_ATOMIC_SET, ...) in C with inline assembler") Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16148/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Make sure to drop the reference to the dma device taken by
of_find_device_by_node() on probe errors and on driver unbind.
Fixes: 334ae614772b ("sparc: Kill SBUS DVMA layer.") Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
If bnx2i_map_ep_dbell_regs() then we accidentally return NULL instead of
an error pointer. It results in a NULL dereference in
iscsi_if_ep_connect().
Fixes: cf4e6363859d ("[SCSI] bnx2i: Add bnx2i iSCSI driver.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Verify that the caller-provided sockaddr structure is large enough to
contain the sa_family field, before accessing it in bind() and connect()
handlers of the AF_IUCV socket. Since neither syscall enforces a minimum
size of the corresponding memory region, very short sockaddrs (zero or
one byte long) result in operating on uninitialized memory while
referencing .sa_family.
Fixes: 52a82e23b9f2 ("af_iucv: Validate socket address length in iucv_sock_bind()") Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com>
[jwi: removed unneeded null-check for addr] Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
For AMD Promontory xHCI host, although you can disable USB 2.0 ports in
BIOS settings, those ports will be enabled anyway after you remove a
device on that port and re-plug it in again. It's a known limitation of
the chip. As a workaround we can clear the PORT_WAKE_BITS.
This will disable wake on connect, disconnect and overcurrent on
AMD Promontory USB2 ports
[checkpatch cleanup and commit message reword -Mathias] Cc: Tsai Nicholas <nicholas.tsai@amd.com> Signed-off-by: Jiahau Chang <Lars_Chang@asmedia.com.tw> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
udf_setsize() called truncate_setsize() with i_data_sem held. Thus
truncate_pagecache() called from truncate_setsize() could lock a page
under i_data_sem which can deadlock as page lock ranks below
i_data_sem - e. g. writeback can hold page lock and try to acquire
i_data_sem to map a block.
Fix the problem by moving truncate_setsize() calls from under
i_data_sem. It is safe for us to change i_size without holding
i_data_sem as all the places that depend on i_size being stable already
hold inode_lock.
Fixes: 7e49b6f2480cb9a9e7322a91592e56a5c85361f5 Signed-off-by: Jan Kara <jack@suse.cz>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
__udf_adinicb_readpage() uses i_size several times. When truncate
changes i_size while the function is running, it can observe several
different values and thus e.g. expose uninitialized parts of page to
userspace. Also use i_size_read() in the function since it does not hold
inode_lock. Since i_size is guaranteed to be small, this cannot really
cause any issues even on 32-bit archs but let's be careful.
Fixes: 9c2fc0de1a6e638fe58c354a463f544f42a90a09 Signed-off-by: Jan Kara <jack@suse.cz>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The function flush_signals clears all pending signals for the process. It
may be used by kernel threads when we need to prepare a kernel thread for
responding to signals. However using this function for an userspaces
processes is incorrect - clearing signals without the program expecting it
can cause misbehavior.
The raid1 and raid5 code uses flush_signals in its request routine because
it wants to prepare for an interruptible wait. This patch drops
flush_signals and uses sigprocmask instead to block all signals (including
SIGKILL) around the schedule() call. The signals are not lost, but the
schedule() call won't respond to them.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
PCI_STD_RESOURCE_END is (confusingly) the index of the last valid BAR, not
the *number* of BARs. To iterate through all possible BARs, we need to
include PCI_STD_RESOURCE_END.
Fixes: 9fe373f9997b ("PCI: Increase IBM ipr SAS Crocodile BARs to at least system page size") Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The usbip stack dynamically allocates the transfer_buffer and
setup_packet of each urb that got generated by the tcp to usb stub code.
As these pointers are always used only once we will set them to NULL
after use. This is done likewise to the free_urb code in vudc_dev.c.
This patch fixes double kfree situations where the usbip remote side
added the URB_FREE_BUFFER.
Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Acked-by: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.2: adjust filenames] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
- if (attr->inherit && (attr->sample_type & PERF_SAMPLE_GROUP))
+ if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
is a clear fail :/
While this changes user visible behaviour; it was previously possible
to create an inherited event with PERF_SAMPLE_READ; this is deemed
acceptible because its results were always incorrect.
Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vince@deater.net> Fixes: 3dab77fb1bf8 ("perf: Rework/fix the whole read vs group stuff") Link: http://lkml.kernel.org/r/20170530094512.dy2nljns2uq7qa3j@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Fix urb and transfer-buffer leaks in an urb-submission error path which
may be hit when a device is disconnected.
Fixes: 66e89522aff7 ("V4L/DVB: IR: add mceusb IR receiver driver") Cc: Jarod Wilson <jarod@redhat.com> Signed-off-by: Johan Hovold <johan@kernel.org> Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
[bwh: Backported to 3.2:
- Add check on urb_type, as async_buf and async_urb aren't always allocated
- Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
In the stable linux-3.16 branch, I ran into a warning in the
wlcore driver:
drivers/net/wireless/ti/wlcore/spi.c: In function 'wl12xx_spi_raw_write':
drivers/net/wireless/ti/wlcore/spi.c:315:1: error: the frame size of 12848 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]
Newer kernels no longer show the warning, but the bug is still there,
as the allocation is based on the CPU page size rather than the
actual capabilities of the hardware.
This replaces the PAGE_SIZE macro with the SZ_4K macro, i.e. 4096 bytes
per buffer.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
[bwh: Backported to 3.2:
- Open-code SZ_4K as it is only defined on some architectures here(!)
- Adjust filename, context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
If we fail to add an interface in mwifiex_add_virtual_intf(), we might
hit a BUG_ON() in the networking code, because we didn't tear things
down properly. Among the problems:
(a) when failing to allocate workqueues, we fail to unregister the
netdev before calling free_netdev()
(b) even if we do try to unregister the netdev, we're still holding the
rtnl lock, so the device never properly unregistered; we'll be at
state NETREG_UNREGISTERING, and then hit free_netdev()'s:
BUG_ON(dev->reg_state != NETREG_UNREGISTERED);
(c) we're allocating some dependent resources (e.g., DFS workqueues)
after we've registered the interface; this may or may not cause
problems, but it's good practice to allocate these before registering
(d) we're not even trying to unwind anything when mwifiex_send_cmd() or
mwifiex_sta_init_cmd() fail
To fix these issues, let's:
* add a stacked set of error handling labels, to keep error handling
consistent and properly ordered (resolving (a) and (d))
* move the workqueue allocations before the registration (to resolve
(c); also resolves (b) by avoiding error cases where we have to
unregister)
[Incidentally, it's pretty easy to interrupt the alloc_workqueue() in,
e.g., the following:
iw phy phy0 interface add mlan0 type station
by sending it SIGTERM.]
This bugfix covers commits like commit 7d652034d1a0 ("mwifiex: channel
switch support for mwifiex"), but parts of this bug exist all the way
back to the introduction of dynamic interface handling in commit 93a1df48d224 ("mwifiex: add cfg80211 handlers add/del_virtual_intf").
Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
[bwh: Backported to 3.2:
- There is no workqueue allocation or cleanup needed here
- Add 'ret' variable
- Keep logging errors with wiphy_err()
- Adjust filename] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
More users for for_each_cpu_wrap() have appeared. Promote the construct
to generic cpumask interface.
The implementation is slightly modified to reduce arguments.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Lauro Ramos Venancio <lvenanci@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: lwang@redhat.com Link: http://lkml.kernel.org/r/20170414122005.o35me2h5nowqkxbv@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.2: there's no old version of the function to delete] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The 88m1101 has an errata when configuring autoneg. However, it was
being applied to many other Marvell PHYs as well. Limit its scope to
just the 88m1101.
Fixes: 76884679c644 ("phylib: Add support for Marvell 88e1111S and 88e1145") Reported-by: Daniel Walker <danielwa@cisco.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Harini Katakam <harinik@xilinx.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The 'move_paghes()' system call was introduced long long ago with the
same permission checks as for sending a signal (except using
CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).
That turns out to not be a great choice - while the system call really
only moves physical page allocations around (and you need other
capabilities to do a lot of it), you can check the return value to map
out some the virtual address choices and defeat ASLR of a binary that
still shares your uid.
So change the access checks to the more common 'ptrace_may_access()'
model instead.
This tightens the access checks for the uid, and also effectively
changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
anybody really _uses_ this legacy system call any more (we hav ebetter
NUMA placement models these days), so I expect nobody to notice.
Famous last words.
Reported-by: Otto Ebeling <otto.ebeling@iki.fi> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Commit 3268c63 ("mm: fix move/migrate_pages() race on task struct") has
added an odd construct where 'mm' is checked for being NULL, and if it is,
it would get dereferenced anyways by mput()ing it.
Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Commit 3268c63 ("mm: fix move/migrate_pages() race on task struct") has
added an odd construct where 'mm' is checked for being NULL, and if it is,
it would get dereferenced anyways by mput()ing it.
This would lead to the following NULL ptr deref and BUG() when calling
migrate_pages() with a pid that has no mm struct:
Migration functions perform the rcu_read_unlock too early. As a result
the task pointed to may change from under us. This can result in an oops,
as reported by Dave Hansen in https://lkml.org/lkml/2012/2/23/302.
The following patch extend the period of the rcu_read_lock until after the
permissions checks are done. We also take a refcount so that the task
reference is stable when calling security check functions and performing
cpuset node validation (which takes a mutex).
The refcount is dropped before actual page migration occurs so there is no
change to the refcounts held during page migration.
Also move the determination of the mm of the task struct to immediately
before the do_migrate*() calls so that it is clear that we switch from
handling the task during permission checks to the mm for the actual
migration. Since the determination is only done once and we then no
longer use the task_struct we can be sure that we operate on a specific
address space that will not change from under us.
[akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Christoph Lameter <cl@linux.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Reported-by: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.
To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.
The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.
While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.
In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:
/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret
Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).
[akpm@linux-foundation.org: fix warning] Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.2:
- Drop changes to kcmp, procfs map_files, procfs has_pid_permissions()
- Keep using uid_t, gid_t and == operator for IDs
- Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The current test for bio vec merging is not fully accurate and can be
tricked into merging bios when certain grant combinations are used.
The result of these malicious bio merges is a bio that extends past
the memory page used by any of the originating bios.
Take into account the following scenario, where a guest creates two
grant references that point to the same mfn, ie: grant 1 -> mfn A,
grant 2 -> mfn A.
These references are then used in a PV block request, and mapped by
the backend domain, thus obtaining two different pfns that point to
the same mfn, pfn B -> mfn A, pfn C -> mfn A.
If those grants happen to be used in two consecutive sectors of a disk
IO operation becoming two different bios in the backend domain, the
checks in xen_biovec_phys_mergeable will succeed, because bfn1 == bfn2
(they both point to the same mfn). However due to the bio merging,
the backend domain will end up with a bio that expands past mfn A into
mfn A + 1.
Fix this by making sure the check in xen_biovec_phys_mergeable takes
into account the offset and the length of the bio, this basically
replicates whats done in __BIOVEC_PHYS_MERGEABLE using mfns (bus
addresses). While there also remove the usage of
__BIOVEC_PHYS_MERGEABLE, since that's already checked by the callers
of xen_biovec_phys_mergeable.
Reported-by: "Jan H. Schönherr" <jschoenh@amazon.de> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[bwh: Backported to 3.2:
- s/bfn/mfn/g
- Adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The 'dir' parameter in xfrm_migrate() is a user-controlled byte which is used
as an array index. This can lead to an out-of-bound access, kernel lockup and
DoS. Add a check for the 'dir' value.
This fixes CVE-2017-11600.
References: https://bugzilla.redhat.com/show_bug.cgi?id=1474928 Fixes: 80c9abaabf42 ("[XFRM]: Extension for dynamic update of endpoint address(es)") Reported-by: "bo Zhang" <zhangbo5891001@gmail.com> Signed-off-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When tcp_disconnect() is called, inet_csk_delack_init() sets
icsk->icsk_ack.rcv_mss to 0.
This could potentially cause tcp_recvmsg() => tcp_cleanup_rbuf() =>
__tcp_select_window() call path to have division by 0 issue.
So this patch initializes rcv_mss to TCP_MIN_MSS instead of 0.
Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Always try to parse an address, since kstrtoul() will safely fail when
given a symbol as input. If that fails (which will be the case for a
symbol), try to parse a symbol instead.
This allows creating a probe such as:
p:probe/vlan_gro_receive 8021q:vlan_gro_receive+0
Which is necessary for this command to work:
perf probe -m 8021q -a vlan_gro_receive
Link: http://lkml.kernel.org/r/fd72d666f45b114e2c5b9cf7e27b91de1ec966f1.1498122881.git.sd@queasysnail.net Fixes: 413d37d1e ("tracing: Add kprobe-based event tracer") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
[bwh: Backported to 3.2: preserve the check that an addresses isn't used for
a kretprobe] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When the scheduler sets TIF_NEED_RESCHED & we call into the scheduler
from arch/mips/kernel/entry.S we disable interrupts. This is true
regardless of whether we reach work_resched from syscall_exit_work,
resume_userspace or by looping after calling schedule(). Although we
disable interrupts in these paths we don't call trace_hardirqs_off()
before calling into C code which may acquire locks, and we therefore
leave lockdep with an inconsistent view of whether interrupts are
disabled or not when CONFIG_PROVE_LOCKING & CONFIG_DEBUG_LOCKDEP are
both enabled.
Without tracing this interrupt state lockdep will print warnings such
as the following once a task returns from a syscall via
syscall_exit_partial with TIF_NEED_RESCHED set:
Fix this by using the TRACE_IRQS_OFF macro to call trace_hardirqs_off()
when CONFIG_TRACE_IRQFLAGS is enabled. This is done before invoking
schedule() following the work_resched label because:
1) Interrupts are disabled regardless of the path we take to reach
work_resched() & schedule().
2) Performing the tracing here avoids the need to do it in paths which
disable interrupts but don't call out to C code before hitting a
path which uses the RESTORE_SOME macro that will call
trace_hardirqs_on() or trace_hardirqs_off() as appropriate.
We call trace_hardirqs_on() using the TRACE_IRQS_ON macro before calling
syscall_trace_leave() for similar reasons, ensuring that lockdep has a
consistent view of state after we re-enable interrupts.
Signed-off-by: Paul Burton <paul.burton@imgtec.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/15385/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Similar to the fix provided by Dominik Heidler in commit 9b3dc0a17d73 ("l2tp: cast l2tp traffic counter to unsigned")
we need to take care of 32bit kernels in dev_get_stats().
When using atomic_long_read(), we add a 'long' to u64 and
might misinterpret high order bit, unless we cast to unsigned.
Fixes: caf586e5f23ce ("net: add a core netdev->rx_dropped counter") Fixes: 015f0688f57ca ("net: net: add a core netdev->tx_dropped counter") Fixes: 6e7333d315a76 ("net: add rx_nohandler stat counter") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: only rx_dropped is updated here] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
When using get_options() it's possible to specify a range of numbers,
like 1-100500. The problem is that it doesn't track array size while
calling internally to get_range() which iterates over the range and
fills the memory with numbers.
Link: http://lkml.kernel.org/r/2613C75C-B04D-4BFF-82A6-12F97BA0F620@gmail.com Signed-off-by: Ilya V. Matveychikov <matvejchikov@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Emergency stacks have their thread_info mostly uninitialised, which in
particular means garbage preempt_count values.
Emergency stack code runs with interrupts disabled entirely, and is
used very rarely, so this has been unnoticed so far. It was found by a
proposed new powerpc watchdog that takes a soft-NMI directly from the
masked_interrupt handler and using the emergency stack. That crashed
at BUG_ON(in_nmi()) in nmi_enter(). preempt_count()s were found to be
garbage.
To fix this, zero the entire THREAD_SIZE allocation, and initialize
the thread_info.
Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Move it all into setup_64.c, use a function not a macro. Fix
crashes on Cell by setting preempt_count to 0 not HARDIRQ_OFFSET] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
[bwh: Backported to 3.2:
- There's only one emergency stack
- No need to call klp_init_thread_info()
- Add the ti variable in emergency_stack_init()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The per netns loopback_dev->ip6_ptr is unregistered and set to
NULL when its mtu is set to smaller than IPV6_MIN_MTU, this
leads to that we could set rt->rt6i_idev NULL after a
rt6_uncached_list_flush_dev() and then crash after another
call.
In this case we should just bring its inet6_dev down, rather
than unregistering it, at least prior to commit 176c39af29bc
("netns: fix addrconf_ifdown kernel panic") we always
override the case for loopback.
Thanks a lot to Andrey for finding a reliable reproducer.
Fixes: 176c39af29bc ("netns: fix addrconf_ifdown kernel panic") Reported-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Daniel Lezcano <dlezcano@fr.ibm.com> Cc: David Ahern <dsahern@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: David Ahern <dsahern@gmail.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: the NETDEV_CHANGEMTU case used to fall-through to the
NETDEV_DOWN case here, so replace that with a separate call to addrconf_ifdown()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Network interface groups support added while ago, however
there is no IFLA_GROUP attribute description in policy
and netlink message size calculations until now.
Add IFLA_GROUP attribute to the policy.
Fixes: cbda10fa97d7 ("net_device: add support for network device groups") Signed-off-by: Serhey Popovych <serhe.popovych@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Thomas Gleixner wrote:
> The CRIU support added a 'feature' which allows a user space task to send
> arbitrary (kernel) signals to itself. The changelog says:
>
> The kernel prevents sending of siginfo with positive si_code, because
> these codes are reserved for kernel. I think we can allow a task to
> send such a siginfo to itself. This operation should not be dangerous.
>
> Quite contrary to that claim, it turns out that it is outright dangerous
> for signals with info->si_code == SI_TIMER. The following code sequence in
> a user space task allows to crash the kernel:
>
> id = timer_create(CLOCK_XXX, ..... signo = SIGX);
> timer_set(id, ....);
> info->si_signo = SIGX;
> info->si_code = SI_TIMER:
> info->_sifields._timer._tid = id;
> info->_sifields._timer._sys_private = 2;
> rt_[tg]sigqueueinfo(..., SIGX, info);
> sigemptyset(&sigset);
> sigaddset(&sigset, SIGX);
> rt_sigtimedwait(sigset, info);
>
> For timers based on CLOCK_PROCESS_CPUTIME_ID, CLOCK_THREAD_CPUTIME_ID this
> results in a kernel crash because sigwait() dequeues the signal and the
> dequeue code observes:
>
> info->si_code == SI_TIMER && info->_sifields._timer._sys_private != 0
>
> which triggers the following callchain:
>
> do_schedule_next_timer() -> posix_cpu_timer_schedule() -> arm_timer()
>
> arm_timer() executes a list_add() on the timer, which is already armed via
> the timer_set() syscall. That's a double list add which corrupts the posix
> cpu timer list. As a consequence the kernel crashes on the next operation
> touching the posix cpu timer list.
>
> Posix clocks which are internally implemented based on hrtimers are not
> affected by this because hrtimer_start() can handle already armed timers
> nicely, but it's a reliable way to trigger the WARN_ON() in
> hrtimer_forward(), which complains about calling that function on an
> already armed timer.
This problem has existed since the posix timer code was merged into
2.5.63. A few releases earlier in 2.5.60 ptrace gained the ability to
inject not just a signal (which linux has supported since 1.0) but the
full siginfo of a signal.
The core problem is that the code will reschedule in response to
signals getting dequeued not just for signals the timers sent but
for other signals that happen to a si_code of SI_TIMER.
Avoid this confusion by testing to see if the queued signal was
preallocated as all timer signals are preallocated, and so far
only the timer code preallocates signals.
Move the check for if a timer needs to be rescheduled up into
collect_signal where the preallocation check must be performed,
and pass the result back to dequeue_signal where the code reschedules
timers. This makes it clear why the code cares about preallocated
timers.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Reference: 66dd34ad31e5 ("signal: allow to send any siginfo to itself")
Reference: 1669ce53e2ff ("Add PTRACE_GETSIGINFO and PTRACE_SETSIGINFO") Fixes: db8b50ba75f2 ("[PATCH] POSIX clocks & timers") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
[bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
This fixes a crash when function_graph and jprobes are used together.
This is essentially commit 237d28db036e ("ftrace/jprobes/x86: Fix
conflict between jprobes and function graph tracing"), but for powerpc.
Jprobes breaks function_graph tracing since the jprobe hook needs to use
jprobe_return(), which never returns back to the hook, but instead to
the original jprobe'd function. The solution is to momentarily pause
function_graph tracing before invoking the jprobe hook and re-enable it
when returning back to the original jprobe'd function.
Fixes: 6794c78243bf ("powerpc64: port of the function graph tracer") Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
[bwh: Backported to 3.2: include <linux/ftrace.h>, which apparently gets
included indirectly upstream] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
The default error code in pfkey_msg2xfrm_state() is -ENOBUFS. We
added a new call to security_xfrm_state_alloc() which sets "err" to zero
so there several places where we can return ERR_PTR(0) if kmalloc()
fails. The caller is expecting error pointers so it leads to a NULL
dereference.
Fixes: df71837d5024 ("[LSM-IPSec]: Security association restriction.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>