]> git.itanic.dy.fi Git - linux-stable/log
linux-stable
6 years agoLinux 3.16.53 v3.16.53
Ben Hutchings [Tue, 9 Jan 2018 00:35:23 +0000 (00:35 +0000)]
Linux 3.16.53

6 years agokaiser: x86: Fix NMI handling
Jiri Kosina [Wed, 3 Jan 2018 14:20:04 +0000 (15:20 +0100)]
kaiser: x86: Fix NMI handling

On Mon, 4 Dec 2017, Hugh Dickins wrote:

> kaiser-3.18.72.tar

Hi Hugh,

this hunk from 0024-kaiser-merged-update.patch:

-       SWITCH_KERNEL_CR3_NO_STACK
+       /*
+        * percpu variables are mapped with user CR3, so no need
+        * to switch CR3 here.
+        */
        cld
        movq    %rsp, %rdx
        movq    PER_CPU_VAR(kernel_stack), %rsp

is problematic, as the patchset actually never user-maps kernel_stack
percpu variable, and therefore crashes on NMIs.

The patch below is needed to make NMIs work properly.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoKPTI: Report when enabled
Kees Cook [Wed, 3 Jan 2018 18:18:01 +0000 (10:18 -0800)]
KPTI: Report when enabled

Make sure dmesg reports when KPTI is enabled.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoKPTI: Rename to PAGE_TABLE_ISOLATION
Kees Cook [Thu, 4 Jan 2018 01:14:24 +0000 (01:14 +0000)]
KPTI: Rename to PAGE_TABLE_ISOLATION

This renames CONFIG_KAISER to CONFIG_PAGE_TABLE_ISOLATION.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.16]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/kaiser: Move feature detection up
Borislav Petkov [Mon, 25 Dec 2017 12:57:16 +0000 (13:57 +0100)]
x86/kaiser: Move feature detection up

... before the first use of kaiser_enabled as otherwise funky
things happen:

  about to get started...
  (XEN) d0v0 Unhandled page fault fault/trap [#14, ec=0000]
  (XEN) Pagetable walk from ffff88022a449090:
  (XEN)  L4[0x110] = 0000000229e0e067 0000000000001e0e
  (XEN)  L3[0x008] = 0000000000000000 ffffffffffffffff
  (XEN) domain_crash_sync called from entry.S: fault at ffff82d08033fd08
  entry.o#create_bounce_frame+0x135/0x14d
  (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
  (XEN) ----[ Xen-4.9.1_02-3.21  x86_64  debug=n   Not tainted ]----
  (XEN) CPU:    0
  (XEN) RIP:    e033:[<ffffffff81007460>]
  (XEN) RFLAGS: 0000000000000286   EM: 1   CONTEXT: pv guest (d0v0)

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agokaiser: disabled on Xen PV
Jiri Kosina [Tue, 2 Jan 2018 13:19:49 +0000 (14:19 +0100)]
kaiser: disabled on Xen PV

Kaiser cannot be used on paravirtualized MMUs (namely reading and writing CR3).
This does not work with KAISER as the CR3 switch from and to user space PGD
would require to map the whole XEN_PV machinery into both.

More importantly, enabling KAISER on Xen PV doesn't make too much sense, as PV
guests use distinct %cr3 values for kernel and user already.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
[bwh: Backported to 3.16: use xen_pv_domain()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/kaiser: Reenable PARAVIRT
Borislav Petkov [Tue, 2 Jan 2018 13:19:49 +0000 (14:19 +0100)]
x86/kaiser: Reenable PARAVIRT

Now that the required bits have been addressed, reenable
PARAVIRT.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/paravirt: Dont patch flush_tlb_single
Thomas Gleixner [Mon, 4 Dec 2017 14:07:30 +0000 (15:07 +0100)]
x86/paravirt: Dont patch flush_tlb_single

commit a035795499ca1c2bd1928808d1a156eda1420383 upstream.

native_flush_tlb_single() will be changed with the upcoming
PAGE_TABLE_ISOLATION feature. This requires to have more code in
there than INVLPG.

Remove the paravirt patching for it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171204150606.828111617@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agokaiser: kaiser_flush_tlb_on_return_to_user() check PCID
Hugh Dickins [Sun, 5 Nov 2017 01:43:06 +0000 (18:43 -0700)]
kaiser: kaiser_flush_tlb_on_return_to_user() check PCID

Let kaiser_flush_tlb_on_return_to_user() do the X86_FEATURE_PCID
check, instead of each caller doing it inline first: nobody needs
to optimize for the noPCID case, it's clearer this way, and better
suits later changes.  Replace those no-op X86_CR3_PCID_KERN_FLUSH lines
by a BUILD_BUG_ON() in load_new_mm_cr3(), in case something changes.

(cherry picked from Change-Id: I9b528ed9d7c1ae4a3b4738c2894ee1740b6fb0b9)

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agokaiser: asm/tlbflush.h handle noPGE at lower level
Hugh Dickins [Sun, 5 Nov 2017 01:23:24 +0000 (18:23 -0700)]
kaiser: asm/tlbflush.h handle noPGE at lower level

I found asm/tlbflush.h too twisty, and think it safer not to avoid
__native_flush_tlb_global_irq_disabled() in the kaiser_enabled case,
but instead let it handle kaiser_enabled along with cr3: it can just
use __native_flush_tlb() for that, no harm in re-disabling preemption.

(This is not the same change as Kirill and Dave have suggested for
upstream, flipping PGE in cr4: that's neat, but needs a cpu_has_pge
check; cr3 is enough for kaiser, and thought to be cheaper than cr4.)

Also delete the X86_FEATURE_INVPCID invpcid_flush_all_nonglobals()
preference from __native_flush_tlb(): unlike the invpcid_flush_all()
preference in __native_flush_tlb_global(), it's not seen in upstream
4.14, and was recently reported to be surprisingly slow.

(cherry picked from Change-Id: I0da819a797ff46bca6590040b6480178dff6ba1e)

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agokaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush
Hugh Dickins [Wed, 4 Oct 2017 03:49:04 +0000 (20:49 -0700)]
kaiser: use ALTERNATIVE instead of x86_cr3_pcid_noflush

Now that we're playing the ALTERNATIVE game, use that more efficient
method: instead of user-mapping an extra page, and reading an extra
cacheline each time for x86_cr3_pcid_noflush.

Neel has found that __stringify(bts $X86_CR3_PCID_NOFLUSH_BIT, %rax)
is a working substitute for the "bts $63, %rax" in these ALTERNATIVEs;
but the one line with $63 in looks clearer, so let's stick with that.

Worried about what happens with an ALTERNATIVE between the jump and
jump label in another ALTERNATIVE?  I was, but have checked the
combinations in SWITCH_KERNEL_CR3_NO_STACK at entry_SYSCALL_64,
and it does a good job.

(cherry picked from Change-Id: I46d06167615aa8d628eed9972125ab2faca93f05)

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/kaiser: Check boottime cmdline params
Borislav Petkov [Tue, 2 Jan 2018 13:19:48 +0000 (14:19 +0100)]
x86/kaiser: Check boottime cmdline params

AMD (and possibly other vendors) are not affected by the leak
KAISER is protecting against.

Keep the "nopti" for traditional reasons and add pti=<on|off|auto>
like upstream.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/kaiser: Rename and simplify X86_FEATURE_KAISER handling
Borislav Petkov [Tue, 2 Jan 2018 13:19:48 +0000 (14:19 +0100)]
x86/kaiser: Rename and simplify X86_FEATURE_KAISER handling

Concentrate it in arch/x86/mm/kaiser.c and use the upstream string "nopti".

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/boot: Add early cmdline parsing for options with arguments
Tom Lendacky [Mon, 17 Jul 2017 21:10:33 +0000 (16:10 -0500)]
x86/boot: Add early cmdline parsing for options with arguments

commit e505371dd83963caae1a37ead9524e8d997341be upstream.

Add a cmdline_find_option() function to look for cmdline options that
take arguments. The argument is returned in a supplied buffer and the
argument length (regardless of whether it fits in the supplied buffer)
is returned, with -1 indicating not found.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/36b5f97492a9745dce27682305f990fc20e5cf8a.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/boot: Pass in size to early cmdline parsing
Dave Hansen [Tue, 22 Dec 2015 22:52:43 +0000 (14:52 -0800)]
x86/boot: Pass in size to early cmdline parsing

commit 8c0517759a1a100a8b83134cf3c7f254774aaeba upstream.

We will use this in a few patches to implement tests for early parsing.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
[ Aligned args properly. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20151222225243.5CC47EB6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/boot: Simplify early command line parsing
Dave Hansen [Tue, 22 Dec 2015 22:52:41 +0000 (14:52 -0800)]
x86/boot: Simplify early command line parsing

commit 4de07ea481361b08fe13735004dafae862482d38 upstream.

__cmdline_find_option_bool() tries to account for both NULL-terminated
and non-NULL-terminated strings. It keeps 'pos' to look for the end of
the buffer and also looks for '!c' in a bunch of places to look for NULL
termination.

But, it also calls strlen(). You can't call strlen on a
non-NULL-terminated string.

If !strlen(cmdline), then cmdline[0]=='\0'. In that case, we will go in
to the while() loop, set c='\0', hit st_wordstart, notice !c, and will
immediately return 0.

So, remove the strlen().  It is unnecessary and unsafe.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20151222225241.15365E43@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/boot: Fix early command-line parsing when partial word matches
Dave Hansen [Tue, 22 Dec 2015 22:52:39 +0000 (14:52 -0800)]
x86/boot: Fix early command-line parsing when partial word matches

commit abcdc1c694fa4055323cbec1cde4c2cb6b68398c upstream.

cmdline_find_option_bool() keeps track of position in two strings:

 1. the command-line
 2. the option we are searchign for in the command-line

We plow through each character in the command-line one at a time, always
moving forward. We move forward in the option ('opptr') when we match
characters in 'cmdline'. We reset the 'opptr' only when we go in to the
'st_wordstart' state.

But, if we fail to match an option because we see a space
(state=st_wordcmp, *opptr='\0',c=' '), we set state='st_wordskip' and
'break', moving to the next character. But, that move to the next
character is the one *after* the ' '. This means that we will miss a
'st_wordstart' state.

For instance, if we have

  cmdline = "foo fool";

and are searching for "fool", we have:

  "fool"
  opptr = ----^

           "foo fool"
   c = --------^

We see that 'l' != ' ', set state=st_wordskip, break, and then move 'c', so:

          "foo fool"
  c = ---------^

and are still in state=st_wordskip. We will stay in wordskip until we
have skipped "fool", thus missing the option we were looking for. This
*only* happens when you have a partially- matching word followed by a
matching one.

To fix this, we always fall *into* the 'st_wordskip' state when we set
it.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20151222225239.8E1DCA58@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/boot: Fix early command-line parsing when matching at end
Dave Hansen [Tue, 22 Dec 2015 22:52:38 +0000 (14:52 -0800)]
x86/boot: Fix early command-line parsing when matching at end

commit 02afeaae9843733a39cd9b11053748b2d1dc5ae7 upstream.

The x86 early command line parsing in cmdline_find_option_bool() is
buggy. If it matches a specified 'option' all the way to the end of the
command-line, it will consider it a match.

For instance,

  cmdline = "foo";
  cmdline_find_option_bool(cmdline, "fool");

will return 1. This is particularly annoying since we have actual FPU
options like "noxsave" and "noxsaves" So, command-line "foo bar noxsave"
will match *BOTH* a "noxsave" and "noxsaves". (This turns out not to be
an actual problem because "noxsave" implies "noxsaves", but it's still
confusing.)

To fix this, we simplify the code and stop tracking 'len'. 'len'
was trying to indicate either the NULL terminator *OR* the end of a
non-NULL-terminated command line at 'COMMAND_LINE_SIZE'. But, each of the
three states is *already* checking 'cmdline' for a NULL terminator.

We _only_ need to check if we have overrun 'COMMAND_LINE_SIZE', and that
we can do without keeping 'len' around.

Also add some commends to clarify what is going on.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fenghua.yu@intel.com
Cc: yu-cheng.yu@intel.com
Link: http://lkml.kernel.org/r/20151222225238.9AEB560C@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agokaiser: add "nokaiser" boot option, using ALTERNATIVE
Hugh Dickins [Sun, 24 Sep 2017 23:59:49 +0000 (16:59 -0700)]
kaiser: add "nokaiser" boot option, using ALTERNATIVE

Added "nokaiser" boot option: an early param like "noinvpcid".
Most places now check int kaiser_enabled (#defined 0 when not
CONFIG_KAISER) instead of #ifdef CONFIG_KAISER; but entry_64.S
and entry_64_compat.S are using the ALTERNATIVE technique, which
patches in the preferred instructions at runtime.  That technique
is tied to x86 cpu features, so X86_FEATURE_KAISER fabricated
("" in its comment so "kaiser" not magicked into /proc/cpuinfo).

Prior to "nokaiser", Kaiser #defined _PAGE_GLOBAL 0: revert that,
but be careful with both _PAGE_GLOBAL and CR4.PGE: setting them when
nokaiser like when !CONFIG_KAISER, but not setting either when kaiser -
neither matters on its own, but it's hard to be sure that _PAGE_GLOBAL
won't get set in some obscure corner, or something add PGE into CR4.
By omitting _PAGE_GLOBAL from __supported_pte_mask when kaiser_enabled,
all page table setup which uses pte_pfn() masks it out of the ptes.

It's slightly shameful that the same declaration versus definition of
kaiser_enabled appears in not one, not two, but in three header files
(asm/kaiser.h, asm/pgtable.h, asm/tlbflush.h).  I felt safer that way,
than with #including any of those in any of the others; and did not
feel it worth an asm/kaiser_enabled.h - kernel/cpu/common.c includes
them all, so we shall hear about it if they get out of synch.

Cleanups while in the area: removed the silly #ifdef CONFIG_KAISER
from kaiser.c; removed the unused native_get_normal_pgd(); removed
the spurious reg clutter from SWITCH_*_CR3 macro stubs; corrected some
comments.  But more interestingly, set CR4.PSE in secondary_startup_64:
the manual is clear that it does not matter whether it's 0 or 1 when
4-level-pts are enabled, but I was distracted to find cr4 different on
BSP and auxiliaries - BSP alone was adding PSE, in probe_page_size_mask().

(cherry picked from Change-Id: I8e5bec716944444359cbd19f6729311eff943e9a)

Signed-off-by: Hugh Dickins <hughd@google.com>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/alternatives: Use optimized NOPs for padding
Borislav Petkov [Sat, 10 Jan 2015 19:34:07 +0000 (20:34 +0100)]
x86/alternatives: Use optimized NOPs for padding

commit 4fd4b6e5537cec5b56db0b22546dd439ebb26830 upstream.

Alternatives allow now for an empty old instruction. In this case we go
and pad the space with NOPs at assembly time. However, there are the
optimal, longer NOPs which should be used. Do that at patching time by
adding alt_instr.padlen-sized NOPs at the old instruction address.

Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/alternatives: Make JMPs more robust
Borislav Petkov [Mon, 5 Jan 2015 12:48:41 +0000 (13:48 +0100)]
x86/alternatives: Make JMPs more robust

commit 48c7a2509f9e237d8465399d9cdfe487d3212a23 upstream.

Up until now we had to pay attention to relative JMPs in alternatives
about how their relative offset gets computed so that the jump target
is still correct. Or, as it is the case for near CALLs (opcode e8), we
still have to go and readjust the offset at patching time.

What is more, the static_cpu_has_safe() facility had to forcefully
generate 5-byte JMPs since we couldn't rely on the compiler to generate
properly sized ones so we had to force the longest ones. Worse than
that, sometimes it would generate a replacement JMP which is longer than
the original one, thus overwriting the beginning of the next instruction
at patching time.

So, in order to alleviate all that and make using JMPs more
straight-forward we go and pad the original instruction in an
alternative block with NOPs at build time, should the replacement(s) be
longer. This way, alternatives users shouldn't pay special attention
so that original and replacement instruction sizes are fine but the
assembler would simply add padding where needed and not do anything
otherwise.

As a second aspect, we go and recompute JMPs at patching time so that we
can try to make 5-byte JMPs into two-byte ones if possible. If not, we
still have to recompute the offsets as the replacement JMP gets put far
away in the .altinstr_replacement section leading to a wrong offset if
copied verbatim.

For example, on a locally generated kernel image

  old insn VA: 0xffffffff810014bd, CPU feat: X86_FEATURE_ALWAYS, size: 2
  __switch_to:
   ffffffff810014bd:      eb 21                   jmp ffffffff810014e0
  repl insn: size: 5
  ffffffff81d0b23c:       e9 b1 62 2f ff          jmpq ffffffff810014f2

gets corrected to a 2-byte JMP:

  apply_alternatives: feat: 3*32+21, old: (ffffffff810014bd, len: 2), repl: (ffffffff81d0b23c, len: 5)
  alt_insn: e9 b1 62 2f ff
  recompute_jumps: next_rip: ffffffff81d0b241, tgt_rip: ffffffff810014f2, new_displ: 0x00000033, ret len: 2
  converted to: eb 33 90 90 90

and a 5-byte JMP:

  old insn VA: 0xffffffff81001516, CPU feat: X86_FEATURE_ALWAYS, size: 2
  __switch_to:
   ffffffff81001516:      eb 30                   jmp ffffffff81001548
  repl insn: size: 5
   ffffffff81d0b241:      e9 10 63 2f ff          jmpq ffffffff81001556

gets shortened into a two-byte one:

  apply_alternatives: feat: 3*32+21, old: (ffffffff81001516, len: 2), repl: (ffffffff81d0b241, len: 5)
  alt_insn: e9 10 63 2f ff
  recompute_jumps: next_rip: ffffffff81d0b246, tgt_rip: ffffffff81001556, new_displ: 0x0000003e, ret len: 2
  converted to: eb 3e 90 90 90

... and so on.

This leads to a net win of around

40ish replacements * 3 bytes savings =~ 120 bytes of I$

on an AMD guest which means some savings of precious instruction cache
bandwidth. The padding to the shorter 2-byte JMPs are single-byte NOPs
which on smart microarchitectures means discarding NOPs at decode time
and thus freeing up execution bandwidth.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/alternatives: Add instruction padding
Borislav Petkov [Sat, 27 Dec 2014 09:41:52 +0000 (10:41 +0100)]
x86/alternatives: Add instruction padding

commit 4332195c5615bf748624094ce4ff6797e475024d upstream.

Up until now we have always paid attention to make sure the length of
the new instruction replacing the old one is at least less or equal to
the length of the old instruction. If the new instruction is longer, at
the time it replaces the old instruction it will overwrite the beginning
of the next instruction in the kernel image and cause your pants to
catch fire.

So instead of having to pay attention, teach the alternatives framework
to pad shorter old instructions with NOPs at buildtime - but only in the
case when

  len(old instruction(s)) < len(new instruction(s))

and add nothing in the >= case. (In that case we do add_nops() when
patching).

This way the alternatives user shouldn't have to care about instruction
sizes and simply use the macros.

Add asm ALTERNATIVE* flavor macros too, while at it.

Also, we need to save the pad length in a separate struct alt_instr
member for NOP optimization and the way to do that reliably is to carry
the pad length instead of trying to detect whether we're looking at
single-byte NOPs or at pathological instruction offsets like e9 90 90 90
90, for example, which is a valid instruction.

Thanks to Michael Matz for the great help with toolchain questions.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Hugh Dickins <hughd@google.com>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/alternatives: Cleanup DPRINTK macro
Borislav Petkov [Tue, 30 Dec 2014 19:27:09 +0000 (20:27 +0100)]
x86/alternatives: Cleanup DPRINTK macro

commit db477a3386dee183130916d6bbf21f5828b0b2e2 upstream.

Make it pass __func__ implicitly. Also, dump info about each replacing
we're doing. Fixup comments and style while at it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agokaiser: alloc_ldt_struct() use get_zeroed_page()
Hugh Dickins [Mon, 18 Dec 2017 03:53:01 +0000 (19:53 -0800)]
kaiser: alloc_ldt_struct() use get_zeroed_page()

Change the 3.2.96 and 3.18.72 alloc_ldt_struct() to allocate its entries
with get_zeroed_page(), as 4.3 onwards does since f454b4788613 ("x86/ldt:
Fix small LDT allocation for Xen").  This then matches the free_page()
I had misported in __free_ldt_struct(), and fixes the
"BUG: Bad page state in process ldt_gdt_32 ... flags: 0x80(slab)"
reported by Kees Cook and Jiri Kosina, and analysed by Jiri.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86: kvmclock: Disable use from vDSO if KPTI is enabled
Ben Hutchings [Fri, 5 Jan 2018 03:09:26 +0000 (03:09 +0000)]
x86: kvmclock: Disable use from vDSO if KPTI is enabled

Currently the pvclock pages aren't being added to user-space page
tables, and my attempt to fix this didn't work.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agokaiser: Set _PAGE_USER of the vsyscall page
Borislav Petkov [Thu, 4 Jan 2018 16:42:45 +0000 (17:42 +0100)]
kaiser: Set _PAGE_USER of the vsyscall page

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
[bwh: Backported to 3.16:
 - Drop the case for disabled CONFIG_X86_VSYSCALL_EMULATION
 - Adjust filename, context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoKAISER: Kernel Address Isolation
Richard Fellner [Thu, 4 May 2017 12:26:50 +0000 (14:26 +0200)]
KAISER: Kernel Address Isolation

This patch introduces our implementation of KAISER (Kernel Address Isolation to
have Side-channels Efficiently Removed), a kernel isolation technique to close
hardware side channels on kernel address information.

More information about the patch can be found on:

        https://github.com/IAIK/KAISER

From: Richard Fellner <richard.fellner@student.tugraz.at>
From: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Subject: [RFC, PATCH] x86_64: KAISER - do not map kernel in user mode
Date: Thu, 4 May 2017 14:26:50 +0200
Link: http://marc.info/?l=linux-kernel&m=149390087310405&w=2
Kaiser-4.10-SHA1: c4b1831d44c6144d3762ccc72f0c4e71a0c713e5

To: <linux-kernel@vger.kernel.org>
To: <kernel-hardening@lists.openwall.com>
Cc: <clementine.maurice@iaik.tugraz.at>
Cc: <moritz.lipp@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: <kirill.shutemov@linux.intel.com>
Cc: <anders.fogh@gdata-adan.de>
After several recent works [1,2,3] KASLR on x86_64 was basically
considered dead by many researchers. We have been working on an
efficient but effective fix for this problem and found that not mapping
the kernel space when running in user mode is the solution to this
problem [4] (the corresponding paper [5] will be presented at ESSoS17).

With this RFC patch we allow anybody to configure their kernel with the
flag CONFIG_KAISER to add our defense mechanism.

If there are any questions we would love to answer them.
We also appreciate any comments!

Cheers,
Daniel (+ the KAISER team from Graz University of Technology)

[1] http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf
[2] https://www.blackhat.com/docs/us-16/materials/us-16-Fogh-Using-Undocumented-CPU-Behaviour-To-See-Into-Kernel-Mode-And-Break-KASLR-In-The-Process.pdf
[3] https://www.blackhat.com/docs/us-16/materials/us-16-Jang-Breaking-Kernel-Address-Space-Layout-Randomization-KASLR-With-Intel-TSX.pdf
[4] https://github.com/IAIK/KAISER
[5] https://gruss.cc/files/kaiser.pdf

(cherry picked from Change-Id: I0eb000c33290af01fc4454ca0c701d00f1d30b1d)

Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)
arch/x86/entry/entry_64_compat.S (not in this tree)
arch/x86/ia32/ia32entry.S (patched instead of that)
arch/x86/include/asm/hw_irq.h
arch/x86/include/asm/pgtable_types.h
arch/x86/include/asm/processor.h
arch/x86/kernel/irqinit.c
arch/x86/kernel/process.c
arch/x86/mm/Makefile
arch/x86/mm/pgtable.c
init/main.c

Signed-off-by: Hugh Dickins <hughd@google.com>
[bwh: Folded in the follow-up patches from Hugh:
 - kaiser: merged update
 - kaiser: do not set _PAGE_NX on pgd_none
 - kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE
 - kaiser: fix build and FIXME in alloc_ldt_struct()
 - kaiser: KAISER depends on SMP
 - kaiser: fix regs to do_nmi() ifndef CONFIG_KAISER
 - kaiser: fix perf crashes
 - kaiser: ENOMEM if kaiser_pagetable_walk() NULL
 - kaiser: tidied up asm/kaiser.h somewhat
 - kaiser: tidied up kaiser_add/remove_mapping slightly
 - kaiser: kaiser_remove_mapping() move along the pgd
 - kaiser: align addition to x86/mm/Makefile
 - kaiser: cleanups while trying for gold link
 - kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET
 - kaiser: delete KAISER_REAL_SWITCH option
 - kaiser: vmstat show NR_KAISERTABLE as nr_overhead
 - kaiser: enhanced by kernel and user PCIDs
 - kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
 - kaiser: PCID 0 for kernel and 128 for user
 - kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user
 - kaiser: paranoid_entry pass cr3 need to paranoid_exit
 - kaiser: _pgd_alloc() without __GFP_REPEAT to avoid stalls
 - kaiser: fix unlikely error in alloc_ldt_struct()
 - kaiser: drop is_atomic arg to kaiser_pagetable_walk()
 Backported to 3.16:
 - Add missing #include in arch/x86/mm/kaiser.c
 - Use variable PEBS buffer size since we have "perf/x86/intel: Use PAGE_SIZE
   for PEBS buffer size on Core2"
 - Renumber X86_FEATURE_INVPCID_SINGLE to avoid collision
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm/64: Fix reboot interaction with CR4.PCIDE
Andy Lutomirski [Mon, 9 Oct 2017 04:53:05 +0000 (21:53 -0700)]
x86/mm/64: Fix reboot interaction with CR4.PCIDE

commit 924c6b900cfdf376b07bccfd80e62b21914f8a5a upstream.

Trying to reboot via real mode fails with PCID on: long mode cannot
be exited while CR4.PCIDE is set.  (No, I have no idea why, but the
SDM and actual CPUs are in agreement here.)  The result is a GPF and
a hang instead of a reboot.

I didn't catch this in testing because neither my computer nor my VM
reboots this way.  I can trigger it with reboot=bios, though.

Fixes: 660da7c9228f ("x86/mm: Enable CR4.PCIDE on supported systems")
Reported-and-tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/f1e7d965998018450a7a70c2823873686a8b21c0.1507524746.git.luto@kernel.org
Cc: Hugh Dickins <hughd@google.com>
[bwh: Backported to 3.16: use clear_in_cr4()]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Enable CR4.PCIDE on supported systems
Andy Lutomirski [Thu, 29 Jun 2017 15:53:21 +0000 (08:53 -0700)]
x86/mm: Enable CR4.PCIDE on supported systems

commit 660da7c9228f685b2ebe664f9fd69aaddcc420b5 upstream.

We can use PCID if the CPU has PCID and PGE and we're not on Xen.

By itself, this has no effect. A followup patch will start using PCID.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/6327ecd907b32f79d5aa0d466f04503bbec5df88.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Hugh Dickins: Backported to 3.18:
 - arch/x86/xen/enlighten_pv.c (not in this tree)
 - arch/x86/xen/enlighten.c (patched instead of that)]
Signed-off-by: Hugh Dickins <hughd@google.com>
[Borislav Petkov: Fix bad backport to disable PCID on Xen]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Add the 'nopcid' boot option to turn off PCID
Andy Lutomirski [Thu, 29 Jun 2017 15:53:20 +0000 (08:53 -0700)]
x86/mm: Add the 'nopcid' boot option to turn off PCID

commit 0790c9aad84901ca1bdc14746175549c8b5da215 upstream.
The parameter is only present on x86_64 systems to save a few bytes,
as PCID is always disabled on x86_32.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/8bbb2e65bcd249a5f18bfb8128b4689f08ac2b60.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Hugh Dickins: Backported to 3.18:
 - Documentation/admin-guide/kernel-parameters.txt (not in this tree)
 - Documentation/kernel-parameters.txt (patched instead of that)
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Disable PCID on 32-bit kernels
Andy Lutomirski [Thu, 29 Jun 2017 15:53:19 +0000 (08:53 -0700)]
x86/mm: Disable PCID on 32-bit kernels

commit cba4671af7550e008f7a7835f06df0763825bf3e upstream.

32-bit kernels on new hardware will see PCID in CPUID, but PCID can
only be used in 64-bit mode.  Rather than making all PCID code
conditional, just disable the feature on 32-bit builds.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/2e391769192a4d31b808410c383c6bf0734bc6ea.1498751203.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code
Andy Lutomirski [Sun, 28 May 2017 17:00:14 +0000 (10:00 -0700)]
x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code

commit ce4a4e565f5264909a18c733b864c3f74467f69e upstream.

The UP asm/tlbflush.h generates somewhat nicer code than the SMP version.
Aside from that, it's fallen quite a bit behind the SMP code:

 - flush_tlb_mm_range() didn't flush individual pages if the range
   was small.

 - The lazy TLB code was much weaker.  This usually wouldn't matter,
   but, if a kernel thread flushed its lazy "active_mm" more than
   once (due to reclaim or similar), it wouldn't be unlazied and
   would instead pointlessly flush repeatedly.

 - Tracepoints were missing.

Aside from that, simply having the UP code around was a maintanence
burden, since it means that any change to the TLB flush code had to
make sure not to break it.

Simplify everything by deleting the UP code.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Hugh Dickins: Backported to 3.18]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()
Andy Lutomirski [Mon, 22 May 2017 22:30:01 +0000 (15:30 -0700)]
x86/mm: Reimplement flush_tlb_page() using flush_tlb_mm_range()

commit ca6c99c0794875c6d1db6e22f246699691ab7e6b upstream.

flush_tlb_page() was very similar to flush_tlb_mm_range() except that
it had a couple of issues:

 - It was missing an smp_mb() in the case where
   current->active_mm != mm.  (This is a longstanding bug reported by Nadav Amit)

 - It was missing tracepoints and vm counter updates.

The only reason that I can see for keeping it at as a separate
function is that it could avoid a few branches that
flush_tlb_mm_range() needs to decide to flush just one page.  This
hardly seems worthwhile.  If we decide we want to get rid of those
branches again, a better way would be to introduce an
__flush_tlb_mm_range() helper and make both flush_tlb_page() and
flush_tlb_mm_range() use it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/3cc3847cf888d8907577569b8bac3f01992ef8f9.1495492063.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Make flush_tlb_mm_range() more predictable
Andy Lutomirski [Sat, 22 Apr 2017 07:01:21 +0000 (00:01 -0700)]
x86/mm: Make flush_tlb_mm_range() more predictable

commit ce27374fabf553153c3f53efcaa9bfab9216bd8c upstream.

I'm about to rewrite the function almost completely, but first I
want to get a functional change out of the way.  Currently, if
flush_tlb_mm_range() does not flush the local TLB at all, it will
never do individual page flushes on remote CPUs.  This seems to be
an accident, and preserving it will be awkward.  Let's change it
first so that any regressions in the rewrite will be easier to
bisect and so that the rewrite can attempt to change no visible
behavior at all.

The fix is simple: we can simply avoid short-circuiting the
calculation of base_pages_to_flush.

As a side effect, this also eliminates a potential corner case: if
tlb_single_page_flush_ceiling == TLB_FLUSH_ALL, flush_tlb_mm_range()
could have ended up flushing the entire address space one page at a
time.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/4b29b771d9975aad7154c314534fec235618175a.1492844372.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Remove flush_tlb() and flush_tlb_current_task()
Andy Lutomirski [Sat, 22 Apr 2017 07:01:20 +0000 (00:01 -0700)]
x86/mm: Remove flush_tlb() and flush_tlb_current_task()

commit 29961b59a51f8c6838a26a45e871a7ed6771809b upstream.

I was trying to figure out what how flush_tlb_current_task() would
possibly work correctly if current->mm != current->active_mm, but I
realized I could spare myself the effort: it has no callers except
the unused flush_tlb() macro.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e52d64c11690f85e9f1d69d7b48cc2269cd2e94b.1492844372.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()
Andy Lutomirski [Sat, 22 Apr 2017 07:01:19 +0000 (00:01 -0700)]
x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()

commit 9ccee2373f0658f234727700e619df097ba57023 upstream.

mark_screen_rdonly() is the last remaining caller of flush_tlb().
flush_tlb_mm_range() is potentially faster and isn't obsolete.

Compile-tested only because I don't know whether software that uses
this mechanism even exists.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/791a644076fc3577ba7f7b7cafd643cc089baa7d.1492844372.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/irq: Do not substract irq_tlb_count from irq_call_count
Aaron Lu [Thu, 11 Aug 2016 07:44:30 +0000 (15:44 +0800)]
x86/irq: Do not substract irq_tlb_count from irq_call_count

commit 82ba4faca1bffad429f15c90c980ffd010366c25 upstream.

Since commit:

  52aec3308db8 ("x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR")

the TLB remote shootdown is done through call function vector. That
commit didn't take care of irq_tlb_count, which a later commit:

  fd0f5869724f ("x86: Distinguish TLB shootdown interrupts from other functions call interrupts")

... tried to fix.

The fix assumes every increase of irq_tlb_count has a corresponding
increase of irq_call_count. So the irq_call_count is always bigger than
irq_tlb_count and we could substract irq_tlb_count from irq_call_count.

Unfortunately this is not true for the smp_call_function_single() case.
The IPI is only sent if the target CPU's call_single_queue is empty when
adding a csd into it in generic_exec_single. That means if two threads
are both adding flush tlb csds to the same CPU's call_single_queue, only
one IPI is sent. In other words, the irq_call_count is incremented by 1
but irq_tlb_count is incremented by 2. Over time, irq_tlb_count will be
bigger than irq_call_count and the substract will produce a very large
irq_call_count value due to overflow.

Considering that:

  1) it's not worth to send more IPIs for the sake of accurate counting of
     irq_call_count in generic_exec_single();

  2) it's not easy to tell if the call function interrupt is for TLB
     shootdown in __smp_call_function_single_interrupt().

Not to exclude TLB shootdown from call function count seems to be the
simplest fix and this patch just does that.

This bug was found by LKP's cyclic performance regression tracking recently
with the vm-scalability test suite. I have bisected to commit:

  3dec0ba0be6a ("mm/rmap: share the i_mmap_rwsem")

This commit didn't do anything wrong but revealed the irq_call_count
problem. IIUC, the commit makes rwc->remap_one in rmap_walk_file
concurrent with multiple threads.  When remap_one is try_to_unmap_one(),
then multiple threads could queue flush TLB to the same CPU but only
one IPI will be sent.

Since the commit was added in Linux v3.19, the counting problem only
shows up from v3.19 onwards.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Link: http://lkml.kernel.org/r/20160811074430.GA18163@aaronlu.sh.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agosched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()
Andy Lutomirski [Fri, 9 Jun 2017 18:49:15 +0000 (11:49 -0700)]
sched/core: Idle_task_exit() shouldn't use switch_mm_irqs_off()

commit 252d2a4117bc181b287eeddf848863788da733ae upstream.

idle_task_exit() can be called with IRQs on x86 on and therefore
should use switch_mm(), not switch_mm_irqs_off().

This doesn't seem to cause any problems right now, but it will
confuse my upcoming TLB flush changes.  Nonetheless, I think it
should be backported because it's trivial.  There won't be any
meaningful performance impact because idle_task_exit() is only
used when offlining a CPU.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
Link: http://lkml.kernel.org/r/ca3d1a9fa93a0b49f5a8ff729eda3640fb6abdf9.1497034141.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoARM: Hide finish_arch_post_lock_switch() from modules
Steven Rostedt [Fri, 13 May 2016 13:30:13 +0000 (15:30 +0200)]
ARM: Hide finish_arch_post_lock_switch() from modules

commit ef0491ea17f8019821c7e9c8e801184ecf17f85a upstream.

The introduction of switch_mm_irqs_off() brought back an old bug
regarding the use of preempt_enable_no_resched:

As part of:

  62b94a08da1b ("sched/preempt: Take away preempt_enable_no_resched() from modules")

the definition of preempt_enable_no_resched() is only available in
built-in code, not in loadable modules, so we can't generally use
it from header files.

However, the ARM version of finish_arch_post_lock_switch()
calls preempt_enable_no_resched() and is defined as a static
inline function in asm/mmu_context.h. This in turn means we cannot
include asm/mmu_context.h from modules.

With today's tip tree, asm/mmu_context.h gets included from
linux/mmu_context.h, which is normally the exact pattern one would
expect, but unfortunately, linux/mmu_context.h can be included from
the vhost driver that is a loadable module, now causing this compile
time error with modular configs:

  In file included from ../include/linux/mmu_context.h:4:0,
                   from ../drivers/vhost/vhost.c:18:
  ../arch/arm/include/asm/mmu_context.h: In function 'finish_arch_post_lock_switch':
  ../arch/arm/include/asm/mmu_context.h:88:3: error: implicit declaration of function 'preempt_enable_no_resched' [-Werror=implicit-function-declaration]
     preempt_enable_no_resched();

Andy already tried to fix the bug by including linux/preempt.h
from asm/mmu_context.h, but that didn't help. Arnd suggested reordering
the header files, which wasn't popular, so let's use this
workaround instead:

The finish_arch_post_lock_switch() definition is now also hidden
inside of #ifdef MODULE, so we don't see anything referencing
preempt_enable_no_resched() from a header file. I've built a
few hundred randconfig kernels with this, and did not see any
new problems.

Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux <linux@armlinux.org.uk>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-arm-kernel@lists.infradead.org
Fixes: f98db6013c55 ("sched/core: Add switch_mm_irqs_off() and use it in the scheduler")
Link: http://lkml.kernel.org/r/1463146234-161304-1-git-send-email-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm, sched/core: Turn off IRQs in switch_mm()
Andy Lutomirski [Tue, 26 Apr 2016 16:39:09 +0000 (09:39 -0700)]
x86/mm, sched/core: Turn off IRQs in switch_mm()

commit 078194f8e9fe3cf54c8fd8bded48a1db5bd8eb8a upstream.

Potential races between switch_mm() and TLB-flush or LDT-flush IPIs
could be very messy.  AFAICT the code is currently okay, whether by
accident or by careful design, but enabling PCID will make it
considerably more complicated and will no longer be obviously safe.

Fix it with a big hammer: run switch_mm() with IRQs off.

To avoid a performance hit in the scheduler, we take advantage of
our knowledge that the scheduler already has IRQs disabled when it
calls switch_mm().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f19baf759693c9dcae64bbff76189db77cb13398.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm, sched/core: Uninline switch_mm()
Andy Lutomirski [Tue, 26 Apr 2016 16:39:08 +0000 (09:39 -0700)]
x86/mm, sched/core: Uninline switch_mm()

commit 69c0319aabba45bcf33178916a2f06967b4adede upstream.

It's fairly large and it has quite a few callers.  This may also
help untangle some headers down the road.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/54f3367803e7f80b2be62c8a21879aa74b1a5f57.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
[bwh: Backported to 3.16]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Build arch/x86/mm/tlb.c even on !SMP
Andy Lutomirski [Tue, 26 Apr 2016 16:39:07 +0000 (09:39 -0700)]
x86/mm: Build arch/x86/mm/tlb.c even on !SMP

commit e1074888c326038340a1ada9129d679e661f2ea6 upstream.

Currently all of the functions that live in tlb.c are inlined on
!SMP builds.  One can debate whether this is a good idea (in many
respects the code in tlb.c is better than the inlined UP code).

Regardless, I want to add code that needs to be built on UP and SMP
kernels and relates to tlb flushing, so arrange for tlb.c to be
compiled unconditionally.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f0d778f0d828fc46e5d1946bca80f0aaf9abf032.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agosched/core: Add switch_mm_irqs_off() and use it in the scheduler
Andy Lutomirski [Tue, 26 Apr 2016 16:39:06 +0000 (09:39 -0700)]
sched/core: Add switch_mm_irqs_off() and use it in the scheduler

commit f98db6013c557c216da5038d9c52045be55cd039 upstream.

By default, this is the same thing as switch_mm().

x86 will override it as an optimization.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/df401df47bdd6be3e389c6f1e3f5310d70e81b2c.1461688545.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agodrivers/vhost: Fix mmu_context.h assumption
Ben Hutchings [Fri, 5 Jan 2018 17:46:15 +0000 (17:46 +0000)]
drivers/vhost: Fix mmu_context.h assumption

Some architectures (such as Alpha) rely on include/linux/sched.h definitions
in their mmu_context.h files.

So include sched.h before mmu_context.h.

(This doesn't seem to be needed upstream, though a similar problem was
fixed by commit 8efd755ac2fe "mm/mmu_context, sched/core: Fix mmu_context.h
assumption".)

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agomm/mmu_context, sched/core: Fix mmu_context.h assumption
Ingo Molnar [Thu, 28 Apr 2016 09:39:12 +0000 (11:39 +0200)]
mm/mmu_context, sched/core: Fix mmu_context.h assumption

commit 8efd755ac2fe262d4c8d5c9bbe054bb67dae93da upstream.

Some architectures (such as Alpha) rely on include/linux/sched.h definitions
in their mmu_context.h files.

So include sched.h before mmu_context.h.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: If INVPCID is available, use it to flush global mappings
Andy Lutomirski [Fri, 29 Jan 2016 19:42:59 +0000 (11:42 -0800)]
x86/mm: If INVPCID is available, use it to flush global mappings

commit d8bced79af1db6734f66b42064cc773cada2ce99 upstream.

On my Skylake laptop, INVPCID function 2 (flush absolutely
everything) takes about 376ns, whereas saving flags, twiddling
CR4.PGE to flush global mappings, and restoring flags takes about
539ns.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/ed0ef62581c0ea9c99b9bf6df726015e96d44743.1454096309.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Add a 'noinvpcid' boot option to turn off INVPCID
Andy Lutomirski [Fri, 29 Jan 2016 19:42:58 +0000 (11:42 -0800)]
x86/mm: Add a 'noinvpcid' boot option to turn off INVPCID

commit d12a72b844a49d4162f24cefdab30bed3f86730e upstream.

This adds a chicken bit to turn off INVPCID in case something goes
wrong.  It's an early_param() because we do TLB flushes before we
parse __setup() parameters.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/f586317ed1bc2b87aee652267e515b90051af385.1454096309.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Fix INVPCID asm constraint
Borislav Petkov [Wed, 10 Feb 2016 14:51:16 +0000 (15:51 +0100)]
x86/mm: Fix INVPCID asm constraint

commit e2c7698cd61f11d4077fdb28148b2d31b82ac848 upstream.

So we want to specify the dependency on both @pcid and @addr so that the
compiler doesn't reorder accesses to them *before* the TLB flush. But
for that to work, we need to express this properly in the inline asm and
deref the whole desc array, not the pointer to it. See clwb() for an
example.

This fixes the build error on 32-bit:

  arch/x86/include/asm/tlbflush.h: In function ‘__invpcid’:
  arch/x86/include/asm/tlbflush.h:26:18: error: memory input 0 is not directly addressable

which gcc4.7 caught but 5.x didn't. Which is strange. :-\

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Michael Matz <matz@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Add INVPCID helpers
Andy Lutomirski [Fri, 29 Jan 2016 19:42:57 +0000 (11:42 -0800)]
x86/mm: Add INVPCID helpers

commit 060a402a1ddb551455ee410de2eadd3349f2801b upstream.

This adds helpers for each of the four currently-specified INVPCID
modes.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/8a62b23ad686888cee01da134c91409e22064db9.1454096309.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86: Clean up cr4 manipulation
Andy Lutomirski [Fri, 24 Oct 2014 22:58:07 +0000 (15:58 -0700)]
x86: Clean up cr4 manipulation

commit 375074cc736ab1d89a708c0a8d7baa4a70d5d476 upstream.

CR4 manipulation was split, seemingly at random, between direct
(write_cr4) and using a helper (set/clear_in_cr4).  Unfortunately,
the set_in_cr4 and clear_in_cr4 helpers also poke at the boot code,
which only a small subset of users actually wanted.

This patch replaces all cr4 access in functions that don't leave cr4
exactly the way they found it with new helpers cr4_set_bits,
cr4_clear_bits, and cr4_set_bits_and_update_boot.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vince Weaver <vince@deater.net>
Cc: "hillf.zj" <hillf.zj@alibaba-inc.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/495a10bdc9e67016b8fd3945700d46cfd5c12c2f.1414190806.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Fix sparse 'tlb_single_page_flush_ceiling' warning and make the variable...
Jeremiah Mahler [Sat, 9 Aug 2014 07:38:33 +0000 (00:38 -0700)]
x86/mm: Fix sparse 'tlb_single_page_flush_ceiling' warning and make the variable read-mostly

commit 86426851c38d3fe84dee34d7daa71d26c174d409 upstream.

A sparse warning is generated about
'tlb_single_page_flush_ceiling' not being declared.

  arch/x86/mm/tlb.c:177:15: warning: symbol
  'tlb_single_page_flush_ceiling' was not declared. Should it be static?

Since it isn't used anywhere outside this file, fix the warning
by making it static.

Also, optimize the use of this variable by adding the
__read_mostly directive, as suggested by David Rientjes.

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Jeremiah Mahler <jmmahler@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/r/1407569913-4035-1-git-send-email-jmmahler@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Set TLB flush tunable to sane value (33)
Dave Hansen [Thu, 31 Jul 2014 15:41:03 +0000 (08:41 -0700)]
x86/mm: Set TLB flush tunable to sane value (33)

commit a5102476a24bce364b74f1110005542a2c964103 upstream.

This has been run through Intel's LKP tests across a wide range
of modern sytems and workloads and it wasn't shown to make a
measurable performance difference positive or negative.

Now that we have some shiny new tracepoints, we can actually
figure out what the heck is going on.

During a kernel compile, 60% of the flush_tlb_mm_range() calls
are for a single page.  It breaks down like this:

 size   percent  percent<=
  V        V        V
GLOBAL:   2.20%   2.20% avg cycles:  2283
     1:  56.92%  59.12% avg cycles:  1276
     2:  13.78%  72.90% avg cycles:  1505
     3:   8.26%  81.16% avg cycles:  1880
     4:   7.41%  88.58% avg cycles:  2447
     5:   1.73%  90.31% avg cycles:  2358
     6:   1.32%  91.63% avg cycles:  2563
     7:   1.14%  92.77% avg cycles:  2862
     8:   0.62%  93.39% avg cycles:  3542
     9:   0.08%  93.47% avg cycles:  3289
    10:   0.43%  93.90% avg cycles:  3570
    11:   0.20%  94.10% avg cycles:  3767
    12:   0.08%  94.18% avg cycles:  3996
    13:   0.03%  94.20% avg cycles:  4077
    14:   0.02%  94.23% avg cycles:  4836
    15:   0.04%  94.26% avg cycles:  5699
    16:   0.06%  94.32% avg cycles:  5041
    17:   0.57%  94.89% avg cycles:  5473
    18:   0.02%  94.91% avg cycles:  5396
    19:   0.03%  94.95% avg cycles:  5296
    20:   0.02%  94.96% avg cycles:  6749
    21:   0.18%  95.14% avg cycles:  6225
    22:   0.01%  95.15% avg cycles:  6393
    23:   0.01%  95.16% avg cycles:  6861
    24:   0.12%  95.28% avg cycles:  6912
    25:   0.05%  95.32% avg cycles:  7190
    26:   0.01%  95.33% avg cycles:  7793
    27:   0.01%  95.34% avg cycles:  7833
    28:   0.01%  95.35% avg cycles:  8253
    29:   0.08%  95.42% avg cycles:  8024
    30:   0.03%  95.45% avg cycles:  9670
    31:   0.01%  95.46% avg cycles:  8949
    32:   0.01%  95.46% avg cycles:  9350
    33:   3.11%  98.57% avg cycles:  8534
    34:   0.02%  98.60% avg cycles: 10977
    35:   0.02%  98.62% avg cycles: 11400

We get in to dimishing returns pretty quickly.  On pre-IvyBridge
CPUs, we used to set the limit at 8 pages, and it was set at 128
on IvyBrige.  That 128 number looks pretty silly considering that
less than 0.5% of the flushes are that large.

The previous code tried to size this number based on the size of
the TLB.  Good idea, but it's error-prone, needs maintenance
(which it didn't get up to now), and probably would not matter in
practice much.

Settting it to 33 means that we cover the mallopt
M_TRIM_THRESHOLD, which is the most universally common size to do
flushes.

That's the short version.  Here's the long one for why I chose 33:

1. These numbers have a constant bias in the timestamps from the
   tracing.  Probably counts for a couple hundred cycles in each of
   these tests, but it should be fairly _even_ across all of them.
   The smallest delta between the tracepoints I have ever seen is
   335 cycles.  This is one reason the cycles/page cost goes down in
   general as the flushes get larger.  The true cost is nearer to
   100 cycles.
2. A full flush is more expensive than a single invlpg, but not
   by much (single percentages).
3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
   (~34 cycles).  At those rates, refilling the 512-entry dTLB takes
   22,000 cycles.
4. 22,000 cycles is approximately the equivalent of doing 85
   invlpg operations.  But, the odds are that the TLB can
   actually be filled up faster than that because TLB misses that
   are close in time also tend to leverage the same caches.
6. ~98% of flushes are <=33 pages.  There are a lot of flushes of
   33 pages, probably because libc's M_TRIM_THRESHOLD is set to
   128k (32 pages)
7. I've found no consistent data to support changing the IvyBridge
   vs. SandyBridge tunable by a factor of 16

I used the performance counters on this hardware (IvyBridge i5-3320M)
to figure out the tlb miss costs:

ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush

     7,720,030,970      dtlb_load_misses_walk_duration                                    [57.13%]
       169,856,353      dtlb_load_misses_walk_completed                                    [57.15%]
       708,832,859      dtlb_store_misses_walk_duration                                    [57.17%]
        19,346,823      dtlb_store_misses_walk_completed                                    [57.17%]
     2,779,687,402      itlb_misses_walk_duration                                    [57.15%]
        82,241,148      itlb_misses_walk_completed                                    [57.13%]
           770,717      itlb_itlb_flush                                              [57.11%]

Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
(~34 cycles).  At those rates, refilling the 512-entry dTLB takes
22,000 cycles.  On a SandyBridge system with more cores and larger
caches, those are dtlb=13.4ns and itlb=9.5ns.

cat perf.stat.txt | perl -pe 's/,//g'
| awk '/itlb_misses_walk_duration/ { icyc+=$1 }
/itlb_misses_walk_completed/ { imiss+=$1 }
/dtlb_.*_walk_duration/ { dcyc+=$1 }
/dtlb_.*.*completed/ { dmiss+=$1 }
END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, "   -----    ", icyc,imiss, dcyc,dmiss }

On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any

The assumptions that this code went in under:
https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are
about 100ns.  Being generous, that is over by a factor of 6 on the
refill side, although it is fairly close on the cost of an invlpg.
An increase of a single invlpg operation seems to lengthen the flush
range operation by about 200 cycles.  Here is one example of the data
collected for flushing 10 and 11 pages (full data are below):

    10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
    11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145

How to generate this table:

echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb
echo x86-tsc > /sys/kernel/debug/tracing/trace_clock
echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter
echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable

Pipe the trace output in to this script:

http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt

Note that these data were gathered with the invlpg threshold set to
150 pages.  Only data points with >=50 of samples were printed:

Flush    % of     %<=
in       flush    this
pages      es     size
------------------------------------------------------------------------------
    -1:   2.20%   2.20% avg cycles:  2283 cycles/page: xxxx samples: 23960
     1:  56.92%  59.12% avg cycles:  1276 cycles/page: 1276 samples: 620895
     2:  13.78%  72.90% avg cycles:  1505 cycles/page:  752 samples: 150335
     3:   8.26%  81.16% avg cycles:  1880 cycles/page:  626 samples: 90131
     4:   7.41%  88.58% avg cycles:  2447 cycles/page:  611 samples: 80877
     5:   1.73%  90.31% avg cycles:  2358 cycles/page:  471 samples: 18885
     6:   1.32%  91.63% avg cycles:  2563 cycles/page:  427 samples: 14397
     7:   1.14%  92.77% avg cycles:  2862 cycles/page:  408 samples: 12441
     8:   0.62%  93.39% avg cycles:  3542 cycles/page:  442 samples: 6721
     9:   0.08%  93.47% avg cycles:  3289 cycles/page:  365 samples: 917
    10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
    11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145
    12:   0.08%  94.18% avg cycles:  3996 cycles/page:  333 samples: 864
    13:   0.03%  94.20% avg cycles:  4077 cycles/page:  313 samples: 289
    14:   0.02%  94.23% avg cycles:  4836 cycles/page:  345 samples: 236
    15:   0.04%  94.26% avg cycles:  5699 cycles/page:  379 samples: 390
    16:   0.06%  94.32% avg cycles:  5041 cycles/page:  315 samples: 643
    17:   0.57%  94.89% avg cycles:  5473 cycles/page:  321 samples: 6229
    18:   0.02%  94.91% avg cycles:  5396 cycles/page:  299 samples: 224
    19:   0.03%  94.95% avg cycles:  5296 cycles/page:  278 samples: 367
    20:   0.02%  94.96% avg cycles:  6749 cycles/page:  337 samples: 185
    21:   0.18%  95.14% avg cycles:  6225 cycles/page:  296 samples: 1964
    22:   0.01%  95.15% avg cycles:  6393 cycles/page:  290 samples: 83
    23:   0.01%  95.16% avg cycles:  6861 cycles/page:  298 samples: 61
    24:   0.12%  95.28% avg cycles:  6912 cycles/page:  288 samples: 1307
    25:   0.05%  95.32% avg cycles:  7190 cycles/page:  287 samples: 533
    26:   0.01%  95.33% avg cycles:  7793 cycles/page:  299 samples: 94
    27:   0.01%  95.34% avg cycles:  7833 cycles/page:  290 samples: 66
    28:   0.01%  95.35% avg cycles:  8253 cycles/page:  294 samples: 73
    29:   0.08%  95.42% avg cycles:  8024 cycles/page:  276 samples: 846
    30:   0.03%  95.45% avg cycles:  9670 cycles/page:  322 samples: 296
    31:   0.01%  95.46% avg cycles:  8949 cycles/page:  288 samples: 79
    32:   0.01%  95.46% avg cycles:  9350 cycles/page:  292 samples: 60
    33:   3.11%  98.57% avg cycles:  8534 cycles/page:  258 samples: 33936
    34:   0.02%  98.60% avg cycles: 10977 cycles/page:  322 samples: 268
    35:   0.02%  98.62% avg cycles: 11400 cycles/page:  325 samples: 177
    36:   0.01%  98.63% avg cycles: 11504 cycles/page:  319 samples: 161
    37:   0.02%  98.65% avg cycles: 11596 cycles/page:  313 samples: 182
    38:   0.02%  98.66% avg cycles: 11850 cycles/page:  311 samples: 195
    39:   0.01%  98.68% avg cycles: 12158 cycles/page:  311 samples: 128
    40:   0.01%  98.68% avg cycles: 11626 cycles/page:  290 samples: 78
    41:   0.04%  98.73% avg cycles: 11435 cycles/page:  278 samples: 477
    42:   0.01%  98.73% avg cycles: 12571 cycles/page:  299 samples: 74
    43:   0.01%  98.74% avg cycles: 12562 cycles/page:  292 samples: 78
    44:   0.01%  98.75% avg cycles: 12991 cycles/page:  295 samples: 108
    45:   0.01%  98.76% avg cycles: 13169 cycles/page:  292 samples: 78
    46:   0.02%  98.78% avg cycles: 12891 cycles/page:  280 samples: 261
    47:   0.01%  98.79% avg cycles: 13099 cycles/page:  278 samples: 67
    48:   0.01%  98.80% avg cycles: 13851 cycles/page:  288 samples: 77
    49:   0.01%  98.80% avg cycles: 13749 cycles/page:  280 samples: 66
    50:   0.01%  98.81% avg cycles: 13949 cycles/page:  278 samples: 73
    52:   0.00%  98.82% avg cycles: 14243 cycles/page:  273 samples: 52
    54:   0.01%  98.83% avg cycles: 15312 cycles/page:  283 samples: 87
    55:   0.01%  98.84% avg cycles: 15197 cycles/page:  276 samples: 109
    56:   0.02%  98.86% avg cycles: 15234 cycles/page:  272 samples: 208
    57:   0.00%  98.86% avg cycles: 14888 cycles/page:  261 samples: 53
    58:   0.01%  98.87% avg cycles: 15037 cycles/page:  259 samples: 59
    59:   0.01%  98.87% avg cycles: 15752 cycles/page:  266 samples: 63
    62:   0.00%  98.89% avg cycles: 16222 cycles/page:  261 samples: 54
    64:   0.02%  98.91% avg cycles: 17179 cycles/page:  268 samples: 248
    65:   0.12%  99.03% avg cycles: 18762 cycles/page:  288 samples: 1324
    85:   0.00%  99.10% avg cycles: 21649 cycles/page:  254 samples: 50
   127:   0.01%  99.18% avg cycles: 32397 cycles/page:  255 samples: 75
   128:   0.13%  99.31% avg cycles: 31711 cycles/page:  247 samples: 1466
   129:   0.18%  99.49% avg cycles: 33017 cycles/page:  255 samples: 1927
   181:   0.33%  99.84% avg cycles:  2489 cycles/page:   13 samples: 3547
   256:   0.05%  99.91% avg cycles:  2305 cycles/page:    9 samples: 550
   512:   0.03%  99.95% avg cycles:  2133 cycles/page:    4 samples: 304
  1512:   0.01%  99.99% avg cycles:  3038 cycles/page:    2 samples: 65

Here are the tlb counters during a 10-second slice of a kernel compile
for a SandyBridge system.  It's better than IvyBridge, but probably
due to the larger caches since this was one of the 'X' extreme parts.

    10,873,007,282      dtlb_load_misses_walk_duration
       250,711,333      dtlb_load_misses_walk_completed
     1,212,395,865      dtlb_store_misses_walk_duration
        31,615,772      dtlb_store_misses_walk_completed
     5,091,010,274      itlb_misses_walk_duration
       163,193,511      itlb_misses_walk_completed
         1,321,980      itlb_itlb_flush

      10.008045158 seconds time elapsed

itlb ns/miss:  9.45338  dtlb ns/miss:  12.9716

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154103.10C1115E@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: New tunable for single vs full TLB flush
Dave Hansen [Thu, 31 Jul 2014 15:41:01 +0000 (08:41 -0700)]
x86/mm: New tunable for single vs full TLB flush

commit 2d040a1ce903ca5d6e7c983621fb29c6883c4c48 upstream.

Most of the logic here is in the documentation file.  Please take
a look at it.

I know we've come full-circle here back to a tunable, but this
new one is *WAY* simpler.  I challenge anyone to describe in one
sentence how the old one worked.  Here's the way the new one
works:

If we are flushing more pages than the ceiling, we use
the full flush, otherwise we use per-page flushes.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154101.12B52CAF@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Fix missed global TLB flush stat
Dave Hansen [Thu, 31 Jul 2014 15:40:56 +0000 (08:40 -0700)]
x86/mm: Fix missed global TLB flush stat

commit 9dfa6dee5355f200cf19528ca7c678ef4007cec5 upstream.

If we take the

if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
local_flush_tlb();
goto out;
}

path out of flush_tlb_mm_range(), we will have flushed the tlb,
but not incremented NR_TLB_LOCAL_FLUSH_ALL.  This unifies the
way out of the function so that we always take a single path when
doing a full tlb flush.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154056.FF763B76@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Rip out complicated, out-of-date, buggy TLB flushing
Dave Hansen [Thu, 31 Jul 2014 15:40:55 +0000 (08:40 -0700)]
x86/mm: Rip out complicated, out-of-date, buggy TLB flushing

commit e9f4e0a9fe2723078b7a1a1169828dd46a7b2f9e upstream.

I think the flush_tlb_mm_range() code that tries to tune the
flush sizes based on the CPU needs to get ripped out for
several reasons:

1. It is obviously buggy.  It uses mm->total_vm to judge the
   task's footprint in the TLB.  It should certainly be using
   some measure of RSS, *NOT* ->total_vm since only resident
   memory can populate the TLB.
2. Haswell, and several other CPUs are missing from the
   intel_tlb_flushall_shift_set() function.  Thus, it has been
   demonstrated to bitrot quickly in practice.
3. It is plain wrong in my vm:
[    0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.037444] tlb_flushall_shift: 6
   Which leads to it to never use invlpg.
4. The assumptions about TLB refill costs are wrong:
http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com
    (more on this in later patches)
5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59
   I believe the sample times were too short.  Running the
   benchmark in a loop yields times that vary quite a bit.

Note that this leaves us with a static ceiling of 1 page.  This
is a conservative, dumb setting, and will be revised in a later
patch.

This also removes the code which attempts to predict whether we
are flushing data or instructions.  We expect instruction flushes
to be relatively rare and not worth tuning for explicitly.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154055.ABC88E89@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/mm: Clean up the TLB flushing code
Dave Hansen [Thu, 31 Jul 2014 15:40:54 +0000 (08:40 -0700)]
x86/mm: Clean up the TLB flushing code

commit 4995ab9cf512e9a6cc07dfd6b1d4e2fc48ce7fef upstream.

The

if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)

line of code is not exactly the easiest to audit, especially when
it ends up at two different indentation levels.  This eliminates
one of the the copy-n-paste versions.  It also gives us a unified
exit point for each path through this function.  We need this in
a minute for our tracepoint.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154054.44F1CDDC@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoLinux 3.16.52 v3.16.52
Ben Hutchings [Mon, 1 Jan 2018 20:52:14 +0000 (20:52 +0000)]
Linux 3.16.52

6 years agoKEYS: add missing permission check for request_key() destination
Eric Biggers [Fri, 8 Dec 2017 15:13:27 +0000 (15:13 +0000)]
KEYS: add missing permission check for request_key() destination

commit 4dca6ea1d9432052afb06baf2e3ae78188a4410b upstream.

When the request_key() syscall is not passed a destination keyring, it
links the requested key (if constructed) into the "default" request-key
keyring.  This should require Write permission to the keyring.  However,
there is actually no permission check.

This can be abused to add keys to any keyring to which only Search
permission is granted.  This is because Search permission allows joining
the keyring.  keyctl_set_reqkey_keyring(KEY_REQKEY_DEFL_SESSION_KEYRING)
then will set the default request-key keyring to the session keyring.
Then, request_key() can be used to add keys to the keyring.

Both negatively and positively instantiated keys can be added using this
method.  Adding negative keys is trivial.  Adding a positive key is a
bit trickier.  It requires that either /sbin/request-key positively
instantiates the key, or that another thread adds the key to the process
keyring at just the right time, such that request_key() misses it
initially but then finds it in construct_alloc_key().

Fix this bug by checking for Write permission to the keyring in
construct_get_dest_keyring() when the default keyring is being used.

We don't do the permission check for non-default keyrings because that
was already done by the earlier call to lookup_user_key().  Also,
request_key_and_link() is currently passed a 'struct key *' rather than
a key_ref_t, so the "possessed" bit is unavailable.

We also don't do the permission check for the "requestor keyring", to
continue to support the use case described by commit 8bbf4976b59f
("KEYS: Alter use of key instantiation link-to-keyring argument") where
/sbin/request-key recursively calls request_key() to add keys to the
original requestor's destination keyring.  (I don't know of any users
who actually do that, though...)

Fixes: 3e30148c3d52 ("[PATCH] Keys: Make request-key create an authorisation key")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agocrypto: hmac - require that the underlying hash algorithm is unkeyed
Eric Biggers [Wed, 29 Nov 2017 02:01:38 +0000 (18:01 -0800)]
crypto: hmac - require that the underlying hash algorithm is unkeyed

commit af3ff8045bbf3e32f1a448542e73abb4c8ceb6f1 upstream.

Because the HMAC template didn't check that its underlying hash
algorithm is unkeyed, trying to use "hmac(hmac(sha3-512-generic))"
through AF_ALG or through KEYCTL_DH_COMPUTE resulted in the inner HMAC
being used without having been keyed, resulting in sha3_update() being
called without sha3_init(), causing a stack buffer overflow.

This is a very old bug, but it seems to have only started causing real
problems when SHA-3 support was added (requires CONFIG_CRYPTO_SHA3)
because the innermost hash's state is ->import()ed from a zeroed buffer,
and it just so happens that other hash algorithms are fine with that,
but SHA-3 is not.  However, there could be arch or hardware-dependent
hash algorithms also affected; I couldn't test everything.

Fix the bug by introducing a function crypto_shash_alg_has_setkey()
which tests whether a shash algorithm is keyed.  Then update the HMAC
template to require that its underlying hash algorithm is unkeyed.

Here is a reproducer:

    #include <linux/if_alg.h>
    #include <sys/socket.h>

    int main()
    {
        int algfd;
        struct sockaddr_alg addr = {
            .salg_type = "hash",
            .salg_name = "hmac(hmac(sha3-512-generic))",
        };
        char key[4096] = { 0 };

        algfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
        bind(algfd, (const struct sockaddr *)&addr, sizeof(addr));
        setsockopt(algfd, SOL_ALG, ALG_SET_KEY, key, sizeof(key));
    }

Here was the KASAN report from syzbot:

    BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:341  [inline]
    BUG: KASAN: stack-out-of-bounds in sha3_update+0xdf/0x2e0  crypto/sha3_generic.c:161
    Write of size 4096 at addr ffff8801cca07c40 by task syzkaller076574/3044

    CPU: 1 PID: 3044 Comm: syzkaller076574 Not tainted 4.14.0-mm1+ #25
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  Google 01/01/2011
    Call Trace:
      __dump_stack lib/dump_stack.c:17 [inline]
      dump_stack+0x194/0x257 lib/dump_stack.c:53
      print_address_description+0x73/0x250 mm/kasan/report.c:252
      kasan_report_error mm/kasan/report.c:351 [inline]
      kasan_report+0x25b/0x340 mm/kasan/report.c:409
      check_memory_region_inline mm/kasan/kasan.c:260 [inline]
      check_memory_region+0x137/0x190 mm/kasan/kasan.c:267
      memcpy+0x37/0x50 mm/kasan/kasan.c:303
      memcpy include/linux/string.h:341 [inline]
      sha3_update+0xdf/0x2e0 crypto/sha3_generic.c:161
      crypto_shash_update+0xcb/0x220 crypto/shash.c:109
      shash_finup_unaligned+0x2a/0x60 crypto/shash.c:151
      crypto_shash_finup+0xc4/0x120 crypto/shash.c:165
      hmac_finup+0x182/0x330 crypto/hmac.c:152
      crypto_shash_finup+0xc4/0x120 crypto/shash.c:165
      shash_digest_unaligned+0x9e/0xd0 crypto/shash.c:172
      crypto_shash_digest+0xc4/0x120 crypto/shash.c:186
      hmac_setkey+0x36a/0x690 crypto/hmac.c:66
      crypto_shash_setkey+0xad/0x190 crypto/shash.c:64
      shash_async_setkey+0x47/0x60 crypto/shash.c:207
      crypto_ahash_setkey+0xaf/0x180 crypto/ahash.c:200
      hash_setkey+0x40/0x90 crypto/algif_hash.c:446
      alg_setkey crypto/af_alg.c:221 [inline]
      alg_setsockopt+0x2a1/0x350 crypto/af_alg.c:254
      SYSC_setsockopt net/socket.c:1851 [inline]
      SyS_setsockopt+0x189/0x360 net/socket.c:1830
      entry_SYSCALL_64_fastpath+0x1f/0x96

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agocrypto: salsa20 - fix blkcipher_walk API usage
Eric Biggers [Wed, 29 Nov 2017 04:56:59 +0000 (20:56 -0800)]
crypto: salsa20 - fix blkcipher_walk API usage

commit ecaaab5649781c5a0effdaf298a925063020500e upstream.

When asked to encrypt or decrypt 0 bytes, both the generic and x86
implementations of Salsa20 crash in blkcipher_walk_done(), either when
doing 'kfree(walk->buffer)' or 'free_page((unsigned long)walk->page)',
because walk->buffer and walk->page have not been initialized.

The bug is that Salsa20 is calling blkcipher_walk_done() even when
nothing is in 'walk.nbytes'.  But blkcipher_walk_done() is only meant to
be called when a nonzero number of bytes have been provided.

The broken code is part of an optimization that tries to make only one
call to salsa20_encrypt_bytes() to process inputs that are not evenly
divisible by 64 bytes.  To fix the bug, just remove this "optimization"
and use the blkcipher_walk API the same way all the other users do.

Reproducer:

    #include <linux/if_alg.h>
    #include <sys/socket.h>
    #include <unistd.h>

    int main()
    {
            int algfd, reqfd;
            struct sockaddr_alg addr = {
                    .salg_type = "skcipher",
                    .salg_name = "salsa20",
            };
            char key[16] = { 0 };

            algfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
            bind(algfd, (void *)&addr, sizeof(addr));
            reqfd = accept(algfd, 0, 0);
            setsockopt(algfd, SOL_ALG, ALG_SET_KEY, key, sizeof(key));
            read(reqfd, key, sizeof(key));
    }

Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: eb6f13eb9f81 ("[CRYPTO] salsa20_generic: Fix multi-page processing")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoKVM: Fix stack-out-of-bounds read in write_mmio
Wanpeng Li [Fri, 15 Dec 2017 01:40:50 +0000 (17:40 -0800)]
KVM: Fix stack-out-of-bounds read in write_mmio

commit e39d200fa5bf5b94a0948db0dae44c1b73b84a56 upstream.

Reported by syzkaller:

  BUG: KASAN: stack-out-of-bounds in write_mmio+0x11e/0x270 [kvm]
  Read of size 8 at addr ffff8803259df7f8 by task syz-executor/32298

  CPU: 6 PID: 32298 Comm: syz-executor Tainted: G           OE    4.15.0-rc2+ #18
  Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
  Call Trace:
   dump_stack+0xab/0xe1
   print_address_description+0x6b/0x290
   kasan_report+0x28a/0x370
   write_mmio+0x11e/0x270 [kvm]
   emulator_read_write_onepage+0x311/0x600 [kvm]
   emulator_read_write+0xef/0x240 [kvm]
   emulator_fix_hypercall+0x105/0x150 [kvm]
   em_hypercall+0x2b/0x80 [kvm]
   x86_emulate_insn+0x2b1/0x1640 [kvm]
   x86_emulate_instruction+0x39a/0xb90 [kvm]
   handle_exception+0x1b4/0x4d0 [kvm_intel]
   vcpu_enter_guest+0x15a0/0x2640 [kvm]
   kvm_arch_vcpu_ioctl_run+0x549/0x7d0 [kvm]
   kvm_vcpu_ioctl+0x479/0x880 [kvm]
   do_vfs_ioctl+0x142/0x9a0
   SyS_ioctl+0x74/0x80
   entry_SYSCALL_64_fastpath+0x23/0x9a

The path of patched vmmcall will patch 3 bytes opcode 0F 01 C1(vmcall)
to the guest memory, however, write_mmio tracepoint always prints 8 bytes
through *(u64 *)val since kvm splits the mmio access into 8 bytes. This
leaks 5 bytes from the kernel stack (CVE-2017-17741).  This patch fixes
it by just accessing the bytes which we operate on.

Before patch:

syz-executor-5567  [007] .... 51370.561696: kvm_mmio: mmio write len 3 gpa 0x10 val 0x1ffff10077c1010f

After patch:

syz-executor-13416 [002] .... 51302.299573: kvm_mmio: mmio write len 3 gpa 0x10 val 0xc1010f

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[bwh: Backported to 3.16:
 - ARM implementation combines the KVM_TRACE_MMIO_WRITE and
   KVM_TRACE_MMIO_READ_UNSATISFIED cases
 - Adjust filename]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoptrace: Properly initialize ptracer_cred on fork
Eric W. Biederman [Mon, 22 May 2017 20:40:12 +0000 (15:40 -0500)]
ptrace: Properly initialize ptracer_cred on fork

commit c70d9d809fdeecedb96972457ee45c49a232d97f upstream.

When I introduced ptracer_cred I failed to consider the weirdness of
fork where the task_struct copies the old value by default.  This
winds up leaving ptracer_cred set even when a process forks and
the child process does not wind up being ptraced.

Because ptracer_cred is not set on non-ptraced processes whose
parents were ptraced this has broken the ability of the enlightenment
window manager to start setuid children.

Fix this by properly initializing ptracer_cred in ptrace_init_task

This must be done with a little bit of care to preserve the current value
of ptracer_cred when ptrace carries through fork.  Re-reading the
ptracer_cred from the ptracing process at this point is inconsistent
with how PT_PTRACE_CAP has been maintained all of these years.

Tested-by: Takashi Iwai <tiwai@suse.de>
Fixes: 64b875f7ac8a ("ptrace: Capture the ptracer's creds not PT_PTRACE_CAP")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoptrace: Don't allow accessing an undumpable mm
Eric W. Biederman [Tue, 22 Nov 2016 18:06:50 +0000 (12:06 -0600)]
ptrace: Don't allow accessing an undumpable mm

commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3 upstream.

It is the reasonable expectation that if an executable file is not
readable there will be no way for a user without special privileges to
read the file.  This is enforced in ptrace_attach but if ptrace
is already attached before exec there is no enforcement for read-only
executables.

As the only way to read such an mm is through access_process_vm
spin a variant called ptrace_access_vm that will fail if the
target process is not being ptraced by the current process, or
the current process did not have sufficient privileges when ptracing
began to read the target processes mm.

In the ptrace implementations replace access_process_vm by
ptrace_access_vm.  There remain several ptrace sites that still use
access_process_vm as they are reading the target executables
instructions (for kernel consumption) or register stacks.  As such it
does not appear necessary to add a permission check to those calls.

This bug has always existed in Linux.

Fixes: v1.0
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
[bwh: Backported to 3.16:
 - Pass around only a write flag, not gup_flags
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoexec: Ensure mm->user_ns contains the execed files
Eric W. Biederman [Thu, 17 Nov 2016 04:06:51 +0000 (22:06 -0600)]
exec: Ensure mm->user_ns contains the execed files

commit f84df2a6f268de584a201e8911384a2d244876e3 upstream.

When the user namespace support was merged the need to prevent
ptrace from revealing the contents of an unreadable executable
was overlooked.

Correct this oversight by ensuring that the executed file
or files are in mm->user_ns, by adjusting mm->user_ns.

Use the new function privileged_wrt_inode_uidgid to see if
the executable is a member of the user namespace, and as such
if having CAP_SYS_PTRACE in the user namespace should allow
tracing the executable.  If not update mm->user_ns to
the parent user namespace until an appropriate parent is found.

Reported-by: Jann Horn <jann@thejh.net>
Fixes: 9e4a36ece652 ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
[bwh: Backported to 3.16:
 - Add #include <linux/user_namespace.h>
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoptrace: Capture the ptracer's creds not PT_PTRACE_CAP
Eric W. Biederman [Tue, 15 Nov 2016 00:48:07 +0000 (18:48 -0600)]
ptrace: Capture the ptracer's creds not PT_PTRACE_CAP

commit 64b875f7ac8a5d60a4e191479299e931ee949b67 upstream.

When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
overlooked.  This can result in incorrect behavior when an application
like strace traces an exec of a setuid executable.

Further PT_PTRACE_CAP does not have enough information for making good
security decisions as it does not report which user namespace the
capability is in.  This has already allowed one mistake through
insufficient granulariy.

I found this issue when I was testing another corner case of exec and
discovered that I could not get strace to set PT_PTRACE_CAP even when
running strace as root with a full set of caps.

This change fixes the above issue with strace allowing stracing as
root a setuid executable without disabling setuid.  More fundamentaly
this change allows what is allowable at all times, by using the correct
information in it's decision.

Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agomm: Add a user_ns owner to mm_struct and fix ptrace permission checks
Eric W. Biederman [Fri, 14 Oct 2016 02:23:16 +0000 (21:23 -0500)]
mm: Add a user_ns owner to mm_struct and fix ptrace permission checks

commit bfedb589252c01fa505ac9f6f2a3d5d68d707ef4 upstream.

During exec dumpable is cleared if the file that is being executed is
not readable by the user executing the file.  A bug in
ptrace_may_access allows reading the file if the executable happens to
enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

This problem is fixed with only necessary userspace breakage by adding
a user namespace owner to mm_struct, captured at the time of exec, so
it is clear in which user namespace CAP_SYS_PTRACE must be present in
to be able to safely give read permission to the executable.

The function ptrace_may_access is modified to verify that the ptracer
has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
This ensures that if the task changes it's cred into a subordinate
user namespace it does not become ptraceable.

The function ptrace_attach is modified to only set PT_PTRACE_CAP when
CAP_SYS_PTRACE is held over task->mm->user_ns.  The intent of
PT_PTRACE_CAP is to be a flag to note that whatever permission changes
the task might go through the tracer has sufficient permissions for
it not to be an issue.  task->cred->user_ns is always the same
as or descendent of mm->user_ns.  Which guarantees that having
CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
credentials.

To prevent regressions mm->dumpable and mm->user_ns are not considered
when a task has no mm.  As simply failing ptrace_may_attach causes
regressions in privileged applications attempting to read things
such as /proc/<pid>/stat

Acked-by: Kees Cook <keescook@chromium.org>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoptrace: change __ptrace_unlink() to clear ->ptrace under ->siglock
Oleg Nesterov [Tue, 22 Mar 2016 21:25:33 +0000 (14:25 -0700)]
ptrace: change __ptrace_unlink() to clear ->ptrace under ->siglock

commit 1333ab03150478df8d6f5673a91df1e50dc6ab97 upstream.

This test-case (simplified version of generated by syzkaller)

#include <unistd.h>
#include <sys/ptrace.h>
#include <sys/wait.h>

void test(void)
{
for (;;) {
if (fork()) {
wait(NULL);
continue;
}

ptrace(PTRACE_SEIZE, getppid(), 0, 0);
ptrace(PTRACE_INTERRUPT, getppid(), 0, 0);
_exit(0);
}
}

int main(void)
{
int np;

for (np = 0; np < 8; ++np)
if (!fork())
test();

while (wait(NULL) > 0)
;
return 0;
}

triggers the 2nd WARN_ON_ONCE(!signr) warning in do_jobctl_trap().  The
problem is that __ptrace_unlink() clears task->jobctl under siglock but
task->ptrace is cleared without this lock held; this fools the "else"
branch which assumes that !PT_SEIZED means PT_PTRACED.

Note also that most of other PTRACE_SEIZE checks can race with detach
from the exiting tracer too.  Say, the callers of ptrace_trap_notify()
assume that SEIZED can't go away after it was checked.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agosecurity: let security modules use PTRACE_MODE_* with bitmasks
Jann Horn [Wed, 20 Jan 2016 23:00:01 +0000 (15:00 -0800)]
security: let security modules use PTRACE_MODE_* with bitmasks

commit 3dfb7d8cdbc7ea0c2970450e60818bb3eefbad69 upstream.

It looks like smack and yama weren't aware that the ptrace mode
can have flags ORed into it - PTRACE_MODE_NOAUDIT until now, but
only for /proc/$pid/stat, and with the PTRACE_MODE_*CREDS patch,
all modes have flags ORed into them.

Signed-off-by: Jann Horn <jann@thejh.net>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoKVM: VMX: remove I/O port 0x80 bypass on Intel hosts
Andrew Honig [Fri, 1 Dec 2017 18:21:09 +0000 (10:21 -0800)]
KVM: VMX: remove I/O port 0x80 bypass on Intel hosts

commit d59d51f088014f25c2562de59b9abff4f42a7468 upstream.

This fixes CVE-2017-1000407.

KVM allows guests to directly access I/O port 0x80 on Intel hosts.  If
the guest floods this port with writes it generates exceptions and
instability in the host kernel, leading to a crash.  With this change
guest writes to port 0x80 on Intel will behave the same as they
currently behave on AMD systems.

Prevent the flooding by removing the code that sets port 0x80 as a
passthrough port.  This is essentially the same as upstream patch
99f85a28a78e96d28907fe036e1671a218fee597, except that patch was
for AMD chipsets and this patch is for Intel.

Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Fixes: fdef3ad1b386 ("KVM: VMX: Enable io bitmaps to avoid IO port 0x80 VMEXITs")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agomm, thp: Do not make page table dirty unconditionally in touch_p[mu]d()
Kirill A. Shutemov [Mon, 27 Nov 2017 03:21:25 +0000 (06:21 +0300)]
mm, thp: Do not make page table dirty unconditionally in touch_p[mu]d()

commit a8f97366452ed491d13cf1e44241bc0b5740b1f0 upstream.

Currently, we unconditionally make page table dirty in touch_pmd().
It may result in false-positive can_follow_write_pmd().

We may avoid the situation, if we would only make the page table entry
dirty if caller asks for write access -- FOLL_WRITE.

The patch also changes touch_pud() in the same way.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[carnil: backport for 3.16:
 - Adjust context
 - Drop specific part for PUD-sized transparent hugepages. Support
   for PUD-sized transparent hugepages was added in v4.11-rc1
]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoUSB: core: prevent malicious bNumInterfaces overflow
Alan Stern [Tue, 12 Dec 2017 19:25:13 +0000 (14:25 -0500)]
USB: core: prevent malicious bNumInterfaces overflow

commit 48a4ff1c7bb5a32d2e396b03132d20d552c0eca7 upstream.

A malicious USB device with crafted descriptors can cause the kernel
to access unallocated memory by setting the bNumInterfaces value too
high in a configuration descriptor.  Although the value is adjusted
during parsing, this adjustment is skipped in one of the error return
paths.

This patch prevents the problem by setting bNumInterfaces to 0
initially.  The existing code already sets it to the proper value
after parsing is complete.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agonetfilter: xt_osf: Add missing permission checks
Kevin Cernekee [Tue, 5 Dec 2017 23:42:41 +0000 (15:42 -0800)]
netfilter: xt_osf: Add missing permission checks

commit 916a27901de01446bcf57ecca4783f6cff493309 upstream.

The capability check in nfnetlink_rcv() verifies that the caller
has CAP_NET_ADMIN in the namespace that "owns" the netlink socket.
However, xt_osf_fingers is shared by all net namespaces on the
system.  An unprivileged user can create user and net namespaces
in which he holds CAP_NET_ADMIN to bypass the netlink_net_capable()
check:

    vpnns -- nfnl_osf -f /tmp/pf.os

    vpnns -- nfnl_osf -f /tmp/pf.os -d

These non-root operations successfully modify the systemwide OS
fingerprint list.  Add new capable() checks so that they can't.

Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agonetlink: Add netns check on taps
Kevin Cernekee [Wed, 6 Dec 2017 20:12:27 +0000 (12:12 -0800)]
netlink: Add netns check on taps

commit 93c647643b48f0131f02e45da3bd367d80443291 upstream.

Currently, a nlmon link inside a child namespace can observe systemwide
netlink activity.  Filter the traffic so that nlmon can only sniff
netlink messages from its own netns.

Test case:

    vpnns -- bash -c "ip link add nlmon0 type nlmon; \
                      ip link set nlmon0 up; \
                      tcpdump -i nlmon0 -q -w /tmp/nlmon.pcap -U" &
    sudo ip xfrm state add src 10.1.1.1 dst 10.1.1.2 proto esp \
        spi 0x1 mode transport \
        auth sha1 0x6162633132330000000000000000000000000000 \
        enc aes 0x00000000000000000000000000000000
    grep --binary abc123 /tmp/nlmon.pcap

Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agonetfilter: nfnetlink_cthelper: Add missing permission checks
Kevin Cernekee [Sun, 3 Dec 2017 20:12:45 +0000 (12:12 -0800)]
netfilter: nfnetlink_cthelper: Add missing permission checks

commit 4b380c42f7d00a395feede754f0bc2292eebe6e5 upstream.

The capability check in nfnetlink_rcv() verifies that the caller
has CAP_NET_ADMIN in the namespace that "owns" the netlink socket.
However, nfnl_cthelper_list is shared by all net namespaces on the
system.  An unprivileged user can create user and net namespaces
in which he holds CAP_NET_ADMIN to bypass the netlink_net_capable()
check:

    $ nfct helper list
    nfct v1.4.4: netlink error: Operation not permitted
    $ vpnns -- nfct helper list
    {
            .name = ftp,
            .queuenum = 0,
            .l3protonum = 2,
            .l4protonum = 6,
            .priv_data_len = 24,
            .status = enabled,
    };

Add capable() checks in nfnetlink_cthelper, as this is cleaner than
trying to generalize the solution.

Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoInput: ims-psu - check if CDC union descriptor is sane
Dmitry Torokhov [Sat, 7 Oct 2017 18:07:47 +0000 (11:07 -0700)]
Input: ims-psu - check if CDC union descriptor is sane

commit ea04efee7635c9120d015dcdeeeb6988130cb67a upstream.

Before trying to use CDC union descriptor, try to validate whether that it
is sane by checking that intf->altsetting->extra is big enough and that
descriptor bLength is not too big and not too small.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoBluetooth: bnep: bnep_add_connection() should verify that it's dealing with l2cap...
Al Viro [Fri, 19 Dec 2014 06:20:59 +0000 (06:20 +0000)]
Bluetooth: bnep: bnep_add_connection() should verify that it's dealing with l2cap socket

commit 71bb99a02b32b4cc4265118e85f6035ca72923f0 upstream.

same story as cmtp

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoBluetooth: cmtp: cmtp_add_connection() should verify that it's dealing with l2cap...
Al Viro [Fri, 19 Dec 2014 06:20:58 +0000 (06:20 +0000)]
Bluetooth: cmtp: cmtp_add_connection() should verify that it's dealing with l2cap socket

commit 96c26653ce65bf84f3212f8b00d4316c1efcbf4c upstream.

... rather than relying on ciptool(8) never passing it anything else.  Give
it e.g. an AF_UNIX connected socket (from socketpair(2)) and it'll oops,
trying to evaluate &l2cap_pi(sock->sk)->chan->dst...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agodccp: CVE-2017-8824: use-after-free in DCCP code
Mohamed Ghannam [Tue, 5 Dec 2017 20:58:35 +0000 (20:58 +0000)]
dccp: CVE-2017-8824: use-after-free in DCCP code

commit 69c64866ce072dea1d1e59a0d61e0f66c0dffb76 upstream.

Whenever the sock object is in DCCP_CLOSED state,
dccp_disconnect() must free dccps_hc_tx_ccid and
dccps_hc_rx_ccid and set to NULL.

Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agosched/topology: Optimize build_group_mask()
Lauro Ramos Venancio [Thu, 20 Apr 2017 19:51:40 +0000 (16:51 -0300)]
sched/topology: Optimize build_group_mask()

commit f32d782e31bf079f600dcec126ed117b0577e85c upstream.

The group mask is always used in intersection with the group CPUs. So,
when building the group mask, we don't have to care about CPUs that are
not part of the group.

Signed-off-by: Lauro Ramos Venancio <lvenanci@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: lwang@redhat.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.16:
 - Update another reference to 'span' introduced by an earlier backport of
   sched/topology changes
 - Adjust filename]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agosched/topology: Simplify build_overlap_sched_groups()
Peter Zijlstra [Fri, 14 Apr 2017 15:32:07 +0000 (17:32 +0200)]
sched/topology: Simplify build_overlap_sched_groups()

commit 91eaed0d61319f58a9f8e43d41a8cbb069b4f73d upstream.

Now that the first group will always be the previous domain of this
@cpu this can be simplified.

In fact, writing the code now removed should've been a big clue I was
doing it wrong :/

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.16: adjust filename, context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agosched/topology: Remove FORCE_SD_OVERLAP
Peter Zijlstra [Wed, 26 Apr 2017 15:36:41 +0000 (17:36 +0200)]
sched/topology: Remove FORCE_SD_OVERLAP

commit af85596c74de2fd9abb87501ae280038ac28a3f4 upstream.

Its an obsolete debug mechanism and future code wants to rely on
properties this undermines.

Namely, it would be good to assume that SD_OVERLAP domains have
children, but if we build the entire hierarchy with SD_OVERLAP this is
obviously false.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[bwh: Backported to 3.16: adjust filename]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agovlan: fix a use-after-free in vlan_device_event()
Cong Wang [Fri, 10 Nov 2017 00:43:13 +0000 (16:43 -0800)]
vlan: fix a use-after-free in vlan_device_event()

commit 052d41c01b3a2e3371d66de569717353af489d63 upstream.

After refcnt reaches zero, vlan_vid_del() could free
dev->vlan_info via RCU:

RCU_INIT_POINTER(dev->vlan_info, NULL);
call_rcu(&vlan_info->rcu, vlan_info_rcu_free);

However, the pointer 'grp' still points to that memory
since it is set before vlan_vid_del():

        vlan_info = rtnl_dereference(dev->vlan_info);
        if (!vlan_info)
                goto out;
        grp = &vlan_info->grp;

Depends on when that RCU callback is scheduled, we could
trigger a use-after-free in vlan_group_for_each_dev()
right following this vlan_vid_del().

Fix it by moving vlan_vid_del() before setting grp. This
is also symmetric to the vlan_vid_add() we call in
vlan_device_event().

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: efc73f4bbc23 ("net: Fix memory leak - vlan_info struct")
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agocan: c_can: don't indicate triple sampling support for D_CAN
Richard Schütz [Sun, 29 Oct 2017 12:03:22 +0000 (13:03 +0100)]
can: c_can: don't indicate triple sampling support for D_CAN

commit fb5f0b3ef69b95e665e4bbe8a3de7201f09f1071 upstream.

The D_CAN controller doesn't provide a triple sampling mode, so don't set
the CAN_CTRLMODE_3_SAMPLES flag in ctrlmode_supported. Currently enabling
triple sampling is a no-op.

Signed-off-by: Richard Schütz <rschuetz@uni-koblenz.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agorbd: use GFP_NOIO for parent stat and data requests
Ilya Dryomov [Mon, 6 Nov 2017 10:33:36 +0000 (11:33 +0100)]
rbd: use GFP_NOIO for parent stat and data requests

commit 1e37f2f84680fa7f8394fd444b6928e334495ccc upstream.

rbd_img_obj_exists_submit() and rbd_img_obj_parent_read_full() are on
the writeback path for cloned images -- we attempt a stat on the parent
object to see if it exists and potentially read it in to call copyup.
GFP_NOIO should be used instead of GFP_KERNEL here.

Link: http://tracker.ceph.com/issues/22014
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: David Disseldorp <ddiss@suse.de>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoMIPS: AR7: Ensure that serial ports are properly set up
Oswald Buddenhagen [Sun, 29 Oct 2017 15:27:20 +0000 (16:27 +0100)]
MIPS: AR7: Ensure that serial ports are properly set up

commit b084116f8587b222a2c5ef6dcd846f40f24b9420 upstream.

Without UPF_FIXED_TYPE, the data from the PORT_AR7 uart_config entry is
never copied, resulting in a dead port.

Fixes: 154615d55459 ("MIPS: AR7: Use correct UART port type")
Signed-off-by: Oswald Buddenhagen <oswald.buddenhagen@gmx.de>
[jonas.gorski: add Fixes tag]
Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Cc: Nicolas Schichan <nschichan@freebox.fr>
Cc: Oswald Buddenhagen <oswald.buddenhagen@gmx.de>
Cc: linux-mips@linux-mips.org
Cc: linux-serial@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/17543/
Signed-off-by: James Hogan <jhogan@kernel.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoKEYS: fix NULL pointer dereference during ASN.1 parsing [ver #2]
Eric Biggers [Tue, 7 Nov 2017 22:29:02 +0000 (22:29 +0000)]
KEYS: fix NULL pointer dereference during ASN.1 parsing [ver #2]

commit 624f5ab8720b3371367327a822c267699c1823b8 upstream.

syzkaller reported a NULL pointer dereference in asn1_ber_decoder().  It
can be reproduced by the following command, assuming
CONFIG_PKCS7_TEST_KEY=y:

        keyctl add pkcs7_test desc '' @s

The bug is that if the data buffer is empty, an integer underflow occurs
in the following check:

        if (unlikely(dp >= datalen - 1))
                goto data_overrun_error;

This results in the NULL data pointer being dereferenced.

Fix it by checking for 'datalen - dp < 2' instead.

Also fix the similar check for 'dp >= datalen - n' later in the same
function.  That one possibly could result in a buffer overread.

The NULL pointer dereference was reproducible using the "pkcs7_test" key
type but not the "asymmetric" key type because the "asymmetric" key type
checks for a 0-length payload before calling into the ASN.1 decoder but
the "pkcs7_test" key type does not.

The bug report was:

    BUG: unable to handle kernel NULL pointer dereference at           (null)
    IP: asn1_ber_decoder+0x17f/0xe60 lib/asn1_decoder.c:233
    PGD 7b708067 P4D 7b708067 PUD 7b6ee067 PMD 0
    Oops: 0000 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 522 Comm: syz-executor1 Not tainted 4.14.0-rc8 #7
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.3-20171021_125229-anatol 04/01/2014
    task: ffff9b6b3798c040 task.stack: ffff9b6b37970000
    RIP: 0010:asn1_ber_decoder+0x17f/0xe60 lib/asn1_decoder.c:233
    RSP: 0018:ffff9b6b37973c78 EFLAGS: 00010216
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000021c
    RDX: ffffffff814a04ed RSI: ffffb1524066e000 RDI: ffffffff910759e0
    RBP: ffff9b6b37973d60 R08: 0000000000000001 R09: ffff9b6b3caa4180
    R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
    FS:  00007f10ed1f2700(0000) GS:ffff9b6b3ea00000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 000000007b6f3000 CR4: 00000000000006f0
    Call Trace:
     pkcs7_parse_message+0xee/0x240 crypto/asymmetric_keys/pkcs7_parser.c:139
     verify_pkcs7_signature+0x33/0x180 certs/system_keyring.c:216
     pkcs7_preparse+0x41/0x70 crypto/asymmetric_keys/pkcs7_key_type.c:63
     key_create_or_update+0x180/0x530 security/keys/key.c:855
     SYSC_add_key security/keys/keyctl.c:122 [inline]
     SyS_add_key+0xbf/0x250 security/keys/keyctl.c:62
     entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x4585c9
    RSP: 002b:00007f10ed1f1bd8 EFLAGS: 00000216 ORIG_RAX: 00000000000000f8
    RAX: ffffffffffffffda RBX: 00007f10ed1f2700 RCX: 00000000004585c9
    RDX: 0000000020000000 RSI: 0000000020008ffb RDI: 0000000020008000
    RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000216 R12: 00007fff1b2260ae
    R13: 00007fff1b2260af R14: 00007f10ed1f2700 R15: 0000000000000000
    Code: dd ca ff 48 8b 45 88 48 83 e8 01 4c 39 f0 0f 86 a8 07 00 00 e8 53 dd ca ff 49 8d 46 01 48 89 85 58 ff ff ff 48 8b 85 60 ff ff ff <42> 0f b6 0c 30 89 c8 88 8d 75 ff ff ff 83 e0 1f 89 8d 28 ff ff
    RIP: asn1_ber_decoder+0x17f/0xe60 lib/asn1_decoder.c:233 RSP: ffff9b6b37973c78
    CR2: 0000000000000000

Fixes: 42d5ec27f873 ("X.509: Add an ASN.1 decoder")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agox86/oprofile/ppro: Do not use __this_cpu*() in preemptible context
Borislav Petkov [Tue, 7 Nov 2017 17:53:07 +0000 (18:53 +0100)]
x86/oprofile/ppro: Do not use __this_cpu*() in preemptible context

commit a743bbeef27b9176987ec0cb7f906ab0ab52d1da upstream.

The warning below says it all:

  BUG: using __this_cpu_read() in preemptible [00000000] code: swapper/0/1
  caller is __this_cpu_preempt_check
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.0-rc8 #4
  Call Trace:
   dump_stack
   check_preemption_disabled
   ? do_early_param
   __this_cpu_preempt_check
   arch_perfmon_init
   op_nmi_init
   ? alloc_pci_root_info
   oprofile_arch_init
   oprofile_init
   do_one_initcall
   ...

These accessors should not have been used in the first place: it is PPro so
no mixed silicon revisions and thus it can simply use boot_cpu_data.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Fix-creation-mandated-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Robert Richter <rric@kernel.org>
Cc: x86@kernel.org
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoALSA: seq: Fix OSS sysex delivery in OSS emulation
Takashi Iwai [Tue, 7 Nov 2017 15:05:24 +0000 (16:05 +0100)]
ALSA: seq: Fix OSS sysex delivery in OSS emulation

commit 132d358b183ac6ad8b3fea32ad5e0663456d18d1 upstream.

The SYSEX event delivery in OSS sequencer emulation assumed that the
event is encoded in the variable-length data with the straight
buffering.  This was the normal behavior in the past, but during the
development, the chained buffers were introduced for carrying more
data, while the OSS code was left intact.  As a result, when a SYSEX
event with the chained buffer data is passed to OSS sequencer port,
it may end up with the wrong memory access, as if it were having a too
large buffer.

This patch addresses the bug, by applying the buffer data expansion by
the generic snd_seq_dump_var_event() helper function.

Reported-by: syzbot <syzkaller@googlegroups.com>
Reported-by: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoALSA: seq: Avoid invalid lockdep class warning
Takashi Iwai [Mon, 6 Nov 2017 19:16:50 +0000 (20:16 +0100)]
ALSA: seq: Avoid invalid lockdep class warning

commit 3510c7aa069aa83a2de6dab2b41401a198317bdc upstream.

The recent fix for adding rwsem nesting annotation was using the given
"hop" argument as the lock subclass key.  Although the idea itself
works, it may trigger a kernel warning like:
  BUG: looking up invalid subclass: 8
  ....
since the lockdep has a smaller number of subclasses (8) than we
currently allow for the hops there (10).

The current definition is merely a sanity check for avoiding the too
deep delivery paths, and the 8 hops are already enough.  So, as a
quick fix, just follow the max hops as same as the max lockdep
subclasses.

Fixes: 1f20f9ff57ca ("ALSA: seq: Fix nested rwsem annotation for lockdep splat")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoARM: 8720/1: ensure dump_instr() checks addr_limit
Mark Rutland [Thu, 2 Nov 2017 17:44:28 +0000 (18:44 +0100)]
ARM: 8720/1: ensure dump_instr() checks addr_limit

commit b9dd05c7002ee0ca8b676428b2268c26399b5e31 upstream.

When CONFIG_DEBUG_USER is enabled, it's possible for a user to
deliberately trigger dump_instr() with a chosen kernel address.

Let's avoid problems resulting from this by using get_user() rather than
__get_user(), ensuring that we don't erroneously access kernel memory.

So that we can use the same code to dump user instructions and kernel
instructions, the common dumping code is factored out to __dump_instr(),
with the fs manipulated appropriately in dump_instr() around calls to
this.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoALSA: timer: Limit max instances per timer
Takashi Iwai [Sun, 5 Nov 2017 09:07:43 +0000 (10:07 +0100)]
ALSA: timer: Limit max instances per timer

commit 9b7d869ee5a77ed4a462372bb89af622e705bfb8 upstream.

Currently we allow unlimited number of timer instances, and it may
bring the system hogging way too much CPU when too many timer
instances are opened and processed concurrently.  This may end up with
a soft-lockup report as triggered by syzkaller, especially when
hrtimer backend is deployed.

Since such insane number of instances aren't demanded by the normal
use case of ALSA sequencer and it merely  opens a risk only for abuse,
this patch introduces the upper limit for the number of instances per
timer backend.  As default, it's set to 1000, but for the fine-grained
timer like hrtimer, it's set to 100.

Reported-by: syzbot
Tested-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoALSA: timer: Protect the whole snd_timer_close() with open race
Takashi Iwai [Wed, 10 Feb 2016 10:53:30 +0000 (11:53 +0100)]
ALSA: timer: Protect the whole snd_timer_close() with open race

commit 9984d1b5835ca29fc7025186a891ee7398d21cc7 upstream.

In order to make the open/close more robust, widen the register_mutex
protection over the whole snd_timer_close() function.  Also, the close
procedure is slightly shuffled to be in the safer order, as well as a
few code refactoring.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agol2tp: don't use l2tp_tunnel_find() in l2tp_ip and l2tp_ip6
Guillaume Nault [Fri, 3 Nov 2017 15:49:00 +0000 (16:49 +0100)]
l2tp: don't use l2tp_tunnel_find() in l2tp_ip and l2tp_ip6

commit 8f7dc9ae4a7aece9fbc3e6637bdfa38b36bcdf09 upstream.

Using l2tp_tunnel_find() in l2tp_ip_recv() is wrong for two reasons:

  * It doesn't take a reference on the returned tunnel, which makes the
    call racy wrt. concurrent tunnel deletion.

  * The lookup is only based on the tunnel identifier, so it can return
    a tunnel that doesn't match the packet's addresses or protocol.

For example, a packet sent to an L2TPv3 over IPv6 tunnel can be
delivered to an L2TPv2 over UDPv4 tunnel. This is worse than a simple
cross-talk: when delivering the packet to an L2TP over UDP tunnel, the
corresponding socket is UDP, where ->sk_backlog_rcv() is NULL. Calling
sk_receive_skb() will then crash the kernel by trying to execute this
callback.

And l2tp_tunnel_find() isn't even needed here. __l2tp_ip_bind_lookup()
properly checks the socket binding and connection settings. It was used
as a fallback mechanism for finding tunnels that didn't have their data
path registered yet. But it's not limited to this case and can be used
to replace l2tp_tunnel_find() in the general case.

Fix l2tp_ip6 in the same way.

Fixes: 0d76751fad77 ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support")
Fixes: a32e0eec7042 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16:
 - In l2tp_ip6.c, always look up in init_net
 - Adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agol2tp: hold tunnel socket when handling control frames in l2tp_ip and l2tp_ip6
Guillaume Nault [Wed, 29 Mar 2017 06:44:59 +0000 (08:44 +0200)]
l2tp: hold tunnel socket when handling control frames in l2tp_ip and l2tp_ip6

commit 94d7ee0baa8b764cf64ad91ed69464c1a6a0066b upstream.

The code following l2tp_tunnel_find() expects that a new reference is
held on sk. Either sk_receive_skb() or the discard_put error path will
drop a reference from the tunnel's socket.

This issue exists in both l2tp_ip and l2tp_ip6.

Fixes: a3c18422a4b4 ("l2tp: hold socket before dropping lock in l2tp_ip{, 6}_recv()")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agol2tp: hold socket before dropping lock in l2tp_ip{, 6}_recv()
Guillaume Nault [Tue, 29 Nov 2016 12:09:45 +0000 (13:09 +0100)]
l2tp: hold socket before dropping lock in l2tp_ip{, 6}_recv()

commit a3c18422a4b4e108bcf6a2328f48867e1003fd95 upstream.

Socket must be held while under the protection of the l2tp lock; there
is no guarantee that sk remains valid after the read_unlock_bh() call.

Same issue for l2tp_ip and l2tp_ip6.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agonetfilter/ipvs: clear ipvs_property flag when SKB net namespace changed
Ye Yin [Thu, 26 Oct 2017 08:57:05 +0000 (16:57 +0800)]
netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed

commit 2b5ec1a5f9738ee7bf8f5ec0526e75e00362c48f upstream.

When run ipvs in two different network namespace at the same host, and one
ipvs transport network traffic to the other network namespace ipvs.
'ipvs_property' flag will make the second ipvs take no effect. So we should
clear 'ipvs_property' when SKB network namespace changed.

Fixes: 621e84d6f373 ("dev: introduce skb_scrub_packet()")
Signed-off-by: Ye Yin <hustcat@gmail.com>
Signed-off-by: Wei Zhou <chouryzhou@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoocfs2: fstrim: Fix start offset of first cluster group during fstrim
Ashish Samant [Thu, 2 Nov 2017 22:59:37 +0000 (15:59 -0700)]
ocfs2: fstrim: Fix start offset of first cluster group during fstrim

commit 105ddc93f06ebe3e553f58563d11ed63dbcd59f0 upstream.

The first cluster group descriptor is not stored at the start of the
group but at an offset from the start.  We need to take this into
account while doing fstrim on the first cluster group.  Otherwise we
will wrongly start fstrim a few blocks after the desired start block and
the range can cross over into the next cluster group and zero out the
group descriptor there.  This can cause filesytem corruption that cannot
be fixed by fsck.

Link: http://lkml.kernel.org/r/1507835579-7308-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoarm64: ensure __dump_instr() checks addr_limit
Mark Rutland [Thu, 2 Nov 2017 16:12:03 +0000 (16:12 +0000)]
arm64: ensure __dump_instr() checks addr_limit

commit 7a7003b1da010d2b0d1dc8bf21c10f5c73b389f1 upstream.

It's possible for a user to deliberately trigger __dump_instr with a
chosen kernel address.

Let's avoid problems resulting from this by using get_user() rather than
__get_user(), ensuring that we don't erroneously access kernel memory.

Where we use __dump_instr() on kernel text, we already switch to
KERNEL_DS, so this shouldn't adversely affect those cases.

Fixes: 60ffc30d5652810d ("arm64: Exception handling")
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoarm64: fix dump_instr when PAN and UAO are in use
Mark Rutland [Mon, 13 Jun 2016 10:15:14 +0000 (11:15 +0100)]
arm64: fix dump_instr when PAN and UAO are in use

commit c5cea06be060f38e5400d796e61cfc8c36e52924 upstream.

If the kernel is set to show unhandled signals, and a user task does not
handle a SIGILL as a result of an instruction abort, we will attempt to
log the offending instruction with dump_instr before killing the task.

We use dump_instr to log the encoding of the offending userspace
instruction. However, dump_instr is also used to dump instructions from
kernel space, and internally always switches to KERNEL_DS before dumping
the instruction with get_user. When both PAN and UAO are in use, reading
a user instruction via get_user while in KERNEL_DS will result in a
permission fault, which leads to an Oops.

As we have regs corresponding to the context of the original instruction
abort, we can inspect this and only flip to KERNEL_DS if the original
abort was taken from the kernel, avoiding this issue. At the same time,
remove the redundant (and incorrect) comments regarding the order
dump_mem and dump_instr are called in.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Fixes: 57f4959bad0a154a ("arm64: kernel: Add support for User Access Override")
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
6 years agoKEYS: fix out-of-bounds read during ASN.1 parsing
Eric Biggers [Thu, 2 Nov 2017 00:47:19 +0000 (00:47 +0000)]
KEYS: fix out-of-bounds read during ASN.1 parsing

commit 2eb9eabf1e868fda15808954fb29b0f105ed65f1 upstream.

syzkaller with KASAN reported an out-of-bounds read in
asn1_ber_decoder().  It can be reproduced by the following command,
assuming CONFIG_X509_CERTIFICATE_PARSER=y and CONFIG_KASAN=y:

    keyctl add asymmetric desc $'\x30\x30' @s

The bug is that the length of an ASN.1 data value isn't validated in the
case where it is encoded using the short form, causing the decoder to
read past the end of the input buffer.  Fix it by validating the length.

The bug report was:

    BUG: KASAN: slab-out-of-bounds in asn1_ber_decoder+0x10cb/0x1730 lib/asn1_decoder.c:233
    Read of size 1 at addr ffff88003cccfa02 by task syz-executor0/6818

    CPU: 1 PID: 6818 Comm: syz-executor0 Not tainted 4.14.0-rc7-00008-g5f479447d983 #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
     __dump_stack lib/dump_stack.c:16 [inline]
     dump_stack+0xb3/0x10b lib/dump_stack.c:52
     print_address_description+0x79/0x2a0 mm/kasan/report.c:252
     kasan_report_error mm/kasan/report.c:351 [inline]
     kasan_report+0x236/0x340 mm/kasan/report.c:409
     __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
     asn1_ber_decoder+0x10cb/0x1730 lib/asn1_decoder.c:233
     x509_cert_parse+0x1db/0x650 crypto/asymmetric_keys/x509_cert_parser.c:89
     x509_key_preparse+0x64/0x7a0 crypto/asymmetric_keys/x509_public_key.c:174
     asymmetric_key_preparse+0xcb/0x1a0 crypto/asymmetric_keys/asymmetric_type.c:388
     key_create_or_update+0x347/0xb20 security/keys/key.c:855
     SYSC_add_key security/keys/keyctl.c:122 [inline]
     SyS_add_key+0x1cd/0x340 security/keys/keyctl.c:62
     entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x447c89
    RSP: 002b:00007fca7a5d3bd8 EFLAGS: 00000246 ORIG_RAX: 00000000000000f8
    RAX: ffffffffffffffda RBX: 00007fca7a5d46cc RCX: 0000000000447c89
    RDX: 0000000020006f4a RSI: 0000000020006000 RDI: 0000000020001ff5
    RBP: 0000000000000046 R08: fffffffffffffffd R09: 0000000000000000
    R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000
    R13: 0000000000000000 R14: 00007fca7a5d49c0 R15: 00007fca7a5d4700

Fixes: 42d5ec27f873 ("X.509: Add an ASN.1 decoder")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>