]> git.itanic.dy.fi Git - linux-stable/log
linux-stable
9 years agoLinux 3.14.21 v3.14.21
Greg Kroah-Hartman [Thu, 9 Oct 2014 19:21:39 +0000 (12:21 -0700)]
Linux 3.14.21

9 years agomm: don't pointlessly use BUG_ON() for sanity check
Linus Torvalds [Mon, 28 Apr 2014 21:24:09 +0000 (14:24 -0700)]
mm: don't pointlessly use BUG_ON() for sanity check

commit 50f5aa8a9b248fa4262cf379863ec9a531b49737 upstream.

BUG_ON() is a big hammer, and should be used _only_ if there is some
major corruption that you cannot possibly recover from, making it
imperative that the current process (and possibly the whole machine) be
terminated with extreme prejudice.

The trivial sanity check in the vmacache code is *not* such a fatal
error.  Recovering from it is absolutely trivial, and using BUG_ON()
just makes it harder to debug for no actual advantage.

To make matters worse, the placement of the BUG_ON() (only if the range
check matched) actually makes it harder to hit the sanity check to begin
with, so _if_ there is a bug (and we just got a report from Srivatsa
Bhat that this can indeed trigger), it is harder to debug not just
because the machine is possibly dead, but because we don't have better
coverage.

BUG_ON() must *die*.  Maybe we should add a checkpatch warning for it,
because it is simply just about the worst thing you can ever do if you
hit some "this cannot happen" situation.

Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm: per-thread vma caching
Davidlohr Bueso [Mon, 7 Apr 2014 22:37:25 +0000 (15:37 -0700)]
mm: per-thread vma caching

commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed.  There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma().  Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality.  On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number.  The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed.  Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question.  Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
   scheme does improve ~50% hit rate by just adding a few more slots to
   the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 50.61%   | 19.90            |
| patched        | 73.45%   | 13.58            |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
   approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 75.28%   | 11.03            |
| patched        | 88.09%   | 9.31             |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 70.66%   | 17.14            |
| patched        | 91.15%   | 12.57            |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
   approach always shows nearly perfect hit rates, while baseline is just
   about non-existent.  The amounts of cycles can fluctuate between
   anywhere from ~60 to ~116 for the baseline scheme, but this approach
   reduces it considerably.  For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline       | 1.06%    | 91.54            |
| patched        | 99.97%   | 14.18            |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agovmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state()
Christoph Lameter [Fri, 18 Apr 2014 22:07:10 +0000 (15:07 -0700)]
vmscan: reclaim_clean_pages_from_list() must use mod_zone_page_state()

commit 83da7510058736c09a14b9c17ec7d851940a4332 upstream.

Seems to be called with preemption enabled.  Therefore it must use
mod_zone_page_state instead.

Signed-off-by: Christoph Lameter <cl@linux.com>
Reported-by: Grygorii Strashko <grygorii.strashko@ti.com>
Tested-by: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm: vmscan: shrink_slab: rename max_pass -> freeable
Vladimir Davydov [Thu, 3 Apr 2014 21:47:32 +0000 (14:47 -0700)]
mm: vmscan: shrink_slab: rename max_pass -> freeable

commit d5bc5fd3fcb7b8dfb431694a8c8052466504c10c upstream.

The name `max_pass' is misleading, because this variable actually keeps
the estimate number of freeable objects, not the maximal number of
objects we can scan in this pass, which can be twice that.  Rename it to
reflect its actual meaning.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim
Vladimir Davydov [Thu, 3 Apr 2014 21:47:19 +0000 (14:47 -0700)]
mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim

commit 99120b772b52853f9a2b829a21dd44d9b20558f1 upstream.

When direct reclaim is executed by a process bound to a set of NUMA
nodes, we should scan only those nodes when possible, but currently we
will scan kmem from all online nodes even if the kmem shrinker is NUMA
aware.  That said, binding a process to a particular NUMA node won't
prevent it from shrinking inode/dentry caches from other nodes, which is
not good.  Fix this.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT
Jens Axboe [Thu, 22 May 2014 18:54:16 +0000 (11:54 -0700)]
mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT

commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d upstream.

In some testing I ran today (some fio jobs that spread over two nodes),
we end up spending 40% of the time in filemap_check_errors().  That
smells fishy.  Looking further, this is basically what happens:

blkdev_aio_read()
    generic_file_aio_read()
        filemap_write_and_wait_range()
            if (!mapping->nr_pages)
                filemap_check_errors()

and filemap_check_errors() always attempts two test_and_clear_bit() on
the mapping flags, thus dirtying it for every single invocation.  The
patch below tests each of these bits before clearing them, avoiding this
issue.  In my test case (4-socket box), performance went from 1.7M IOPS
to 4.0M IOPS.

Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm: optimize put_mems_allowed() usage
Mel Gorman [Thu, 3 Apr 2014 21:47:24 +0000 (14:47 -0700)]
mm: optimize put_mems_allowed() usage

commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.

Since put_mems_allowed() is strictly optional, its a seqcount retry, we
don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some relative
fast-paths.

Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.

This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
where it is important to note that the return value of the latter call
is inverted from its previous incarnation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead...
Raghavendra K T [Thu, 3 Apr 2014 21:48:23 +0000 (14:48 -0700)]
mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages

commit 6d2be915e589b58cb11418cbe1f22ff90732b6ac upstream.

Currently max_sane_readahead() returns zero on the cpu whose NUMA node
has no local memory which leads to readahead failure.  Fix this
readahead failure by returning minimum of (requested pages, 512).  Users
running applications on a memory-less cpu which needs readahead such as
streaming application see considerable boost in the performance.

Result:

fadvise experiment with FADV_WILLNEED on a PPC machine having memoryless
CPU with 1GB testfile (12 iterations) yielded around 46.66% improvement.

fadvise experiment with FADV_WILLNEED on a x240 machine with 1GB
testfile 32GB* 4G RAM numa machine (12 iterations) showed no impact on
the normal NUMA cases w/ patch.

  Kernel       Avg  Stddev
  base      7.4975   3.92%
  patched   7.4174   3.26%

[Andrew: making return value PAGE_SIZE independent]
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm, compaction: ignore pageblock skip when manually invoking compaction
David Rientjes [Thu, 3 Apr 2014 21:47:23 +0000 (14:47 -0700)]
mm, compaction: ignore pageblock skip when manually invoking compaction

commit 91ca9186484809c57303b33778d841cc28f696ed upstream.

The cached pageblock hint should be ignored when triggering compaction
through /proc/sys/vm/compact_memory so all eligible memory is isolated.
Manually invoking compaction is known to be expensive, there's no need
to skip pageblocks based on heuristics (mainly for debugging).

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm, compaction: determine isolation mode only once
David Rientjes [Mon, 7 Apr 2014 22:37:34 +0000 (15:37 -0700)]
mm, compaction: determine isolation mode only once

commit da1c67a76f7cf2b3404823d24f9f10fa91aa5dc5 upstream.

The conditions that control the isolation mode in
isolate_migratepages_range() do not change during the iteration, so
extract them out and only define the value once.

This actually does have an effect, gcc doesn't optimize it itself because
of cc->sync.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm/compaction: clean-up code on success of ballon isolation
Joonsoo Kim [Mon, 7 Apr 2014 22:37:07 +0000 (15:37 -0700)]
mm/compaction: clean-up code on success of ballon isolation

commit b6c750163c0d138f5041d95fcdbd1094b6928057 upstream.

It is just for clean-up to reduce code size and improve readability.
There is no functional change.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm/compaction: check pageblock suitability once per pageblock
Joonsoo Kim [Mon, 7 Apr 2014 22:37:06 +0000 (15:37 -0700)]
mm/compaction: check pageblock suitability once per pageblock

commit c122b2087ab94192f2b937e47b563a9c4e688ece upstream.

isolation_suitable() and migrate_async_suitable() is used to be sure
that this pageblock range is fine to be migragted.  It isn't needed to
call it on every page.  Current code do well if not suitable, but, don't
do well when suitable.

1) It re-checks isolation_suitable() on each page of a pageblock that was
   already estabilished as suitable.
2) It re-checks migrate_async_suitable() on each page of a pageblock that
   was not entered through the next_pageblock: label, because
   last_pageblock_nr is not otherwise updated.

This patch fixes situation by 1) calling isolation_suitable() only once
per pageblock and 2) always updating last_pageblock_nr to the pageblock
that was just checked.

Additionally, move PageBuddy() check after pageblock unit check, since
pageblock check is the first thing we should do and makes things more
simple.

[vbabka@suse.cz: rephrase commit description]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm/compaction: change the timing to check to drop the spinlock
Joonsoo Kim [Mon, 7 Apr 2014 22:37:05 +0000 (15:37 -0700)]
mm/compaction: change the timing to check to drop the spinlock

commit be1aa03b973c7dcdc576f3503f7a60429825c35d upstream.

It is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th
pfn page.  This may results in below situation while isolating
migratepage.

1. try isolate 0x0 ~ 0x200 pfn pages.
2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop
   the spinlock.
3. Then, to complete isolating, retry to aquire the lock.

I think that it is better to use SWAP_CLUSTER_MAX th pfn for checking the
criteria about dropping the lock.  This has no harm 0x0 pfn, because, at
this time, locked variable would be false.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agodrbd: fix regression 'out of mem, failed to invoke fence-peer helper'
Lars Ellenberg [Wed, 9 Jul 2014 19:18:32 +0000 (21:18 +0200)]
drbd: fix regression 'out of mem, failed to invoke fence-peer helper'

commit bbc1c5e8ad6dfebf9d13b8a4ccdf66c92913eac9 upstream.

Since linux kernel 3.13, kthread_run() internally uses
wait_for_completion_killable().  We sometimes may use kthread_run()
while we still have a signal pending, which we used to kick our threads
out of potentially blocking network functions, causing kthread_run() to
mistake that as a new fatal signal and fail.

Fix: flush_signals() before kthread_run().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm/compaction: do not call suitable_migration_target() on every page
Joonsoo Kim [Mon, 7 Apr 2014 22:37:04 +0000 (15:37 -0700)]
mm/compaction: do not call suitable_migration_target() on every page

commit 01ead5340bcf5f3a1cd2452c75516d0ef4d908d7 upstream.

suitable_migration_target() checks that pageblock is suitable for
migration target.  In isolate_freepages_block(), it is called on every
page and this is inefficient.  So make it called once per pageblock.

suitable_migration_target() also checks if page is highorder or not, but
it's criteria for highorder is pageblock order.  So calling it once
within pageblock range has no problem.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm/compaction: disallow high-order page for migration target
Joonsoo Kim [Mon, 7 Apr 2014 22:37:03 +0000 (15:37 -0700)]
mm/compaction: disallow high-order page for migration target

commit 7d348b9ea64db0a315d777ce7d4b06697f946503 upstream.

Purpose of compaction is to get a high order page.  Currently, if we
find high-order page while searching migration target page, we break it
to order-0 pages and use them as migration target.  It is contrary to
purpose of compaction, so disallow high-order page to be used for
migration target.

Additionally, clean-up logic in suitable_migration_target() to simplify
the code.  There is no functional changes from this clean-up.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm, compaction: avoid isolating pinned pages
David Rientjes [Thu, 3 Apr 2014 21:48:00 +0000 (14:48 -0700)]
mm, compaction: avoid isolating pinned pages

commit 119d6d59dcc0980dcd581fdadb6b2033b512a473 upstream.

Page migration will fail for memory that is pinned in memory with, for
example, get_user_pages().  In this case, it is unnecessary to take
zone->lru_lock or isolating the page and passing it to page migration
which will ultimately fail.

This is a racy check, the page can still change from under us, but in
that case we'll just fail later when attempting to move the page.

This avoids very expensive memory compaction when faulting transparent
hugepages after pinning a lot of memory with a Mellanox driver.

On a 128GB machine and pinning ~120GB of memory, before this patch we
see the enormous disparity in the number of page migration failures
because of the pinning (from /proc/vmstat):

compact_pages_moved 8450
compact_pagemigrate_failed 15614415

0.05% of pages isolated are successfully migrated and explicitly
triggering memory compaction takes 102 seconds.  After the patch:

compact_pages_moved 9197
compact_pagemigrate_failed 7

99.9% of pages isolated are now successfully migrated in this
configuration and memory compaction takes less than one second.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoswap: change swap_list_head to plist, add swap_avail_head
Dan Streetman [Wed, 4 Jun 2014 23:09:59 +0000 (16:09 -0700)]
swap: change swap_list_head to plist, add swap_avail_head

commit 18ab4d4ced0817421e6db6940374cc39d28d65da upstream.

Originally get_swap_page() started iterating through the singly-linked
list of swap_info_structs using swap_list.next or highest_priority_index,
which both were intended to point to the highest priority active swap
target that was not full.  The first patch in this series changed the
singly-linked list to a doubly-linked list, and removed the logic to start
at the highest priority non-full entry; it starts scanning at the highest
priority entry each time, even if the entry is full.

Replace the manually ordered swap_list_head with a plist, swap_active_head.
Add a new plist, swap_avail_head.  The original swap_active_head plist
contains all active swap_info_structs, as before, while the new
swap_avail_head plist contains only swap_info_structs that are active and
available, i.e. not full.  Add a new spinlock, swap_avail_lock, to protect
the swap_avail_head list.

Mel Gorman suggested using plists since they internally handle ordering
the list entries based on priority, which is exactly what swap was doing
manually.  All the ordering code is now removed, and swap_info_struct
entries and simply added to their corresponding plist and automatically
ordered correctly.

Using a new plist for available swap_info_structs simplifies and
optimizes get_swap_page(), which no longer has to iterate over full
swap_info_structs.  Using a new spinlock for swap_avail_head plist
allows each swap_info_struct to add or remove themselves from the
plist when they become full or not-full; previously they could not
do so because the swap_info_struct->lock is held when they change
from full<->not-full, and the swap_lock protecting the main
swap_active_head must be ordered before any swap_info_struct->lock.

Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Cc: Weijie Yang <weijieut@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agolib/plist: add plist_requeue
Dan Streetman [Wed, 4 Jun 2014 23:09:57 +0000 (16:09 -0700)]
lib/plist: add plist_requeue

commit a75f232ce0fe38bd01301899ecd97ffd0254316a upstream.

Add plist_requeue(), which moves the specified plist_node after all other
same-priority plist_nodes in the list.  This is essentially an optimized
plist_del() followed by plist_add().

This is needed by swap, which (with the next patch in this set) uses a
plist of available swap devices.  When a swap device (either a swap
partition or swap file) are added to the system with swapon(), the device
is added to a plist, ordered by the swap device's priority.  When swap
needs to allocate a page from one of the swap devices, it takes the page
from the first swap device on the plist, which is the highest priority
swap device.  The swap device is left in the plist until all its pages are
used, and then removed from the plist when it becomes full.

However, as described in man 2 swapon, swap must allocate pages from swap
devices with the same priority in round-robin order; to do this, on each
swap page allocation, swap uses a page from the first swap device in the
plist, and then calls plist_requeue() to move that swap device entry to
after any other same-priority swap devices.  The next swap page allocation
will again use a page from the first swap device in the plist and requeue
it, and so on, resulting in round-robin usage of equal-priority swap
devices.

Also add plist_test_requeue() test function, for use by plist_test() to
test plist_requeue() function.

Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Cc: Weijie Yang <weijieut@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agolib/plist: add helper functions
Dan Streetman [Wed, 4 Jun 2014 23:09:55 +0000 (16:09 -0700)]
lib/plist: add helper functions

commit fd16618e12a05df79a3439d72d5ffdac5d34f3da upstream.

Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
define and initialize a struct plist_head.

Add plist_for_each_continue() and plist_for_each_entry_continue(),
equivalent to list_for_each_continue() and list_for_each_entry_continue(),
to iterate over a plist continuing after the current position.

Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
and ->next, implemented by list_prev_entry() and list_next_entry(), to
access the prev/next struct plist_node entry.  These are needed because
unlike struct list_head, direct access of the prev/next struct plist_node
isn't possible; the list must be navigated via the contained struct
list_head.  e.g.  instead of accessing the prev by list_prev_entry(node,
node_list) it can be accessed by plist_prev(node).

Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Cc: Weijie Yang <weijieut@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoswap: change swap_info singly-linked list to list_head
Dan Streetman [Wed, 4 Jun 2014 23:09:53 +0000 (16:09 -0700)]
swap: change swap_info singly-linked list to list_head

commit adfab836f4908deb049a5128082719e689eed964 upstream.

The logic controlling the singly-linked list of swap_info_struct entries
for all active, i.e.  swapon'ed, swap targets is rather complex, because:

 - it stores the entries in priority order
 - there is a pointer to the highest priority entry
 - there is a pointer to the highest priority not-full entry
 - there is a highest_priority_index variable set outside the swap_lock
 - swap entries of equal priority should be used equally

this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
where different priority swap targets are incorrectly used equally.

That bug probably could be solved with the existing singly-linked lists,
but I think it would only add more complexity to the already difficult to
understand get_swap_page() swap_list iteration logic.

The first patch changes from a singly-linked list to a doubly-linked list
using list_heads; the highest_priority_index and related code are removed
and get_swap_page() starts each iteration at the highest priority
swap_info entry, even if it's full.  While this does introduce unnecessary
list iteration (i.e.  Schlemiel the painter's algorithm) in the case where
one or more of the highest priority entries are full, the iteration and
manipulation code is much simpler and behaves correctly re: the above bug;
and the fourth patch removes the unnecessary iteration.

The second patch adds some minor plist helper functions; nothing new
really, just functions to match existing regular list functions.  These
are used by the next two patches.

The third patch adds plist_requeue(), which is used by get_swap_page() in
the next patch - it performs the requeueing of same-priority entries
(which moves the entry to the end of its priority in the plist), so that
all equal-priority swap_info_structs get used equally.

The fourth patch converts the main list into a plist, and adds a new plist
that contains only swap_info entries that are both active and not full.
As Mel suggested using plists allows removing all the ordering code from
swap - plists handle ordering automatically.  The list naming is also
clarified now that there are two lists, with the original list changed
from swap_list_head to swap_active_head and the new list named
swap_avail_head.  A new spinlock is also added for the new list, so
swap_info entries can be added or removed from the new list immediately as
they become full or not full.

This patch (of 4):

Replace the singly-linked list tracking active, i.e.  swapon'ed,
swap_info_struct entries with a doubly-linked list using struct
list_heads.  Simplify the logic iterating and manipulating the list of
entries, especially get_swap_page(), by using standard list_head
functions, and removing the highest priority iteration logic.

The change fixes the bug:
https://lkml.org/lkml/2014/2/13/181
in which different priority swap entries after the highest priority entry
are incorrectly used equally in pairs.  The swap behavior is now as
advertised, i.e. different priority swap entries are used in order, and
equal priority swap targets are used concurrently.

Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Cc: Weijie Yang <weijieut@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm: exclude memoryless nodes from zone_reclaim
Michal Hocko [Mon, 7 Apr 2014 22:37:01 +0000 (15:37 -0700)]
mm: exclude memoryless nodes from zone_reclaim

commit 70ef57e6c22c3323dce179b7d0d433c479266612 upstream.

We had a report about strange OOM killer strikes on a PPC machine
although there was a lot of swap free and a tons of anonymous memory
which could be swapped out.  In the end it turned out that the OOM was a
side effect of zone reclaim which wasn't unmapping and swapping out and
so the system was pushed to the OOM.  Although this sounds like a bug
somewhere in the kswapd vs.  zone reclaim vs.  direct reclaim
interaction numactl on the said hardware suggests that the zone reclaim
should not have been set in the first place:

  node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  node 0 size: 0 MB
  node 0 free: 0 MB
  node 2 cpus:
  node 2 size: 7168 MB
  node 2 free: 6019 MB
  node distances:
  node   0   2
  0:  10  40
  2:  40  10

So all the CPUs are associated with Node0 which doesn't have any memory
while Node2 contains all the available memory.  Node distances cause an
automatic zone_reclaim_mode enabling.

Zone reclaim is intended to keep the allocations local but this doesn't
make any sense on the memoryless nodes.  So let's exclude such nodes for
init_zone_allows_reclaim which evaluates zone reclaim behavior and
suitable reclaim_nodes.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agojiffies: Fix timeval conversion to jiffies
Andrew Hunter [Thu, 4 Sep 2014 21:17:16 +0000 (14:17 -0700)]
jiffies: Fix timeval conversion to jiffies

commit d78c9300c51d6ceed9f6d078d4e9366f259de28c upstream.

timeval_to_jiffies tried to round a timeval up to an integral number
of jiffies, but the logic for doing so was incorrect: intervals
corresponding to exactly N jiffies would become N+1. This manifested
itself particularly repeatedly stopping/starting an itimer:

setitimer(ITIMER_PROF, &val, NULL);
setitimer(ITIMER_PROF, NULL, &val);

would add a full tick to val, _even if it was exactly representable in
terms of jiffies_ (say, the result of a previous rounding.)  Doing
this repeatedly would cause unbounded growth in val.  So fix the math.

Here's what was wrong with the conversion: we essentially computed
(eliding seconds)

jiffies = usec  * (NSEC_PER_USEC/TICK_NSEC)

by using scaling arithmetic, which took the best approximation of
NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
x/(2^USEC_JIFFIE_SC), and computed:

jiffies = (usec * x) >> USEC_JIFFIE_SC

and rounded this calculation up in the intermediate form (since we
can't necessarily exactly represent TICK_NSEC in usec.) But the
scaling arithmetic is a (very slight) *over*approximation of the true
value; that is, instead of dividing by (1 usec/ 1 jiffie), we
effectively divided by (1 usec/1 jiffie)-epsilon (rounding
down). This would normally be fine, but we want to round timeouts up,
and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
would be fine if our division was exact, but dividing this by the
slightly smaller factor was equivalent to adding just _over_ 1 to the
final result (instead of just _under_ 1, as desired.)

In particular, with HZ=1000, we consistently computed that 10000 usec
was 11 jiffies; the same was true for any exact multiple of
TICK_NSEC.

We could possibly still round in the intermediate form, adding
something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
convert usec->nsec, round in nanoseconds, and then convert using
time*spec*_to_jiffies.  This adds one constant multiplication, and is
not observably slower in microbenchmarks on recent x86 hardware.

Tested: the following program:

int main() {
  struct itimerval zero = {{0, 0}, {0, 0}};
  /* Initially set to 10 ms. */
  struct itimerval initial = zero;
  initial.it_interval.tv_usec = 10000;
  setitimer(ITIMER_PROF, &initial, NULL);
  /* Save and restore several times. */
  for (size_t i = 0; i < 10; ++i) {
    struct itimerval prev;
    setitimer(ITIMER_PROF, &zero, &prev);
    /* on old kernels, this goes up by TICK_USEC every iteration */
    printf("previous value: %ld %ld %ld %ld\n",
           prev.it_interval.tv_sec, prev.it_interval.tv_usec,
           prev.it_value.tv_sec, prev.it_value.tv_usec);
    setitimer(ITIMER_PROF, &prev, NULL);
  }
    return 0;
}

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Paul Turner <pjt@google.com>
Reported-by: Aaron Jacobs <jacobsa@google.com>
Signed-off-by: Andrew Hunter <ahh@google.com>
[jstultz: Tweaked to apply to 3.17-rc]
Signed-off-by: John Stultz <john.stultz@linaro.org>
[bwh: Backported to 3.16: adjust filename]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomedia: vb2: fix VBI/poll regression
Hans Verkuil [Sat, 20 Sep 2014 19:16:35 +0000 (16:16 -0300)]
media: vb2: fix VBI/poll regression

commit 58d75f4b1ce26324b4d809b18f94819843a98731 upstream.

The recent conversion of saa7134 to vb2 unconvered a poll() bug that
broke the teletext applications alevt and mtt. These applications
expect that calling poll() without having called VIDIOC_STREAMON will
cause poll() to return POLLERR. That did not happen in vb2.

This patch fixes that behavior. It also fixes what should happen when
poll() is called when STREAMON is called but no buffers have been
queued. In that case poll() will also return POLLERR, but only for
capture queues since output queues will always return POLLOUT
anyway in that situation.

This brings the vb2 behavior in line with the old videobuf behavior.

Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm: numa: Do not mark PTEs pte_numa when splitting huge pages
Mel Gorman [Thu, 2 Oct 2014 18:47:42 +0000 (19:47 +0100)]
mm: numa: Do not mark PTEs pte_numa when splitting huge pages

commit abc40bd2eeb77eb7c2effcaf63154aad929a1d5f upstream.

This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
NUMA type from the pmd to the pte"). If a huge page is being split due
a protection change and the tail will be in a PROT_NONE vma then NUMA
hinting PTEs are temporarily created in the protected VMA.

 VM_RW|VM_PROTNONE
|-----------------|
      ^
      split here

In the specific case above, it should get fixed up by change_pte_range()
but there is a window of opportunity for weirdness to happen. Similarly,
if a huge page is shrunk and split during a protection update but before
pmd_numa is cleared then a pte_numa can be left behind.

Instead of adding complexity trying to deal with the case, this patch
will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
will not be triggered which is marginal in comparison to the complexity
in dealing with the corner cases during THP split.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm, thp: move invariant bug check out of loop in __split_huge_page_map
Waiman Long [Wed, 6 Aug 2014 23:05:36 +0000 (16:05 -0700)]
mm, thp: move invariant bug check out of loop in __split_huge_page_map

commit f8303c2582b889351e261ff18c4d8eb197a77db2 upstream.

In __split_huge_page_map(), the check for page_mapcount(page) is
invariant within the for loop.  Because of the fact that the macro is
implemented using atomic_read(), the redundant check cannot be optimized
away by the compiler leading to unnecessary read to the page structure.

This patch moves the invariant bug check out of the loop so that it will
be done only once.  On a 3.16-rc1 based kernel, the execution time of a
microbenchmark that broke up 1000 transparent huge pages using munmap()
had an execution time of 38,245us and 38,548us with and without the
patch respectively.  The performance gain is about 1%.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agohugetlb: ensure hugepage access is denied if hugepages are not supported
Nishanth Aravamudan [Tue, 6 May 2014 19:50:00 +0000 (12:50 -0700)]
hugetlb: ensure hugepage access is denied if hugepages are not supported

commit 457c1b27ed56ec472d202731b12417bff023594a upstream.

Currently, I am seeing the following when I `mount -t hugetlbfs /none
/dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`.  I think it's
related to the fact that hugetlbfs is properly not correctly setting
itself up in this state?:

  Unable to handle kernel paging request for data at address 0x00000031
  Faulting instruction address: 0xc000000000245710
  Oops: Kernel access of bad area, sig: 11 [#1]
  SMP NR_CPUS=2048 NUMA pSeries
  ....

In KVM guests on Power, in a guest not backed by hugepages, we see the
following:

  AnonHugePages:         0 kB
  HugePages_Total:       0
  HugePages_Free:        0
  HugePages_Rsvd:        0
  HugePages_Surp:        0
  Hugepagesize:         64 kB

HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
are not supported at boot-time, but this is only checked in
hugetlb_init().  Extract the check to a helper function, and use it in a
few relevant places.

This does make hugetlbfs not supported (not registered at all) in this
environment.  I believe this is fine, as there are no valid hugepages
and that won't change at runtime.

[akpm@linux-foundation.org: use pr_info(), per Mel]
[akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoCIFS: Fix SMB2 readdir error handling
Pavel Shilovsky [Mon, 18 Aug 2014 16:49:57 +0000 (20:49 +0400)]
CIFS: Fix SMB2 readdir error handling

commit 52755808d4525f4d5b86d112d36ffc7a46f3fb48 upstream.

SMB2 servers indicates the end of a directory search with
STATUS_NO_MORE_FILE error code that is not processed now.
This causes generic/257 xfstest to fail. Fix this by triggering
the end of search by this error code in SMB2_query_directory.

Also when negotiating CIFS protocol we tell the server to close
the search automatically at the end and there is no need to do
it itself. In the case of SMB2 protocol, we need to close it
explicitly - separate close directory checks for different
protocols.

Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoring-buffer: Fix infinite spin in reading buffer
Steven Rostedt (Red Hat) [Thu, 2 Oct 2014 20:51:18 +0000 (16:51 -0400)]
ring-buffer: Fix infinite spin in reading buffer

commit 24607f114fd14f2f37e3e0cb3d47bce96e81e848 upstream.

Commit 651e22f2701b "ring-buffer: Always reset iterator to reader page"
fixed one bug but in the process caused another one. The reset is to
update the header page, but that fix also changed the way the cached
reads were updated. The cache reads are used to test if an iterator
needs to be updated or not.

A ring buffer iterator, when created, disables writes to the ring buffer
but does not stop other readers or consuming reads from happening.
Although all readers are synchronized via a lock, they are only
synchronized when in the ring buffer functions. Those functions may
be called by any number of readers. The iterator continues down when
its not interrupted by a consuming reader. If a consuming read
occurs, the iterator starts from the beginning of the buffer.

The way the iterator sees that a consuming read has happened since
its last read is by checking the reader "cache". The cache holds the
last counts of the read and the reader page itself.

Commit 651e22f2701b changed what was saved by the cache_read when
the rb_iter_reset() occurred, making the iterator never match the cache.
Then if the iterator calls rb_iter_reset(), it will go into an
infinite loop by checking if the cache doesn't match, doing the reset
and retrying, just to see that the cache still doesn't match! Which
should never happen as the reset is suppose to set the cache to the
current value and there's locks that keep a consuming reader from
having access to the data.

Fixes: 651e22f2701b "ring-buffer: Always reset iterator to reader page"
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoinit/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu
Josh Triplett [Fri, 3 Oct 2014 23:19:24 +0000 (16:19 -0700)]
init/Kconfig: Fix HAVE_FUTEX_CMPXCHG to not break up the EXPERT menu

commit 62b4d2041117f35ab2409c9f5c4b8d3dc8e59d0f upstream.

commit 03b8c7b623c80af264c4c8d6111e5c6289933666 ("futex: Allow
architectures to skip futex_atomic_cmpxchg_inatomic() test") added the
HAVE_FUTEX_CMPXCHG symbol right below FUTEX.  This placed it right in
the middle of the options for the EXPERT menu.  However,
HAVE_FUTEX_CMPXCHG does not depend on EXPERT or FUTEX, so Kconfig stops
placing items in the EXPERT menu, and displays the remaining several
EXPERT items (starting with EPOLL) directly in the General Setup menu.

Since both users of HAVE_FUTEX_CMPXCHG only select it "if FUTEX", make
HAVE_FUTEX_CMPXCHG itself depend on FUTEX.  With this change, the
subsequent items display as part of the EXPERT menu again; the EMBEDDED
menu now appears as the next top-level item in the General Setup menu,
which makes General Setup much shorter and more usable.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoFix problem recognizing symlinks
Steve French [Thu, 25 Sep 2014 06:26:55 +0000 (01:26 -0500)]
Fix problem recognizing symlinks

commit 19e81573fca7b87ced7701e01ba164b968d929bd upstream.

Changeset eb85d94bd introduced a problem where if a cifs open
fails during query info of a file we
will still try to close the file (happens with certain types
of reparse points) even though the file handle is not valid.

In addition for SMB2/SMB3 we were not mapping the return code returned
by Windows when trying to open a file (like a Windows NFS symlink)
which is a reparse point.

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agodrm/i915: Flush the PTEs after updating them before suspend
Chris Wilson [Thu, 25 Sep 2014 09:13:12 +0000 (10:13 +0100)]
drm/i915: Flush the PTEs after updating them before suspend

commit 91e56499304f3d612053a9cf17f350868182c7d8 upstream.

As we use WC updates of the PTE, we are responsible for notifying the
hardware when to flush its TLBs. Do so after we zap all the PTEs before
suspend (and the BIOS tries to read our GTT).

Fixes a regression from

commit 828c79087cec61eaf4c76bb32c222fbe35ac3930
Author: Ben Widawsky <benjamin.widawsky@intel.com>
Date:   Wed Oct 16 09:21:30 2013 -0700

    drm/i915: Disable GGTT PTEs on GEN6+ suspend

that survived and continue to cause harm even after

commit e568af1c626031925465a5caaab7cca1303d55c7
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Mar 26 20:08:20 2014 +0100

    drm/i915: Undo gtt scratch pte unmapping again

v2: Trivial rebase.
v3: Fixes requires pointer dances.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=82340
Tested-by: ming.yao@intel.com
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
Cc: Todd Previte <tprevite@gmail.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomd/raid5: disable 'DISCARD' by default due to safety concerns.
NeilBrown [Thu, 2 Oct 2014 03:45:00 +0000 (13:45 +1000)]
md/raid5: disable 'DISCARD' by default due to safety concerns.

commit 8e0e99ba64c7ba46133a7c8a3e3f7de01f23bd93 upstream.

It has come to my attention (thanks Martin) that 'discard_zeroes_data'
is only a hint.  Some devices in some cases don't do what it
says on the label.

The use of DISCARD in RAID5 depends on reads from discarded regions
being predictably zero.  If a write to a previously discarded region
performs a read-modify-write cycle it assumes that the parity block
was consistent with the data blocks.  If all were zero, this would
be the case.  If some are and some aren't this would not be the case.
This could lead to data corruption after a device failure when
data needs to be reconstructed from the parity.

As we cannot trust 'discard_zeroes_data', ignore it by default
and so disallow DISCARD on all raid4/5/6 arrays.

As many devices are trustworthy, and as there are benefits to using
DISCARD, add a module parameter to over-ride this caution and cause
DISCARD to work if discard_zeroes_data is set.

If a site want to enable DISCARD on some arrays but not on others they
should select DISCARD support at the filesystem level, and set the
raid456 module parameter.
    raid456.devices_handle_discard_safely=Y

As this is a data-safety issue, I believe this patch is suitable for
-stable.
DISCARD support for RAID456 was added in 3.7

Cc: Shaohua Li <shli@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Fixes: 620125f2bf8ff0c4969b79653b54d7bcc9d40637
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agocpufreq: integrator: fix integrator_cpufreq_remove return type
Arnd Bergmann [Fri, 26 Sep 2014 20:19:12 +0000 (22:19 +0200)]
cpufreq: integrator: fix integrator_cpufreq_remove return type

commit d62dbf77f7dfaa6fb455b4b9828069a11965929c upstream.

When building this driver as a module, we get a helpful warning
about the return type:

drivers/cpufreq/integrator-cpufreq.c:232:2: warning: initialization from incompatible pointer type
  .remove = __exit_p(integrator_cpufreq_remove),

If the remove callback returns void, the caller gets an undefined
value as it expects an integer to be returned. This fixes the
problem by passing down the value from cpufreq_unregister_driver.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm: migrate: Close race between migration completion and mprotect
Mel Gorman [Thu, 2 Oct 2014 18:47:41 +0000 (19:47 +0100)]
mm: migrate: Close race between migration completion and mprotect

commit d3cb8bf6081b8b7a2dabb1264fe968fd870fa595 upstream.

A migration entry is marked as write if pte_write was true at the time the
entry was created. The VMA protections are not double checked when migration
entries are being removed as mprotect marks write-migration-entries as
read. It means that potentially we take a spurious fault to mark PTEs write
again but it's straight-forward. However, there is a race between write
migrations being marked read and migrations finishing. This potentially
allows a PTE to be write that should have been read. Close this race by
double checking the VMA permissions using maybe_mkwrite when migration
completes.

[torvalds@linux-foundation.org: use maybe_mkwrite]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoperf: fix perf bug in fork()
Peter Zijlstra [Thu, 2 Oct 2014 23:17:02 +0000 (16:17 -0700)]
perf: fix perf bug in fork()

commit 6c72e3501d0d62fc064d3680e5234f3463ec5a86 upstream.

Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by
calling perf_event_free_task() when failing sched_fork() we will not yet
have done the memset() on ->perf_event_ctxp[] and will therefore try and
'free' the inherited contexts, which are still in use by the parent
process.  This is bad..

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Sylvain 'ythier' Hitier <sylvain.hitier@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoudf: Avoid infinite loop when processing indirect ICBs
Jan Kara [Thu, 4 Sep 2014 12:06:55 +0000 (14:06 +0200)]
udf: Avoid infinite loop when processing indirect ICBs

commit c03aa9f6e1f938618e6db2e23afef0574efeeb65 upstream.

We did not implement any bound on number of indirect ICBs we follow when
loading inode. Thus corrupted medium could cause kernel to go into an
infinite loop, possibly causing a stack overflow.

Fix the possible stack overflow by removing recursion from
__udf_read_inode() and limit number of indirect ICBs we follow to avoid
infinite loops.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoLinux 3.14.20 v3.14.20
Greg Kroah-Hartman [Sun, 5 Oct 2014 21:52:37 +0000 (14:52 -0700)]
Linux 3.14.20

9 years agoARM: DRA7: Add support for soc_is_dra74x() and soc_is_dra72x() variants
Rajendra Nayak [Thu, 28 Aug 2014 01:38:22 +0000 (19:38 -0600)]
ARM: DRA7: Add support for soc_is_dra74x() and soc_is_dra72x() variants

commit af438fec6cb99fc2e2faf8b16b865af26ce722e6 upstream.

Use the corresponding compatibles to identify the devices.

Signed-off-by: Rajendra Nayak <rnayak@ti.com>
Signed-off-by: Lokesh Vutla <lokeshvutla@ti.com>
Acked-by: Nishanth Menon <nm@ti.com>
Tested-by: Nishanth Menon <nm@ti.com>
Signed-off-by: Paul Walmsley <paul@pwsan.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoclk: qcom: mdp_lut_clk is a child of mdp_src
Stephen Boyd [Wed, 9 Jul 2014 01:36:06 +0000 (18:36 -0700)]
clk: qcom: mdp_lut_clk is a child of mdp_src

commit f87dfcabc6f173cc811d185d33327f50a8c88399 upstream.

The mdp_lut_clk isn't a child of the mdp_clk. Instead it's the
child of the mdp_src clock. Fix it.

Fixes: 6d00b56fe "clk: qcom: Add support for MSM8960's multimedia clock controller (MMCC)"
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoclk: qcom: Fix MN frequency tables, parent map, and jpegd
Stephen Boyd [Wed, 9 Jul 2014 01:36:06 +0000 (18:36 -0700)]
clk: qcom: Fix MN frequency tables, parent map, and jpegd

commit ff20783f7b9f35b29e768d8ecc7076c1ca1a60ca upstream.

Clocks that don't have a pre-divider don't list any pre-divider
in their frequency tables, but their tables are initialized using
aggregate initializers. Use tagged initializers so we properly
assign the m and n values for each frequency. Furthermore, the
mmcc_pxo_pll8_pll2_pll3 array improperly mapped the second
element to pll2 instead of pll8, causing the clock driver to
recalculate the wrong rate for any clocks using this array along
with a rate that uses pll2. Plus the .num_parents field is 3
instead of 4 so you can't even switch the parent to pll3. Finally
I noticed that the jpegd clock improperly indicates that the
pre-divider width is only 2, when it's actually 4 bits wide.

Fixes: 6d00b56fe "clk: qcom: Add support for MSM8960's multimedia clock controller (MMCC)"
Tested-by: Rob Clark <robdclark@gmail.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agostaging/lustre: disable virtual block device for 64K pages
Arnd Bergmann [Fri, 20 Jun 2014 12:23:28 +0000 (14:23 +0200)]
staging/lustre: disable virtual block device for 64K pages

commit 0bf22be0da8ea74bc7ccc5b07d7855830be16eca upstream.

The lustre virtual block device cannot handle 64K pages and fails at compile
time. To avoid running into this error, let's disable the Kconfig option
for this driver in cases it doesn't support.

Reported-by: Dann Frazier <dann.frazier@canonical.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoaio: block exit_aio() until all context requests are completed
Gu Zheng [Wed, 3 Sep 2014 09:45:44 +0000 (17:45 +0800)]
aio: block exit_aio() until all context requests are completed

commit 6098b45b32e6baeacc04790773ced9340601d511 upstream.

It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
9 years agoclk: prevent erronous parsing of children during rate change
Tero Kristo [Thu, 21 Aug 2014 13:47:45 +0000 (16:47 +0300)]
clk: prevent erronous parsing of children during rate change

commit 067bb1741c27c8d3b74ac98c0b8fc12b31e67005 upstream.

In some cases, clocks can switch their parent with clk_set_rate, for
example clk_mux can do this in some cases. Current implementation of
clk_change_rate uses un-safe list iteration on the clock children, which
will cause wrong clocks to be parsed in case any of the clock children
change their parents during the change rate operation. Fixed by using
the safe list iterator instead.

The problem was detected due to some divide by zero errors generated
by clock init on dra7-evm board, see discussion under
http://article.gmane.org/gmane.linux.ports.arm.kernel/349180 for details.

Fixes: 71472c0c06cf ("clk: add support for clock reparent on set_rate")
Signed-off-by: Tero Kristo <t-kristo@ti.com>
Reported-by: Nishanth Menon <nm@ti.com>
Signed-off-by: Mike Turquette <mturquette@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoperf/x86/intel: Use rdmsrl_safe() when initializing RAPL PMU
Venkatesh Srinivas [Thu, 13 Mar 2014 19:36:26 +0000 (12:36 -0700)]
perf/x86/intel: Use rdmsrl_safe() when initializing RAPL PMU

commit 24223657806a0ebd0ae5c9caaf7b021091889cf2 upstream.

CPUs which should support the RAPL counters according to
Family/Model/Stepping may still issue #GP when attempting to access
the RAPL MSRs. This may happen when Linux is running under KVM and
we are passing-through host F/M/S data, for example. Use rdmsrl_safe
to first access the RAPL_POWER_UNIT MSR; if this fails, do not
attempt to use this PMU.

Signed-off-by: Venkatesh Srinivas <venkateshs@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394739386-22260-1-git-send-email-venkateshs@google.com
Cc: zheng.z.yan@intel.com
Cc: eranian@google.com
Cc: ak@linux.intel.com
Cc: linux-kernel@vger.kernel.org
[ The patch also silently fixes another bug: rapl_pmu_init() didn't handle the memory alloc failure case previously. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[backport by whissi]
Cc: Thomas D <whissi@whissi.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agopartitions: aix.c: off by one bug
Dan Carpenter [Tue, 5 Aug 2014 08:09:59 +0000 (11:09 +0300)]
partitions: aix.c: off by one bug

commit d97a86c170b4e432f76db072a827fe30b4d6f659 upstream.

The lvip[] array has "state->limit" elements so the condition here
should be >= instead of >.

Fixes: 6ceea22bbbc8 ('partitions: add aix lvm partition support files')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Philippe De Muyter <phdm@macqel.be>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agodmaengine: dw: don't perform DMA when dmaengine_submit is called
Andy Shevchenko [Wed, 18 Jun 2014 09:15:38 +0000 (12:15 +0300)]
dmaengine: dw: don't perform DMA when dmaengine_submit is called

commit dd8ecfcac66b4485416b2d1df0ec4798b198d7d6 upstream.

Accordingly to discussion [1] and followed up documentation the DMA controller
driver shouldn't start any DMA operations when dmaengine_submit() is called.

This patch fixes the workflow in dw_dmac driver to follow the documentation.

[1] http://www.spinics.net/lists/arm-kernel/msg125987.html

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Cc: "Petallo, MauriceX R" <mauricex.r.petallo@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agodmaengine: dw: introduce dwc_dostart_first_queued() helper
Andy Shevchenko [Wed, 18 Jun 2014 09:15:36 +0000 (12:15 +0300)]
dmaengine: dw: introduce dwc_dostart_first_queued() helper

commit e7637c6c0382485f4d2e20715d058dae6f2b6a7c upstream.

We have a duplicate code which starts first descriptor in the queue. Let's make
this as a separate helper that can be used in future as well.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Cc: "Petallo, MauriceX R" <mauricex.r.petallo@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoserial: 8250_dma: check the result of TX buffer mapping
Heikki Krogerus [Mon, 28 Apr 2014 12:59:56 +0000 (15:59 +0300)]
serial: 8250_dma: check the result of TX buffer mapping

commit d4089a332883ad969700aac5dd4dd5f1c4fee825 upstream.

Using dma_mapping_error() to make sure the mapping did not
fail.

Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Cc: "Petallo, MauriceX R" <mauricex.r.petallo@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoPM / sleep: Use valid_state() for platform-dependent sleep states only
Rafael J. Wysocki [Mon, 26 May 2014 11:40:53 +0000 (13:40 +0200)]
PM / sleep: Use valid_state() for platform-dependent sleep states only

commit 43e8317b0bba1d6eb85f38a4a233d82d7c20d732 upstream.

Use the observation that, for platform-dependent sleep states
(PM_SUSPEND_STANDBY, PM_SUSPEND_MEM), a given state is either
always supported or always unsupported and store that information
in pm_states[] instead of calling valid_state() every time we
need to check it.

Also do not use valid_state() for PM_SUSPEND_FREEZE, which is always
valid, and move the pm_test_level validity check for PM_SUSPEND_FREEZE
directly into enter_state().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoPM / sleep: Add state field to pm_states[] entries
Rafael J. Wysocki [Mon, 26 May 2014 11:40:47 +0000 (13:40 +0200)]
PM / sleep: Add state field to pm_states[] entries

commit 27ddcc6596e50cb8f03d2e83248897667811d8f6 upstream.

To allow sleep states corresponding to the "mem", "standby" and
"freeze" lables to be different from the pm_states[] indexes of
those strings, introduce struct pm_sleep_state, consisting of
a string label and a state number, and turn pm_states[] into an
array of objects of that type.

This modification should not lead to any functional changes.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Brian Norris <computersforpeace@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoipvs: fix ipv6 hook registration for local replies
Julian Anastasov [Fri, 22 Aug 2014 14:53:41 +0000 (17:53 +0300)]
ipvs: fix ipv6 hook registration for local replies

commit eb90b0c734ad793d5f5bf230a9e9a4dcc48df8aa upstream.

commit fc604767613b6d2036cdc35b660bc39451040a47
("ipvs: changes for local real server") from 2.6.37
introduced DNAT support to local real server but the
IPv6 LOCAL_OUT handler ip_vs_local_reply6() is
registered incorrectly as IPv4 hook causing any outgoing
IPv4 traffic to be dropped depending on the IP header values.

Chris tracked down the problem to CONFIG_IP_VS_IPV6=y
Bug report: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349768

Reported-by: Chris J Arges <chris.j.arges@canonical.com>
Tested-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agonetfilter: x_tables: allow to use default cgroup match
Daniel Borkmann [Mon, 18 Aug 2014 13:46:28 +0000 (15:46 +0200)]
netfilter: x_tables: allow to use default cgroup match

commit caa8ad94edf686d02b555c65a6162c0d1b434958 upstream.

There's actually no good reason why we cannot use cgroup id 0,
so lets just remove this artificial barrier.

Reported-by: Alexey Perevalov <a.perevalov@samsung.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Alexey Perevalov <a.perevalov@samsung.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoipvs: Maintain all DSCP and ECN bits for ipv6 tun forwarding
Alex Gartrell [Wed, 16 Jul 2014 22:57:34 +0000 (15:57 -0700)]
ipvs: Maintain all DSCP and ECN bits for ipv6 tun forwarding

commit 76f084bc10004b3050b2cff9cfac29148f1f6088 upstream.

Previously, only the four high bits of the tclass were maintained in the
ipv6 case.  This matches the behavior of ipv4, though whether or not we
should reflect ECN bits may be up for debate.

Signed-off-by: Alex Gartrell <agartrell@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agonetfilter: xt_hashlimit: perform garbage collection from process context
Eric Dumazet [Thu, 24 Jul 2014 04:36:50 +0000 (06:36 +0200)]
netfilter: xt_hashlimit: perform garbage collection from process context

commit 7bd8490eef9776ced7632345df5133384b6be0fe upstream.

xt_hashlimit cannot be used with large hash tables, because garbage
collector is run from a timer. If table is really big, its possible
to hold cpu for more than 500 msec, which is unacceptable.

Switch to a work queue, and use proper scheduling points to remove
latencies spikes.

Later, we also could switch to a smoother garbage collection done
at lookup time, one bucket at a time...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Patrick McHardy <kaber@trash.net>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoipvs: avoid netns exit crash on ip_vs_conn_drop_conntrack
Julian Anastasov [Thu, 10 Jul 2014 06:24:01 +0000 (09:24 +0300)]
ipvs: avoid netns exit crash on ip_vs_conn_drop_conntrack

commit 2627b7e15c5064ddd5e578e4efd948d48d531a3f upstream.

commit 8f4e0a18682d91 ("IPVS netns exit causes crash in conntrack")
added second ip_vs_conn_drop_conntrack call instead of just adding
the needed check. As result, the first call still can cause
crash on netns exit. Remove it.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Hans Schillstrom <hans@schillstrom.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomd/raid1: intialise start_next_window for READ case to avoid hang
NeilBrown [Mon, 22 Sep 2014 00:06:23 +0000 (10:06 +1000)]
md/raid1: intialise start_next_window for READ case to avoid hang

commit f0cc9a057151892b885be21a1d19b0185568281d upstream.

r1_bio->start_next_window is not initialised in the READ
case, so allow_barrier may incorrectly decrement
   conf->current_window_requests
which can cause raise_barrier() to block forever.

Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomd/raid1: fix_read_error should act on all non-faulty devices.
NeilBrown [Thu, 18 Sep 2014 01:09:04 +0000 (11:09 +1000)]
md/raid1: fix_read_error should act on all non-faulty devices.

commit b8cb6b4c121e1bf1963c16ed69e7adcb1bc301cd upstream.

If a devices is being recovered it is not InSync and is not Faulty.

If a read error is experienced on that device, fix_read_error()
will be called, but it ignores non-InSync devices.  So it will
neither fix the error nor fail the device.

It is incorrect that fix_read_error() ignores non-InSync devices.
It should only ignore Faulty devices.  So fix it.

This became a bug when we allowed reading from a device that was being
recovered.  It is suitable for any subsequent -stable kernel.

Fixes: da8840a747c0dbf49506ec906757a6b87b9741e9
Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Tested-by: Alexander Lyakas <alex.bolshoy@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomd/raid1: count resync requests in nr_pending.
NeilBrown [Tue, 16 Sep 2014 02:14:14 +0000 (12:14 +1000)]
md/raid1: count resync requests in nr_pending.

commit 34e97f170149bfa14979581c4c748bc9b4b79d5b upstream.

Both normal IO and resync IO can be retried with reschedule_retry()
and so be counted into ->nr_queued, but only normal IO gets counted in
->nr_pending.

Before the recent improvement to RAID1 resync there could only
possibly have been one or the other on the queue.  When handling a
read failure it could only be normal IO.  So when handle_read_error()
called freeze_array() the fact that freeze_array only compares
->nr_queued against ->nr_pending was safe.

But now that these two types can interleave, we can have both normal
and resync IO requests queued, so we need to count them both in
nr_pending.

This error can lead to freeze_array() hanging if there is a read
error, so it is suitable for -stable.

Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomd/raid1: update next_resync under resync_lock.
NeilBrown [Wed, 10 Sep 2014 06:01:24 +0000 (16:01 +1000)]
md/raid1: update next_resync under resync_lock.

commit c2fd4c94deedb89ac1746c4a53219be499372c06 upstream.

raise_barrier() uses next_resync as part of its calculations, so it
really should be updated first, instead of afterwards.

next_resync is always used under resync_lock so update it under
resync lock to, just before it is used.  That is safest.

This could cause normal IO and resync IO to interact badly so
it suitable for -stable.

Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomd/raid1: Don't use next_resync to determine how far resync has progressed
NeilBrown [Wed, 10 Sep 2014 05:56:57 +0000 (15:56 +1000)]
md/raid1: Don't use next_resync to determine how far resync has progressed

commit 235549605eb7f1c5a37cef8b09d12e6d412c5cd6 upstream.

next_resync is (approximately) the location for the next resync request.
However it does *not* reliably determine the earliest location
at which resync might be happening.
This is because resync requests can complete out of order, and
we only limit the number of current requests, not the distance
from the earliest pending request to the latest.

mddev->curr_resync_completed is a reliable indicator of the earliest
position at which resync could be happening.   It is updated less
frequently, but is actually reliable which is more important.

So use it to determine if a write request is before the region
being resynced and so safe from conflict.

This error can allow resync IO to interfere with normal IO which
could lead to data corruption. Hence: stable.

Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomd/raid1: make sure resync waits for conflicting writes to complete.
NeilBrown [Wed, 10 Sep 2014 05:01:49 +0000 (15:01 +1000)]
md/raid1: make sure resync waits for conflicting writes to complete.

commit 2f73d3c55d09ce60647b96ad2a9b539c95a530ee upstream.

The resync/recovery process for raid1 was recently changed
so that writes could happen in parallel with resync providing
they were in different regions of the device.

There is a problem though:  While a write request will always
wait for conflicting resync to complete, a resync request
will *not* always wait for conflicting writes to complete.

Two changes are needed to fix this:

1/ raise_barrier (which waits until it is safe to do resync)
   must wait until current_window_requests is zero
2/ wait_battier (which waits at the start of a new write request)
   must update current_window_requests if the request could
   possible conflict with a concurrent resync.

As concurrent writes and resync can lead to data loss,
this patch is suitable for -stable.

Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
Cc: majianpeng <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomd/raid1: be more cautious where we read-balance during resync.
NeilBrown [Tue, 9 Sep 2014 03:49:46 +0000 (13:49 +1000)]
md/raid1: be more cautious where we read-balance during resync.

commit c6d119cf1b5a778e9ed60a006e2a434fcc4471a2 upstream.

commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 made
it possible for reads to happen concurrently with resync.
This means that we need to be more careful where read_balancing
is allowed during resync - we can no longer be sure that any
resync that has already started will definitely finish.

So keep read_balancing to before recovery_cp, which is conservative
but safe.

This bug makes it possible to read from a device that doesn't
have up-to-date data, so it can cause data corruption.
So it is suitable for any kernel since 3.11.

Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomd/raid1: clean up request counts properly in close_sync()
NeilBrown [Thu, 4 Sep 2014 06:30:38 +0000 (16:30 +1000)]
md/raid1: clean up request counts properly in close_sync()

commit 669cc7ba77864e7b1ac39c9f2b2afb8730f341f4 upstream.

If there are outstanding writes when close_sync is called,
the change to ->start_next_window might cause them to
decrement the wrong counter when they complete.  Fix this
by merging the two counters into the one that will be decremented.

Having an incorrect value in a counter can cause raise_barrier()
to hangs, so this is suitable for -stable.

Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomedia: adv7604: fix inverted condition
Hans Verkuil [Fri, 12 Sep 2014 09:02:02 +0000 (06:02 -0300)]
media: adv7604: fix inverted condition

commit 77639ff2b3404a913b8037d230a384798b854bae upstream.

The log_status function should show HDMI information, but the test checking for
an HDMI input was inverted. Fix this.

Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomedia: cx18: fix kernel oops with tda8290 tuner
Hans Verkuil [Tue, 26 Aug 2014 05:59:53 +0000 (02:59 -0300)]
media: cx18: fix kernel oops with tda8290 tuner

commit 6a03dc92cc2edfa2257502557b9f714893987383 upstream.

This was caused by an uninitialized setup.config field.

Based on a suggestion from Devin Heitmueller.

Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Thanks-to: Devin Heitmueller <dheitmueller@kernellabs.com>
Reported-by: Scott Robinson <scott.robinson55@gmail.com>
Tested-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomedia: af9035: new IDs: add support for PCTV 78e and PCTV 79e
Malcolm Priestley [Tue, 5 Aug 2014 09:19:16 +0000 (06:19 -0300)]
media: af9035: new IDs: add support for PCTV 78e and PCTV 79e

commit a04646c045cab08a9e62b9be8f01ecbb0632d24e upstream.

add the following IDs
USB_PID_PCTV_78E (0x025a) for PCTV 78e
USB_PID_PCTV_79E (0x0262) for PCTV 79e

For these it9135 devices.

Signed-off-by: Malcolm Priestley <tvboxspy@gmail.com>
Cc: Antti Palosaari <crope@iki.fi>
Signed-off-by: Antti Palosaari <crope@iki.fi>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agocpufreq: release policy->rwsem on error
Prarit Bhargava [Wed, 10 Sep 2014 14:12:08 +0000 (10:12 -0400)]
cpufreq: release policy->rwsem on error

commit 7106e02baed4a72fb23de56b02ad4d31daa74d95 upstream.

While debugging a cpufreq-related hardware failure on a system I saw the
following lockdep warning:

 =========================
 [ BUG: held lock freed! ] 3.17.0-rc4+ #1 Tainted: G            E
 -------------------------
 insmod/2247 is freeing memory ffff88006e1b1400-ffff88006e1b17ff, with a lock still held there!
  (&policy->rwsem){+.+...}, at: [<ffffffff8156d37d>] __cpufreq_add_dev.isra.21+0x47d/0xb80
 3 locks held by insmod/2247:
  #0:  (subsys mutex#5){+.+.+.}, at: [<ffffffff81485579>] subsys_interface_register+0x69/0x120
  #1:  (cpufreq_rwsem){.+.+.+}, at: [<ffffffff8156cf73>] __cpufreq_add_dev.isra.21+0x73/0xb80
  #2:  (&policy->rwsem){+.+...}, at: [<ffffffff8156d37d>] __cpufreq_add_dev.isra.21+0x47d/0xb80

 stack backtrace:
 CPU: 0 PID: 2247 Comm: insmod Tainted: G            E  3.17.0-rc4+ #1
 Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 08/24/2013
  0000000000000000 000000008f3063c4 ffff88006f87bb30 ffffffff8171b358
  ffff88006bcf3750 ffff88006f87bb68 ffffffff810e09e1 ffff88006e1b1400
  ffffea0001b86c00 ffffffff8156d327 ffff880073003500 0000000000000246
 Call Trace:
  [<ffffffff8171b358>] dump_stack+0x4d/0x66
  [<ffffffff810e09e1>] debug_check_no_locks_freed+0x171/0x180
  [<ffffffff8156d327>] ? __cpufreq_add_dev.isra.21+0x427/0xb80
  [<ffffffff8121412b>] kfree+0xab/0x2b0
  [<ffffffff8156d327>] __cpufreq_add_dev.isra.21+0x427/0xb80
  [<ffffffff81724cf7>] ? _raw_spin_unlock+0x27/0x40
  [<ffffffffa003517f>] ? pcc_cpufreq_do_osc+0x17f/0x17f [pcc_cpufreq]
  [<ffffffff8156da8e>] cpufreq_add_dev+0xe/0x10
  [<ffffffff814855d1>] subsys_interface_register+0xc1/0x120
  [<ffffffff8156bcf2>] cpufreq_register_driver+0x112/0x340
  [<ffffffff8121415a>] ? kfree+0xda/0x2b0
  [<ffffffffa003517f>] ? pcc_cpufreq_do_osc+0x17f/0x17f [pcc_cpufreq]
  [<ffffffffa003562e>] pcc_cpufreq_init+0x4af/0xe81 [pcc_cpufreq]
  [<ffffffffa003517f>] ? pcc_cpufreq_do_osc+0x17f/0x17f [pcc_cpufreq]
  [<ffffffff81002144>] do_one_initcall+0xd4/0x210
  [<ffffffff811f7472>] ? __vunmap+0xd2/0x120
  [<ffffffff81127155>] load_module+0x1315/0x1b70
  [<ffffffff811222a0>] ? store_uevent+0x70/0x70
  [<ffffffff811229d9>] ? copy_module_from_fd.isra.44+0x129/0x180
  [<ffffffff81127b86>] SyS_finit_module+0xa6/0xd0
  [<ffffffff81725b69>] system_call_fastpath+0x16/0x1b
 cpufreq: __cpufreq_add_dev: ->get() failed
insmod: ERROR: could not insert module pcc-cpufreq.ko: No such device

The warning occurs in the __cpufreq_add_dev() code which does

        down_write(&policy->rwsem);
...
        if (cpufreq_driver->get && !cpufreq_driver->setpolicy) {
                policy->cur = cpufreq_driver->get(policy->cpu);
                if (!policy->cur) {
                        pr_err("%s: ->get() failed\n", __func__);
                        goto err_get_freq;
                }

If cpufreq_driver->get(policy->cpu) returns an error we execute the
code at err_get_freq, which does not up the policy->rwsem.  This causes
the lockdep warning.

Trivial patch to up the policy->rwsem in the error path.

After the patch has been applied, and an error occurs in the
cpufreq_driver->get(policy->cpu) call we will now see

cpufreq: __cpufreq_add_dev: ->get() failed
cpufreq: __cpufreq_add_dev: ->get() failed
modprobe: ERROR: could not insert 'pcc_cpufreq': No such device

Fixes: 4e97b631f24c (cpufreq: Initialize governor for a new policy under policy->rwsem)
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agonl80211: clear skb cb before passing to netlink
Johannes Berg [Wed, 30 Jul 2014 12:55:26 +0000 (14:55 +0200)]
nl80211: clear skb cb before passing to netlink

commit bd8c78e78d5011d8111bc2533ee73b13a3bd6c42 upstream.

In testmode and vendor command reply/event SKBs we use the
skb cb data to store nl80211 parameters between allocation
and sending. This causes the code for CONFIG_NETLINK_MMAP
to get confused, because it takes ownership of the skb cb
data when the SKB is handed off to netlink, and it doesn't
explicitly clear it.

Clear the skb cb explicitly when we're done and before it
gets passed to netlink to avoid this issue.

Reported-by: Assaf Azulay <assaf.azulay@intel.com>
Reported-by: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoFix nasty 32-bit overflow bug in buffer i/o code.
Anton Altaparmakov [Mon, 22 Sep 2014 00:53:03 +0000 (01:53 +0100)]
Fix nasty 32-bit overflow bug in buffer i/o code.

commit f2d5a94436cc7cc0221b9a81bba2276a25187dd3 upstream.

On 32-bit architectures, the legacy buffer_head functions are not always
handling the sector number with the proper 64-bit types, and will thus
fail on 4TB+ disks.

Any code that uses __getblk() (and thus bread(), breadahead(),
sb_bread(), sb_breadahead(), sb_getblk()), and calls it using a 64-bit
block on a 32-bit arch (where "long" is 32-bit) causes an inifinite loop
in __getblk_slow() with an infinite stream of errors logged to dmesg
like this:

  __find_get_block_slow() failed. block=6740375944, b_blocknr=2445408648
  b_state=0x00000020, b_size=512
  device sda1 blocksize: 512

Note how in hex block is 0x191C1F988 and b_blocknr is 0x91C1F988 i.e. the
top 32-bits are missing (in this case the 0x1 at the top).

This is because grow_dev_page() is broken and has a 32-bit overflow due
to shifting the page index value (a pgoff_t - which is just 32 bits on
32-bit architectures) left-shifted as the block number.  But the top
bits to get lost as the pgoff_t is not type cast to sector_t / 64-bit
before the shift.

This patch fixes this issue by type casting "index" to sector_t before
doing the left shift.

Note this is not a theoretical bug but has been seen in the field on a
4TiB hard drive with logical sector size 512 bytes.

This patch has been verified to fix the infinite loop problem on 3.17-rc5
kernel using a 4TB disk image mounted using "-o loop".  Without this patch
doing a "find /nt" where /nt is an NTFS volume causes the inifinite loop
100% reproducibly whilst with the patch it works fine as expected.

Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agodrm/radeon/px: fix module unload
Alex Deucher [Fri, 12 Sep 2014 22:00:53 +0000 (18:00 -0400)]
drm/radeon/px: fix module unload

commit 2e97140dd58cab8772bf77d73eabda213e45202d upstream.

Use the new vga_switcheroo_fini_domain_pm_ops function
to unregister the pm ops.

Based on a patch from:
Pali Rohár <pali.rohar@gmail.com>

bug:
https://bugzilla.kernel.org/show_bug.cgi?id=84431

Reviewed-by: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Pali Rohár <pali.rohar@gmail.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agodrm/nouveau/runpm: fix module unload
Alex Deucher [Fri, 12 Sep 2014 22:06:56 +0000 (18:06 -0400)]
drm/nouveau/runpm: fix module unload

commit 53beaa01e0fe8e4202f43485a03b32fcf5dfea74 upstream.

Use the new vga_switcheroo_fini_domain_pm_ops function
to unregister the pm ops.

Based on a patch from:
Pali Rohár <pali.rohar@gmail.com>

bug:
https://bugzilla.kernel.org/show_bug.cgi?id=84431

Reviewed-by: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agovgaswitcheroo: add vga_switcheroo_fini_domain_pm_ops
Alex Deucher [Fri, 12 Sep 2014 21:51:29 +0000 (17:51 -0400)]
vgaswitcheroo: add vga_switcheroo_fini_domain_pm_ops

commit 766a53d059d1500c9755c8af017bd411bd8f1b20 upstream.

Drivers should call this on unload to unregister pmops.

Bug:
https://bugzilla.kernel.org/show_bug.cgi?id=84431

Reviewed-by: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Pali Rohár <pali.rohar@gmail.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoperf: Fix a race condition in perf_remove_from_context()
Cong Wang [Tue, 2 Sep 2014 22:27:20 +0000 (15:27 -0700)]
perf: Fix a race condition in perf_remove_from_context()

commit 3577af70a2ce4853d58e57d832e687d739281479 upstream.

We saw a kernel soft lockup in perf_remove_from_context(),
it looks like the `perf` process, when exiting, could not go
out of the retry loop. Meanwhile, the target process was forking
a child. So either the target process should execute the smp
function call to deactive the event (if it was running) or it should
do a context switch which deactives the event.

It seems we optimize out a context switch in perf_event_context_sched_out(),
and what's more important, we still test an obsolete task pointer when
retrying, so no one actually would deactive that event in this situation.
Fix it directly by reloading the task pointer in perf_remove_from_context().

This should cure the above soft lockup.

Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409696840-843-1-git-send-email-xiyou.wangcong@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoIB/core: When marshaling uverbs path, clear unused fields
Matan Barak [Tue, 2 Sep 2014 12:32:34 +0000 (15:32 +0300)]
IB/core: When marshaling uverbs path, clear unused fields

commit a59c5850f09b4c2d6ad2fc47e5e1be8d654529d6 upstream.

When marsheling a user path to the kernel struct ib_sa_path, need
to zero smac, dmac and set the vlan id to the "no vlan" value.

Fixes: dd5f03beb4f7 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
Reported-by: Aleksey Senin <alekseys@mellanox.com>
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoIB/mlx4: Don't duplicate the default RoCE GID
Moni Shoua [Thu, 21 Aug 2014 11:28:38 +0000 (14:28 +0300)]
IB/mlx4: Don't duplicate the default RoCE GID

commit f5c4834d9328c4ed9fe5dcbec6128d6da16db69a upstream.

When reading the IPv6 addresses from the net-device, make sure to
avoid adding a duplicate entry to the GID table because of equality
between the default GID we generate and the default IPv6 link-local
address of the device.

Fixes: acc4fccf4eff ("IB/mlx4: Make sure GID index 0 is always occupied")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoIB/mlx4: Avoid null pointer dereference in mlx4_ib_scan_netdevs()
Moni Shoua [Thu, 21 Aug 2014 11:28:37 +0000 (14:28 +0300)]
IB/mlx4: Avoid null pointer dereference in mlx4_ib_scan_netdevs()

commit e381835cf1b8e3b2857277dbc3b77d8c5350f70a upstream.

When Ethernet netdev is not present for a port (e.g. when the link
layer type of the port is InfiniBand) it's possible to dereference a
null pointer when we do netdevice scanning.

To fix that, we move a section of code that needs to run only when
netdev is present to a proper if () statement.

Fixes: ad4885d279b6 ("IB/mlx4: Build the port IBoE GID table properly under bonding")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoIB/qib: Correct reference counting in debugfs qp_stats
Mike Marciniszyn [Fri, 19 Sep 2014 12:32:19 +0000 (08:32 -0400)]
IB/qib: Correct reference counting in debugfs qp_stats

commit 85cbb7c728bf39c45a9789b88c9471c0d7a58b0e upstream.

This particular reference count is not needed with the rcu protection,
and the current code leaks a reference count, causing a hang in
qib_qp_destroy().

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoGFS2: fix d_splice_alias() misuses
Al Viro [Fri, 12 Sep 2014 19:56:04 +0000 (20:56 +0100)]
GFS2: fix d_splice_alias() misuses

commit cfb2f9d5c921e38b0f12bb26fed10b877664444d upstream.

Callers of d_splice_alias(dentry, inode) don't need iput(), neither
on success nor on failure.  Either the reference to inode is stored
in a previously negative dentry, or it's dropped.  In either case
inode reference the caller used to hold is consumed.

__gfs2_lookup() does iput() in case when d_splice_alias() has failed.
Double iput() if we ever hit that.  And gfs2_create_inode() ends up
not only with double iput(), but with link count dropped to zero - on
an inode it has just found in directory.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoalarmtimer: Lock k_itimer during timer callback
Richard Larocque [Wed, 10 Sep 2014 01:31:05 +0000 (18:31 -0700)]
alarmtimer: Lock k_itimer during timer callback

commit 474e941bed9262f5fa2394f9a4a67e24499e5926 upstream.

Locks the k_itimer's it_lock member when handling the alarm timer's
expiry callback.

The regular posix timers defined in posix-timers.c have this lock held
during timout processing because their callbacks are routed through
posix_timer_fn().  The alarm timers follow a different path, so they
ought to grab the lock somewhere else.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: Richard Larocque <rlarocque@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoalarmtimer: Do not signal SIGEV_NONE timers
Richard Larocque [Wed, 10 Sep 2014 01:31:04 +0000 (18:31 -0700)]
alarmtimer: Do not signal SIGEV_NONE timers

commit 265b81d23a46c39df0a735a3af4238954b41a4c2 upstream.

Avoids sending a signal to alarm timers created with sigev_notify set to
SIGEV_NONE by checking for that special case in the timeout callback.

The regular posix timers avoid sending signals to SIGEV_NONE timers by
not scheduling any callbacks for them in the first place.  Although it
would be possible to do something similar for alarm timers, it's simpler
to handle this as a special case in the timeout.

Prior to this patch, the alarm timer would ignore the sigev_notify value
and try to deliver signals to the process anyway.  Even worse, the
sanity check for the value of sigev_signo is skipped when SIGEV_NONE was
specified, so the signal number could be bogus.  If sigev_signo was an
unitialized value (as it often would be if SIGEV_NONE is used), then
it's hard to predict which signal will be sent.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: Richard Larocque <rlarocque@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoalarmtimer: Return relative times in timer_gettime
Richard Larocque [Wed, 10 Sep 2014 01:31:03 +0000 (18:31 -0700)]
alarmtimer: Return relative times in timer_gettime

commit e86fea764991e00a03ff1e56409ec9cacdbda4c9 upstream.

Returns the time remaining for an alarm timer, rather than the time at
which it is scheduled to expire.  If the timer has already expired or it
is not currently scheduled, the it_value's members are set to zero.

This new behavior matches that of the other posix-timers and the POSIX
specifications.

This is a change in user-visible behavior, and may break existing
applications.  Hopefully, few users rely on the old incorrect behavior.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: Richard Larocque <rlarocque@google.com>
[jstultz: minor style tweak]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoparisc: Only use -mfast-indirect-calls option for 32-bit kernel builds
John David Anglin [Tue, 23 Sep 2014 00:54:50 +0000 (20:54 -0400)]
parisc: Only use -mfast-indirect-calls option for 32-bit kernel builds

commit d26a7730b5874a5fa6779c62f4ad7c5065a94723 upstream.

In spite of what the GCC manual says, the -mfast-indirect-calls has
never been supported in the 64-bit parisc compiler. Indirect calls have
always been done using function descriptors irrespective of the
-mfast-indirect-calls option.

Recently, it was noticed that a function descriptor was always requested
when the -mfast-indirect-calls option was specified. This caused
problems when the option was used in  application code and doesn't make
any sense because the whole point of the option is to avoid using a
function descriptor for indirect calls.

Fixing this broke 64-bit kernel builds.

I will fix GCC but for now we need the attached change. This results in
the same kernel code as before.

Signed-off-by: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoparisc: Implement new LWS CAS supporting 64 bit operations.
Guy Martin [Fri, 12 Sep 2014 16:02:34 +0000 (18:02 +0200)]
parisc: Implement new LWS CAS supporting 64 bit operations.

commit 89206491201cbd1571009b36292af781cef74c1b upstream.

The current LWS cas only works correctly for 32bit. The new LWS allows
for CAS operations of variable size.

Signed-off-by: Guy Martin <gmsoft@tuxicoman.be>
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agodon't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu()
Al Viro [Sun, 14 Sep 2014 01:55:46 +0000 (21:55 -0400)]
don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu()

commit 7bd88377d482e1eae3c5329b12e33cfd664fa6a9 upstream.

return the value instead, and have path_init() do the assignment.  Broken by
"vfs: Fix absolute RCU path walk failures due to uninitialized seq number",
which was Cc-stable with 2.6.38+ as destination.  This one should go where
it went.

To avoid dummy value returned in case when root is already set (it would do
no harm, actually, since the only caller that doesn't ignore the return value
is guaranteed to have nd->root *not* set, but it's more obvious that way),
lift the check into callers.  And do the same to set_root(), to keep them
in sync.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agopowerpc: Add smp_mb()s to arch_spin_unlock_wait()
Michael Ellerman [Thu, 7 Aug 2014 05:36:18 +0000 (15:36 +1000)]
powerpc: Add smp_mb()s to arch_spin_unlock_wait()

commit 78e05b1421fa41ae8457701140933baa5e7d9479 upstream.

Similar to the previous commit which described why we need to add a
barrier to arch_spin_is_locked(), we have a similar problem with
spin_unlock_wait().

We need a barrier on entry to ensure any spinlock we have previously
taken is visibly locked prior to the load of lock->slock.

It's also not clear if spin_unlock_wait() is intended to have ACQUIRE
semantics. For now be conservative and add a barrier on exit to give it
ACQUIRE semantics.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agopowerpc: Add smp_mb() to arch_spin_is_locked()
Michael Ellerman [Thu, 7 Aug 2014 05:36:17 +0000 (15:36 +1000)]
powerpc: Add smp_mb() to arch_spin_is_locked()

commit 51d7d5205d3389a32859f9939f1093f267409929 upstream.

The kernel defines the function spin_is_locked(), which can be used to
check if a spinlock is currently locked.

Using spin_is_locked() on a lock you don't hold is obviously racy. That
is, even though you may observe that the lock is unlocked, it may become
locked at any time.

There is (at least) one exception to that, which is if two locks are
used as a pair, and the holder of each checks the status of the other
before doing any update.

Assuming *A and *B are two locks, and *COUNTER is a shared non-atomic
value:

The first CPU does:

spin_lock(*A)

if spin_is_locked(*B)
# nothing
else
smp_mb()
LOAD r = *COUNTER
r++
STORE *COUNTER = r

spin_unlock(*A)

And the second CPU does:

spin_lock(*B)

if spin_is_locked(*A)
# nothing
else
smp_mb()
LOAD r = *COUNTER
r++
STORE *COUNTER = r

spin_unlock(*B)

Although this is a strange locking construct, it should work.

It seems to be understood, but not documented, that spin_is_locked() is
not a memory barrier, so in the examples above and below the caller
inserts its own memory barrier before acting on the result of
spin_is_locked().

For now we assume spin_is_locked() is implemented as below, and we break
it out in our examples:

bool spin_is_locked(*LOCK) {
LOAD l = *LOCK
return l.locked
}

Our intuition is that there should be no problem even if the two code
sequences run simultaneously such as:

CPU 0 CPU 1
==================================================
spin_lock(*A) spin_lock(*B)
LOAD b = *B LOAD a = *A
if b.locked # true if a.locked # true
# nothing # nothing
spin_unlock(*A) spin_unlock(*B)

If one CPU gets the lock before the other then it will do the update and
the other CPU will back off:

CPU 0 CPU 1
==================================================
spin_lock(*A)
LOAD b = *B
spin_lock(*B)
if b.locked # false LOAD a = *A
else if a.locked # true
smp_mb() # nothing
LOAD r1 = *COUNTER spin_unlock(*B)
r1++
STORE *COUNTER = r1
spin_unlock(*A)

However in reality spin_lock() itself is not indivisible. On powerpc we
implement it as a load-and-reserve and store-conditional.

Ignoring the retry logic for the lost reservation case, it boils down to:
spin_lock(*LOCK) {
LOAD l = *LOCK
l.locked = true
STORE *LOCK = l
ACQUIRE_BARRIER
}

The ACQUIRE_BARRIER is required to give spin_lock() ACQUIRE semantics as
defined in memory-barriers.txt:

     This acts as a one-way permeable barrier.  It guarantees that all
     memory operations after the ACQUIRE operation will appear to happen
     after the ACQUIRE operation with respect to the other components of
     the system.

On modern powerpc systems we use lwsync for ACQUIRE_BARRIER. lwsync is
also know as "lightweight sync", or "sync 1".

As described in Power ISA v2.07 section B.2.1.1, in this scenario the
lwsync is not the barrier itself. It instead causes the LOAD of *LOCK to
act as the barrier, preventing any loads or stores in the locked region
from occurring prior to the load of *LOCK.

Whether this behaviour is in accordance with the definition of ACQUIRE
semantics in memory-barriers.txt is open to discussion, we may switch to
a different barrier in future.

What this means in practice is that the following can occur:

CPU 0 CPU 1
==================================================
LOAD a = *A  LOAD b = *B
a.locked = true b.locked = true
LOAD b = *B LOAD a = *A
STORE *A = a STORE *B = b
if b.locked # false if a.locked # false
else else
smp_mb() smp_mb()
LOAD r1 = *COUNTER LOAD r2 = *COUNTER
r1++ r2++
STORE *COUNTER = r1
STORE *COUNTER = r2 # Lost update
spin_unlock(*A) spin_unlock(*B)

That is, the load of *B can occur prior to the store that makes *A
visibly locked. And similarly for CPU 1. The result is both CPUs hold
their lock and believe the other lock is unlocked.

The easiest fix for this is to add a full memory barrier to the start of
spin_is_locked(), so adding to our previous definition would give us:

bool spin_is_locked(*LOCK) {
smp_mb()
LOAD l = *LOCK
return l.locked
}

The new barrier orders the store to the lock we are locking vs the load
of the other lock:

CPU 0 CPU 1
==================================================
LOAD a = *A  LOAD b = *B
a.locked = true b.locked = true
STORE *A = a STORE *B = b
smp_mb() smp_mb()
LOAD b = *B LOAD a = *A
if b.locked # true if a.locked # true
# nothing # nothing
spin_unlock(*A) spin_unlock(*B)

Although the above example is theoretical, there is code similar to this
example in sem_lock() in ipc/sem.c. This commit in addition to the next
commit appears to be a fix for crashes we are seeing in that code where
we believe this race happens in practice.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agopowerpc/perf: Fix ABIv2 kernel backtraces
Anton Blanchard [Tue, 26 Aug 2014 02:44:15 +0000 (12:44 +1000)]
powerpc/perf: Fix ABIv2 kernel backtraces

commit 85101af13bb854a6572fa540df7c7201958624b9 upstream.

ABIv2 kernels are failing to backtrace through the kernel. An example:

39.30%  readseek2_proce  [kernel.kallsyms]    [k] find_get_entry
            |
            --- find_get_entry
               __GI___libc_read

The problem is in valid_next_sp() where we check that the new stack
pointer is at least STACK_FRAME_OVERHEAD below the previous one.

ABIv1 has a minimum stack frame size of 112 bytes consisting of 48 bytes
and 64 bytes of parameter save area. ABIv2 changes that to 32 bytes
with no paramter save area.

STACK_FRAME_OVERHEAD is in theory the minimum stack frame size,
but we over 240 uses of it, some of which assume that it includes
space for the parameter area.

We need to work through all our stack defines and rationalise them
but let's fix perf now by creating STACK_FRAME_MIN_SIZE and using
in valid_next_sp(). This fixes the issue:

30.64%  readseek2_proce  [kernel.kallsyms]    [k] find_get_entry
            |
            --- find_get_entry
               pagecache_get_page
               generic_file_read_iter
               new_sync_read
               vfs_read
               sys_read
               syscall_exit
               __GI___libc_read

Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agobrcmfmac: handle IF event for P2P_DEVICE interface
Arend van Spriel [Fri, 12 Sep 2014 14:19:30 +0000 (16:19 +0200)]
brcmfmac: handle IF event for P2P_DEVICE interface

commit 87c4790330810fe5caf0172d9320cf24ef19cebe upstream.

The firmware notifies about interface changes through the IF event
which has a NO_IF flag that means host can ignore the event. This
behaviour was introduced in the driver by:

  commit 2ee8382fc6c763c76396a6aaff77a27089eed3aa
  Author: Arend van Spriel <arend@broadcom.com>
  Date:   Sat Aug 10 12:27:24 2013 +0200

      brcmfmac: ignore IF event if firmware indicates it

It turns out that the IF event for the P2P_DEVICE also has this
flag set, but the event should not be ignored in this scenario.
The mentioned commit caused a regression in 3.12 kernel in creation
of the P2P_DEVICE interface.

Reviewed-by: Hante Meuleman <meuleman@broadcom.com>
Reviewed-by: Franky (Zhenhui) Lin <frankyl@broadcom.com>
Reviewed-by: Daniel (Deognyoun) Kim <dekim@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieterpg@broadcom.com>
Signed-off-by: Arend van Spriel <arend@broadcom.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agosched: Fix unreleased llc_shared_mask bit during CPU hotplug
Wanpeng Li [Wed, 24 Sep 2014 08:38:05 +0000 (16:38 +0800)]
sched: Fix unreleased llc_shared_mask bit during CPU hotplug

commit 03bd4e1f7265548832a76e7919a81f3137c44fd1 upstream.

The following bug can be triggered by hot adding and removing a large number of
xen domain0's vcpus repeatedly:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 IP: [..] find_busiest_group
PGD 5a9d5067 PUD 13067 PMD 0
Oops: 0000 [#3] SMP
[...]
Call Trace:
load_balance
? _raw_spin_unlock_irqrestore
idle_balance
__schedule
schedule
schedule_timeout
? lock_timer_base
schedule_timeout_uninterruptible
msleep
lock_device_hotplug_sysfs
online_store
dev_attr_store
sysfs_write_file
vfs_write
SyS_write
system_call_fastpath

Last level cache shared mask is built during CPU up and the
build_sched_domain() routine takes advantage of it to setup
the sched domain CPU topology.

However, llc_shared_mask is not released during CPU disable,
which leads to an invalid sched domainCPU topology.

This patch fix it by releasing the llc_shared_mask correctly
during CPU disable.

Yasuaki also reported that this can happen on real hardware:

  https://lkml.org/lkml/2014/7/22/1018

His case is here:

==
Here is an example on my system.
My system has 4 sockets and each socket has 15 cores and HT is
enabled. In this case, each core of sockes is numbered as
follows:

 | CPU#
Socket#0 | 0-14 , 60-74
Socket#1 | 15-29, 75-89
Socket#2 | 30-44, 90-104
Socket#3 | 45-59, 105-119

Then llc_shared_mask of CPU#30 has 0x3fff80000001fffc0000000.

It means that last level cache of Socket#2 is shared with
CPU#30-44 and 90-104.

When hot-removing socket#2 and #3, each core of sockets is
numbered as follows:

 | CPU#
Socket#0 | 0-14 , 60-74
Socket#1 | 15-29, 75-89

But llc_shared_mask is not cleared. So llc_shared_mask of CPU#30
remains having 0x3fff80000001fffc0000000.

After that, when hot-adding socket#2 and #3, each core of
sockets is numbered as follows:

 | CPU#
Socket#0 | 0-14 , 60-74
Socket#1 | 15-29, 75-89
Socket#2 | 30-59
Socket#3 | 90-119

Then llc_shared_mask of CPU#30 becomes
0x3fff8000fffffffc0000000. It means that last level cache of
Socket#2 is shared with CPU#30-59 and 90-104. So the mask has
the wrong value.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Tested-by: Linn Crosetto <linn@hp.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1411547885-48165-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm: softdirty: keep bit when zapping file pte
Peter Feiner [Thu, 25 Sep 2014 23:05:29 +0000 (16:05 -0700)]
mm: softdirty: keep bit when zapping file pte

commit dbab31aa2ceec2d201966fa0b552f151310ba5f4 upstream.

This fixes the same bug as b43790eedd31 ("mm: softdirty: don't forget to
save file map softdiry bit on unmap") and 9aed8614af5a ("mm/memory.c:
don't forget to set softdirty on file mapped fault") where the return
value of pte_*mksoft_dirty was being ignored.

To be sure that no other pte/pmd "mk" function return values were being
ignored, I annotated the functions in arch/x86/include/asm/pgtable.h
with __must_check and rebuilt.

The userspace effect of this bug is that the softdirty mark might be
lost if a file mapped pte get zapped.

Signed-off-by: Peter Feiner <pfeiner@google.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agomm, slab: initialize object alignment on cache creation
David Rientjes [Thu, 25 Sep 2014 23:05:20 +0000 (16:05 -0700)]
mm, slab: initialize object alignment on cache creation

commit d4a5fca592b9ab52b90bb261a90af3c8f53be011 upstream.

Since commit 4590685546a3 ("mm/sl[aou]b: Common alignment code"), the
"ralign" automatic variable in __kmem_cache_create() may be used as
uninitialized.

The proper alignment defaults to BYTES_PER_WORD and can be overridden by
SLAB_RED_ZONE or the alignment specified by the caller.

This fixes https://bugzilla.kernel.org/show_bug.cgi?id=85031

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Andrei Elovikov <a.elovikov@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoocfs2/dlm: do not get resource spinlock if lockres is new
Joseph Qi [Thu, 25 Sep 2014 23:05:16 +0000 (16:05 -0700)]
ocfs2/dlm: do not get resource spinlock if lockres is new

commit 5760a97c7143c208fa3a8f8cad0ed7dd672ebd28 upstream.

There is a deadlock case which reported by Guozhonghua:
  https://oss.oracle.com/pipermail/ocfs2-devel/2014-September/010079.html

This case is caused by &res->spinlock and &dlm->master_lock
misordering in different threads.

It was introduced by commit 8d400b81cc83 ("ocfs2/dlm: Clean up refmap
helpers").  Since lockres is new, it doesn't not require the
&res->spinlock.  So remove it.

Fixes: 8d400b81cc83 ("ocfs2/dlm: Clean up refmap helpers")
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Reported-by: Guozhonghua <guozhonghua@h3c.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agonilfs2: fix data loss with mmap()
Andreas Rohner [Thu, 25 Sep 2014 23:05:14 +0000 (16:05 -0700)]
nilfs2: fix data loss with mmap()

commit 56d7acc792c0d98f38f22058671ee715ff197023 upstream.

This bug leads to reproducible silent data loss, despite the use of
msync(), sync() and a clean unmount of the file system.  It is easily
reproducible with the following script:

  ----------------[BEGIN SCRIPT]--------------------
  mkfs.nilfs2 -f /dev/sdb
  mount /dev/sdb /mnt

  dd if=/dev/zero bs=1M count=30 of=/mnt/testfile

  umount /mnt
  mount /dev/sdb /mnt
  CHECKSUM_BEFORE="$(md5sum /mnt/testfile)"

  /root/mmaptest/mmaptest /mnt/testfile 30 10 5

  sync
  CHECKSUM_AFTER="$(md5sum /mnt/testfile)"
  umount /mnt
  mount /dev/sdb /mnt
  CHECKSUM_AFTER_REMOUNT="$(md5sum /mnt/testfile)"
  umount /mnt

  echo "BEFORE MMAP:\t$CHECKSUM_BEFORE"
  echo "AFTER MMAP:\t$CHECKSUM_AFTER"
  echo "AFTER REMOUNT:\t$CHECKSUM_AFTER_REMOUNT"
  ----------------[END SCRIPT]--------------------

The mmaptest tool looks something like this (very simplified, with
error checking removed):

  ----------------[BEGIN mmaptest]--------------------
  data = mmap(NULL, file_size - file_offset, PROT_READ | PROT_WRITE,
              MAP_SHARED, fd, file_offset);

  for (i = 0; i < write_count; ++i) {
        memcpy(data + i * 4096, buf, sizeof(buf));
        msync(data, file_size - file_offset, MS_SYNC))
  }
  ----------------[END mmaptest]--------------------

The output of the script looks something like this:

  BEFORE MMAP:    281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
  AFTER MMAP:     6604a1c31f10780331a6850371b3a313  /mnt/testfile
  AFTER REMOUNT:  281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile

So it is clear, that the changes done using mmap() do not survive a
remount.  This can be reproduced a 100% of the time.  The problem was
introduced in commit 136e8770cd5d ("nilfs2: fix issue of
nilfs_set_page_dirty() for page at EOF boundary").

If the page was read with mpage_readpage() or mpage_readpages() for
example, then it has no buffers attached to it.  In that case
page_has_buffers(page) in nilfs_set_page_dirty() will be false.
Therefore nilfs_set_file_dirty() is never called and the pages are never
collected and never written to disk.

This patch fixes the problem by also calling nilfs_set_file_dirty() if the
page has no buffers attached to it.

[akpm@linux-foundation.org: s/PAGE_SHIFT/PAGE_CACHE_SHIFT/]
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Tested-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agofs/notify: don't show f_handle if exportfs_encode_inode_fh failed
Andrey Vagin [Tue, 9 Sep 2014 21:51:06 +0000 (14:51 -0700)]
fs/notify: don't show f_handle if exportfs_encode_inode_fh failed

commit 7e8824816bda16bb11ff5ff1e1212d642e57b0b3 upstream.

Currently we handle only ENOSPC.  In case of other errors the file_handle
variable isn't filled properly and we will show a part of stack.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agofsnotify/fdinfo: use named constants instead of hardcoded values
Andrey Vagin [Tue, 9 Sep 2014 21:51:04 +0000 (14:51 -0700)]
fsnotify/fdinfo: use named constants instead of hardcoded values

commit 1fc98d11cac6dd66342e5580cb2687e5b1e9a613 upstream.

MAX_HANDLE_SZ is equal to 128, but currently the size of pad is only 64
bytes, so exportfs_encode_inode_fh can return an error.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agokcmp: fix standard comparison bug
Rasmus Villemoes [Tue, 9 Sep 2014 21:51:01 +0000 (14:51 -0700)]
kcmp: fix standard comparison bug

commit acbbe6fbb240a927ee1f5994f04d31267d422215 upstream.

The C operator <= defines a perfectly fine total ordering on the set of
values representable in a long.  However, unlike its namesake in the
integers, it is not translation invariant, meaning that we do not have
"b <= c" iff "a+b <= a+c" for all a,b,c.

This means that it is always wrong to try to boil down the relationship
between two longs to a question about the sign of their difference,
because the resulting relation [a LEQ b iff a-b <= 0] is neither
anti-symmetric or transitive.  The former is due to -LONG_MIN==LONG_MIN
(take any two a,b with a-b = LONG_MIN; then a LEQ b and b LEQ a, but a !=
b).  The latter can either be seen observing that x LEQ x+1 for all x,
implying x LEQ x+1 LEQ x+2 ...  LEQ x-1 LEQ x; or more directly with the
simple example a=LONG_MIN, b=0, c=1, for which a-b < 0, b-c < 0, but a-c >
0.

Note that it makes absolutely no difference that a transmogrying bijection
has been applied before the comparison is done.  In fact, had the
obfuscation not been done, one could probably not observe the bug
(assuming all values being compared always lie in one half of the address
space, the mathematical value of a-b is always representable in a long).
As it stands, one can easily obtain three file descriptors exhibiting the
non-transitivity of kcmp().

Side note 1: I can't see that ensuring the MSB of the multiplier is
set serves any purpose other than obfuscating the obfuscating code.

Side note 2:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <assert.h>
#include <sys/syscall.h>

enum kcmp_type {
        KCMP_FILE,
        KCMP_VM,
        KCMP_FILES,
        KCMP_FS,
        KCMP_SIGHAND,
        KCMP_IO,
        KCMP_SYSVSEM,
        KCMP_TYPES,
};
pid_t pid;

int kcmp(pid_t pid1, pid_t pid2, int type,
 unsigned long idx1, unsigned long idx2)
{
return syscall(SYS_kcmp, pid1, pid2, type, idx1, idx2);
}
int cmp_fd(int fd1, int fd2)
{
int c = kcmp(pid, pid, KCMP_FILE, fd1, fd2);
if (c < 0) {
perror("kcmp");
exit(1);
}
assert(0 <= c && c < 3);
return c;
}
int cmp_fdp(const void *a, const void *b)
{
static const int normalize[] = {0, -1, 1};
return normalize[cmp_fd(*(int*)a, *(int*)b)];
}
#define MAX 100 /* This is plenty; I've seen it trigger for MAX==3 */
int main(int argc, char *argv[])
{
int r, s, count = 0;
int REL[3] = {0,0,0};
int fd[MAX];
pid = getpid();
while (count < MAX) {
r = open("/dev/null", O_RDONLY);
if (r < 0)
break;
fd[count++] = r;
}
printf("opened %d file descriptors\n", count);
for (r = 0; r < count; ++r) {
for (s = r+1; s < count; ++s) {
REL[cmp_fd(fd[r], fd[s])]++;
}
}
printf("== %d\t< %d\t> %d\n", REL[0], REL[1], REL[2]);
qsort(fd, count, sizeof(fd[0]), cmp_fdp);
memset(REL, 0, sizeof(REL));

for (r = 0; r < count; ++r) {
for (s = r+1; s < count; ++s) {
REL[cmp_fd(fd[r], fd[s])]++;
}
}
printf("== %d\t< %d\t> %d\n", REL[0], REL[1], REL[2]);
return (REL[0] + REL[2] != 0);
}

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoeventpoll: fix uninitialized variable in epoll_ctl
Nicolas Iooss [Tue, 9 Sep 2014 21:50:51 +0000 (14:50 -0700)]
eventpoll: fix uninitialized variable in epoll_ctl

commit c680e41b3a2e944185c74bf60531e3d316d3ecc4 upstream.

When calling epoll_ctl with operation EPOLL_CTL_DEL, structure epds is
not initialized but ep_take_care_of_epollwakeup reads its event field.
When this unintialized field has EPOLLWAKEUP bit set, a capability check
is done for CAP_BLOCK_SUSPEND in ep_take_care_of_epollwakeup.  This
produces unexpected messages in the audit log, such as (on a system
running SELinux):

    type=AVC msg=audit(1408212798.866:410): avc:  denied
    { block_suspend } for  pid=7754 comm="dbus-daemon" capability=36
    scontext=unconfined_u:unconfined_r:unconfined_t
    tcontext=unconfined_u:unconfined_r:unconfined_t
    tclass=capability2 permissive=1

    type=SYSCALL msg=audit(1408212798.866:410): arch=c000003e syscall=233
    success=yes exit=0 a0=3 a1=2 a2=9 a3=7fffd4d66ec0 items=0 ppid=1
    pid=7754 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
    fsgid=0 tty=(none) ses=3 comm="dbus-daemon"
    exe="/usr/bin/dbus-daemon"
    subj=unconfined_u:unconfined_r:unconfined_t key=(null)

("arch=c000003e syscall=233 a1=2" means "epoll_ctl(op=EPOLL_CTL_DEL)")

Remove use of epds in epoll_ctl when op == EPOLL_CTL_DEL.

Fixes: 4d7e30d98939 ("epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready")
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arve Hjønnevåg <arve@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9 years agoRevert "mac80211: disable uAPSD if all ACs are under ACM"
Johannes Berg [Mon, 25 Aug 2014 10:08:09 +0000 (12:08 +0200)]
Revert "mac80211: disable uAPSD if all ACs are under ACM"

commit bb512ad0732232f1d2693bb68f31a76bed8f22ae upstream.

This reverts commit 24aa11ab8ae03292d38ec0dbd9bc2ac49fe8a6dd.

That commit was wrong since it uses data that hasn't even been set
up yet, but might be a hold-over from a previous connection.

Additionally, it seems like a driver-specific workaround that
shouldn't have been in mac80211 to start with.

Fixes: 24aa11ab8ae0 ("mac80211: disable uAPSD if all ACs are under ACM")
Reviewed-by: Luciano Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>