OpenVZ Source code
  1. OpenVZ Source code

vzkernel

Public
AuthorCommitMessageCommit dateIssues
Konstantin KhorenkoKonstantin Khorenko
396a563a7b5OpenVZ kernel rh7-3.10.0-693.11.6.vz7.42.6Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Andrey VaginKonstantin KhorenkoAndrey Vagin
530d5853ba4target: call alua helper before reporting group states to initiatorAn alua helper is called with the same set of arguments as it is called when a group state is changed, but the fourth argument will be "Read". For example: default_tg_pt_gp 0 Active/Optimized Read implicit iqn.2014-06.com.vstorage:test-1 Signed-off-by: Andrei Vagin <avagin@openvz.org>
Andrey VaginKonstantin KhorenkoAndrey Vagin
d5714465b67target: move alua user helper from group to deviceWe added this helper to tune a device backing store (to set a correct delta for a ploop device). It is executed when a group state is changed. In this case, there is no difference where it is placed. But now we understand, that we need to run this helper before reporting group states to an initiator. It will be used to sync groups with other targets in a cluster. We have to guaranty that only ...
Konstantin KhorenkoKonstantin Khorenko
0ab8c242e62OpenVZ kernel rh7-3.10.0-693.11.6.vz7.42.5Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Andrey RyabininKonstantin KhorenkoAndrey Ryabinin
87ebda94dd4ms/mm/memcontrol.c: try harder to decrease [memory,memsw].limit_in_bytesmem_cgroup_resize_[memsw]_limit() tries to free only 32 (SWAP_CLUSTER_MAX) pages on each iteration. This makes it practically impossible to decrease limit of memory cgroup. Tasks could easily allocate back 32 pages, so we can't reduce memory usage, and once retry_count reaches zero we return -EBUSY. Easy to reproduce the problem by running the following commands: mkdir /sys/fs/cgroup/memo...PSBM-80732
Andrey RyabininKonstantin KhorenkoAndrey Ryabinin
546729e87a2mm: try harder to decrease cache.limit_in_bytesmem_cgroup_resize_cache_limit() tries to free only 32 (SWAP_CLUSTER_MAX) pages on each iteration. This makes it practically impossible to decrease limit of memory cgroup. Tasks could easily allocate back 32 pages, so we can't reduce memory usage, and once retry_count reaches zero we return -EBUSY. Easy to reproduce the problem by running the following commands: mkdir /sys/fs/cgroup/memory...PSBM-80732
Yu ZhaoKonstantin KhorenkoYu Zhao
8b765718160ms/memcg: refactor mem_cgroup_resize_limit()mem_cgroup_resize_limit() and mem_cgroup_resize_memsw_limit() have identical logics. Refactor code so we don't need to keep two pieces of code that does same thing. Link: http://lkml.kernel.org/r/20180108224238.14583-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <h...PSBM-80732
Johannes WeinerKonstantin KhorenkoJohannes Weiner
d39d3862ddems/mm: memcontrol: fix transparent huge page allocations under pressureIn a memcg with even just moderate cache pressure, success rates for transparent huge page allocations drop to zero, wasting a lot of effort that the allocator puts into assembling these pages. The reason for this is that the memcg reclaim code was never designed for higher-order charges. It reclaims in small batches until there is room for at least one page. Huge page charges only succeed w...PSBM-80732
Johannes WeinerKonstantin KhorenkoJohannes Weiner
980cb5adaaams/mm: memcontrol: simplify detecting when the memory+swap limit is hitWhen attempting to charge pages, we first charge the memory counter and then the memory+swap counter. If one of the counters is at its limit, we enter reclaim, but if it's the memory+swap counter, reclaim shouldn't swap because that wouldn't change the situation. However, if the counters have the same limits, we never get to the memory+swap limit. To know whether reclaim should swap or not, ...PSBM-80732
Johannes WeinerKonstantin KhorenkoJohannes Weiner
a6807364a77ms/mm: memcontrol: factor out reclaim iterator loading and updatingmem_cgroup_iter() is too hard to follow. Factor out the lockless reclaim iterator loading and updating so it's easier to follow the big picture. Also document the iterator invalidation mechanism a bit more extensively. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: K...PSBM-75892
Andrey RyabininKonstantin KhorenkoAndrey Ryabinin
2895afbad16mm/vmscan: call wait_iff_congested() only if we have troubles in recalimingEven if zone congested it might be better to continue reclaim as we may allocate memory from another zone. So call in wait_iff_congested() only if we have troubles in reclaiming memory. https://jira.sw.ru/browse/PSBM-61409 Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>PSBM-61409
Andrey RyabininKonstantin KhorenkoAndrey Ryabinin
6254b06fdfbmm/vmscan: Use per-zone sum of reclaim_stat to change zone state.Currently we collect reclaim stats per-lru list and set zone flags based on these stats. This seems wrong, as lrus a per-memcg thus one zone could have hundreds of them. Move all that zone-related logic from shrink_inactive_list() to shrink_zone, and make decisions based on per-zone sum of reclaim stat instead of just per-lru. https://jira.sw.ru/browse/PSBM-61409 Signed-off-by: Andrey Ryabini...PSBM-61409
Andrey RyabininKonstantin KhorenkoAndrey Ryabinin
3a090b2339fmm/vmscan: collect reclaim stats across zoneCurrently we collect reclaim stats per-lru list and set zone flags based on these stats. This seems wrong, as lrus a per-memcg thus one zone could have hundreds of them. So add reclaim_stats pointer into shrink_control struct and sum counters we need while iterating lrus in zone. Don't use them yet, that's would be the next patch. https://jira.sw.ru/browse/PSBM-61409 Signed-off-by: Andrey Rya...PSBM-61409
Michal HockoKonstantin KhorenkoMichal Hocko
00a46c2c9e5ms/mm: throttle on IO only when there are too many dirty and writeback pageswait_iff_congested has been used to throttle allocator before it retried another round of direct reclaim to allow the writeback to make some progress and prevent reclaim from looping over dirty/writeback pages without making any progress. We used to do congestion_wait before commit 0e093d99763e ("writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant c...PSBM-61409
Michal HockoKonstantin KhorenkoMichal Hocko
494d99f07b1ms/mm, vmscan: enhance mm_vmscan_lru_shrink_inactive tracepointmm_vmscan_lru_shrink_inactive will currently report the number of scanned and reclaimed pages. This doesn't give us an idea how the reclaim went except for the overall effectiveness though. Export and show other counters which will tell us why we couldn't reclaim some pages. - nr_dirty, nr_writeback, nr_congested and nr_immediate tells us how many pages are blocked due to IO - nr_activa...PSBM-61409
Michal HockoKonstantin KhorenkoMichal Hocko
94d28860e65ms/mm, vmscan: extract shrink_page_list reclaim counters into a structshrink_page_list returns quite some counters back to its caller. Extract the existing 5 into struct reclaim_stat because this makes the code easier to follow and also allows further counters to be returned. While we are at it, make all of them unsigned rather than unsigned long as we do not really need full 64b for them (we never scan more than SWAP_CLUSTER_MAX pages at once). This should red...PSBM-61409
Michal HockoKonstantin KhorenkoMichal Hocko
4080c5a6e33ms/mm, vmscan: add active list aging tracepointOur reclaim process has several tracepoints to tell us more about how things are progressing. We are, however, missing a tracepoint to track active list aging. Introduce mm_vmscan_lru_shrink_active which reports the number of - nr_taken is number of isolated pages from the active list - nr_referenced pages which tells us that we are hitting referenced pages which are deactivated. If thi...PSBM-61409
Hugh DickinsKonstantin KhorenkoHugh Dickins
1baf5de93cbms/mm: fix direct reclaim writeback regressionShortly before 3.16-rc1, Dave Jones reported: WARNING: CPU: 3 PID: 19721 at fs/xfs/xfs_aops.c:971 xfs_vm_writepage+0x5ce/0x630 [xfs]() CPU: 3 PID: 19721 Comm: trinity-c61 Not tainted 3.15.0+ #3 Call Trace: xfs_vm_writepage+0x5ce/0x630 [xfs] shrink_page_list+0x8f9/0xb90 shrink_inactive_list+0x253/0x510 shrink_lruvec+0x563/0x6c0 shrink_zone+0x3b/0x100 shr...
Joonsoo KimKonstantin KhorenkoJoonsoo Kim
787c0992862ms/mm/compaction: fix wrong order check in compact_finished()What we want to check here is whether there is highorder freepage in buddy list of other migratetype in order to steal it without fragmentation. But, current code just checks cc->order which means allocation request order. So, this is wrong. Without this fix, non-movable synchronous compaction below pageblock order would not stopped until compaction is complete, because migratetype of most pa...
Vlastimil BabkaKonstantin KhorenkoVlastimil Babka
a9544eb3b36ms/mm, compaction: properly signal and act upon lock and need_sched() contentionCompaction uses compact_checklock_irqsave() function to periodically check for lock contention and need_resched() to either abort async compaction, or to free the lock, schedule and retake the lock. When aborting, cc->contended is set to signal the contended state to the caller. Two problems have been identified in this mechanism. First, compaction also calls directly cond_resched() in both ...
Vlastimil BabkaKonstantin KhorenkoVlastimil Babka
2b6e65e79c5ms/mm/compaction: avoid rescanning pageblocks in isolate_freepagesThe compaction free scanner in isolate_freepages() currently remembers PFN of the highest pageblock where it successfully isolates, to be used as the starting pageblock for the next invocation. The rationale behind this is that page migration might return free pages to the allocator when migration fails and we don't want to skip them if the compaction continues. Since migration now returns fr...
Vlastimil BabkaKonstantin KhorenkoVlastimil Babka
8ed6fafcd47ms/mm/compaction: do not count migratepages when unnecessaryDuring compaction, update_nr_listpages() has been used to count remaining non-migrated and free pages after a call to migrage_pages(). The freepages counting has become unneccessary, and it turns out that migratepages counting is also unnecessary in most cases. The only situation when it's needed to count cc->migratepages is when migrate_pages() returns with a negative error code. Otherwise,...
David RientjesKonstantin KhorenkoDavid Rientjes
8339d32a007ms/mm, compaction: terminate async compaction when reschedulingAsync compaction terminates prematurely when need_resched(), see compact_checklock_irqsave(). This can never trigger, however, if the cond_resched() in isolate_migratepages_range() always takes care of the scheduling. If the cond_resched() actually triggers, then terminate this pageblock scan for async compaction as well. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorm...
David RientjesKonstantin KhorenkoDavid Rientjes
75f291dceb6ms/mm, compaction: embed migration mode in compact_controlWe're going to want to manipulate the migration mode for compaction in the page allocator, and currently compact_control's sync field is only a bool. Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction depending on the value of this bool. Convert the bool to enum migrate_mode and pass the migration mode in directly. Later, we'll want to avoid MIGRATE_SYNC_LIGHT for thp alloc...
David RientjesKonstantin KhorenkoDavid Rientjes
a931bf3e5a4ms/mm, compaction: add per-zone migration pfn cache for async compactionEach zone has a cached migration scanner pfn for memory compaction so that subsequent calls to memory compaction can start where the previous call left off. Currently, the compaction migration scanner only updates the per-zone cached pfn when pageblocks were not skipped for async compaction. This creates a dependency on calling sync compaction to avoid having subsequent calls to async compact...
David RientjesKonstantin KhorenkoDavid Rientjes
de2f34677c1ms/mm, compaction: return failed migration target pages back to freelistGreg reported that he found isolated free pages were returned back to the VM rather than the compaction freelist. This will cause holes behind the free scanner and cause it to reallocate additional memory if necessary later. He detected the problem at runtime seeing that ext4 metadata pages (esp the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were constantly visited by compacti...
David RientjesKonstantin KhorenkoDavid Rientjes
c4241eaf49fms/mm, migration: add destination page freeing callbackMemory migration uses a callback defined by the caller to determine how to allocate destination pages. When migration fails for a source page, however, it frees the destination page back to the system. This patch adds a memory migration callback defined by the caller to determine how to free destination pages. If a caller, such as memory compaction, builds its own freelist for migration targ...
Vlastimil BabkaKonstantin KhorenkoVlastimil Babka
6b51fca8991ms/mm/compaction: cleanup isolate_freepages()isolate_freepages() is currently somewhat hard to follow thanks to many looks like it is related to the 'low_pfn' variable, but in fact it is not. This patch renames the 'high_pfn' variable to a hopefully less confusing name, and slightly changes its handling without a functional change. A comment made obsolete by recent changes is also updated. [akpm@linux-foundation.org: comment fixes, per ...
Heesub ShinKonstantin KhorenkoHeesub Shin
a09c351e662ms/mm/compaction: clean up unused code linesRemove code lines currently not in use or never called. Signed-off-by: Heesub Shin <heesub.shin@samsung.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsu...
David RientjesKonstantin KhorenkoDavid Rientjes
57ddbd7a8d9ms/mm, compaction: ignore pageblock skip when manually invoking compactionThe cached pageblock hint should be ignored when triggering compaction through /proc/sys/vm/compact_memory so all eligible memory is isolated. Manually invoking compaction is known to be expensive, there's no need to skip pageblocks based on heuristics (mainly for debugging). Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redha...
David RientjesKonstantin KhorenkoDavid Rientjes
6179ca1f337ms/mm, compaction: determine isolation mode only onceThe conditions that control the isolation mode in isolate_migratepages_range() do not change during the iteration, so extract them out and only define the value once. This actually does have an effect, gcc doesn't optimize it itself because of cc->sync. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Vlast...
Joonsoo KimKonstantin KhorenkoJoonsoo Kim
9a81258bb25ms/mm/compaction: clean-up code on success of ballon isolationIt is just for clean-up to reduce code size and improve readability. There is no functional change. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked...
Joonsoo KimKonstantin KhorenkoJoonsoo Kim
69f42514558ms/mm/compaction: check pageblock suitability once per pageblockisolation_suitable() and migrate_async_suitable() is used to be sure that this pageblock range is fine to be migragted. It isn't needed to call it on every page. Current code do well if not suitable, but, don't do well when suitable. 1) It re-checks isolation_suitable() on each page of a pageblock that was already estabilished as suitable. 2) It re-checks migrate_async_suitable() on each ...
Joonsoo KimKonstantin KhorenkoJoonsoo Kim
1ce03f8056fms/mm/compaction: change the timing to check to drop the spinlockIt is odd to drop the spinlock when we scan (SWAP_CLUSTER_MAX - 1) th pfn page. This may results in below situation while isolating migratepage. 1. try isolate 0x0 ~ 0x200 pfn pages. 2. When low_pfn is 0x1ff, ((low_pfn+1) % SWAP_CLUSTER_MAX) == 0, so drop the spinlock. 3. Then, to complete isolating, retry to aquire the lock. I think that it is better to use SWAP_CLUSTER_MAX th pfn for ch...
Joonsoo KimKonstantin KhorenkoJoonsoo Kim
4ce80d5f804ms/mm/compaction: do not call suitable_migration_target() on every pagesuitable_migration_target() checks that pageblock is suitable for migration target. In isolate_freepages_block(), it is called on every page and this is inefficient. So make it called once per pageblock. suitable_migration_target() also checks if page is highorder or not, but it's criteria for highorder is pageblock order. So calling it once within pageblock range has no problem. Signed-of...
Joonsoo KimKonstantin KhorenkoJoonsoo Kim
6fed33655c5ms/mm/compaction: disallow high-order page for migration targetPurpose of compaction is to get a high order page. Currently, if we find high-order page while searching migration target page, we break it to order-0 pages and use them as migration target. It is contrary to purpose of compaction, so disallow high-order page to be used for migration target. Additionally, clean-up logic in suitable_migration_target() to simplify the code. There is no functi...
David RientjesKonstantin KhorenkoDavid Rientjes
705c7b021d3ms/mm, compaction: avoid isolating pinned pagesPage migration will fail for memory that is pinned in memory with, for example, get_user_pages(). In this case, it is unnecessary to take zone->lru_lock or isolating the page and passing it to page migration which will ultimately fail. This is a racy check, the page can still change from under us, but in that case we'll just fail later when attempting to move the page. This avoids very expen...
Vlastimil BabkaKonstantin KhorenkoVlastimil Babka
e1d5051dedfms/mm: compaction: reset scanner positions immediately when they meetCompaction used to start its migrate and free page scaners at the zone's lowest and highest pfn, respectively. Later, caching was introduced to remember the scanners' progress across compaction attempts so that pageblocks are not re-scanned uselessly. Additionally, pageblocks where isolation failed are marked to be quickly skipped when encountered again in future compactions. Currently, both...
Vlastimil BabkaKonstantin KhorenkoVlastimil Babka
bab8a4ef408ms/mm: compaction: do not mark unmovable pageblocks as skipped in async compactionCompaction temporarily marks pageblocks where it fails to isolate pages as to-be-skipped in further compactions, in order to improve efficiency. One of the reasons to fail isolating pages is that isolation is not attempted in pageblocks that are not of MIGRATE_MOVABLE (or CMA) type. The problem is that blocks skipped due to not being MIGRATE_MOVABLE in async compaction become skipped due to th...
Vlastimil BabkaKonstantin KhorenkoVlastimil Babka
93ef5641dd2ms/mm: compaction: encapsulate defer reset logicCurrently there are several functions to manipulate the deferred compaction state variables. The remaining case where the variables are touched directly is when a successful allocation occurs in direct compaction, or is expected to be successful in the future by kswapd. Here, the lowest order that is expected to fail is updated, and in the case of successful allocation, the deferred status and...
Mel GormanKonstantin KhorenkoMel Gorman
14c6f7fbba4ms/mm: compaction: trace compaction begin and endThe broad goal of the series is to improve allocation success rates for huge pages through memory compaction, while trying not to increase the compaction overhead. The original objective was to reintroduce capturing of high-order pages freed by the compaction, before they are split by concurrent activity. However, several bugs and opportunities for simple improvements were found in the curren...
David RientjesKonstantin KhorenkoDavid Rientjes
9bdc45a8f82ms/mm/compaction.c: periodically schedule when freeing pagesPatchset description: compaction related stable backports These are some compaction related -stable backports that we missing. David Rientjes (9): ms/mm/compaction.c: periodically schedule when freeing pages ms/mm, compaction: avoid isolating pinned pages ms/mm, compaction: determine isolation mode only once ms/mm, compaction: ignore pageblock skip when manually invoking compactio...PSBM-81070
Konstantin KhorenkoKonstantin Khorenko
d8969f9b74bOpenVZ kernel rh7-3.10.0-693.11.6.vz7.42.4Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Sebastian Andrzej SiewiorKonstantin KhorenkoSebastian Andrzej Siewior
f5139e5070bms/kbuild: add -fno-PIEDebian started to build the gcc with -fPIE by default so the kernel build ends before it starts properly with: |kernel/bounds.c:1:0: error: code model kernel does not support PIC mode Also add to KBUILD_AFLAGS due to: |gcc -Wp,-MD,arch/x86/entry/vdso/vdso32/.note.o.d … -mfentry -DCC_USING_FENTRY … vdso/vdso32/note.S |arch/x86/entry/vdso/vdso32/note.S:1:0: sorry, unimplemented: -mfentry isn’t ...
Sebastian Andrzej SiewiorKonstantin KhorenkoSebastian Andrzej Siewior
369f626be24ms/scripts/has-stack-protector: add -fno-PIEAdding -no-PIE to the fstack protector check. -no-PIE was introduced before -fstack-protector so there is no need for a runtime check. Without it the build stops: |Cannot use CONFIG_CC_STACKPROTECTOR_STRONG: -fstack-protector-strong available but compiler is broken due to -mcmodel=kernel + -fPIE if -fPIE is enabled by default. Tagging it stable so it is possible to compile recent stable kern...
Linus TorvaldsKonstantin KhorenkoLinus Torvalds
d1255da44c7ms/loop: fix concurrent lo_open/lo_release范龙飞 reports that KASAN can report a use-after-free in __lock_acquire. The reason is due to insufficient serialization in lo_release(), which will continue to use the loop device even after it has decremented the lo_refcnt to zero. In the meantime, another process can come in, open the loop device again as it is being shut down. Confusion ensues. Reported-by: 范龙飞 <long7573@126.com> Signed-off-...2 Jira Issues
Pavel TikhomirovKonstantin KhorenkoPavel Tikhomirov
11bddd0543bfence-watchdog: print alive messagesWe have a situation when node worked for ~8 days and came to The Point where jiffies > fence_wdog_jiffies64 (what ever updated /sys/kernel/watchdog_timer stopped doing these) but node lived after The Point for 17 more hours, these means that nobody called fence_wdog_check_timer() for 17 hours, else crash or reboot would've happened earlier. When fence_wdog_check_timer() was called the first ti...TTASK-22056
Xin LongKonstantin KhorenkoXin Long
94b497499d2ms/dccp: call inet_add_protocol after register_pernet_subsys in dccp_v6_initBackport of ms commit a0f9a4c2ffef Patch "call inet_add_protocol after register_pernet_subsys in dccp_v4_init" fixed a null pointer dereference issue for dccp_ipv4 module. The same fix is needed for dccp_ipv6 module. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> https://jira.sw.ru/browse/PSBM-80708 Signed-off-by: Kirill Tkhai <ktkhai@vir...PSBM-80708
Xin LongKonstantin KhorenkoXin Long
0b2737515cems/dccp: call inet_add_protocol after register_pernet_subsys in dccp_v4_initBackport of ms commit d5494acb88aa Now dccp_ipv4 works as a kernel module. During loading this module, if one dccp packet is being recieved after inet_add_protocol but before register_pernet_subsys in which v4_ctl_sk is initialized, a null pointer dereference may be triggered because of init_net.dccp.v4_ctl_sk is 0x0. Jianlin found this issue when the following call trace occurred: [ 171.95...PSBM-80708
Andrey RyabininKonstantin KhorenkoAndrey Ryabinin
9af2260ea0anet/dccp: fix use after free in tw_timer_handler()DCCP doesn't purge timewait sockets on network namespace shutdown. So, after net namespace destroyed we could still have an active timer which will trigger use after free in tw_timer_handler(): BUG: KASAN: use-after-free in tw_timer_handler+0x4a/0xa0 at addr ffff88010e0d1e10 Read of size 8 by task swapper/1/0 Call Trace: __asan_load8+0x54/0x90 tw_timer_handler+0x4a/0xa0 call_timer_fn+0x127/...PSBM-80708