OpenVZ Source code
  1. OpenVZ Source code

vzkernel

Public
AuthorCommitMessageCommit dateIssues
Konstantin KhorenkoKonstantin Khorenko
9100cbc72cfOpenVZ kernel rh9-5.14.0-362.8.1.vz9.35.7Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Alexey KuznetsovKonstantin KhorenkoAlexey Kuznetsov
72dcce0c8d2fs/fuse: enhanced splice supportUnfortunately, existing support of splice in fuse is completely useless, it has many flaws, each of them is fatal, even taken separately. - it passes only single splice, which requires of user space to prepare one more splice to merge header. - ... and does not allow to use splices coming from TCP as they can be huge and do not fit to single pipe buffer. - it uses kvmalloc(!!!) for temp bu...VSTOR-79527
Alexey KuznetsovKonstantin KhorenkoAlexey Kuznetsov
137e8807d5bnet: zerocopy over unix socketsObservation is that af_unix sockets today became slower and eat a lot of more cpu than 100G ethernet. So, implement MSG_ZEROCOPY over af_unix sockets to be able to talk to local services without collapse of performance. Unexpectedly, this makes sense! F.e. zerocopy cannot be done in TCP over loopback, because skbs when passing over loopback change ownership. But unix sockets traditionally impl...VSTOR-79527
Alexey KuznetsovKonstantin KhorenkoAlexey Kuznetsov
105a147a0c2fs/fuse: fuse queue routingGeneric fuse multiqueue support. It improves previously existing per-cpu routing and makes it extensible. At the moment three routing tactics are implemented and tested: 1. Old per-cpu routing. Deprecated, but left for performance comparisons. Also it still can be good in some situations. 2. Size buckets to support large fuse writes. Userspace selects it as default for fuse writes. 3. Ha...VSTOR-79527
Konstantin KhorenkoKonstantin Khorenko
3cb059b77cdOpenVZ kernel rh9-5.14.0-362.8.1.vz9.35.6Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Konstantin KhorenkoKonstantin Khorenko
b6a336efca8configs: Enable in-kernel accelerator for virtio-blk guests in configs dirWe store precompiled config files for the convenience, so enable VHOST_BLK module there as well. https://virtuozzo.atlassian.net/browse/PSBM-139414 https://virtuozzo.atlassian.net/browse/PSBM-152375 Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Feature: vhost-blk: in-kernel accelerator for virtio-blk guests2 Jira Issues
Konstantin KhorenkoKonstantin Khorenko
04de95c1036configs: Enable in-kernel accelerator for virtio-blk guestshttps://virtuozzo.atlassian.net/browse/PSBM-139414 https://virtuozzo.atlassian.net/browse/PSBM-152375 Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Feature: vhost-blk: in-kernel accelerator for virtio-blk guests2 Jira Issues
Andrey ZhadchenkoKonstantin KhorenkoAndrey Zhadchenko
e21e142cb12drivers/vhost: vhost-blk accelerator for virtio-blk guestsAlthough QEMU virtio is quite fast, there is still some room for improvements. Disk latency can be reduced if we handle virito-blk requests in host kernel istead of passing them to QEMU. The patch adds vhost-blk kernel module to do so. Some test setups: fio --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=128 QEMU drive options: cache=none filesystem: xfs SSD: | r...4 Jira Issues
Andrey ZhadchenkoKonstantin KhorenkoAndrey Zhadchenko
5f60cbb11d0drivers/vhost: add ioctl to increase the number of workersFinally add ioctl to allow userspace to create additional workers For now only allow to increase the number of workers https://jira.sw.ru/browse/PSBM-139414 Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com> ====== Patchset description: vhost-blk: in-kernel accelerator for virtio-blk guests Although QEMU virtio-blk is quite fast, there is still some room for improvements. Dis...4 Jira Issues
Mike ChristieKonstantin KhorenkoMike Christie
5271bf51f1bms/vhost: replace single worker pointer with xarrayThe next patch allows userspace to create multiple workers per device, so this patch replaces the vhost_worker pointer with an xarray so we can store mupltiple workers and look them up. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-15-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> ======== Also rework vhost_work_...2 Jira Issues
Mike ChristieKonstantin KhorenkoMike Christie
ab2f6961e8bms/vhost: convert poll work to be vq basedThis has the drivers pass in their poll to vq mapping and then converts the core poll code to use the vq based helpers. In the next patches we will allow vqs to be handled by different workers, so to allow drivers to execute operations like queue, stop, flush, etc on specific polls/vqs we need to know the mappings. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <2023062...2 Jira Issues
Andrey ZhadchenkoKonstantin KhorenkoAndrey Zhadchenko
1bcb1e6e6d1drivers/vhost: attach cgrous to specififc workerUpdate vhost_attach_cgroups() to operate with specific worker rather than global vhost device functions https://virtuozzo.atlassian.net/browse/PSBM-152375 https://virtuozzo.atlassian.net/browse/PSBM-139414 Signed-off-by: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com> Feature: vhost-blk: in-kernel accelerator for virtio-blk guests2 Jira Issues
Mike ChristieKonstantin KhorenkoMike Christie
b33e9080a1ems/vhost: take worker or vq for flushingThis patch has the core work flush function take a worker. When we support multiple workers we can then flush each worker during device removal, stoppage, etc. It also adds a helper to flush specific virtqueues, so vhost-scsi can flush IO vqs from it's ctl vq. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-7-michael.christie@oracle.com> Signed-off-...2 Jira Issues
Mike ChristieKonstantin KhorenkoMike Christie
db46389987fms/vhost: take worker or vq instead of dev for queueingThis patch has the core work queueing function take a worker for when we support multiple workers. It also adds a helper that takes a vq during queueing so modules can control which vq/worker to queue work on. This temp leaves vhost_work_queue. It will be removed when the drivers are converted in the next patches. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <2023062...2 Jira Issues
Mike ChristieKonstantin KhorenkoMike Christie
312bc762cd9ms/vhost, vhost_net: add helper to check if vq has workIn the next patches each vq might have different workers so one could have work but others do not. For net, we only want to check specific vqs, so this adds a helper to check if a vq has work pending and converts vhost-net to use it. Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-Id: <20230626232307.97930-5-michael.christie@oracle....2 Jira Issues
Mike ChristieKonstantin KhorenkoMike Christie
ee7a2282666ms/vhost: add vhost_worker pointer to vhost_virtqueueThis patchset allows userspace to map vqs to different workers. This patch adds a worker pointer to the vq so in later patches in this set we can queue/flush specific vqs and their workers. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-4-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> ========= (cherry picked from...2 Jira Issues
Mike ChristieKonstantin KhorenkoMike Christie
a25b680efbdms/vhost: dynamically allocate vhost_workerThis patchset allows us to allocate multiple workers, so this has us move from the vhost_worker that's embedded in the vhost_dev to dynamically allocating it. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20230626232307.97930-3-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> ========= Half of this commit is already present. Add the res...2 Jira Issues
Konstantin KhorenkoKonstantin Khorenko
e10e2fafaa6FD: vhost-blk: in-kernel accelerator for virtio-blk guestshttps://jira.sw.ru/browse/PSBM-139414 Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Feature: vhost-blk: in-kernel accelerator for virtio-blk guestsPSBM-139414
Liu KuiKonstantin KhorenkoLiu Kui
14faea33294fs/fuse kio: destroy rdma_cm_id immediately in case cm fails during connection establishmentPreviously, if cm fails after the rio has been created, the rdma_cm_id would not be destroyed immediately. However the cm_id->context could still point to rc->id which would no longer be valid. This dealy create a window during which cm_id->context holds an illegal pointer. If an RMDA cm event arrives during this window, an illegal pointer dereference will happen, thus crashing the system. htt...VSTOR-79838
Konstantin KhorenkoKonstantin Khorenko
ff5a8ce86adOpenVZ kernel rh9-5.14.0-362.8.1.vz9.35.5Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Alexey KuznetsovKonstantin KhorenkoAlexey Kuznetsov
c4e2490aa51fs/fuse: multithread fuse writefuse user space creates cloned channel device and binds it to cpu. Kernel routes WRITE requests to these channels, which allows us to offload expensive reads from fuse device to multiple threads. At the moment we see significant improvements, about 30% in some major ostor workload. Signed-off-by: Alexey Kuznetsov <kuznet@acronis.com> Feature: fuse: multithread fuse write
Konstantin KhorenkoKonstantin Khorenko
544856295cfmm: Drop swap_cache_info reporting in vzstatMainstream has dropped swap_cache_info statistics: 442701e7058b ("mm/swap: remove swap_cache_info statistics") So we are dropping reporting it via /proc/vz/stats interface. We could leave the format of /proc/vz/stats file the same (it is an interface after all, should be stable), but as in vz9 we'll have so many changes, vzstat utility is also should be rewritten, so it's a good time to drop...PSBM-152466
Konstantin KhorenkoKonstantin Khorenko
77567b1b78aOpenVZ kernel rh9-5.14.0-362.8.1.vz9.35.4Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Konstantin KhorenkoKonstantin Khorenko
cd020db2561OpenVZ kernel rh9-5.14.0-362.8.1.vz9.35.3Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
YuriyKonstantin KhorenkoYuriy
a9f0b04bb37fs/fuse kio: skip truncating dropped cslistsIf try_cslist_get returns false, it indicates that the cslist has already been dropped and the map has been truncated. So, this cslist should not be handled. https://pmc.acronis.work/browse/VSTOR-76384 Signed-off-by: Yuriy Vasilev <yuriy.vasilev@virtuozzo.com> Acked-by: Alexey Kuznetsov <kuznet@virtuozzo.com>VSTOR-76384
YuriyKonstantin KhorenkoYuriy
063c30f17fffs/fuse kio: skip handling dropped cslists in pcs_map_notify_addr_changeIf try_cslist_get returns false, it indicates that the cslist has been dropped and should not be handled without holding cs->lock. https://pmc.acronis.work/browse/VSTOR-76384 Signed-off-by: Yuriy Vasilev <yuriy.vasilev@virtuozzo.com> Acked-by: Alexey Kuznetsov <kuznet@virtuozzo.com>VSTOR-76384
YuriyKonstantin KhorenkoYuriy
a184bf61849fs/fuse kio: introduce try_cslist_get()This function allows checking if the cslist has been dropped before usage. https://pmc.acronis.work/browse/VSTOR-76384 Signed-off-by: Yuriy Vasilev <yuriy.vasilev@virtuozzo.com> Acked-by: Alexey Kuznetsov <kuznet@virtuozzo.com>VSTOR-76384
YuriyKonstantin KhorenkoYuriy
5c38a1637eefs/fuse kio: do not allow getting cslist when refcnt is equal to 0When the refcnt of a cslist is equal to 0, it indicates that the cslist has been dropped and is going to be freed. In such cases, let's trigger a BUG_ON to prevent use after free. https://pmc.acronis.work/browse/VSTOR-76384 Signed-off-by: Yuriy Vasilev <yuriy.vasilev@virtuozzo.com> Acked-by: Alexey Kuznetsov <kuznet@virtuozzo.com>VSTOR-76384
Konstantin KhorenkoKonstantin Khorenko
3b5521b69b3OpenVZ kernel rh9-5.14.0-362.8.1.vz9.35.2Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Alexander AtanasovKonstantin KhorenkoAlexander Atanasov
a86d1d9f0eaext4/mfsync: do not BUG_ON on wrong set of filesmfsync(...) can not sync files from different filesystems if passed such set of files it BUG_ONs. Instead of BUG return -EINVAL. https://pmc.acronis.work/browse/VSTOR-78331 Signed-off-by: Alexander Atanasov <alexander.atanasov@virtuozzo.com> Acked-by: Alexey Kuznetsov <kuznet@virtuozzo.com>VSTOR-78331
Pavel TikhomirovKonstantin KhorenkoPavel Tikhomirov
8cf1c11d447mm/memcontrol: prohibit writing to memory.numa_migrate from containerWe might want to put containers on designated numa nodes for optimal perfomance, it will be all ruinied if container could force its memory pages to move to any node it wants. This memory.numa_migrate file was originaly made for vcmmd which works from ve0, so we should be fine with this additional restriction. Fixes: dfc0b63bfd50c ("mm: memcontrol: add memory.numa_migrate file") https://virtu...PSBM-152372
Konstantin KhorenkoKonstantin Khorenko
4d08995e658sched: Do not set LBF_NEED_BREAK flag if scanned all the tasksAfter ms commit b0defa7ae03e ("sched/fair: Make sure to try to detach at least one movable task") detach_tasks() does not stop on the condition (env->loop > env->loop_max) in case no movable task found. Instead of that (if there are no movable tasks in the rq) exits always happen on the loop_break check - thus with LBF_NEED_BREAK flag set. It's not a problem for mainstream because load_balanc...
Konstantin KhorenkoKonstantin Khorenko
a486357cc95OpenVZ kernel rh9-5.14.0-362.8.1.vz9.35.1Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Konstantin KhorenkoKonstantin Khorenko
5c2b8d6367eOpenVZ kernel rh9-5.14.0-362.8.1.vz9.30.14Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Konstantin KhorenkoKonstantin Khorenko
a970393ab4aconfigs: commit actual Virtuozzo 9 release and debug configsThose configs are generated by the following command: # cd redhat/configs/ && ARCH_MACH=x86_64 ./build_configs.sh kernel rhel Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com> Feature: internal
Alexey KuznetsovKonstantin KhorenkoAlexey Kuznetsov
fe71f02f6fafuse: pcs: dangerous typo in commit_sync_info()Unpleasant, shows the code never was in this place before. Fixes: 3202fa19f30e ("fuse: a protocol to reenable optimizations after replication finished") https://pmc.acronis.work/browse/VSTOR-77923 Signed-off-by: Alexey Kuznetsov <kuznet@acronis.com>VSTOR-77923
Kui LiuKonstantin KhorenkoKui Liu
60bf2c64fddfs/fuse kio: always ack RIO_MSG_RDMA_READ_REQ received from csdIn our userspace RDMA implementation, it is required that every RIO_MSG_RDMA_READ_REQ msg must be acked strictly in order. However this rule can be broken due to a bug in kio, which though is triggered by very abnormal hardware behaviour that it can take very long time (>10s) for a WR to complete. This happens in the read workload with large block size that the the client needs to issue RDMA R...4 Jira Issues
Kui LiuKonstantin KhorenkoKui Liu
9573bf29e31fs/fuse: make size of qhash and limit of each bucket module parametersBoth size of qhash and limit of each bucket can affect performance of certain workload significantly. There is no single set of value that'd be the best for all workload, we may need to choose a value based on workload, so it'd be better make them configurable. Here we choose the default value to be 16 (qhash size) x 256 (bucket limit). Signed-off-by: Liu Kui <Kui.Liu@acronis.com> Acked-by: A...
Ilpo JärvinenKonstantin KhorenkoIlpo Järvinen
da615639c35ms/tty: Make ->set_termios() old ktermios constThere should be no reason to adjust old ktermios which is going to get discarded anyway. Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220816115739.10928-9-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Getting rid of compilation warnings. ht...PSBM-148793
Ilpo JärvinenKonstantin KhorenkoIlpo Järvinen
7fef9ee2ecems/usb: serial: Make ->set_termios() old ktermios constThere should be no reason to adjust old ktermios which is going to get discarded anyway. Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220816115739.10928-8-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Getting rid of compilation warnings. ht...PSBM-148793
Ilpo JärvinenKonstantin KhorenkoIlpo Järvinen
5067e1dd0a4ms/tty: Make ldisc ->set_termios() old ktermios constThere should be no reason to adjust old ktermios which is going to get discarded anyway. Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220816115739.10928-6-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Getting rid of compilation warnings. ht...PSBM-148793
Ilpo JärvinenKonstantin KhorenkoIlpo Järvinen
05e671ac027ms/serial: dz: Assume previous baudrate is validAssume previously used termios has a valid baudrate and use it directly. Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20220816115739.10928-4-ilpo.jarvinen@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Getting ...PSBM-148793
Kees CookKonstantin KhorenkoKees Cook
e188502de22ms/treewide: Replace open-coded flex arrays in unionsIn support of enabling -Warray-bounds and -Wzero-length-bounds and correctly handling run-time memcpy() bounds checking, replace all open-coded flexible arrays (i.e. 0-element arrays) in unions with the DECLARE_FLEX_ARRAY() helper macro. This fixes warnings such as: fs/hpfs/anode.c: In function 'hpfs_add_sector_to_btree': fs/hpfs/anode.c:209:27: warning: array subscript 0 is outside the bound...PSBM-148793
Alexander AtanasovKonstantin KhorenkoAlexander Atanasov
46ad581166dve/userns: remove all hashed entries before freeing user_namespace548df8b4b57b (ve/userns: associate user_struct with the user_namespace, 2017-03-13) introduced dynamically allocated per-userns uid hastable instead of using a global static hash table. The problem with that allocate hashtable is that life cycle of the two objects is different - both structes use reference counts but they are counted separately. The contained objects (user_struct) are not refe...PSBM-150648
Alexey KuznetsovKonstantin KhorenkoAlexey Kuznetsov
55c440bad3dfuse: scalable queue limitingThis is missing element in previous scalability patch. We removes any limits on direct io submitted to cluster there, which is not right thing to do. The problem is not trivial. Experiments show we cannot do _any_ shared spinlock in this path, even empty lock-unlock added there reduces performance twice! So, we have to come with scalable solution not using locks. Several approaches were tried,...VSTOR-54040
Alexey KuznetsovKonstantin KhorenkoAlexey Kuznetsov
34cba88f29ffuse: skip bg_queue for async direct io pcs requestsThere is a capital problem in fuse pcs implementation. While requests scale by cpu we still have contention on bg_lock and all the requests go through single bottleneck at bg_queue. Of course we had inferior performance due to this, but we ignored the problem as the preformance still was good. But recently it was found that under some realistic curcumstances we get collapse of preformance, it ...VSTOR-54040
Alexey KuznetsovKonstantin KhorenkoAlexey Kuznetsov
8620fd2a3a1fuse: pcs: new rpc affinity mode - RSSThe mode aligns socket io jobs to RSS, receive/transmit jobs are scheduled at cpus which is mapped by RSS from rpc socket. Precondition is multiqueue device with enabled RSS and XPS. If RSS and XPS are enabled, sockets are entirely localized to one cpu, they are not accessed from other cpus, which minimizes lock contention and keep perfect cache locality for socket data. Nevertheless, we have ...VSTOR-54040
Alexey KuznetsovKonstantin KhorenkoAlexey Kuznetsov
153b63ea657fuse: pcs: split trace_printktrace_printk() is a function which is not desired in release kernels, if is referenced from a module, even if it is not actually used, it allocates lots of memory and scares people with some messages. What can we do? 1. Surround it with ifdef turned off in release kernels No. We need this at customer's environments to investigate actual problems, modules cannot be replaced with debuggin...PSBM-146513
Alexey KuznetsovKonstantin KhorenkoAlexey Kuznetsov
33acf7985c5fuse: pcs: rpc timeout was incoherentThe code from user space was ported incorrectly without understanding how this actually works. This can result in lockup of failing connection. We have two timeouts - per-message timeout, when we cancel timed out request, but assume this is because of semantics of the request, f.e. CS needs to talk to another CS or to MDS, and that communitacion fails, which obviously does not mean _this_ conn...VSTOR-54040
Alexey KuznetsovKonstantin KhorenkoAlexey Kuznetsov
fbdab838e5bfuse: do not accelerate writes with unknown dirty stateIf we have no sync seq numbers we cannot force dirty status. So, route writes via slow path. When CSes reply with dirty seqs, we will be able to make shortcut. https://pmc.acronis.work/browse/VSTOR-54040 Signed-off-by: Alexey Kuznetsov <kuznet@acronis.com> Feature: vStorageVSTOR-54040