The way of kvm: memory management

Tuấn Anh Phạm
Coccoc Engineering Blog
6 min readOct 11, 2021

--

At coccoc, we have many VMs running, with different purposes: personal sandbox, builders, backing services for dev environment … But we only have limited resources for them, so we must pack as many VMs on physical servers as possible. As memory is the biggest concern, we apply techniques: memory overcommitment, KSM and memory ballooning.

Memory Overcommitment

Most of VMs do not use 100% memory available all the time, so we could allocate VMs memory more than the total mem physical hosts have. While this technique is great, as it allows us to pack more VMs on a already crowded host, it should be used with care. If the VMs really use more RAM than the host could provide, things will be ugly: if you have enough swap, some of the running processes will be moved to swap space on disks, which degraded performance greatly; or if you don’t have enough swap, host OS may trigger OOM kill which destroy VMs or other system process, may lead to file system errors and/or guests unbootable.

At coccoc, to reduce the risks of memory overcommitment, we combine it with other two techniques: KSM and memory ballooning. Also most of our hosts have mem overcommit rate from 150% to 200%, they rarely have problems with it.

KSM: kernel samepage merging

Originally created for KVM hypervisor, this feature allows KVM guests to share identical memory pages, allow greater guest density of identical or similar guest operating system. We only run debian VMs on our cluster, so this feature benefits us a lot.
KSM was merged in linux 2.6.32, and need to be enabled via CONFIG_KSM=y.

# grep KSM /boot/config-$(uname -r)
CONFIG_KSM=y

KSM stores its status information in files under /sys/kernel/mm/ksm directory:

  • pages_shared: shared pages are being used.
  • pages_sharing: pages currently shared.
  • full_scans: number of full scan run.
  • pages_unshared: unique pages but repeatedly scan for merging.
  • pages_volatile: pages changing to fast to be scanned.
  • merge_across_nodes: Whether pages from different NUMA nodes can be merged.
  • run: whether KSM process is running.

To make ksm run:

echo 1 > /sys/kernel/mm/ksm/run

Let ‘s see how much ksm helps us to save mem usage on a real host example:

## mem infos on the host
[root@virt5v.dev:~]# free -mh
total used free shared buff/cache available
Mem: 94Gi 89Gi 572Mi 937Mi 4.7Gi 3.4Gi
## sum up mem usage of VMs running on this host
[oneadmin@virt-frontend:~]$ onevm list -k -s UMEM=20 --filter HOST=virt5v.dev.itim.vn |\
awk 'BEGIN {mem=0} $5 == "runn" { mem+=$7 } END {print mem/(1024*1024)}'
97.2861
## number of VMs running
[oneadmin@virt-frontend:~]$ onevm list --filter HOST=virt5v.dev.itim.vn | awk '$5 == "runn"' | wc -l
18

Command explains: we use opennebula for cloud platform, thus onevm command is run on opennebula frontend to collect metrics, in this case mem usage on host virt5v, -k is to convert to kilobytes, with size of UMEM field is 20. awk is used to filter only the VMs in running state, sum up their mem usages and finally print the result in Gi.

Here, it ‘s interesting: The total mem usage (real usage, not memory allocation) of running VMs is greater than the total host mem. Let ‘s dig more on ksm stats:

[root@virt5v.dev:/sys/kernel/mm/ksm]# grep . pages_shared pages_sharing 
pages_shared:1700529
pages_sharing:3574747

KSM saved 3574747-1700529=1874218 pages on host. With each page is 4KB, that means 1874218 * 4 / (1024*1024)=7.149G. We can still improve the number by packing more identical OS on the same host, e.g host virt5v.dev only deploy VMs with debian10.

The trade-off: of course, there ‘s no free meal, esp when it ‘s related to resources. Once you save one resource, you must use another to compensate. In KSM case, it ‘s cpu power — the ksm in running may take up 1 core, as it needs to scan memory pages to find identical ones.

ksm cpu consumptions

On the picture above, you could see ksmd process has cpu usage of 73%. This is a price we ‘re willing to pay, because our VMs ‘s not CPU bound, so we can trade some cpu power for memory capacity.

Memory ballooning

It ‘s a technique that allow hypervisor retrieves unused memory from VMs. This allows memory over-commitment, for total RAM that guests VMs allocated is more than physical host RAM. When the physical host running low on memory, it will reclaim back unused mem from guests, using balloon device.

To check if guest VM have ballooning device ready:

# lspci | grep balloon
00:06.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon

I will demonstrate on taking back unused memory from VM to physical host. First, let ‘s view mem stat on my sandbox

[root@pta-sb:~]# free -mh
total used free shared buff/cache available
Mem: 3.8Gi 1.0Gi 1.7Gi 41Mi 1.1Gi 2.6Gi

The same sandbox from host POV:

[root@virt2v.dev:~]# virsh dommemstat one-997
actual 4194304
unused 1804460
available 4031520
usable 2683184

Now, let ‘s deflate memory on the sandbox.

## deflate sandbox memory
[root@virt2v.dev:~]# virsh setmem --live one-997 3000000
## check again on host
[root@virt2v.dev:~]# virsh dommemstat one-997
actual 3000000
unused 609868
available 2837216
usable 1489048
## check inside sandbox
[root@pta-sb:~]# free -mh
total used free shared buff/cache available
Mem: 2.7Gi 1.0Gi 596Mi 41Mi 1.1Gi 1.4Gi

While using ballooning technique may impact performance, especially when the host is low on mem, with multiple VMs request to inflate. Excessive ballooning can impair performance of applications, or manifest as high disk I/O or latency. But the benefits, which are resource optimization, memory capacity and lower cost outweight the risks.

Ovirt ‘s MOM

Those techniques ‘re great, but to apply them is not an easy task: when do we get ksm to run, when do we take back memory from VMs — which VMs will be deflated and how much RAM will be taken back. Luckily, we have MOM (memory overcomitmment manager) from ovirt to cover them up.
MOM is a service, which collect data from multiple sources to be used by evaluation engine, and depend on pre-defined policies, will take actions and trigger ksm or memory balloon, or both.
Reading the mom log reveals to us what it did to manage the host memory. For examples, let ‘s see how it works on one of our host:

2021-07-14 02:32:08,106 - mom.Controllers.Balloon - INFO - Ballooning guest:one-1217 from 14981844 to 14928458
2021-07-14 02:32:18,348 - mom.Controllers.Balloon - INFO - Ballooning guest:one-301 from 2038364 to 2031658
2021-07-14 02:32:28,571 - mom.Controllers.Balloon - INFO - Ballooning guest:one-301 from 2031660 to 2026053
2021-07-14 02:32:28,622 - mom.Controllers.Balloon - INFO - Ballooning guest:one-1217 from 14928460 to 14874355
2021-07-14 03:57:11,271 - mom.Controllers.Balloon - INFO - Ballooning guest:one-301 from 2026056 to 2097152
2021-07-14 03:57:11,298 - mom.Controllers.Balloon - INFO - Ballooning guest:one-1217 from 14874356 to 15768747
2021-07-14 03:59:03,987 - mom.Controllers.Balloon - INFO - Ballooning guest:one-1217 from 15768748 to 16557185
2021-07-14 03:59:14,047 - mom.Controllers.Balloon - INFO - Ballooning guest:one-1217 from 16557188 to 16777216
2021-07-14 04:00:56,636 - mom.Controllers.Balloon - INFO - Ballooning guest:one-1217 from 16777216 to 16717574
2021-07-14 04:01:17,097 - mom.Controllers.Balloon - INFO - Ballooning guest:one-1217 from 16717576 to 16650394

The number is hard for us human to read, so let ‘s add a simple filter to make it more readable:

# head /var/log/mom.log | awk '{print $9,$NF - $(NF-2)}'
guest:one-1217 -53386
guest:one-301 -6706
guest:one-301 -5607
guest:one-1217 -54105
guest:one-301 71096
guest:one-1217 894391
guest:one-1217 788437
guest:one-1217 220028
guest:one-1217 -59642
guest:one-1217 -67182

Command explains: I use awk to print out guest name and memory of the guest being changed (in KB); the negative value means that the host take back memory from guest (deflating), otherwise positive value indicates that host return memory to guest (inflating).

--

--