I’ve been thinking about these problems for a while. Previously, I thought that the “put a VM on it” approach was the right one. In 2015, I wrote novm [1], which I think served as inspiration for some developments that followed. My thinking has changed over the years and I actually work on gVisor today (disclaimer!). I’d like to share some thoughts here.
Hypervisors never left. They are a fundamental building block for infrastructure and will continue to be.
The question is whether there will be a broad shift to start relying on hypervisors to isolate every individual application. In my opinion, just wrapping containers in VMs is not a solution. (Nor do I find it technologically interesting, but that’s me.) I agree that the approach addresses some of the challenges of isolation, but is one step forward, two steps back in other ways.
Virtualizing at the hardware boundary lets you do some things very well. For example, device state is simple, and hardware support lets you track dirty memory passively and efficiently, so you can implement live migration for virtual machines much better than you could for processes. It can divide big machines into fungible, commodity sizes (allowing applications from having to care about NUMA, etc.). It lets you pass though and assign hardware devices. It gives you a strong security baseline.
But abstractions work best when they are faithful. Virtual machines operate on virtual CPUs, memory and devices, and operating systems work best when those abstractions behave like the real thing. That is, CPUs and memory are mostly available, and hardware acts like hardware (works independently, interactions don’t stall).
Containers and applications operate on OS-level abstractions: threads, memory mappings, futexes, etc. These abstractions are the basis for container efficiency — not because startup time is fast, but because these abstractions allow for a lot of statistical multiplexing and over-subscription while still performing well. The abstractions provide a lot of visibility for the OS to make good choices with global information (e.g. informing the scheduler, reclaim policy, etc.).
A problem arises when you decide that you want to bind single applications to single VMs, and then run many VMs instead of many containers. Effectively, the abstractions that you expose are now CPUs and memory, and these just don’t work as well for over-subscription and overall infrastructure efficiency. There’s no shared scheduler or cooperative synchronization (e.g. in an OS, threads waking each other will be moved to the same core), there’s no shared page cache, etc.
There are other problems too: virtualization gives you a very strong security baseline, but you have to start punching significant holes to get the container semantics you want. E.g. the cited virtfs is a great example: it’s easy to reason about the state of a block device, but an effective FUSE api (and shared memory for metadata) is a much larger system surface. The hardware interface itself is not a silver bullet. Devices are still complex (escapes happen), and the last few years have taught us that even the hardware mechanisms can have flaws. For example, AFAIK Kata containers is still vulnerable to L1TF unless you’re using instance-exclusive cpusets or have disabled hyper-threading. (Whereas native processes and containers are not vulnerable to this particular bug.)
The “put a VM on it” approach also may not have the standard image problems that plain hypervisors have, but you’ve got portability challenges. It seems non-ideal that a container isolation solution can run in infrastructure X and Y, but not in standard public clouds or your on-prem VMWare hosts, etc. (There might be specific technologies for each case, but that’s rather the point.)
That’s my 2c. I’m pretty optimistic that we can have strong isolation while still preserving the efficiency, portability and features of container-based infrastructure. I like a lot of these projects (especially the ones doing technologically interesting things, e.g. nabla, x-containers, virtfs, etc.) but I don’t think the straight-up “put a VM on it” approach is going to get us there.
Hi, completely agree with all of this. In fact, we've been focusing on the problem you mention about needing FS holes for VMs to regain container semantics (https://www.usenix.org/system/files/hotstorage19-paper-kolle...). Just in case, these are some of the container semantics we care about: FS crash consistency, file sharing (write+write), and efficient use of memory due to having a single page cache. The key question is: what's the smallest hole we can poke (smaller than allowing every single FS operation in the host)?
Hypervisors never left. They are a fundamental building block for infrastructure and will continue to be.
The question is whether there will be a broad shift to start relying on hypervisors to isolate every individual application. In my opinion, just wrapping containers in VMs is not a solution. (Nor do I find it technologically interesting, but that’s me.) I agree that the approach addresses some of the challenges of isolation, but is one step forward, two steps back in other ways.
Virtualizing at the hardware boundary lets you do some things very well. For example, device state is simple, and hardware support lets you track dirty memory passively and efficiently, so you can implement live migration for virtual machines much better than you could for processes. It can divide big machines into fungible, commodity sizes (allowing applications from having to care about NUMA, etc.). It lets you pass though and assign hardware devices. It gives you a strong security baseline.
But abstractions work best when they are faithful. Virtual machines operate on virtual CPUs, memory and devices, and operating systems work best when those abstractions behave like the real thing. That is, CPUs and memory are mostly available, and hardware acts like hardware (works independently, interactions don’t stall).
Containers and applications operate on OS-level abstractions: threads, memory mappings, futexes, etc. These abstractions are the basis for container efficiency — not because startup time is fast, but because these abstractions allow for a lot of statistical multiplexing and over-subscription while still performing well. The abstractions provide a lot of visibility for the OS to make good choices with global information (e.g. informing the scheduler, reclaim policy, etc.).
A problem arises when you decide that you want to bind single applications to single VMs, and then run many VMs instead of many containers. Effectively, the abstractions that you expose are now CPUs and memory, and these just don’t work as well for over-subscription and overall infrastructure efficiency. There’s no shared scheduler or cooperative synchronization (e.g. in an OS, threads waking each other will be moved to the same core), there’s no shared page cache, etc.
There are other problems too: virtualization gives you a very strong security baseline, but you have to start punching significant holes to get the container semantics you want. E.g. the cited virtfs is a great example: it’s easy to reason about the state of a block device, but an effective FUSE api (and shared memory for metadata) is a much larger system surface. The hardware interface itself is not a silver bullet. Devices are still complex (escapes happen), and the last few years have taught us that even the hardware mechanisms can have flaws. For example, AFAIK Kata containers is still vulnerable to L1TF unless you’re using instance-exclusive cpusets or have disabled hyper-threading. (Whereas native processes and containers are not vulnerable to this particular bug.)
The “put a VM on it” approach also may not have the standard image problems that plain hypervisors have, but you’ve got portability challenges. It seems non-ideal that a container isolation solution can run in infrastructure X and Y, but not in standard public clouds or your on-prem VMWare hosts, etc. (There might be specific technologies for each case, but that’s rather the point.)
That’s my 2c. I’m pretty optimistic that we can have strong isolation while still preserving the efficiency, portability and features of container-based infrastructure. I like a lot of these projects (especially the ones doing technologically interesting things, e.g. nabla, x-containers, virtfs, etc.) but I don’t think the straight-up “put a VM on it” approach is going to get us there.