Session notes from the Virtualization microconf at the 2012 LPC

The Linux Plumbers Conf wiki seems to have made the discussion notes for the 2012 conf read-only as well as visible only to people who have logged in.  I suspect this is due to the spam problem, but I'll put those notes here so that they're available without needing a login.  The source is here.

These are the notes I took during the virtualization microconference at the 2012 Linux Plumbers Conference.

Virtualization Security Discussion - Paul Moore

Slides

  • threats to virt system
  • 3 things to worry about
    • attacks from host - has full access to guest
    • attacks from other guests on the host
      • break out from guest, attack host and other guests (esp. in multi-tenant situations)
    • attacks from the network
      • traditional mitigation: separate networks, physical devices, etc.
  • protecting guest against malicious hosts
  • host has full access to guest resources
  • host has ability to modify guest stuff at will; w/o guest knowing it
  • how to solve?
    • no real concrete solutions that are perfect
    • guest needs to be able to verify / attest host state
      • root of trust
    • guests need to be able to protect data when offline
      • (discussion) encrypt guests – internally as well as qcow2 encryption
  • decompose host
    • (discussion) don't run services as root
  • protect hosts against malicious guests
  • just assume all guests are going to be malicious
  • more than just qemu isolation
  • how?
    • multi-layer security
    • restrict guest access to guest-owned resources
    • h/w passthrough – make sure devices are tied to those guests
    • limit avl. kernel interfaces
      • system calls, netlink, /proc, /sys, etc.
    • if a guest doesn't need an access, don't give it!
  • libvirt+svirt
    • MAC in host to provide separation, etc.
    • addresses netlink, /proc, /sys
  • (discussion) aside: how to use libvirt w/o GUI?
    • there is 'virsh', documentation can be improved.
  • seccomp
    • allows to selectively turn off syscalls; addresses syscalls in list above.
  • priv separation
    • libvirt handles n/w, file desc. passing, etc.
  • protecting guest against hostile networks
  • guests vulnerable directly and indirectly
  • direct: buggy apache
  • indirect: host attacked
  • qos issue on loaded systems
  • host and guest firewalls can solve a lot of problems
  • extend guest separation across network
    • network virt – for multi-tenant solutions
    • guest ipsec and vpn services on host
  • (discussion) blue pill vulnerability – how to mitigate?
    • lot of work being done by trusted computing group - TPM
    • maintain a solid root of trust
  • somebody pulling rug beneath you, happens even after boot
  • you'll need h/w support?
    • yes, TPM
    • UEFI, secure boot
  • what about post-boot security threats?
    • let's say booted securely. other mechanisms you can enable - IMA – extends root of trust higher. signed hashes, binaries.
    • unfortunately, details beyond scope for a 20-min talk

Storage Virtualization for KVM: Putting the pieces together - Bharata Rao

Slides

  • Different aspects of storage mgmt with kvm
    • mainly using glusterfs as storage backend
    • integrating with libvirt, vdsm
  • problems
    • multiple choices for fs and virt mgmt
      • libvirt, ovirt, etc.
  • not many fs'es are virt-ready
    • virt features like snapshots, thin-provisioning, cloning not present as part of fs
    • some things done in qemu: snapshot, block mig, img fmt handling are better handled outside
  • storage integration
    • storage device vendor doesn't have well-defined interfaces
  • gluster has potential: leverage its capabilities, and solve many of these problems.
  • intro on glusterfs
    • userspace distributed fs
    • aggregates storage resources from multiple nodes and presents a unified fs namespace
  • glusterfs features
    • replication, striping, distribution, geo-replication/sync, online volume extension
  • (discussion) why gluster vs ceph?
    • gluster is modular; pluggable, flexible.
    • keeps storage stack clean. only keep those things active which are needed
    • gluster doesn't have metadata.
      • unfortunately, gluster people not around to answer these questions.
  • by having backend in qemu, qemu can already leverage glusterfs features
    • (discussion) there is a rados spec in qemu already
      • yes, this is one more protocol that qemu will now support
  • glusterfs is modular: details
    • translators: convert requests from users into requests for storage
    • open/read/write calls percolate down the translator stack
      • any plugin can be introduced in the stack
  • current status: enablement work to integrate gluster-qemu
    • start by writing a block driver in qemu to support gluster natively
    • add block device support in gluster itself via block device translator
  • (discussion) do all features of gluster work with these block devices?
    • not yet, early stages. Hope is all features will eventually work.
  • interesting use-case: replace qemu block dev with gluster translators
  • would you have to re-write qcow2?
    • don't need to, many of qcow2 features already exist in glusterfs common code
  • slide showing perf numbers
  • future
    • is it possible to export LUNs to gluster clients?
    • creating a VM image means creating a LUN
    • exploit per-vm storage offload – all this using a block device translator
    • export LUNs as files; also export files as LUNs.
  • (discussion) why not use raw files directly instead of adding all this overhead? This looks like a perf disaster (ip network, qemu block layer, etc.) – combination of stuff increasing latency, etc.
    • all of this is an experimentation, to go where we haven't yet thought about – explore new opportunities. this is just the initial work; more interesting stuff can be build upon this platform later.
  • libvirt, ovirt, vdsm support for glusterfs added – details in slides
  • (discussion) storage array integration (slide) - question
    • way vendors could integrate san storage into virt stack.
    • we should have capability to use array-assisted features to create lun.
    • from ovirt/vdsm/libvirt/etc.
  • (discussion) we already have this in scsi. why add another layer? why in userspace?
    • difficult, as per current understanding: send commands directly to storage: fast copy from lun-to-lun, etc., not via scsi T10 extentions.
    • these are out-of-band mechanisms, in mgmt path, not data path.
  • why would someone want to do that via python etc.?

Next generation interrupt virtualization for KVM - Joerg Roedel

Slides

  • presenting new h/w tech today that accelerates guests
  • current state
    • kvm emulates local apic and io-apic
    • all reads/writes intercepted
    • interrupts can be queued from user or kernel
    • ipi costs high
  • h/w support
    • limited
    • tpr is accelerated by using cr8 register
    • only used by 64 bit guests
  • shiny new feature: avic
  • avic is designed to accelrate most common interrupt system features
    • ipi
    • tpr
    • interurpts from assigned devs
  • ideally none of those require intercept anymore
  • avic virtualizes apic for each vcpu
    • uses an apic backing page
    • guest physical apic id table, guest logical apic id table
    • no x2apic in first version
  • guest vapic backing page
    • store local apic contents for one vcpu
    • writes to accelerated won't intercept
    • to non-accelerated cause intercepts
  • accelerated:
    • tpr
    • EOI
    • ICR low
    • ICR high
  • physical apic id table
    • maps guest physical apic id to host vapic pages
    • (discussion) what if guest cpu is not running
      • will be covered later
  • table maintained by kvm
  • logical apic id table
    • maps guest logical apic ids to guest physical apic ids
      • indexed by guest logical apic id
  • doorbell mechanism
    • used to signal avic interrupts between physcial cpus
      • src pcpu figures out physical apic id of the dest.
      • when dest. vcpu is running, it sends doorbell interrupt to physical cpu
  • iommu can also send doorbell messages to pcpus
    • iommu checks if vcpu is running too
    • for not running vcpus, it sends an event log entry
  • imp. for assigned devices
  • msr can also be used to issue doorbell messages by hand – for emulated devices
  • running and not running vcpus
    • doorbell only when vcpu running
  • if target pcpu is not running, sw notified about a new interrupt for this vcpu
  • support in iommu
    • iommu necessary for avic-enabled device pass-through
    • (discussion) kvm has to maintain? enable/disable on sched-in/sched-out
  • support can be mostly be implemented in kvm-amd module
    • some outside support in apic emulation
    • some changes to lapic emulation
      • change layout
      • kvm x86 core code will allocate vapic pages
      • (discussion) instead of kvm_vcpu_kick(), just run doorbell
  • vapic page needs to be mapped in nested page table
    • likely requires changes to kvm softmmu code
  • open question wrt device passthrough
    • changes to vfio required
      • ideally fully transparent to userspace

Reviewing Unused and New Features for Interrupt/APIC Virtualization - Jun Nakajima

Slides

  • Intel is going to talk about a similar hardware feature
  • intel: have a bitmap, and they decide whether to exit or not.
  • amd: hardcoded. apic timer counter, for example.
  • q to intel: do you have other things to talk about?
    • yes, coming up later.
  • paper on 'net showed perf improved from 5-6Gig/s to wire speed, using emulation of this tech.
  • intel have numbers on their slides.
  • they used sr-iov 10gbe; measured vmexit
  • interrupt window: when hypervisor wants to inject interrupt, guest may not be running. hyp. has to enter vm. when guest is ready to receive interrupt, it comes back with vmexit. problem: as you need to inject interrupt, more vmexits, guest becomes busier. so: they wanted to eliminate them.
    • read case: if you have something in advance (apic page), hyp can just point to that instead of this exit dance
    • more than 50% exits are interrupt-related or apic related.
  • new features for interrupt/apic virt
    • reads are redirected to apic page
    • writes: vmexit after write; not intercepted. no need for emluation.
  • virt-interrupt delivery
    • extend tpr virt to other apic registers
    • eoi - no need for vm exits (using new bitmap)
      • this looks different from amd
    • but for eoi behaviour, intel/amd can have common interface.
  • intel/amd comparing their approaches / features / etc.
    • most notably, intel have support for x2apic, not for iommu. amd have support for iommu, not for x2apic.
  • for apic page, approaches mostly similar.
  • virt api can have common infra, but data structures are totally different. intel spec will be avl. in a month or so (update: already available now). amd spec shd be avl in a month too.
  • they can remove interrupt window, meaning 10% optimization for 6 VM case
  • net result
    • eliminate 50% of vmexits
    • optimization of 10% vmexits.
  • intel also supports x2apic h/w.
  • VMFUNC
    • this can hide info from other vcpus
    • secure channel between guest and host; can do whatever hypervisor wants.
    • vcpu executes vmfunc instrucion in special thread
  • usecases:
    • allow hvm guests to share pages/info with hypervisor in secure fashion
  • (discussion) why not just add to ept table
  • (discussion) does intel's int. virt. has iommu component to?
    • doesn't want to commit.

CoLo - Coarse-grained Lock-stepping VM for non-stop service - Will Auld

Slides

  • non-stop service with VM replication
    • client-server
    • Compare and contrast with Ramus - Xen's solution
      • xen: ramus
        • buffers responses until checkpoint to secondary server completes (once per epic)
        • resumes secondary only on failover
        • failover at anytime
      • xen: colo
        • runs two VMs in parallel comparing their responses, checkpoints only on miscompare
        • resumes after every checkpoint
        • failover at anytime
  • CoLo runs VMs on primary and secondary at same time.
    • both machines respond to requests; they check for similartiy. When they agree, one of the responses sent to client
  • diff. between two models:
    • ramus: assume machine states have to be same. This is the reason to buffer responses until checkpoint has completed.
    • in colo; no such req. only requirement is request stream must be the same.
  • CoLo non-stop service focus on server response, not internal machine state (since multiprocessor environment is inherently nondeterministic)
  • there's heartbeat, checkpoint
  • colo managers on both machines compare requests.
    • when they're not same, CoLo does checkpoint.
  • (discussion) why will response be same?
    • int. machine state shouldn't matter for most responses.
    • some exceptions, like tcp/ip timestamps.
    • minor modifications to tcp/ip stacks
      • coarse grain time stamp
      • highly deterministic ack mechanism
    • even then, state of machine is dissimilar.
  • resume of machine on secondary node:
    • another stumbling block.
  • (slides) graph on optimizations
  • how do you handle disk access? network is easier – n/w stack resumes on failover. if you don't do failover in a state where you know disk is in a consistent state, you can get corruption.
    • Two solutions
      • For NAS, do same compares as with responses (this can also trigger checkpoints).
      • On local disks, buffer original state of changed pages, revert to original and them checkpoint with primary nodes disk writes included. This is equivalent to how the memory image is updated. (This was not described complete enough during the session).
  • that sounds dangerous. client may have acked data, etc.
    • will have to look closer at this. (More complete explanation above counters this)
  • how often do you get mismatch?
    • depends on workload. some were like 300-400 packets of good packets, then a mismatch.
  • during that, are you vulnerable to failure?
    • no, can failover at any point. internal state doesn't matter. Both VMs, provide consistent request streams from their initial state and match responses up to the moment of failover.

NUMA - Dario Faggioli, Andrea Arcangeli

NUMA and Virtualization, the case of Xen - Dario Faggioli

Slides

  • Intro to NUMA
    • access costs to memory is different, based on which processor access it
    • remote mem is slower
  • in context of virt, want to avoid accessing remote memory
  • what we used to have in xen
    • on creation of VM, memory was allocated on all nodes
  • to improve: automatic placement
    • at vm1 creation time, pin vm1 to first node,
    • at vm2 create time, pin vm2 to second node since node1 already has a vm pinned to it
  • then they went ahead a bit, because pinning was inflexible
    • inflexible
    • lots of idle cpus and memories
  • what they will have in xen 4.3
    • node affinity
      • instead of static cpu pinning, preference to run vms on specific cpus
  • perf evaluation
    • specjbb in 3 configs (details in slides)
    • they get 13-17% improvements in 2vcpus in each vm
  • open problems
    • dynamic memory migration
    • io numa
      • take into consideration io devices
    • guest numa
      • if vm bigger than 1 node, should guest be aware?
    • ballooning and sharing
      • sharing could cause remote access
      • ballooning causes local pressures
    • inter-vm dependencies
    • how to actually benchmark and evaluate perf to evaluate if they're improving

AutoNUMA - Andrea Arcangeli

Slides

  • solving similar problem for Linux kernel
  • implementation details avail in slides, will skip now
  • components of autonuma design
    • novel approach
      • mem and cpu migration tried earlier using diff. approaches, optimistic about this approach.
    • core design is to two novel ideas
      • introduce numa hinting pagefaults
        • works at thread-level, on thread locality
      • false sharing / relation detection
  • autonuma logic
    • cpu follows memory
    • memory in b/g slowly follows cpu
    • actual migration is done by knuma_migrated
      • all this is async and b/g, doesn't stall memory channels
  • benchmarkings
    • developed a new benchmark tool, autonuma-benchmark
      • generic to measure alternative approaches too
    • comparing to gravity measurement
    • put all memory in single node
      • then drop pinning
      • then see how memory spreads by autonuma logic
  • see slides for graphics on convergance
  • perf numbers
    • also includes comparison with alternative approach, sched numa.
    • graphs show autonuma is better than schednuma, which is better than vanilla kernel
  • criticism
    • complex
    • takes more memory
      • takes 12 bytes per page, Andrea thinks it's reasonable.
      • it's done to try to decrease risk of hitting slowdowns (is faster than vanilla already)
    • stddev shows autonuma is pretty deterministic
  • why is autonuma so important?
    • even 2 sockets show differences and improvements.
    • but 8 nodes really shows autonuma shines

Discussions

  • looks like andrea focussing on 2 nodes / sockets, not more? looks like it will have bad impact on bigger nodes
    • points to graph showing 8 nodes
    • on big nodes, distance is more.
    • agrees autonuma doesn't worry about distances
    • currently worries only about convergence
    • distance will be taken as future optimisation
    • same for Xen case
      • access to 2 node h/w is easier
    • as Andrea also mentioned, improvement on 2 node is lower bound; so improvements on bigger ones should be bigger too; don't expect to be worse
  • not all apps just compute; they do io. and they migrate to the right cpu to where the device is.
    • are we moving memory to cpu, or cpu to device, etc… what should the heuristic be?
      • 1st step should be to get cpu and mem right – they matter the most.
      • doing for kvm is great since it's in linux, and everyone gets the benefit.
      • later, we might want to take other tunables into account.
    • crazy things in enterprise world, like storage
    • for high-perf networking, use tight binding, and then autonuma will not interfere.
      • this already works.
    • xen case is similar
      • this is also something that's going to be workload-dependent, so custom configs/admin is needed.
  • did you have a chance to test on AMD Magny-Cours (many more nodes)
    • hasn't tried autonuma on specific h/w
    • more nodes, better perf, since upstream is that much more worse.
    • xen
      • he did, and at least placement was better.
      • more benchmarking is planned.
  • suggestion: do you have a way to measure imbalance / number of accesses going to correct node
    • to see if it's moving towards convergence, or not moving towards convergence, maybe via tracepoints
    • essentially to analyse what the system is doing.
    • exposing this data so it can be analysed.
  • using printks right now for development, there's a lot of info, all the info you have to see why the algo is doing what it's doing.
  • good to have in production so that admins can see
    • what autonuma is doing
    • how much is it converging
      • to decide to make it more aggressive, etc.
  • overall, all such stats can be easily exported, it's already avl. via printk, but have to moved to something more structured and standard.
  • xen case is same; trying to see how they can use perf counters, etc. for statistical review of what is going on, but not precise enough
    • tells how many remote memory accesses are happening, but not from where and to where
    • something more in s/w is needed to enable this information.

One balloon for all - towards unified balloon driver - Daniel Kiper

Slides

  • wants to integrate various balloon drivers avl. in Linux
  • currently 3 separate drivers
    • virtio
    • xen
    • vmware
  • despite impl. differences, their core is similar
    • feature difference in drivers (xen has selfballooning)
    • overall lots of duplicate code
  • do we have an example of a good solution?
    • yes, generic mem hotplug code
    • core functionality is h/w independent
    • arch-specific parts are minimal, most is generic
  • solution proposed
    • core should be hypervisor-independent
    • should co-operate on h/w independent level - e.g mem hotplug, tmem, movable pages to reduce fragmentation
    • selfballooning ready
    • support for hugepages
    • standard api and abi if possible
    • arch-specific parts should communicate with underlying hypervisor and h/w if needed
  • crazy idea
    • replace ballooning with mem hot-unplug support
    • however, ballooning operates on single pages whereas hotplug/unplug works on groups of pages that are arch-dependent.
      • not flexible at all
      • have to use userspace interfaces
        • can be done via udev scripts, which is a better way
  • discussion: does acpi hotplug work seamlessly?
    • on x86 baremetal, hotplug works like this:
      • hotplug mem
      • acpi signals to kernel
      • acpi adds to mem space
      • this is not visible to processes directly
      • has to be enabled via sysfs interfaces, by writing 'enable command' to every section that has to be hotplugged
  • is selfballooning desirable?
    • kvm isn't looking at it
    • guest wants to keep mem to itself, it has no interest in making host run faster
    • you paid for mem, but why not use all of it
    • if there's a tradeoff for the guest: you pay less, you get more mem later, etc., guests could be interested.
    • essentially, what is guest admin's incentive to give up precious RAM to host?

ARM - Marc Zyngier, Stefano Stabellini

Some other notes are on session etherpad

KVM ARM - Marc Zyngier

Slides

  • ARM architecture virtualization extensions
    • recent introduction in arch
    • new hypervisor mode PL2
    • traditionally secure state and non-secure state
    • Hyp mode is in non-secure side
  • higher privilege than kernel mode
  • adds second stage translation; adds extra level of indirection between guests and physical mem
    • tlbs are tagged by VMID (like EPT/NPT)
  • ability to trap accesses to most system registers
  • can handle irqs, fiqs, async aborts
    • e.g. guest doesn't see interrupts firing
  • hyp mode: not a superset of SVC
    • has its own pagetables
    • only stage 1, not 2
    • follows LPAE, new physical extensions.
    • one translation table register
      • so difficult to run Linux directly in Hyp mode
      • therefore they use Hyp mode to switch between host and guest modes (unlike x86)
  • KVM/ARM
    • uses HYP mode to context switch from host to guest and back
    • exits guest on physical interrupt firing
    • access to a few privileged system registers
    • WFI (wait for interrupt)
      • (discussion) WFI is trapped and then we exit to host
    • etc.
    • on guest exit, control restored to host
    • no nesting; arch isn't ready for that.
  • MM
    • host in charge of all MM
    • has no stage2 translation itself (saves tlb entries)
    • guests are in total control of page tables
    • becomes easy to map a real device into the guest physical space
    • for emulated devices, accesses fault, generates exit, and then host takes over
    • 4k pages only
  • instruction emulation
    • trap on mmio
    • most instructions described in HSR
    • added complexity due to having to handle multiple ISAs (ARM, Thumb)
  • interrupt handling
    • redirect all interrupts to hyp mode only while running a guest. This only affects physical interrupts.
    • leave it pending and return to host
    • pending int will kick in when returns to guest mode?
      • No, it will be handled in host mode. Basically, we use the redirection to HYP mode to exit the guest, but keep the handling on the host.
  • inject two ways
    • manipulating arch. pins in the guest?
      • The architecture defines virtual interrupt pins that can be manipulated (VI→I, VF→F, VA→A). The host can manipulate these pins to inject interrupts or faults into the guest.
  • using virtual GIC extensions,
  • booting protocol
    • if you boot in HYP mode, and if you enter a non-kvm kernel, it gracefully goes back to SVC.
    • if kvm-enabled kernel is attempted to boot into, automatically goes into HYP mode
    • If a kvm-enabled kernel is booted in HYP mode, it installs a HYP stub and goes back to SVC. The only goal of this stub is to provide a hook for KVM (or another hypervisor) to install itself.
  • current status
    • pending: stable userspace ABI
    • pending: upstreaming
      • stuck on reviewing

Xen ARM - Stefano Stabellini

Slides

  • Why?
    • arm servers
    • smartphones
    • 288 cores in a 4U rack – causes a serious maintenance headache
  • challenges
    • traditional way: port xen, and port hypercall interface to arm
    • from Linux side, using PVOPS to modify setpte, etc., is difficult
  • then, armv7 came.
  • design goals
    • exploit h/w as much as possible
    • limit to one type of guest
      • (x86: pv, hvm)
      • no pvops, but pv interfaces for IO
    • no qemu
      • lots of code, complicated
    • no compat code
      • 32-bit, 64-bit, etc., complicated
    • no shadow pagetables
      • most difficult code to read ever
  • NO emulation at all!
  • one type of guest
    • like pv guests
      • boot from a user supplied kernel
      • no emulated devices
      • use PV interfaces for IO
    • like hvm guests
      • exploit nested paging
      • same entry point on native and xen
      • use device tree to discover xen presence
      • simple device emulation can be done in xen
        • no need for qemu
  • exploit h/w
    • running xen in hyp mode
    • no pv mmu
    • hypercall
    • generic timer
      • export timer int. to guest
  • GIC: general interrupt controller
    • int. controller with virt support
    • use GIC to inject event notifications into any guest domains with Xen support
      • x86 taught us this provides a great perf boost (event notifications on multiple vcpus simultaneously)
      • on x86, they had a pci device to inject interrupts to guest at regular intervals (on x86 we had a pci device to inject event notifications as legacy interrupt)
  • hypercall calling convention
    • hvc (hypercall)
    • pass params on registers
    • hvc takes an argument: 0xEA1 – means it's a xen hypercall.
  • 64-bit ready abi (another lesson from x86)
    • no compat code in xen
      • 2600 lines of lesser code
  • had to write a 1500 line patch of mechanical substitutions to make 32-bit host make all guests work fine
  • status
    • xen and dom0 boot
    • vm creation and destruction work
    • pv console, disk, network work
    • xen hypervisor patches almost entirely upstream
    • linux side patches should go in next merge window
  • open issues
    • acpi
      • will have to add acpi parsers, etc. in device table
      • linux has 110,000 lines – should all be merged
  • uefi
    • grub2 on arm: multiboot2
    • need to virtualise runtime services
    • so only hypervisor can use them now
  • client devices
    • lack ref arch
    • difficult to support all tablets, etc. in market
    • uefi secure boot (is required by win8)
    • windows 8

Discussion

  • who's responsbile for device tree mgmt for xen?
    • xen takes dt from hw, changes for mem mgmt, then psases to dom0
    • at present, currently have to build dt binary
  • at the moment, linux kernel infrastructure doesn't support interrupt priorities.
    • needed to prevent a guest doing a DoS on host by just generating interrupts non-stop
    • xen does support int. priorities in GIC

VFIO - Are we there yet? - Alex Williamson

Slides

  • are we there yet? almost
  • what is vfio?
    • virtual function io
    • not sr-iov specific
    • userspace driver interface
      • kvm/qemu vm is a userspace driver
    • iommu required
      • visibility issue with devices in iommu, guaranteeing devices are isolated and safe to use – different from uio.
    • config space access is done from kernel
      • adds to safety requirement – can't have userspace doing bad things on host
  • what's different from last year?
    • 1st proposal shot down last year, and got revised at last LPC
    • allow IOMMU driver to define device visibility – not per-device, but the whole group exposed
    • more modular
  • what's different from pci device assignment
    • x86 only
    • kvm only
    • no iommu grouping
    • relies on pci-sysfs
    • turns kvm into a device driver
  • current staus
    • core pci and iommu drivers in 3.6
    • qemu will be pushed for 1.3
  • what's next?
    • qemu integration
    • legacy pci interrupts
      • more of a qemu-kvm problem, since vfio already supports this, but these are unique since they're level-triggered; host has to mask interrupt so it doesn't cause a DoS till guest acks interrupt
        • like to bypass qemu directly – irqfd for edge-triggered. now exposing irqfd for level
  • (lots of discussion here)
  • libvirt support
    • iommu grps changed the way we do device assignment
    • sysfs entry point; move device to vfio driver
    • do you pass group by file descriptor?
    • lots of discussion on how to do this
    • existing method needs name for access to /sys
    • how can we pass file descriptors from libvirt for groups and containers to work in different security models?
      • The difficulty is in how qemu assembles the groups and containers. On the qemu command line, we specify an individual device, but that device lives in a group, which is the unit of ownership in vfio and may or may not be connectable to other containers. We need to figure out the details here,
  • POWER support
    • already adding
  • PowerPC
    • freescale looking at it
    • one api for x86, ppc was strange
  • error reporting
    • better ability to inject AER etc to guest
    • maybe another ioctl interrupt
    • What are we going to be able to do if we do get PCIe AER errors to show up at a device, what is the guest going to be able to do (for instance can it reset links).
      • We're going to have to figure this out and it will factor into how much of the AER registers on the device do we expose and allow the guest to control. Perhaps not all errors are guest serviceable and we'll need to figure out how to manage those.
  • better page pinning and mapping
    • gup issues with ksm running in b/g
  • PRI support
  • graphics support
    • issues with legacy io port space and mmio
    • can be handled better with vfio
  • NUMA hinting

Semiassignment: best of both worlds - Alex Graf

Slides

  • b/g on device assignment
  • best of both worlds
    • assigned device during normal operation
    • emulated during migration
  • xen solution – prototype
    • create a bridge in domU
    • guest sees a pv device and a real device
    • guest changes needed for bridge
    • migration is guest-visible, since real device goes away and comes back (hotplug)
      • security issue if VM doesn't ack hot-unplug
  • vmware way
    • writing a new driver for each n/w device they want to support
    • this new driver calls into vmxnet
    • binary blob is mapped into your address space
    • migration is guest exposed
      • new blob needed for destination n/w card
  • alex way
    • emulate real device in qemu
    • e.g. expose emulated igbvf if passing through igbvf
    • need to write migration code for each adapter as well
  • demo
    • doesn't quite work right now
  • is it a good idea?
  • how much effort really?
    • doesn't think it's much effort
    • current choices in datacenters are igbvf and <something else>
      • that's not true!
      • easily a dozen adapters avl. now
      • lots of examples given why this claim isn't true
        • no one needs single-vendor/card dependency in an entire datacenter
  • non-deterministic network performance
  • more complicated network configuration
  • discussion
    • Another solution suggested by Benjamin Herrenschmidt: use s3; remove 'live' from 'live migration'.
    • AER approach
  • General consensus was to just do bonding+failover

KVM performance: vhost scalability - John Fastabend

Slides

  • current situation: one kernel thread per vhost
  • if we create a lot of VMs and a lot of virtio-net devices, perf doesn't scale
  • not numa aware
  • Main grouse is it doesn't scale.
  • instead of having a thread of every vhost device, create a vhost thread per cpu
  • add some numa-awareness scheduling – pick best cpu based on load
  • perf graphs
    • for 1 VM, number of instances of netperf increase, per-cpu-vhost doesn't shine.
    • another tweak: use 2 threads per cpu: perf is better
  • for 4 VMs, results are good for 1-thread. much better than 2-thread. (2 thread does worse than current) With 4VMs per-cpu-vhost was nearly equivalent.
  • on 12 VMs, 1-thread works better, 2-thread works better than baseline. Per cpu-vhosts shine here outperforming baseline and 1-thread/2-thread cases.
  • tried tcp, udp, inter-guest, all netperf tests, etc.
    • this result is pretty standard for all the tests they've done.
  • RFC
    • should they continue?
    • strong objections?
  • discussion
    • were you testing with raw qemu or libvirt?
      • as libvirt creates its own cgroups, and that may interfere.
    • pinning vs non-pinning
      • gives similar results
  • no objections!
  • in a cgroup - roundrobin the vhost threads – interesting case to check with pinning as well.
  • transmit and receive interfere with each other – so perf improvement was seen when they pinned transmit side.
  • try this on bare-metal.

Network overlays - Vivek Kashyap

  • want to migrate machines from one place to another in a data center
    • don't want physical limitations (programming switches, routers, mac addr, etc)
  • idea is to define a set of tunnels which are overlaid on top of networks
    • vms migrate within tunnels, completely isolated from physical networks
  • Scaling at layer 2 is limited by the need to support broadcsat/multicast over the network
  • overlay networks
    • when migrating across domains (subnets), have to re-number IP addresses
      • when migrating need to migrate IP and MAC addresses
      • When migrating across subnets might need to re-number or find another mechanism
    • solution is to have a set of tunnels
    • every end-user can view their domain/tunnel as a single virtual network
      • they only see their own traffic, no one else can see their traffic.
  • standardization is required
    • being worked on at IETF
    • MTU seen as VM is not same as what is on the physical network (because headers added by extra layers)
    • vxlan adds udp headers
    • one option is to have large(er) physical MTU so it takes care of this otherwise there will be fragmentation
      • Proposal
        • If guest does pathMTU discovery let tunnel end point return the ICMP error to reduce the guest's view of the MTU.
        • Even if the guest has not set the DF (dont fragment) bit return an ICMP error. The guest will handle the ICMP error and update its view of the MTU on the route.
        • having the hypervisor to co-operate so guests do a path MTU discovery and things work fine
          • no guest changes needed, only hypervisor needs small change
  • (discussion) Cannot assume much about guests; guests may not handle ICMP.
  • Some way to avoid flooding
    • extend to support an 'address resolution module'
    • Stephen Hemminger supported the proposal
  • Fragmentation
    • can't assume much about guests; they may not like packets getting fragmented if they set DF
    • fragmentation highly likely since new headers are added
      • The above is wrong comment since if DF is set we do pathMTU and the packet wont be fragmented. Also, the fragmentation if done is on the tunnel. The VM's dont see fragmentation but it is not performant to fragment and reassemble at end points.
      • Instead the proposal is to use PathMTU discovery to make the VM's send packets that wont need to be fragmented.
  • PXE, etc., can be broken
  • Distributed Overlay Ethernet Network
    • DOVE module for tunneling support
      • use 24-bit VNI
  • patches should be coming to netdev soon enough.
  • possibly using checksum offload infrastructure for tunneling
  • question: layer 2 vs layer 3
    • There is interest in the industry to support overlay solutions for layer 2 and layer 3.

Lightning talks

QEMU disaggregation - Stefano Stabellini

Slides

  • dom0 is a privileged VM
  • better model is to split dom0 into multiple service VMs
    • disk domain, n/w domain, everything else
      • no bottleneck, better security, simpler
  • hvm domain needs device model (qemu)
  • wouldn't it be nice if one qemu does only disk emulation
    • second does network emulation
    • etc.
  • to do this, they moved pci decoder in xen
    • traps on all pci requests
    • hypervisor de-multiplexes to the 'right' qemu
  • open issues
    • need flexibility in qemu to start w/o devices
    • modern qemu better
      • but: always uses PCI host bridge, PIIX3, etc.
    • one qemu uses this all, others have functionality, but don't use it
  • multifunction devices
    • PIIX3
  • internal dependencies
    • one component pulls others
      • vnc pulls keyboard, which pulls usb, etc.
  • it is in demoable state

Xenner -- Alex Graf

  • intro
    • guest kernel module that allows a xen pv kernel to run on top of kvm – messages to xenbus go to qemu
  • is anyone else interested in this at all?
    • xen folks last year did show interest for a migration path to get rid of pv code.
    • xen is still interested, but not in short time. – few years.
    • do you guys want to work together and get it rolling?
      • no one commits to anything right now