Tag Archives: virtualisation

Session notes from the Virtualization microconf at the 2012 LPC

The Linux Plumbers Conf wiki seems to have made the discussion notes for the 2012 conf read-only as well as visible only to people who have logged in.  I suspect this is due to the spam problem, but I’ll put those notes here so that they’re available without needing a login.  The source is here.

These are the notes I took during the virtualization microconference at the 2012 Linux Plumbers Conference. Continue reading

About Random Numbers and Virtual Machines

Several applications need random numbers for correct and secure operation.  When ssh-server gets installed on a system, public and private key paris are generated.  Random numbers are needed for this operation.  Same with creating a GPG key pair.  Initial TCP sequence numbers are randomized.  Process PIDs are randomized.  Without such randomization, we’d get a predictable set of TCP sequence numbers or PIDs, making it easy for attackers to break into servers or desktops.

 

On a system without any special hardware, Linux seeds its entropy pool from sources like keyboard and mouse input, disk IO, network IO, and any other sources whose kernel modules indicate they are capable of adding to the kernel’s entropy pool (i.e .the interrupts they receive are from sufficiently non-deterministic sources).  For servers, keyboard and mouse inputs are rare (most don’t even have a keyboard / mouse connected).  This makes getting true random numbers difficult: applications requesting random numbers from /dev/random have to wait for indefinite periods to get the randomness they desire (like creating ssh keys, typically during firstboot.).

 

For applications that need random numbers instantaneously, but can make do with slightly low-quality random numbers, they have the option of getting their randomness from /dev/urandom, which doesn’t block to serve random numbers — it’s just not guaranteed that the numbers one receives from /dev/urandom truly reflect pure randomness.  Indiscriminate reading of /dev/urandom will reduce the system’s entropy levels, and will starve applications that need true random numbers.  Random numbers in a system are a rare resource, so applications should only fetch them when they are needed, and only read as many bytes as needed.

 

There are a few random number generator devices that can be plugged into computers.  These can be PCI or USB devices, and are fairly popular add-ons on servers.  The Linux kernel has a hwrng (hardware random number generator) abstraction to select an active hwrng device among several that might be present, and ask the device to give random data when the kernel’s entropy pool falls below the low watermark.  The rng-tools package comes with rngd, a daemon, that reads random numbers from hwrngs and feeds them into the kernel’s entropy pool.

 

Virtual machines are similar to server setups: there is very little going on in a VM’s environment for the guest kernel to source random data.  A server that hosts several VMs may still have a lot of disk and network IO happening as a result of all the VMs it hosts, but a single VM may not be doing much to itself generate enough entropy for its applications.  One solution, therefore, to sourcing random numbers in VMs is to ask the host for a portion of the randomness it has collected, and feed them into the guest’s entropy pool.  A paravirtualized hardware random number generator exists for KVM VMs.  The device is called virtio-rng, and as the name suggests, the device sits on top of the virtio PV framework.  The Linux kernel gained support for virtio-rng devices in kernel 2.6.26 (released in 2008).  The QEMU-side device was added in the recent 1.3 release.

 

On the host side, the virtio-rng device (by default) reads from the host’s /dev/random and feeds that into the guest.  The source of this data can be modified, of course.  If the host lacks any hwrng, /dev/random is the best source to use.  If the host itself has a hwrng, using input from that device is recommended.

 

Newer Intel architectures have an instruction, RDRAND, that provides random numbers.  This instruction can be directly exposed to guests.  Guests probe for the presence of this instruction (using CPUID) and use it if available.  This doesn’t need any modification to the guest.  However, there’s one drawback to exposing this instruction to guests: live migration.  If not all hosts in a server farm have the same CPU, live-migrating a guest from one host that exposes this instruction to another that doesn’t, will not work.  In this case, virtio-rng can be configured to use RDRAND as its source, and the guest can continue to work as in the previous example.

 

It looks like QEMU/KVM is the only hypervisor that has the support for exposing a hardware random number generator to guests.  (One could pass through a real hwrng to a guest, but that doesn’t scale and isn’t practical for all situations — e.g. live migration.)  Fedora 19 will have QEMU 1.3, which has the virtio-rng device, and even older guests running on top of F19 will be able to use the device.

 

For more information on virtio-rng, see the QEMU feature page.  LWN.net has an excellent article on random numbers, based on H. Peter Anvin’s talk at LinuxCon EU 2012.

Avi Kivity Stepping Down from the KVM Project

Avi Kivity giving his keynote speech

Avi Kivity announced he is stepping down as (co-)maintainer of the KVM Project at the recently-concluded KVM Forum 2012 in Barcelona, Spain.  Avi wrote the initial implementation of the KVM code back at Qumranet, and has been maintaining the KVM-related kernel and qemu code for about 7 years now.

In his keynote speech, he mentioned he’s founding a startup with a friend, and hopes to create new technology as exciting as KVM.  He also mentioned they’re in stealth mode right now, so questions about the new venture didn’t get any answers.

He returned to the stage on the second day of the Forum to talk about the new memory API work he’s been doing in qemu, and in his typical dry humour, he mentioned he was supposed to vanish in a puff of smoke after his keynote, but the special effects machinery didn’t work, so he was back on stage.  Avi later rued the lack of laughter at this joke, and that made him very sad.  To offer him some consolation, it was pointed out that not everyone knew of his departure, as many had missed his keynote.  He quipped “that’s even worse than not getting laughs”.

His leadership, as well as his humour, will be missed.  Personally, he’s helped me grow during the last few years we’ve worked together.  But I’m sure whatever he’s working on will be something to look forward to, and we’re not really bidding him adieu from the tech world.

Virtualization at the Linux Plumbers Conference 2012

The 2012 edition of the Linux Plumbers Conference concluded recently.  I was there, running the virtualization microconference.  The format of LPC sessions is to have discussions around current as well as future projects.  The key words are ‘discussion’ (not talks — slides are optional!) and ‘current’ and ‘future’ projects — not discussing work that’s already done; rather discussing unsolved problems or new ideas.  LPC is a great platform for getting people involved in various subsystems across the entire OS stack in one place, so any sticky problems tend to get resolved by discussing issues face-to-face.

The virt microconf had A LOT of submissions: 17 topics to be discussed in a standard time slot of 2.5 hours for one microconf track.  I asked for a ‘double track’, making it 5 hours of time for 17 topics.  Still difficult, but reducing a few topics to ‘lightning talks’, we could get a somewhat decent 20 minutes per topic.  I contemplated between rejecting topics and thus increasing the time each discusison would get, or keeping all the topics, and asking the people to wrap up in 20 minutes.  I went for the latter — getting more stuff discussed (and hence, more problems / issues ‘out there’) is a better use of time, IMO.  That would also ensure that people stay on-topic and focussed.

There was also a general change in the way microconfs were scheduled this time: the microconfs were not given a complete 2.5-hour slot.  Rather, they were given 3 slots of 45 minutes each.  This helped the schedule pages to show the topics of the microconfs being discussed at that time, so the attendees could pick and choose the discussion they wanted to attend, rather than seeing a generic ‘Virtualization Micrconf’ slot.  I think this was a good idea.  Individual microconf owners could request for modifications to this scheme, of course, and some microconfs just chose to run the entire session in one slot, or reserved one whole day in a room, etc.  For the virt microconf, I went with six separate slots, scheduled in a way to avoid conflicts with other virt-related topics in other sessions, giving a total of 4.5 hours for 17 topics.

I segregated the CFP submissions so I could schedule related discussions in one slot, to avoid jumping between subjects and to also help concentrate on specifics in an area.  Two submissions, one on security and one on storage, were by themselves, so I clubbed them into one ‘security and storage‘ session.  The others were nicely aligned, so we could have ‘x86‘, ‘MM‘, ‘ARM‘, ‘Networking‘ and ‘lightning talks’ topics in separate slots.  Since there were 4 network-related talks, I asked for a double slot (two 45-min slots back-to-back), and clubbed the lightning talks in the same session, which was scheduled to be the last session for the virt microconf.

Given this, I would say the microconf went quite well — the notes and slides are up at the LPC 2012 virt microconf wiki, and we could get good discussions going for most of the topics, given the time constraints.  Of course, a major benefit of going to conferences is to meet people outside of the sessions, in the hallways and at social events, and the discussions continued there as well.  I did bank on this extra time we would have into the ‘reject vs take all of them’ problem mentioned earlier.  From what I heard, the beer at the social events failed to stop technical discussions, so it all worked out for the best.

Each microconf owner (or a representative) had to do a short summary at the end of the LPC, for the benefit of the people not present for some sessions.  I did the virt summary in roughly these words:

We had a quite productive virtualization microconfierence.  We received a lot of submissions, and accepted them all, which meant we had to limit the time for each discussion in the slots, but we could divide the slots by a general topic, effectively increasing the discussion time for the larger topic.

 

We had a healthy representation from the KVM as well as Xen sides.  For example, in the MM topic, we discussed NUMA awareness for KVM as well as Xen.  Dario Faggioli presented the Xen side, and Andrea Arcangeli spoke on the Linux/KVM side.  Andrea spoke about AutoNUMA. It has been contentious on the mailing lists, and from the Kernel Summit discussions, it looked like some agreement will be reached soon.  Xen uses a similar approach to AutoNUMA, and they would end up pushing the patches soon as well.  Daniel Kiper spoke about integrating the various balloon drivers in the kernel to remove code duplication.

 

Both AMD and Intel publically announced new hardware features for interrupt virtualization for the first time here, and it was interesting to see them compare notes and find out what the other is doing and how, for example do they support IOMMU?  x2apic?  Etc.

 

New ARM architecture support work was presented by Marc Zyngier for the KVM effort, and Stefano Stabellini for the Xen effort.  Much of the work seems to be done, and patches are in a shape to be applied for the next merge window.  There are a few open issues, and they were discussed as well.

 

We had quite a few talks for the networking session.  Alex Williamson spoke about VFIO, which seemed to get mentioned a lot throughout the conference in multiple sessions.  This is a new way of doing device assignment, and progress looks positive, with the kernel side already merged in 3.6, and qemu patches queued up for 1.3.  Alex Graf then talked about ‘semi-assignment’, a way to do device assignment (or pci passthrough) while also getting proper migration support.  The effort involved writing device emulation for each device supported, and the approach wasn’t too popular.  IBM and Intel guys have been doing virtio net scalability testing, and John Fastabend spoke about some optimisations, which were generally well-received.  We should expect patches and more benchmarks soon.  Vivek Kashyap spoke about network overlays, and how creating a tunnel for networks for VMs can help with VM migration across networks.

 

We also had a session on security, by Paul Moore, who gave an overview of the various methods to secure VMs, specifically the new seccomp work.

 

Lastly, we had Bharata Rao talk about introducing a glusterfs backend for qemu, replacing qemu’s block drivers, which gives more flexibility in handling disk storage for VMs.

 

The organisers are collecting feedback, so if you were there, be sure to let them know of your experience, and what we could do better in the coming years.

I’d like to thank the Linux Foundation and the Linux Plumbers Conf organisers for giving me the opportunity to be there and run the virt microconf.

FUDCon Pune: My talk on ‘Linux Virtualization’

My second talk at FUDCon Pune was on Virtualization (slides) on day 2.  While I had registered the talk well in advance, I wasn’t quite sure what really to talk about: should I talk about the basics of virtualization?  Should I talk about what’s latest (coming up in Fedora 16)?  Should I talk about how KVM works in detail?  My first talk on git had gone well, and as expected for this FUDCon, majority of the participants were students.  Expecting a similar student-heavy audience for the 2nd talk as well, I decided on discussing the basics of the Linux Virt Stack.  Kashyap had a session lined up after me on libvirt, so I thought I could give an overview of virt-manager, libvirt, QEMU and Linux (KVM).

And since my registered talk title was ‘Latest in Linux Virtualization’, I did leave a few slides on upcoming enhancements in Fedora 16 (mostly concentrating on the QEMU side of things) at the end of the slide deck, to cover those things if I had time left.

As with the previous git talk, I didn’t get around to making the slides and deciding on the flow of the talk till the night before the day of the talk, and that left me with much less sleep than normal.  The video for the talk is available online; I haven’t seen it myself, but if you do, you’ll find I was almost sleep-talking through the session.

To make it interactive as well as keep me awake, I asked the audience to stop me and ask questions any time during the talk.  What was funny about that was the talk was also being live streamed, and the audio signal for the live streaming was carried via one mic and the audio stream for the audience as well as the recorded talk was on a different mic.  So even though the audience questions were taken on the audience mic, I had to repeat the questions for the people who were catching the talk live.

I got some feedback later from a few people — I missed to introduce myself, and I should have put some performance graphs in the slides, as almost all users would be interested in KVM performance vs other hypervisors.  Both good points.  The performance slides I hadn’t thought about earlier, I’ll try to incorporate some such graphs in future presentations.  Interestingly, I hadn’t also thought of introducing myself.  Previously, I was used to someone else introducing me and then me picking up from there.  At the FUDCon, we (the organisers) missed on getting speaker bios, and didn’t have volunteers introduce each speaker before their sessions.  So no matter which way I look at it, I take the blame as speaker and organiser for not having done this.

There was some time before my session to start and there were a few people in the auditorium (the room where the talk was to be held), so Kashyap thought of playing some Fedora / FOSS / Red Hat videos.  (People generally like the Truth Happens video, and that one was played as well.)  These, and many more are available on the Red Hat Videos channel on YouTube. There was also some time between my session and Kashyap’s (to allow for people to move around, take a break, etc.), so we played the F16 release video that Jared gave us.

Overall, I think the talk went quite well (though I may have just dreamed that).  I tried to stay awake for Kashyap’s session on libvirt to answer any questions directed my way; I know I did answer a couple of them, so I must have managed to stay up.

Communication between Guests and Hosts

Guest and Host communication should be a simple affair — the venerable TCP/IP sockets should be the first answer to any remote communication.  However, it’s not so simple once some special virtualisation-related constraints are added to the mix:

  • the guest and host are different machines, managed differently
  • the guest administrator and the host administrator may be different people
  • the guest administrator might inadvertently block IP-based communication channels to the host via firewall rules, rendering the TCP/IP-based communication channels unusable

The last point needs some elaboration: system administrators want to be really conservative in what they “open” to the outside world.  In this sense, the guest and host administrators are actively hostile to each other.  Also, rightly, neither should trust each other, given that a lot of the data stored in operating systems are now stored within clouds and any leak of the data could prove disastrous to the administrators and their employers.

So what’s really needed is a special communication channel between guests and hosts that are not susceptible to being blocked out by guests or hosts as well as being a very special-purpose low-bandwidth channel that doesn’t look to re-implement TCP/IP.  Some other requirements are mentioned on this page.

After several iterations, we settled on one particular implementation: virtio-serial.  The virtio-serial infrastructure rides on top of virtio, a generic para-virtual bus that enables exposing custom devices to guests.  virtio devices are abstracted enough so that guest drivers need not know what kind of bus they’re actually riding on: they are PCI devices on x86 and native devices on s390 under the hood.  What this means is the same guest driver can be used to communicate with a virtio-serial device under x86 as well as s390.  Behind the scenes, the virtio layer, depending on the guest architecture type, works with the host virtio-pci device or virtio-s390 device.

The host device is coded in qemu.  One host virtio-serial device is capable of hosting multiple channels or ports on the same device.  The number of ports that can ride on top of a virtio-serial device is currently arbitrarily limited to 31, but one device can very well support 2^31 ports.  The device is available since upstream qemu release 0.13 as well as in Fedora from release 13 onwards.

The guest driver is written for Linux and Windows guests.  The API exposed includes open, read, write, poll, close calls.  For the Linux guest, ports can be opened in blocking as well as non-blocking modes.  The driver is included upstream from Linux kernel version 2.6.35.  Kernel 2.6.37 will also have asynchronous IO support — ie, SIGIO will be delivered to interested userspace apps whenever the host-side connection is established or closed, or when a port gets hot-unplugged.

Using the ports is simple: when using qemu from the command line directly, add:

-chardev socket,path=/tmp/port0,server,nowait,id=port0-char 
-device virtio-serial 
-device virtserialport,id=port1,name=org.fedoraproject.port.0,chardev=port0-char
this creates one device with one port and exposes to the guest the name ‘org.fedoraproject.port.0‘.  Guest apps can then open /dev/virtio-ports/org.fedoraproject.port.0 and start communicating with the host.  Host apps can open the /tmp/port0 unix domain socket to communicate with the guest.  Of course, there are other qemu chardev backends that can be used other than unix domain sockets.  There also is an in-qemu API that can be used.
More invocation options and examples are given in the invocation and how to test sections.

There is sample C code for the guest as well as sample python code from the test suites.  The original test suite, written to verify the functionality of the user-kernel interface, will in the near future be moved to autotest, enabling faster addition of more tests and tests that not just check for correctness, but also regressions and bugs.

virtio-serial is already in use by the Matahari, Spice, libguestfs and Anaconda projects.  I’ll briefly mention how Anaconda is going to use virtio-serial: starting Fedora 14, guest installs of Fedora will automatically send Anaconda logs to the host if a virtio-serial port with the name of ‘org.fedoraproject.anaconda.log.0‘ is found.  virt-install is modified to create such a virtio-serial port.  This means debugging early anaconda output will be easier with the logs available on the host (and not worrying about guest file system corruptions during install or network drivers not available before a crash).

Further use: There are many more uses of virtio-serial, which should be pretty easy to code:

  • shutting down or suspending VMs when a host is shut down
  • clipboard copy/paste between hosts and guests (this is under progress  by the Spice team)
  • lock a desktop session in the guest when a vnc/spice connection is closed
  • fetch cpu/memory/power usage rates at regular intervals for monitoring

Virtualisation (on Fedora)

A few volunteers from India associated with the Fedora Project wrote articles for Linux For You‘s March 2010 Virtualisation Special. Those articles, and a few others, are put up on the Fedora wiki space at Magazine Articles on Virtualization. Thanks to LFY for letting us upload the pdfs!

We’re always looking for more content, in the form of how-tos, articles, experiences, tips, etc., so feel free to upload content to the wiki or blog about it.

We also have contact with some magazine publishers so if you’re interested in writing for online or print magazines, let the marketing folks know!

Comparison of File Systems And Speeding Up Applications

Update: I’ve done a newer article on this subject at http://log.amitshah.net/2009/04/re-comparing-file-systems.html that removes some of the deficiencies in the tests mentioned here and has newer, more accurate results along with some new file systems.

How should one allocate disk space for a file for later writing? ftruncate() (or lseek() followed by write()) create sparse files, not what is needed. A traditional way is to write zeroes to the file till it reaches the desired file size. Doing things this way has a few drawbacks:

  • Slow, as small chunks are written one at a time by the write() syscall
  • Lots of fragmentation

posix_fallocate() is a library call that handles the chunking of writes in one batch; the application need not have to code his/her own block-by-block writes. But this still is in the userspace.

Linux 2.6.23 introduced the fallocate() system call. The allocation is then moved to kernel space and hence is faster. New file systems that support extents make this call very fast indeed: a single extent is to be marked as being allocated on disk (as traditionally blocks were being marked as ‘used’). Fragmentation too is reduced as file systems will now keep track of extents, instead of smaller blocks.

posix_fallocate() will internally use fallocate() if the syscall exists in the running kernel.

So I thought it would be a good idea to make libvirt use posix_fallocate() so that systems with the newer file systems will directly benefit when allocating disk space for virtual machines. I wasn’t sure of what method libvirt already used to allocate the space. I found out that it allocated blocks in 4KiB sized chunks.

So I sent a patch to the libvir-list to convert to posix_fallocate() and danpb asked me about what the benefits of this approach were and also asked about using alternative approaches if not writing in 4K chunks. I didn’t have any data to back up my claims of “this approach will be fast and will result in less fragmentation, which is desirable”. So I set out to do some benchmarking. To do that, though, I first had to make some empty disk space to create a few file systems of sufficiently large sizes. Hunting for a test machine with spare disk space proved futie, so I went about resizing my ext3 partition and creating about 15 GB of free disk space. I intended to test ext3, ext4, xfs and btrfs. I could use my existing ext3 partition for the testing, but that would not give honest results about the fragmentation (existing file systems may already be fragmented, causing big new files surely to be fragmented whereas on a fresh fs, I won’t run into that risk).

Though even creating separate partitions on rotating storage and testing file system performance won’t give perfectly honest results, I figured if the percentage difference in the results was quite high, that won’t matter. I grabbed the latest Linus tree and the latest dev trees for the userspace utilities for all the file systems and created about 5GB partitions for each fs.

I then wrote a program that created a file, allocated disk space and closed it and calculate the time taken in doing so. This was done multiple times for different allocation methods: posix_fallocate(), mmap() + memset() and writing zeroes in 4096 byte chunks and 8192 byte chunks.

So I had four methods of allocating files and 5G partition size. So I decided to check the performance by creating 1GiB file size for each allocation method.

The program is here. The results, here. The git tree is here.

I was quite surprised seeing poor performance for posix_fallocate() on ext4. On digging a bit, I realised mkfs.ext4 didn’t create it with extents enabled. I reformatted the partition, but that data was valuable to have as well. Shows how much a file system is better with extents support.

Graphically, it looks like this:
Notice that ext4, xfs and btrfs take only a few microseconds to complete posix_fallocate().

The number of fragments created:

btrfs doesn’t yet have the ioctl implemented for calculating fragments.

The results are very impressive and the final patches to libvirt were finalised pretty quickly. They’re now in the development branch libvirt. Coming soon to a virtual machine management application near you.

Use of posix_fallocate() will be beneficial to programs that know in advance the size of the file being created, like torrent clients, ftp clients, browsers, download managers, etc. It won’t be beneficial in the speed sense, as data is only written when it’s downloaded, but it’s beneficial in the as-less-fragmentation-as-possible sense.