Communication between Guests and Hosts

Guest and Host communication should be a simple affair — the venerable TCP/IP sockets should be the first answer to any remote communication.  However, it’s not so simple once some special virtualisation-related constraints are added to the mix:

  • the guest and host are different machines, managed differently
  • the guest administrator and the host administrator may be different people
  • the guest administrator might inadvertently block IP-based communication channels to the host via firewall rules, rendering the TCP/IP-based communication channels unusable

The last point needs some elaboration: system administrators want to be really conservative in what they “open” to the outside world.  In this sense, the guest and host administrators are actively hostile to each other.  Also, rightly, neither should trust each other, given that a lot of the data stored in operating systems are now stored within clouds and any leak of the data could prove disastrous to the administrators and their employers.

So what’s really needed is a special communication channel between guests and hosts that are not susceptible to being blocked out by guests or hosts as well as being a very special-purpose low-bandwidth channel that doesn’t look to re-implement TCP/IP.  Some other requirements are mentioned on this page.

After several iterations, we settled on one particular implementation: virtio-serial.  The virtio-serial infrastructure rides on top of virtio, a generic para-virtual bus that enables exposing custom devices to guests.  virtio devices are abstracted enough so that guest drivers need not know what kind of bus they’re actually riding on: they are PCI devices on x86 and native devices on s390 under the hood.  What this means is the same guest driver can be used to communicate with a virtio-serial device under x86 as well as s390.  Behind the scenes, the virtio layer, depending on the guest architecture type, works with the host virtio-pci device or virtio-s390 device.

The host device is coded in qemu.  One host virtio-serial device is capable of hosting multiple channels or ports on the same device.  The number of ports that can ride on top of a virtio-serial device is currently arbitrarily limited to 31, but one device can very well support 2^31 ports.  The device is available since upstream qemu release 0.13 as well as in Fedora from release 13 onwards.

The guest driver is written for Linux and Windows guests.  The API exposed includes open, read, write, poll, close calls.  For the Linux guest, ports can be opened in blocking as well as non-blocking modes.  The driver is included upstream from Linux kernel version 2.6.35.  Kernel 2.6.37 will also have asynchronous IO support — ie, SIGIO will be delivered to interested userspace apps whenever the host-side connection is established or closed, or when a port gets hot-unplugged.

Using the ports is simple: when using qemu from the command line directly, add:

-chardev socket,path=/tmp/port0,server,nowait,id=port0-char
-device virtio-serial
-device virtserialport,id=port1,name=org.fedoraproject.port.0,chardev=port0-char

this creates one device with one port and exposes to the guest the name ‘org.fedoraproject.port.0‘.  Guest apps can then open /dev/virtio-ports/org.fedoraproject.port.0 and start communicating with the host.  Host apps can open the /tmp/port0 unix domain socket to communicate with the guest.  Of course, there are other qemu chardev backends that can be used other than unix domain sockets.  There also is an in-qemu API that can be used.

More invocation options and examples are given in the invocation and how to test sections. 

There is sample C code for the guest as well as sample python code from the test suites.  The original test suite, written to verify the functionality of the user-kernel interface, will in the near future be moved to autotest, enabling faster addition of more tests and tests that not just check for correctness, but also regressions and bugs.

virtio-serial is already in use by the Matahari, Spice, libguestfs and Anaconda projects.  I’ll briefly mention how Anaconda is going to use virtio-serial: starting Fedora 14, guest installs of Fedora will automatically send Anaconda logs to the host if a virtio-serial port with the name of ‘org.fedoraproject.anaconda.log.0‘ is found.  virt-install is modified to create such a virtio-serial port.  This means debugging early anaconda output will be easier with the logs available on the host (and not worrying about guest file system corruptions during install or network drivers not available before a crash).

Further use: There are many more uses of virtio-serial, which should be pretty easy to code:

  • shutting down or suspending VMs when a host is shut down
  • clipboard copy/paste between hosts and guests (this is under progress  by the Spice team)
  • lock a desktop session in the guest when a vnc/spice connection is closed
  • fetch cpu/memory/power usage rates at regular intervals for monitoring

Upgrading from Fedora 11 to Fedora 13

Having already installed (what would be) F13 on my work and personal laptops the traditional way — by installing a fresh copy (since I wanted to modify the partition layout), I tried an upgrade on my desktop.

My desktop was running Fedora11 and I moved it to Fedora13. I wanted to test how the upgrade functionality works, does it run into any errors (esp. since it’s from 11 -> 13, skipping 12 entirely), if the experience is smooth, etc.

I started out by downloading the RC compose from http://alt.fedoraproject.org/. Since all my installs are for the x86-64 architecture, I downloaded the DVD.iso. I then loopback-mounted the DVD on my laptop:

# mount -o loop /home/amit/Downloads/Fedora-13-x86_64-DVD.iso /mnt/F13

I then exported the contents of the mount via NFS; edit /etc/exports and put the following line:

/mnt/F13 172.31.10.*

This ensures the mount is only available to users on my local network.

Then, ensure the nfs services are running:

# service nfs start
# service nfslock start

On my desktop which was to be upgraded, I mounted the NFS export:

# mount -t nfs 172.31.1.12:/mnt/F13 /mnt

And copied the kernel and initrd images to boot into:

# cp /mnt/isolinux/vmlinuz /boot
# cp /mnt/isolinux/initrd.img /boot

Then update the grub config with this new kernel that we’ll boot into for the upgrade. Edit /boot/grub.conf and add:

title Fedora 13 install
    root (hd0,0)
    kernel /vmlinuz
    initrd /initrd.img

Once that’s done, reboot and select the entry we just put in the grub.conf file. The install process starts and asks where the files are located for the install. Select NFS and provide the details: Server 172.31.1.12 and directory /mnt/F13.

The first surprise for me was to see the updated graphics for the Anaconda installer. They got changed in the time I installed F13 (beta) on my laptops. The new artwork certainly looks very good and smooth. More white, less blue is a departure from the usual Fedora artwork, but it does look nice.

I then proceeded to select ‘upgrade’, it found my old F11 install and everything after that ‘just worked’. I was skeptical about this while it was running: I had some rpmfusion.org repositories enabled and some packages installed from those repositories. I was wondering if those packages would be upgraded as well, or would they be left at the current state, which could create dependency problems, or if they would be completely removed. I had to wait for the install to finish, which took a while. The post-install process took more than half an hour, and when it was done, I selected ‘Reboot’. Half-expecting something to have broken or to not work, I logged in, and voila, I was presented the shiny new GNOME 2.30 desktop. The temporary install kernel that I had put in as the default boot kernel was also removed. Small thing in itself, but great for usability.

Everything looked and felt right, no sign of breakage, no error messages, no warnings, just some good seamless upgrade.

I can’t say really expected this. Coming from a die-hard Debian fan, distribution upgrades are something that was the forte of just Debian. For now. The Fedora developers have done a really good job of getting this process extremely easy to use and extremely reliable. Kudos to them!

While the Fedora 13 release has been pushed back a week for a install-over-NFS bug, it needs a certain combination of misfortunes to trigger, and luckily, I didn’t hit that bug. However, when trying the F13 beta install on my laptop, I had hit a couple of Anaconda bugs, one of which is now resolved for F14 (crash when upgrading without a bootloader configuration) and the other one (no UI refresh if I switch between virtual consoles until a package finishes install — really felt while installing over a slow network link) is a known problem with the design of Anaconda, and hopefully the devs get to it.

Overall, a really nice experience and I can now comfortably say Fedora has really rocketed ahead (all puns intended) since the old times when even installing packages used to be a nightmare. This is good progress indeed, and I’m glad to note that the future of the Linux desktop is in very good hands.

Cheers to the entire team!

Virtualisation (on Fedora)

A few volunteers from India associated with the Fedora Project wrote articles for Linux For You‘s March 2010 Virtualisation Special. Those articles, and a few others, are put up on the Fedora wiki space at Magazine Articles on Virtualization. Thanks to LFY for letting us upload the pdfs!

We’re always looking for more content, in the form of how-tos, articles, experiences, tips, etc., so feel free to upload content to the wiki or blog about it.

We also have contact with some magazine publishers so if you’re interested in writing for online or print magazines, let the marketing folks know!

Re-comparing file systems

The previous attempt at comparing file systems based on the ability to allocate large files and zero them met with some interesting feedback. I was asked why I didn’t add reiserfs to the tests and also if I could test with larger files.

The test itself had a few problems, making the results unfair:

- I had different partitions for different file systems. So the hard drive geometry and seek times would play a part in the test results

- One can never be sure that the data that was requested to be written to the hard disk was actually written unless one unmounts the partition

- Other data that was in the cache before starting the test could be in the process of being written out to the disk and that could also interfere with the results

All these have been addressed in the newer results.

There are a few more goodies too:
- gnuplot script to ease the charting of data
- A script to automate testing of on various file systems
- A big bug fixed that affected the results for the chunk-writing cases (4k and 8k): this existed right from the time I first wrote the test and was the result of using the wrong parameter for calculating chunk size. This was spotted by Mike Galbraith on lkml.

Browse the sources here

or git-clone them by

git clone git://git.fedorapeople.org/~amitshah/alloc-perf.git

So in addition to ext3, ext4, xfs and btrfs, I’ve added ext2, reiserfs and expanded the ext3 test to cover the three journalling modes: data, writeback and guarded. guarded is the new mode that’s being proposed (it’s not yet in the Linux kernel). It’s to have the speed of writeback and the consistency of ordered.

I’ve also run these tests twice, once with a user logged in and a full desktop on. This is to measure the times that a user will see when actually working on the system and some app tries allocating files.

I also ran the tests in single mode so that there are no background services running and the effect of other processes on the tests is not seen. This is done to see the timing. The fragmentation will of course remain more or less the same; that’s not a property of system load.

It’s also important to note that I created this test suite to mainly find out how fragmented the files are when allocating them using different methods on different file systems. The comparison of performance is a side-effect. This test is also not useful for any kind of stress-testing file systems. There are other suites that do a good job of it.

That said, the results suggest that btrfs, xfs and ext4 are the best when it comes to keeping fragments at the lowest. Reiserfs really looks bad in these tests.Time-wise, the file systems that support the fallocate() syscall perform the best, using almost no time in allocating files of any size. ext4, xfs and btrfs support this syscall.

On to the tests. I created a 4GiB file for each test. The tests are: posix_fallocate(), mmap+memset, writing 4k-sized chunks and writing 8k-sized chunks. These tests are repeated inside the same partition sized 20GiB. The script reformats the partition for the appropriate fs before the run.

The results:

The first 4 columns show the times (in seconds) and the last four columns show the fragments resulting from the corresponding test.

The results, in text form, are:

# 4GiB file
# Desktop on
filesystem posix-fallocate mmap chunk-4096 chunk-8192 posix-fallocate mmap chunk-4096 chunk-8192
ext2 73 96 77 80 34 39 39 36
ext3-writeback 89 104 89 93 34 36 37 37
ext3-ordered 87 98 89 92 34 35 37 36
ext3-guarded 89 102 90 93 34 35 36 36
ext4 0 84 74 79 1 10 9 7
xfs 0 81 75 81 1 2 2 2
reiserfs 85 86 89 93 938 35 953 956
btrfs 0 85 79 82 1 1 1 1

# 4GiB file
# Single
filesystem posix-fallocate mmap chunk-4096 chunk-8192 posix-fallocate mmap chunk-4096 chunk-8192
ext2 71 85 73 77 33 37 35 36
ext3-writeback 84 91 86 90 34 35 37 36
ext3-ordered 85 85 87 91 34 34 37 36
ext3-guarded 84 85 86 90 34 34 38 37
ext4 0 74 72 76 1 10 9 7
xfs 0 72 73 77 1 2 2 2
reiserfs 83 75 86 91 938 35 953 956
btrfs 0 74 76 80 1 1 1 1

[Sorry; couldn't find an option to make this look proper]

Fig. 1, number of fragments. reiserfs performs really bad here.

Fig. 2. The same results, but without reiserfs.
Fig. 3, time results, with desktop on

Fig. 4. Time results, without desktop — in single user mode.

So in conclusion, as noted above, btrfs, xfs and ext4 are the best when it comes to keeping fragments at the lowest. Reiserfs really looks bad in these tests. Time-wise, the file systems that support the fallocate() syscall perform the best, using almost no time in allocating files of any size. ext4, xfs and btrfs support this syscall.

Comparison of File Systems And Speeding Up Applications

Update: I’ve done a newer article on this subject at http://log.amitshah.net/2009/04/re-comparing-file-systems.html that removes some of the deficiencies in the tests mentioned here and has newer, more accurate results along with some new file systems.

How should one allocate disk space for a file for later writing? ftruncate() (or lseek() followed by write()) create sparse files, not what is needed. A traditional way is to write zeroes to the file till it reaches the desired file size. Doing things this way has a few drawbacks:

  • Slow, as small chunks are written one at a time by the write() syscall
  • Lots of fragmentation

posix_fallocate() is a library call that handles the chunking of writes in one batch; the application need not have to code his/her own block-by-block writes. But this still is in the userspace.

Linux 2.6.23 introduced the fallocate() system call. The allocation is then moved to kernel space and hence is faster. New file systems that support extents make this call very fast indeed: a single extent is to be marked as being allocated on disk (as traditionally blocks were being marked as ‘used’). Fragmentation too is reduced as file systems will now keep track of extents, instead of smaller blocks.

posix_fallocate() will internally use fallocate() if the syscall exists in the running kernel.

So I thought it would be a good idea to make libvirt use posix_fallocate() so that systems with the newer file systems will directly benefit when allocating disk space for virtual machines. I wasn’t sure of what method libvirt already used to allocate the space. I found out that it allocated blocks in 4KiB sized chunks.

So I sent a patch to the libvir-list to convert to posix_fallocate() and danpb asked me about what the benefits of this approach were and also asked about using alternative approaches if not writing in 4K chunks. I didn’t have any data to back up my claims of “this approach will be fast and will result in less fragmentation, which is desirable”. So I set out to do some benchmarking. To do that, though, I first had to make some empty disk space to create a few file systems of sufficiently large sizes. Hunting for a test machine with spare disk space proved futie, so I went about resizing my ext3 partition and creating about 15 GB of free disk space. I intended to test ext3, ext4, xfs and btrfs. I could use my existing ext3 partition for the testing, but that would not give honest results about the fragmentation (existing file systems may already be fragmented, causing big new files surely to be fragmented whereas on a fresh fs, I won’t run into that risk).

Though even creating separate partitions on rotating storage and testing file system performance won’t give perfectly honest results, I figured if the percentage difference in the results was quite high, that won’t matter. I grabbed the latest Linus tree and the latest dev trees for the userspace utilities for all the file systems and created about 5GB partitions for each fs.

I then wrote a program that created a file, allocated disk space and closed it and calculate the time taken in doing so. This was done multiple times for different allocation methods: posix_fallocate(), mmap() + memset() and writing zeroes in 4096 byte chunks and 8192 byte chunks.

So I had four methods of allocating files and 5G partition size. So I decided to check the performance by creating 1GiB file size for each allocation method.

The program is here. The results, here. The git tree is here.

I was quite surprised seeing poor performance for posix_fallocate() on ext4. On digging a bit, I realised mkfs.ext4 didn’t create it with extents enabled. I reformatted the partition, but that data was valuable to have as well. Shows how much a file system is better with extents support.

Graphically, it looks like this:
Notice that ext4, xfs and btrfs take only a few microseconds to complete posix_fallocate().

The number of fragments created:

btrfs doesn’t yet have the ioctl implemented for calculating fragments.

The results are very impressive and the final patches to libvirt were finalised pretty quickly. They’re now in the development branch libvirt. Coming soon to a virtual machine management application near you.

Use of posix_fallocate() will be beneficial to programs that know in advance the size of the file being created, like torrent clients, ftp clients, browsers, download managers, etc. It won’t be beneficial in the speed sense, as data is only written when it’s downloaded, but it’s beneficial in the as-less-fragmentation-as-possible sense.

Virtualisation: The KVM Way

I was invited at the Convergence 2008 conference on Virtualization at the IIT Bombay. I thought it would be fun to have competitors looking at you while you tell everyone why your hypervisor is the best. It was a slight disappointment, though, as Citrix (Xen) and Microsoft were absent. VMWare was present, however, and quite a lot of people who have used Xen.

The problem with KVM is people know about it, but not much information is available by way of whitepapers or nice graphs showing why KVM is better than the others. So not many try it. That’s really sad, since KVM is so simple to learn, use and adapt that it should ideally be the first thing that comes to peoples’ minds when they think of a virtual machine monitor. And since KVM believes in the UNIX philosophy of ‘do one thing and do it right’, we automatically inherit several features from the environment that already exists, for which the competition has to spend years and a lot of dollars to get support for. For example, guest memory swapping, NUMA support, hibernate/suspend/resume of the host machine with VMs on and so on.

The slide deck on my KVM presentation is available for those interested. People searching for more KVM information: see the KVM Forums page. As of now, there is quite a lot of information present in the KVM Forum 2007 page. The 2008 KVM Forum is happening in June, watch the kvm-devel mailing list for information on what to expect and how to get there.

Foss.in

Foss.in/2007 is over and I’m back home. The slide deck on my kvm talk is now available.

This was the first time I went to foss.in and I really liked the experience. More than the talks, it’s the corridor discussions and meeting up with people that’s really the most interesting part. The place was full with people who have contributed immensely to the software I use everyday, and I couldn’t let go of such an opportunity to go and thank them personally. I definitely missed thanking everyone, so I think I’ll go there next year to make up for that. Danese Cooper gets my vote for the best talk: Trekking with White Elephants. It’s a great way to learn how to go about contributing to open source and years of experience in getting the management knowledgeable about free software. I’ve learnt these lessons myself through all these years and I’m sure young people out there will benefit a lot from these tips. (I will update the link once I get access to the final slides)

My talk on KVM turned into a demo session for KVM and explaining merits of the approach as opposed to Xen, as a few people in the audience had already used Xen and they wanted to know why KVM is different or better. Too bad, since I was hoping there would be contributors who would have liked to know how KVM actually works.

I wasn’t also too happy with the scheduling of the talks: there was a gcc talk in parallel with a kernel talk and a filesystem / distributed computing talk in parallel with another kernel talk. To make matters worse, Thomas Gleixner’s talk on the RT patches was added later in the same time slot as I was to speak.

Rusty has to be the most entertaining kernel hacker; in his inimitable style, he provided a grand finale to the event meant to encourage contributors to the FOSS community. He got me up on stage along with James Morris to speak about how we got involved with FOSS and the kernel.

The Linux kernel folks at IBM LTC Bangalore swore they wouldn’t let me go away easily and asked me to visit their office where people would ask me all sorts of questions on KVM. That was a very nice session that I had; they’re mostly interested in the power management and migration issues on kvm, and that got me pretty kicked, as I’m extremely interested in the power management and Green issues of late. Though I couldn’t answer most of the questions related to power management, I’m sure the kvm-devel list can help.

Moreover, quite a few people came up to me and asked about my work on the kernel and kvm and that was quite encouraging.

I’m sure I also caused inconvenience to people at the sponsor stalls asking them in what ways their company contributed to foss software. Most of them were there just to attract talent. I’m hoping the FOSS enthusiasts don’t stop contributing once they’re in those big companies.

Making Wireless Work from the Command Line

There’s an amazingly horrible piece of software that manages network connections on KDE: knetworkmanager. And with my recent upgrade to Kubuntu Gutsy from Feisty, it has been behaving as wildly as only a chimp could.

So I decided to simply not use it. There are a few workarounds, like deleting all entries from /etc/network/interfaces (on any Debian-based system), except the one for the localhost and then restarting knetworkmanager. It worked for me till I suspended and resumed.

So what I do instead is this (make sure you have entries in /etc/network/interfaces for eth0, eth1, etc. from the backup file in case you tried the workaround).

[If eth1 is your wireless interface]

$ sudo iwlist eth1 scan
< shows a listing of all available wireless networks found >

$ sudo iwconfig eth1 essid “name”

[where name is the name of the wireless network you want to connect to]

Another step to get a DHCP IP assigned might be needed.

$ /etc/init.d/networking restart

This should be possible even with restarting the dhcp, but I’m not sure which one yet.

Done!

Make Your Computer / Laptop Use Less Power

Making your computer use less power is a necessary but difficult to implement issue… unless you know where to look and what to do. You want to use less power to conserve battery life of your laptop; you want to use less power on your servers to save on electricity and cooling costs. And you want to use less power on your desktops to save electricity consumption, all resulting in lesser pollution, more savings and in the case of laptops, more productive time on battery.

http://www.lesswatts.org/ is the website dedicated to help you with configuring your system to use less power. The developers are also examining and fixing the code of the worst-offending applications, so configuring your system for using less power (when it really should by default) becomes less and less necessary.

For now, you can check instructions at all the websites available as to what to do with your kernel, CPU, hard disk, GPU (yeah, tune them down as well. We don’t do graphics-heavy work to warrant running the GPU at full clock always) among other things. Also, one of the biggest consumers of power, the LCD backlight — that can be tuned without at configuration at all. You just have to adjust it to the lowest visible best.

However, what’s more interesting to me is small things that we can do, like stop cursors from blinking, since they need X to wake up for each time it has to blink. There’s a really nice compilation for the popular applications on one page. Go do it, it’ll help the CPU remain in an idle state (and hence consume less electricity) longer.

Quote by Nat Friedman

I recently got to meet Nat Friedman of Gnome fame. Thanks Moshe for the opportunity.

While Moshe and Nat did most of the talking (about entrepreneurs and entrepreneurship), I just kept listening and learning. I had an interesting story to mention, though. A couple of ex-colleagues from Codito had witnessed a live coding session on-stage by Miguel and Nat, and have come out impressed by their style.

I got to hear the story behind such sessions. Nat had this to say (not the exact words, I’ve sadly forgotten them):

“What I’ve come to know is that in India, you can actually get a degree in computers without actually doing a lot of coding”

I just kept laughing as this was the reality! Back in college, there were a few who really liked the subjects and who liked to code. The others coded just to get through the exams (and in cases, managed to get through without even coding).

Nat says he’s never had as many people coming to him after a talk as in India. They don’t ask questions during the talk, but after the talk. And the kind of questions they ask prompted him to ask “So how many lines of code have you written?” The reply is usually in the range of 3000. I’m not sure if his reaction was as animated in front of the crowd, but he said “that’s the kind of number you should be coding every day if you’ve got to be decent coders!”

So true. It’s a pity, since we have a lot of people entering this industry.. a lot of youngsters being churned out by colleges. What’s painful is that everyone is misguided. Some take up the course just because “there’s more money”. Some are here there are colleges mushrooming everywhere, which can accommodate many such people. It’s not for passion that many join the course and the industry.

Those few who, in spite of the extremely tolerant and greedy industry that we’ve managed to create here, can’t make it to the industry post-college (for obvious reasons), become lecturers at these colleges. Doesn’t help students at all. Of the talented lot, a few get disillusioned, a few don’t get proper guidance… and that marks the sad start to an already finished career.

I particularly remember the nice anecdotes we used to have during lectures and practicals. A few gems:

- While doing the Kirchoff’s Voltage and Current Laws: two currents flow through a resistor in the same direction. They’re supposed to add up. The lecturer says it’s I1 – I2. We’re of course in the mood to have fun, so one guy points out the mistake and another one says there’s no mistake. So we pass time debating this. At the end of the hour, the lecturer says “According to my logic, it’s right. You go home and check with your books.”

- The same guy, in a lab session. My unfortunate friend‘s allotted power supply doesn’t work. He asks for a replacement. This lecturer says “why do you need a different power supply? Use this multi-meter, set it to 5V DC and use it.” Atul couldn’t control his laughter. The lecturer took offence and that might’ve reflected on our performance. (Hope you don’t get into such trouble at CMU! ;-) )

Back to Nat: I was very impressed by him. Though I’m not a Gnome-fan, I still like all the work they’re doing and from my interaction, I’m sure he’s taking the Linux desktop to the masses.