Comparison of File Systems And Speeding Up Applications

Update: I’ve done a newer article on this subject at that removes some of the deficiencies in the tests mentioned here and has newer, more accurate results along with some new file systems.

How should one allocate disk space for a file for later writing? ftruncate() (or lseek() followed by write()) create sparse files, not what is needed. A traditional way is to write zeroes to the file till it reaches the desired file size. Doing things this way has a few drawbacks:

  • Slow, as small chunks are written one at a time by the write() syscall
  • Lots of fragmentation

posix_fallocate() is a library call that handles the chunking of writes in one batch; the application need not have to code his/her own block-by-block writes. But this still is in the userspace.

Linux 2.6.23 introduced the fallocate() system call. The allocation is then moved to kernel space and hence is faster. New file systems that support extents make this call very fast indeed: a single extent is to be marked as being allocated on disk (as traditionally blocks were being marked as ‘used’). Fragmentation too is reduced as file systems will now keep track of extents, instead of smaller blocks.

posix_fallocate() will internally use fallocate() if the syscall exists in the running kernel.

So I thought it would be a good idea to make libvirt use posix_fallocate() so that systems with the newer file systems will directly benefit when allocating disk space for virtual machines. I wasn’t sure of what method libvirt already used to allocate the space. I found out that it allocated blocks in 4KiB sized chunks.

So I sent a patch to the libvir-list to convert to posix_fallocate() and danpb asked me about what the benefits of this approach were and also asked about using alternative approaches if not writing in 4K chunks. I didn’t have any data to back up my claims of “this approach will be fast and will result in less fragmentation, which is desirable”. So I set out to do some benchmarking. To do that, though, I first had to make some empty disk space to create a few file systems of sufficiently large sizes. Hunting for a test machine with spare disk space proved futie, so I went about resizing my ext3 partition and creating about 15 GB of free disk space. I intended to test ext3, ext4, xfs and btrfs. I could use my existing ext3 partition for the testing, but that would not give honest results about the fragmentation (existing file systems may already be fragmented, causing big new files surely to be fragmented whereas on a fresh fs, I won’t run into that risk).

Though even creating separate partitions on rotating storage and testing file system performance won’t give perfectly honest results, I figured if the percentage difference in the results was quite high, that won’t matter. I grabbed the latest Linus tree and the latest dev trees for the userspace utilities for all the file systems and created about 5GB partitions for each fs.

I then wrote a program that created a file, allocated disk space and closed it and calculate the time taken in doing so. This was done multiple times for different allocation methods: posix_fallocate(), mmap() + memset() and writing zeroes in 4096 byte chunks and 8192 byte chunks.

So I had four methods of allocating files and 5G partition size. So I decided to check the performance by creating 1GiB file size for each allocation method.

The program is here. The results, here. The git tree is here.

I was quite surprised seeing poor performance for posix_fallocate() on ext4. On digging a bit, I realised mkfs.ext4 didn’t create it with extents enabled. I reformatted the partition, but that data was valuable to have as well. Shows how much a file system is better with extents support.

Graphically, it looks like this:
Notice that ext4, xfs and btrfs take only a few microseconds to complete posix_fallocate().

The number of fragments created:

btrfs doesn’t yet have the ioctl implemented for calculating fragments.

The results are very impressive and the final patches to libvirt were finalised pretty quickly. They’re now in the development branch libvirt. Coming soon to a virtual machine management application near you.

Use of posix_fallocate() will be beneficial to programs that know in advance the size of the file being created, like torrent clients, ftp clients, browsers, download managers, etc. It won’t be beneficial in the speed sense, as data is only written when it’s downloaded, but it’s beneficial in the as-less-fragmentation-as-possible sense.

12 Replies to “Comparison of File Systems And Speeding Up Applications”

  1. >That’s most interesting, I have been looking at FS performance recently and haven’t looked at this fallocate() bottleneck to performance. Your numbers sure look promising indeed and of course upstream. Do you have the hardware to do this at filesystems that span 1TB+ speaking of the enterprise space? At this level (PS: I’m more of a novice in FS and Virt space) considering usage of 8GB of DDR2 @533 or @667, I am also concerned about the kernel’s true ability to use all of this space. Yes, I should be on a 64-bit system for this. I have reason to believe that VFS performance itself could be hit by bottlenecks that truly aren’t throughput or I/O scheduling problems alone.

    I also see OpenSolaris or rather SUN/Solaris sending a barrage of attacks using ZFS on linux filesystem performance. It would be interesting to launch a counter-attack that does not involve porting ZFS to linux (lame I’d say,) coz ZFS seems to be combining volume management with root filesystem concepts, The rest of what ZFS can do is just that the code is written, the design though IMHO is not too smart besides what I have mentioned above.

    I am keen on doing work that should help linux enterprise filesystems scale up to performance and compete. Maybe I haven’t searched on google for enough, but I find this space sparse. Your comments would be valuable.

  2. >Hey Beta –

    Answering a few points, one by one
    – Sorry, no hardware to test for 1TB+.
    – I don’t quite understand your concern of the kernel not being able to “utilise all of this space”.

    As for posix_fallocate(), the performance is dependent on the underlying file system. (I don’t claim to know everything about file systems and extents and their handling of blocks, but this is my guess of how things work based on the results I got. The answer is obviously in the sources, which are some free time away): If the fs supports extents, it has to allocate as few extents as possible with contiguous blocks making up each extent. When an extent is allocated with the fallocate() syscall, the blocks are marked as being ‘zeroed’ so that any reads will result in 0s being read. This way of initialising a new file to 0s is the fastest and most eficient.

    – On ZFS: there surely are enterprises deploying Linux and using huge file systems. If Linux doesn’t already have an fs that suits their needs, we’ll surely have one (I suspect many) in the future. btrfs instantly comes to mind. It’s not as if enterprises choose to deploy Sun only because they have ZFS. If that were the single biggest factor, Linux companies would be out of business (though it now looks like Sun is on shaky ground. Rumours of IBM buying Sun can only be seen as good news in this space).

    – Let’s work together whenever possible; I’d be interested to see your results as well and depending on my free time, I should be able to look at a couple of file systems.

  3. >Hi Amit,

    [ utilizing all space ]
    On the kernel using all addressable Memory space, I just referred to the specialized need for HugeMem (64GB) support (I had to do this with Ubuntu release kernels, haven’t quite checked out the same on the git repository yet.)

    While increasing memory on low cost server systems is possible the Processor(s)’s caching ability has a direct impact on efficiently using High speed DDR3. I didn’t quite mean it was not addressable or accessible. I do believe that VM optimizations would also have to accompany for the enterprise grade.

    [ On SUN::ZFS ]
    I am aware of Btrfs and I’m hoping to see it stable for usage. Sun’s ZFS does restrict itself to 64-bit server space, SPARC and x86-64 or x64 (no IA64.) It’s slim, but they do have strong momentum while RHEL will have to at least take this head on. While no threat, Solaris vs Linux ends up with {+ZFS, +DTrace, +Multi-Compiler Support(kernel), -Drivers, -Architecture Support, -Embedded Scalability}

    [ Working together ]
    I think I can put together at least one 1TB filesystem with enough space to benchmark performance at the very minimal. I am trying to fathom VFS and various bottle-necks on fs performance. That should take some time.

  4. >It doesn’t appear that ext3 or xfs implement the fallocate() system-call, so essentially you are comparing your userspace implementation with glibc’s (which is a bit faster perhaps b/c it just pwrite()s a byte per block).

  5. >FWIW, if you want a call that gives you the fast allocation behavior if the fs supports it, and an error if not, see fallocate(2) – but it’s only supported in very bleeding-edge glibc. F11 does have it, though.

  6. >Hey Eric,

    Thanks for dropping by!

    I have support for the fallocate() syscall in the test program but I don’t run the glibc from rawhide yet — I’m on F10. I’ve not yet added fallocate() support to libvirt too. I think I’ll do that some time after F11 releases.

  7. >By the way, mkfs.ext4 really should have created a file system with extents created. By any chance did you have an out-of-date /etc/e2fsck.conf file that you didn't replace when you installed a new version of mkfs.ext4?

  8. >Hey Ted,

    I was using a system based on F10 at the time; I didn't edit e2fsck.conf myself but if it did exist, it might have contained some override info.

    Checking on my current F11 system, there's no /etc/e2fsck.conf.

  9. >Running Ubuntu Jaunty here /etc/e2fsck.conf does exist. I have only ext4 and ext3 filesystems having migrated from all reiserfs on my 2 TB box.

  10. >On Ubuntu Karmic on my 1 TB box /etc/e2fsck.conf does not exist. ext4 is the only active filesystem on that box.

Leave a Reply

Your email address will not be published. Required fields are marked *