Choosing a Linux filesystem

File Choice

The choice for a filesystem depends on why you need it. We take a deep look at the differences among some popular Linux filesystems so you can make an informed choice.

By Ben Martin

Feng Yu, 123rf

Choice is a great thing, and a modern Linux kernel certainly offers you a wide selection of filesystems for your precious data. In this article, I examine some of the issues of filesystem design and show you how some popular filesystems handle each of these design points.

This view from a designer's perspective should give you some insights on which filesystem will work best for your environment.

Problems, Problems ...

Filesystems are deceptively simple beasts at first glance - all you want is to shove some data into files on disk and have it available again later, right?

Going with this definition, all you need the filesystem to do is track the metadata of file and directory names and their relationships with each other (the nesting inside directories) and possibly to store a continuous range of bytes for each file as their "content." Or, maybe the filesystem could just store the metadata in a table, listing the filename, file size, modification time, and so forth, as well as where to find the byte content of the files.

At this stage, the filesystem still seems moderately simple. There will have to be some locking primitives on the metadata table to ensure that metadata remains consistent, even with multiple processes doing things at the same time, but the need for some form of locking is not a show stopper. Also, it would be nice to have some sort of index on the metadata table so you can read a directory quickly.

But wait ... if the power goes out on the PC, the filesystem had better still be useful and contain all the data when you turn the machine back on again. And everything should be fine if you happen to have been writing data to the disk when the power dropped out. Maybe you don't get all the stuff that was being written at the time of the power failure, but everything else should be just fine. Also, you would probably rather not wait for one of these fsck filesystem check operations to take hours and hours to complete on a shiny new 2TB disk. Surely a handful of seconds is enough for the poor filesystem to determine that it was part way through doing 50 things when the power went away and to recover itself to some sane state.

Now things are starting to get more complex. If the filesystem tries to update the metadata in the same place on disk, how can it determine whether a block in the metadata table is valid or was half written when the power was lost? What about the index on the metadata table; it is possible that part of that index was written to disk and the other half was still in RAM waiting to be written to disk when the power was lost. You really, really don't want to lose any of this metadata, no matter what happens. The index must be valid also; otherwise, the filesystem can't rely on it and might as well not have it. But even if the filesystem can detect that it was not shut down properly, rebuilding the index could take a very long time, and this is very much getting into the marathon fsck scenario you wanted to avoid.

So, maybe the filesystem can maintain a journal of what metadata changes have happened along the way, and at some "checkpoint" time, build a new metadata table and index with the updates and switch from the old data structures to the new one. The system will always have at least one valid metadata table and index on disk this way, and if the machine falls over, it can use the current version and the journal to move forward to as close as possible to when the failure occurred. Although this plan might be made to work, rebuilding the index each time you want to sync filesystem changes to disk is likely to be very slow.

In addition to data consistency, you also expect a filesystem to be as quick as possible. When you untar a 50MB archive or run a distributed compile, the quicker the things you are interested in doing actually happen, the better. Luckily, you don't have to try to implement your own fast, safe filesystem; several filesystems in the Linux kernel already address these problems in various ways.

Storing the Metadata

So far, you have determined it is a good idea to store file and directory metadata in some sort of abstract table and have an index on that table in order to find entries quickly. When you consider how a directory might be handled, the index becomes a vital part of making the filesystem perform well.

Normally, each file receives a unique number: its inode. A directory will likely list the inodes for all the files it contains. So to find out who is the owner and see the group and access permissions for the files in a directory, the filesystem will use a collection of lookups into the metadata table using the inode numbers. File metadata lookup by inode number is a fairly frequent operation, from checking access permissions of files when you read them, to simply performing an ls -l on a directory.

If the filesystem design limits the number of inodes available and allocates a contiguous table for all of the inodes at filesystem creation time, the inode number can be used to find the entry in the metadata table directly. For example, if 128 inodes fit into a page, then inode 600 will be in the fifth page of the metadata table. The downside of this scheme is that the table has to be contiguous and preallocated; otherwise, using a b-tree to store both the metadata table and its index becomes an attractive solution.

A b-tree can be useful for implementing an index that allows lookup of file metadata by inode. For those unfamiliar, a b-tree is an indexing method allowing you to lookup a key or a range of keys quickly. The b-tree data structure dates back to the 1970s and works well on disks because you can look up a key with relatively few disk seeks, and as you add data to the b-tree, it remains balanced. (For brevity, I won't differentiate between b-trees and b+trees here.)

A b-tree is a paged data structure, meaning that many lookup keys are grouped into a single disk page (typically 4KB), and a key in a page references another page. So, if page numbers and keys are 8 bytes each, the root 4KB page of the tree can refer to 256 other pages, each of which can refer to 256 pages. As you can see, it doesn't take many pages to be able to contain thousands or millions of keys. This huge "fan out" at each page in the tree means you might be able to find a single key in a b-tree of a million keys by only needing to look at three pages.

Because each page might require a disk head seek, which are extremely slow operations, minimizing the number of pages you need to access is the name of the game. Of course, this "seeks are death" rule is changing as solid state drives penetrate the market, but the cost per gigabyte of spinning storage will likely make it the primary choice for some years to come.

The bottom-level pages in a b-tree store the "data" for a key, which can either be a pointer off into some other data structure or the data itself if it is small enough. For a filesystem, the data is the modification time, owner, permissions, and so on, which is small and can be stored directly in the b-tree. The XFS filesystem allocates and stores inodes in groups of 64; the inode b-tree references one of these inode groups.

B-trees are also ordered, which can be exploited by the filesystem to provide good performance. For example, when you create 50 files in a directory, if the filesystem uses sequential inode numbers for those 50 files, they will be sorted by the b-tree close together and a lookup for all 50 might use the same page. Because the numbers are clustered together, the pages that are required leading down the tree will also be the same, and you might be able to look up all 50 files with just three or four disk seeks.

Although all of this indexing is wonderful, the problem of what happens to this b-tree when the power is interrupted still exists. Remember that you don't want to have to rebuild the b-tree index, and you always want to know that it is fully valid.

Journal and COW

Two major strategies exist to maintain the consistency of filesystem metadata: the use of a journal to maintain a list of changes and the use of a Copy On Write (COW) b-tree. When using a journal, you have to periodically make the changes to the primary data structures and record that in the journal itself. A similar checkpoint process is used in a COW b-tree design, in which you migrate to using a new root node.

The journaling approach to maintain metadata is taken by XFS, ext3, and ext4 (Figures 1 and 2). COW b-trees are used by Btrfs (Figure 3). The ext filesystem series use a preallocated table of inodes. The XFS filesystem uses a b-tree of inodes, and Btrfs uses COW b-trees inodes.

Figure 1: The extended filesystem (ext) has undergone several generations of development.

Figure 2: Originally developed for Silicon Graphics, the XFS journaling filesystem is now maintained as an independent project.

Figure 3: The B-tree filesystem (Btrfs - pronounced "Butter FS") has created excitement in the open source world with advanced features for easy snapshots and multi-device spanning. As the wiki states, Btrfs is relatively new and still under heavy development.

COW B-Trees

Normally when you update a b-tree, the page containing a key is read into RAM, updated there, and then flushed back out to disk again. If adding a key makes the page too large to fit back where it goes, it is split into two pages and a new key is added to the old page's parent. In a COW b-tree, inserting a key leaves the old tree fully intact and creates a copy of the page to be updated. To link this page to a tree, a copy of all the parent pages is also made. Remember that there might only be three to five parent pages for any page in the b-tree because the data structure fans out quickly, so copying all the parents is not that expensive an operation. At this point, you will have a b-tree that uses four or five new pages and shares all the other pages with the original b-tree. Because the insertion copies all the parent pages, you will get a new tree root node each time, effectively giving you a filesystem snapshot at the time of each modification.

For performance reasons, a filesystem that uses COW b-trees will try to perform many operations on the b-tree in RAM and only periodically sync out the copied pages, thus creating a new persistent tree root node. If changes are held back, many changes can be performed on the copied nodes in RAM without needing to copy them again, further reducing disk accesses.

These snapshots allow you to mount the same filesystem read-only as it was at a point in time (e.g., to see your home directory as it was last month). A writable snapshot is called a clone and allows you to do things like experiment with different system changes before deciding which clone(s) you want to keep.

Although all of these copies seem to be wasting disk space, you're only talking about duplicating 4KB pages, so even if you duplicate millions of pages, you are still talking about fairly modest amounts of disk space. Once the old pages are no longer required, they can be reused so the tree doesn't keep growing to eat up the whole disk.

Btrfs uses COW trees for everything, so you can quickly take a snapshot of the entire filesystem at any point in time.

Flushing

At some point, a filesystem is going to want to know that a specific piece of data is safely written to disk - for example, that the journal or every page in a new COW b-tree is all on disk instead of partially cached in the hard disk's little on-disk 16MB memory cache.

A SATA command can be used to make sure data is on the platter instead of in a cache along the way. Unfortunately, this SATA command is an all-or-nothing flush. You can't say, "make sure the data from SATA command 12, 45, and 99 is on disk." At least you can't say that reliably on all hardware. Normally you have to say, "flush everything you've got." Given that an on-disk cache is currently only 16 or 32MB, which might not all be data that needs to be written anyway, and that you can write 100MB/sec to disk, this might not seem like a huge problem. But when you consider that that cache might not be the only one containing data to be written - for example, disk controllers might have a cache too - and that the cached data might have to go to scattered locations on the disk itself, you might be in for a nice delay when you issue such a flush SATA command.

This SATA flushing is referred to as using disk barriers [1]. The ext4, Btrfs, and XFS filesystems use barriers by default. Although metadata-heavy operations such as creating or deleting many files will be slower with barriers, the filesystem will also be much more resilient to problems caused by power failures. You can turn off barriers for these filesystems if you like, and you should be aware that barriers might not work over the device mapper framework prior to around kernel version 2.6.31, depending on which patches your Linux distribution incorporates. Hopefully, you are now thinking that if you are running a machine without a battery backup and still care about your data, then you want barriers. I have found that enabling barriers can slow down metadata operations and fsync by a factor of 10. This is a direct trade-off between speed and resilience to power failure.

Storing the Content

There are many ways to store the bytes of a file on disk, and when a filesystem stores these bytes, it obviously needs to know where they are so it can give them back to you later. With block mapping, you divide up the disk into fixed-size blocks, and each file stores a list of block numbers, so you can find all the pieces of the file again.

For small files, the list of blocks can be stored with the inode information. As a file gets larger, the list of block numbers needed to find all the bytes for the file becomes larger, so the inode stores the block number, which is a block that only contains the block number locating the file content. As the file gets larger, the intermediate block containing the list of blocks for the file content will become full, then you get blocks with references to blocks with references to the data, and so on.

Aside from storing the block list, another downside to block allocation is that large files will have many blocks, and these blocks might not be physically very close to each other on disk. Of course, the filesystem can try to grab blocks that are close to each other when writing a file.

On the other hand, an extent-based solution focuses on contiguous ranges of bytes [2]. In an ideal world, a single 10GB home movie would be stored in one contiguous range of bytes on disk (a single extent), and this would only require a small amount of metadata to track where the content is located; this data also could be read start-to-end without a disk seek.

As you can imagine, if you are using extents and have minimal fragmentation, you should be able to delete a large file much quicker than if using block allocation. An SGI Open Source Project paper by Christoph Hellwig [3] reveals that the time required to delete a 60GB file is about 50 seconds for ext3, about two seconds for ext4, and 0.04 seconds for XFS (his figure 1). Because the latter two filesystems employ extents, you might get a considerable advantage from an extent-based filesystem if you often create and delete large files (for instance, if you do video production work).

The Achilles heel for extent-based allocation is revealed in the presence of file fragmentation. A pathological case for this is downloading a Linux distribution over a peer-to-peer network. As the file is downloaded, little pieces arrive at different offsets and are given to the filesystem. So you might end up with hundreds of smaller extents containing these little pieces.

To avoid file fragmentation, the filesystem can avoid writing data to disk right away; such a strategy is called delayed allocation. For example, a program might write a file by opening it, then looping and writing 64KB chunks of data many times before finally closing the file. If the filesystem caches the written data in memory until the program closes the file, it can know exactly how large the data is before writing anything to disk. For an extent-based filesystem, it can then seek out a single extent that is closest to the desired size and write the data there in one go. Not only is fragmentation avoided, but only a single disk seek is used to write all the data out. A block allocation scheme might try to find a group of blocks that are in the same area on disk and write the data to those blocks in a manner that reduces the disk seek load when sequentially reading the data again.

Functions such as fallocate(2) allow a program to inform the filesystem up front exactly how large a file will become, but not all programs take advantage of such functions to help the filesystem out.

Even with all the help of delayed allocation and fallocate, you are still going to end up with fragmentation of the contents of files. For example, if you get a filesystem near 100 percent full and more than one extent is needed for a large file, fragmentation will have to occur. This is one of the reasons many filesystems have online defragmentation tools.

For XFS, you can use the xfs_fsr filesystem reorganizer tool to optimize the extent allocation of files. One of the nice things about xfs_fsr is that you can tell it how long to run, and it will remember where it was the next time you run it. In this way, you can run a few hours in the middle of the night each day, and each time it will pick up where it left off the last time.

The ext4 filesystem is getting support for online defragmentation of extents. The e4defrag utility allows you to explicitly defragment a file, although as of early 2010, this program was not packaged for Fedora 11, 12, or Rawhide.

Choosing a Filesystem

With all these cows, journals, and trees, you might be wondering which filesystem is the best for you. The simple answer is: it depends. And perhaps the best solution is to use more than one on your systems, taking advantage of the strong points of each for particular workloads.

The ext family stores inodes in a table, XFS uses b-trees, and Btrfs uses COW b-trees for everything. Ext3, ext4, and XFS use a journal to maintain integrity in the face of power failure. The ext filesystems like to run a shortish fsck operation from time to time during boot. Ext4, XFS, and Btrfs all enable disk barriers by default, leaving ext3 as the odd one out in this respect.

A common guideline for good performance is that you should also try not to fill disks up too much. Once a filesystem gets about 90-95 percent full, the filesystem will have to start making choices that scatter files and possibly metadata around on disk.

On a related note, to create less locking contention, many filesystems use a many-subpart design. For example, an XFS filesystem is divided into many "allocation groups" which, in many ways, operate like small autonomous filesystems. This allows the filesystem to update many data structures in parallel because, largely, one allocation group does not affect the others. One thing to be aware of here is that the extent of a file might have to be split and stored across many allocation groups. In a way, the contention of low disk space is slightly amplified by using these many smaller parts.

Many filesystems let you decide how large to make an inode when you are creating the filesystem. If you are planning on using extended attributes and you want good performance at the expense of a slight loss in disk space, you might want to make your inodes slightly larger than normal. The fixed size of an inode might be split between storing the extent list in the inode as well as handling extended attributes.

Ironically, when using extended attributes extensively, explicitly doubling the inode size might actually cause the filesystem to use less space overall for its metadata and give better performance at the same time. This paradox occurs because the extent list and extended attributes might all fit into the inode, thus avoiding the need to have the filesystem move the extended attributes to another 4KB block because everything doesn't fit into the inode.

As a concrete case, for XFS, an extent list of up to 19 entries can fit into a default-size inode. If you start using one or more extended attributes, they share the same space as these 19 entries might use, and when both don't fit, extended attributes are moved into a b-tree in a page of their own.

Given how Btrfs uses COW b-trees, it is easy to understand the excitement about Btrfs - a filesystem that uses extent allocation and allows extremely quick filesystem snapshots and clones [4]. The main downside to Btrfs is its relative newness. Conversely, a major advantage to the ext range of filesystems is that they have been around for a long, long time. When you are talking about your home directory, you might value the fact that many, many folks have been using ext3 for a long time as a degree of certainty that your data will continue to be available.

On the other hand, XFS has been in the Linux kernel since the 2.4 days, and, with its extent-based file allocation and online defragmentation, it can handle large files very well. A well known downside to XFS is that it is slow to handle many file create and delete operations. So if you plan to expand tarballs of relatively small files like source code, XFS might not be the optimal choice.

In summary, if you want large files that you can defragment, XFS is a great choice. If dynamic inode allocation and extents don't excite you, then ext3 is a time-tested solution. Ext4 brings extent allocation to an ext3 filesystem design, and ext4 should have online defragmentation of extents at some point (e4defrag). If you want to have quick snapshots and clones at specific points in time, then Btrfs is the way to go. To defragment Btrfs extents, use btrfsctl -d.

If you want speed and reliability, you might consider storing your filesystem journal on a good quality SLC SSD, or invest in a battery backup and turn off filesystem barriers.

INFO

[1] Benchmarking barriers: http://ldn.linuxfoundation.org/article/filesystems-data-preservation-fsync-and-benchmarks-pt-3
[2] ext2 design: http://e2fsprogs.sourceforge.net/ext2intro.html
[3] "XFS: The Big Storage File System for Linux" by Christoph Hellwig, login:, October 2009: http://oss.sgi.com/projects/xfs/papers/hellwig.pdf
[4] Btrfs design: http://btrfs.wiki.kernel.org/index.php/Btrfs_design

THE AUTHOR

Ben Martin has been working on filesystems for more than 10 years. He completed his doctorate and now offers consulting services focused on the libferris suite, filesystems in general, and Qt/C++ development. Ben enjoys both writing open source and writing about open source projects.