7.4.2. Client data caching
In the previous section, we looked at the async thread's management of
an NFS client's buffer cache. The async threads perform
read-ahead and write-behind for the NFS client processes. We also saw
how NFS moves data in NFS buffers, rather than in page- or buffer
cache-sized chunks. The use of NFS buffers allows NFS operations to
utilize some of the sequential disk I/O optimizations of Unix disk
device drivers.
Reading in buffers that are multiples of the local filesystem block size allows
NFS to reduce the cost of getting file blocks from a server. The
overhead of performing an RPC call to read just a few bytes from a
file is significant compared to the cost of reading that data from
the server's disk, so it is to the client's and
server's advantage to spread the RPC cost over as many data
bytes as possible. If an application sequentially reads data from a
file in 128-byte buffers, the first read operation brings over a full
(8 kilobytes for NFS Version 2, usually more for NFS Version 3)
buffer from the filesystem. If the file is less than the buffer size,
the entire file is read from the NFS server. The next
read( ) picks up data that is in the buffer (or
page) cache, and following reads walk through the entire buffer. When
the application reads data that is not cached, another full NFS
buffer is read from the server. If there are async threads performing
read-ahead on the client, the next buffer may already be present on
the NFS client by the time the process needs data from it. Performing
reads in NFS buffer-sized operations improves NFS performance
significantly by decoupling the client application's system
call buffer size and the VFS implementation's buffer size.
Going the other way, small write operations to the same file are
buffered until they fill a complete page or buffer. When a full
buffer is written, the operating system gives it to an async thread,
and async threads try to cluster write buffers together so they can
be sent in NFS buffer-sized requests. The eventual
write RPC call is performed synchronous to the
async thread; that is, the async thread does not continue execution
(and start another write or read operation) until the RPC call
completes. What happens on the server depends on what version of NFS
is being used.
There are elements of a write-back cache in the async threads.
Queueing small write operations until they can be done in
buffer-sized RPC calls leaves the client with data that is not
present on a disk, and a client failure before the data is written to
the server would leave the server with an old copy of the file. This
behavior is similar to that of the Unix buffer cache or the page
cache in memory-mapped systems. If a client is writing to a local
file, blocks of the file are cached in memory and are not flushed to
disk until the
operating system schedules them.
If the machine crashes between the time the data is updated in a file
cache page and the time that page is flushed to disk, the file on
disk is not changed by the write. This is also expected of systems
with local disks -- applications running at the time of the
crash may not leave disk files in well-known states.
Having file blocks cached on the server during writes poses a problem
if the server crashes. The client cannot determine which RPC write
operations completed before the crash, violating the stateless nature
of NFS. Writes cannot be cached on the server side, as this would
allow the client to think that the data was properly written when the
server is still exposed to losing the cached request during a reboot.
Ensuring that writes are
completed before they are acknowledged
introduces a major bottleneck for NFS write operations, especially
for NFS Version 2. A single Version 2 file write operation may
require up to three disk writes on the server to update the
file's inode, an indirect block pointer, and the data block
being written. Each of these server write operations must complete
before the NFS write RPC returns to the client.
Some vendors eliminate most of this bottleneck by committing the data
to nonvolatile, nondisk storage at memory speeds, and then moving
data from the NFS write buffer memory to disk in large (64 kilobyte)
buffers. Even when using NFS Version 3, the introduction of
nonvolatile, nondisk storage can improve performance, though much
less dramatically than with NFS Version 2.
Using the buffer cache and allowing async threads to cluster multiple
buffers introduces some problems when several machines are reading
from and writing to the same file. To prevent file inconsistency with
multiple readers and writers of the same file, NFS institutes a
flush-on-close policy:
- All partially filled NFS buffers are written to the NFS server when a
file is closed.
- For NFS Version 3 clients, any writes that were done with the stable
flag set to off are forced onto the server's stable storage via
the commit operation.
This ensures that a process on another NFS client sees all changes to
a file that it is opening for reading:
Client A |
Client B |
open( ) |
|
write( ) |
|
NFS Version 3 only: commit |
|
close( ) |
|
|
open( ) |
|
read( ) |
The read( ) system call on Client B will see all
of the data in a file just written by Client A, because Client A
flushed out all of its buffers for that file when the
close( ) system call was made. Note that file
consistency is less certain if Client B opens the file before Client
A has closed it. If overlapping read and write operations will be
performed on a single file, file locking must be used to prevent
cache consistency problems. When a file has been locked, the use of
the buffer cache is disabled for that file, making it more of a
write-through than a write-back cache. Instead of bundling small NFS
requests together, each NFS write request for a locked file is sent
to the NFS server immediately.