Disconnected Operation in a Distributed File System
James J. Kistler
Carnegie Mellon University, PA, USA
May 1993
Disconnected operation refers to the ability of a distributed system
client to operate despite server inaccessibility by emulating services locally.
The capability to operate disconnected is already valuable in
many systems, and its importance is growing with two major trends:
the increasing scale of distributed systems, and the
proliferation of powerful mobile computers. The former makes clients vulnerable
to more frequent and less controllable system failures, and the
latter introduces an important class of clients which are disconnected
frequently and for long durations -- often as a matter of choice.
This dissertation shows that it is practical to support
disconnected operation for a fundamental system service:
general purpose file management. It describes the architecture, implementation,
and evaluation of disconnected file service in the Coda file system. The architecture
is centered on the idea that the disconnected service agent should be one
and the same with the client cache manager. The Coda cache manager prepares for
disconnection by pre-fetching and hoarding copies of critical files;
while disconnected it logs all update activity and otherwise emulates server behavior;
upon reconnection it reintegrates by sending its log to the server for replay.
This design achieves the goal of high data availability -- users can access
many of their files while disconnected, but it does not sacrifice
the other positive properties of contemporary distributed file systems:
scalability, performance, security, and transparency.
The system has been seriously used by more than twenty people
over the course of two years. Both stationary and mobile workstations
have been employed as clients, and disconnections have
ranged up to about ten days in length.
Usage experience has been extremely positive. The hoarding strategy
has sufficed to avoid most disconnected cache misses, and
partitioned data sharing has been rare enough to cause
very few reintegration failures. Measurements and simulation results
indicate that disconnected operation in Coda should be equally transparent
and successful at much larger scale.
The main contributions of the thesis work and this dissertation are the following:
a new, client-based approach to data availability that exploits existing
system structure and has special significance for mobile computers;
an implementation of the approach of sufficient robustness that it has
been put to real use; and analysis which sheds further light on the scope
and applicability of the approach.