ZFS–Zetta file systems.
The breakthrough file system in Solaris 10 delivers virtually unlimited capacity, provable data integrity, and near-zero administration. ZFS, the dynamic new file system in Sun’s Solaris 10 Operating System , will make you forget everything you thought you knew about file systems.
ZFS is supported on both SPARC and x86 platforms.You can easily move disks from a SPARC server to an x86 server. ZFS is the only self-healing, self-managing general-purpose file system. It is very easy for administration , Provides data integrity and never required fsck for inconsistency of file systems.
World’s first 128-bit file system, ZFS offers 16 billion billion times the capacity of 32- or 64-bit systems.The limitations of ZFS are designed to be so large that they will not be encountered in practice for some time.
ZFS offers relief for your company’s bottom line. Because ZFS is built on top of virtual storage pools (unlike traditional file systems that require a separate volume manager), creating and deleting file systems is much less complex. Not only does this eliminate the need to pay for volume manager licenses and allow for single support contracts, it lowers administration costs and increases storage utilization.
Creating a storage pool, growing a pool, or adding or removing a file system can all be done with a single, simple command–instead of the multistep process (format, newfs, edit/etc/vfstab, and so on) found in traditional file systems and volume managers.
Consider this case: To create a pool, to create three file systems, and then to grow the pool–5 logical steps–5 simple ZFS commands are required, as opposed to 28 steps with a traditional file system and volume manager.
Moreover, these commands are all constant-time and complete in just a few seconds. Traditional file systems and volumes often take hours to configure. In the case above, ZFS reduces the time required to complete the tasks from 40 minutes to under 10 seconds.
ZFS can self-heal data in a mirrored or RAID configuration. When one copy is damaged, ZFS detects it via the checksum and uses another copy to repair it.
A volume manager is a layer of software that groups a set of block devices in order to implement some form of data protection and/or aggregation of devices exporting the collection as a storage volumes that behaves as a simple block device.
A filesystem is a layer that will manage such a block device using a subset of system memory in order to provide Filesystem operations (including Posix semantics) to applications and provide a hierarchical namespace for storage – files. Applications issue reads and writes to the Filesystem and the Filesystem issues Input and Output (I/O) operations to the storage/block device.
ZFS implements those 2 functions at once. It thus typically manages sets of block devices (leaf vdev), possibly grouping them into protected devices (RAID-Z or N-way mirror) and aggregating those top level vdevs into pool. Top level vdevs can be added to a pool at any time. Objects that are stored onto a pool will be dynamically striped onto the available vdevs.
Associated with pools, ZFS manages a number of very lightweight filesystem objects. A ZFS filesystem is basically just a set of properties associated with a given mount point. Properties of a filesystem includes the quota (maximum size) and reservation (guaranteed size) as well as, for example, whether or not to compress file data when storing blocks. The filesystem is characterized as lightweight because it does not statically associate with any physical disk blocks and any of its settable properties can be simply changed dynamically.
The recordsize is one of those properties of a given ZFS filesystem instance. ZFS files smaller than the recordsize are stored using a single filesystem block (FSB) of variable length in multiple of a disk sector (512 Bytes). Larger files are stored using multiple FSB, each of recordsize bytes, with default value of 128K.
The FSB is the basic file unit managed by ZFS and to which a checksum is applied. After a file grows to be larger than the recordsize (and gets to be stored with multiple FSB) changing the Filesystem’s recordsize property will not impact the file in question. A copy of the file will inherit the tuned recordsize value. A FSB can be mirrored onto a vdev or spread to a RAID-Z device.
The recordsize is currently the only performance tunable of ZFS. The default recordsize may lead to early storage saturation: For many small updates (much smaller than 128K) to large files (bigger than 128K) the default value can cause an extra strain on the physical storage or on the data channel (such as a fiber channel) linking it to the host. For those loads, If one notices a saturated I/O channel then tuning the recordsize to smaller values should be investigated.
The basic mode of operation for writes operations that do not require synchronous semantics (no O_DSYNC, fsync(), etc), is that ZFS will absorb the operation in a per host system cache called Adaptive Replacement Cache (ARC). Since there is only one host system memory but potentially multiple ZFS pools, cached data from all pools is handled by a unique ARC.
Each file modification (e.g. a write) is associated with a certain transaction group (TXG). At regular interval (default of txg_time = 5 seconds) each TXG will shut down and the pool will issue a sync operation for that group. A TXG may also be shut down when the ARC indicates that there is too much dirty memory currently being cached. As a TXG closes, a new one immediately opens and file modifications then associate with the new active TXG.
If the active TXG shuts down while a previous one is still in the process of syncing data to the storage, then applications will be throttled until the running sync completes. In this situation where are sinking a TXG, while TXG + 1 is closed due to memory limitations or the 5 second clock and is waiting to sync itself; applications are throttled waiting to write to TXG + 2. We need sustained saturation of the storage or a memory constraint in order to throttle applications.
A sync of the Storage Pool will involve sending all level 0 data blocks to disk, when done, all level 1 indirect blocks, etc. until eventually all blocks representing the new state of the filesystem have been committed. At that point we update the ueberblock to point to the new consistent state of the storage pool.
ZFS Intent Log (ZIL)
For file modification that come with some immediate data integrity constraint (O_DSYNC, fsync etc.) ZFS manages a per-filesystem intent log or ZIL. The ZIL marks each FS operation (say a write) with a log sequence number. When a synchronous command is requested for the operation (such as an fsync), the ZIL will output blocks up to the sequence number. When the ZIL is in process of committing data, further commit operations will wait for the previous ones to complete. This allows the ZIL to aggregate multiple small transactions into larger ones thus performing commits using fewer larger I/Os.
The ZIL works by issuing all the required I/Os and then flushing the write caches if those are enabled. This use of disk write cache does not artificially improve a disk’s commit latency because ZFS insures that data is physically committed to storage before returning. However the write cache allows a disk to hold multiple concurrent I/O transactions and this acts as a good substitute for drives that do not implement tag queues.
I/O Scheduler and Priorities
ZFS keeps track of pending I/Os but only issues to disk controllers a certain number (35 by default). This allows the controllers to operate efficiently while never overflowing their queues. By limiting the I/O queue size, service times of individual disks are kept to reasonable values. When one I/O completes, the I/O scheduler then decides the next most important one to issue. The priority scheme is timed based; so for instance an Input I/O to service a read calls will be prioritize over any regular Output I/O issued in the last ~ 0.5 seconds.
The fact that ZFS will limit each leaf devices I/O queue to 35, is one of the reasons that suggests that zpool should be built using vdevs that are individual disks or at least volumes that map to small number of disks. Otherwise this self imposed limits could become an artificial performance throttle.
If a read cannot be serviced from the ARC cache, ZFS will issue a ‘prioritized’ I/O for the data. So even if the storage is handling a heavy output load, there are only 35 I/Os outstanding, all with reasonable service times. As soon as one of the 35 I/Os completes the I/O scheduler will issue the read I/O to the controller. This insures good service times for read operations in general.
However to avoid starvation, when there is a long-standing backlog of Output I/Os then eventually those regain priority over the Input I/O. ZIL synchronous I/Os are of the same priority to synchronous reads.
Soft Track Buffer
An input I/O is serious business. While a Filesystem can decide where to write stuff out on disk, the Inputs are requested by applications. This means a necessary head seek to the location of the data. The time to issue a small read will be totally dominated by this seek. So ZFS takes the stance that it might as well amortize those operations and so, for uncached reads, ZFS normally will issue a fairly large Input I/O (64K by default). This will help loads that input data using similar access pattern to the output phase. The data goes into a per device cache holding 20MB.
This cache can be invaluable in reducing the I/Os necessary to read-in data. But just like the recordsize, if the inflated I/O cause a storage channel saturation the Soft Track Buffer can act as a performance throttle.
The ARC Cache
The most interesting caching occurs at the ARC layer. The ARC manages the memory used by blocks from all pools (each pool servicing many filesystems). ARC stands for Adaptive Replacement Cache.
That ARC manages it’s data keeping a notion of Most Frequently Used (MFU) and Most Recently Use (MRU) balancing intelligently between the two. One of it’s very interesting properties is that a large scan of a file will not destroy most of the cached data.
On a system with Free Memory, the ARC will grow as it starts to cache data. Under memory pressure the ARC will return some of it’s memory to the kernel until low memory conditions are relieved.