Clusters built out of commodity components are becoming more and more prevalent in the supercomputing sector as a cost-effective solution for building high-performance, costeffective parallel platforms. Symmetric Multiprocessors, or SMPs for short, are commonly used as building blocks for scalable clustered systems, when interconnected over a high bandwidth, low latency communications infrastructure, such as Myrinet or Infiniband.

To meet the I/O needs of modern HPC applications, a distributed, cluster filesystem needs to be deployed, allowing processes running on cluster nodes to access a common filesystem namespace and perform I/O from and to shared data concurrently. Today, there are various cluster filesystems implementations which focus on high performance, i.e. high aggregate I/O bandwidth, low I/O latency and high number of sustainable I/O operations per second, as multiple clients perform concurrent access to shared data. At the core of their design is a shared-disk approach, in which all participating cluster nodes are assumed to have equal access to a shared storage pool.

In such environments, the requirement that all nodes have access to a shared storage pool is usually fulfilled by utilizing a high-end Storage Area Network (SAN), traditionally based on Fibre-Channel. An SAN is a networking infrastructure providing high-speed connections between multiple nodes and a number of hard disk enclosures. The disks are treated by the nodes as Direct-attached Storage, i.e. the protocols used are similar to those employed for accessing locally attached disks, such as SCSI over FC.

However, this storage architecture entails maintaining two separate networks, one for access to shared storage and a distinct one for Inter-Process Communication (IPC) between peer processes. This increases the cost per node, since the SAN needs to scale to a large number of nodes and each new member of the cluster needs to be equipped with an appropriate interface to access it (e.g. an FC Host Bus Adapter). Moreover, while the number of nodes increases, the aggregate bandwidth to the storage pool remains constant, since it is determined by the number of physical links to the storage enclosures. To address these problems, a shared-disk filesystem is more commonly deployed in such way that only a small fraction of the cluster nodes is physically connected to the SAN (“storage” nodes), exporting the shared disks for block-level access by the remaining nodes, over the cluster interconnection network. In this approach, all nodes can access the parallel filesystem by issuing block read and write requests over the interconnect. The storage nodes receive the requests, pass them to the storage subsystem and eventually return the results of the operations back to the client node.

Taken to extreme, this design approach allows shared-disk filesystems to be deployed over shared-nothing architectures, by having each node contribute part or all of its locally available storage (e.g. a number of directly attached Serial ATA or SCSI disks) to a virtual, shared, block-level storage pool. This model has a number of distinct advantages:

  • aggregate bandwidth to storage increases as more nodes are added to the system; since more I/O links to disks are added with each node, the performance of the I/O subsystem scales along with the computational capacity of the cluster
  • the total installation cost is drastically reduced, since a dedicated SAN remains small, or is eliminated altogether, allowing resources to be diverted to acquiring more cluster nodes. These nodes have a dual role, both as compute and as storage nodes.

The cornerstone of this design is the network disk sharing layer, usually implemented in a client/server approach. It runs as a server on the storage nodes, receiving requests and passing them transparently to a directly-attached storage medium. It also runs as a client on cluster nodes, exposing a block device interface to the Operating System and the locally executing instance of the parallel filesystem. There are various implementations of such systems, facilitating blocklevel sharing of storage devices over the interconnect. GPFS includes the NSD (Network Shared Disks) layer, which takes care of forwarding block access requests to storage nodes over TCP/IP. Traditionally, the Linux kernel has included the NBD (Network Block Device) driver and Redhat’s GFS can also be deployed over an improved version called GNBD.

However, all of these implementations are based on TCP/IP. Thus, they treat all modern cluster interconnects uniformly, without any regard to their advanced communication features, such as support for zero-copy message exchange using DMA. Employing a complex protocol stack residing in the kernel results in very good code portability but imposes significant protocol overhead; using TCP/IP-related system calls results in frequent data copying between userspace and kernelspace, increased CPU utilization and high latency. Moreover, this means that less CPU time is made available to the actual computational workload executing on top of the cluster, as its I/O load increases.

On the other hand, cluster interconnects such as Myrinet and Infiniband are able to remove the OS from the critical path (OS bypass) by offloading communication protocol processing to embedded microprocessors onboard the NIC and employing DMA engines for direct message exchange from and to userspace buffers. This leads to significant improvements in bandwidth and latency and reduces host CPU utilization dramatically. There are research efforts focusing on implementing block device sharing over such interconnects and exploiting their Remote DMA (RDMA) capabilities. However, the problem still remains that the data follow an unoptimized path. Whenever block data need to be exchanged with a remote node they first need to be transferred from the storage pool to main memory, then from main memory to the interconnect NIC. These unnecessary data transfers impact the computational capacity of a storage node significantly, by aggravating contention on shared resources as is the shared bus to main memory and the peripheral (e.g. PCI) bus. Even with no processor sharing involved, compute-intensive workloads may suffer significant slowdowns since the memory access cost for each processor becomes significantly higher due to contention with I/O on the shared path to memory.

We present an efficient method for implementing network shared block-level storage over processor and DMA-enabled cluster interconnects, using Myrinet as a case study. Our approach, called gmblock, is based on direct transfers of block data from disk to network and viceversa and does not require any intermediate copies in RAM.

It builds on existing OS and user level networking abstractions, enhancing them to support the necessary functionality, instead of relying on low-level architecture-specific code changes. By employing OS abstractions, such as the Linux VM mechanism and the Linux I/O subsystem to construct the proposed data path, we maintain UNIX-like process semantics (I/O system calls, process isolation and memory protection, enforcement of access rights, etc). Moreover, application code remains portable across different architectures and only minimal code changes are necessary. Thus, gmblock can be a viable solution for implementing shared-disk parallel filesystems on top of distributed storage, and can be integrated in existing parallel I/O infrastructures as an efficient substrate for network block device sharing. The proposed extensions to the Linux VM mechanism and the GM message-passing framework for Myrinet are generic enough to be useful for a variety of other uses, providing an efficient framework for direct data movement between I/O and communication devices, abstracting away architecture- and device-specific details.

We believe this approach is generic enough to be applicable to different types of devices. An example would be a system for direct streaming of data coming from a video grabber to structured, filesystem based storage. Moreover our approach lends itself to resource virtualization: each user may make use of the optimized data path while sharing access to the NIC or other I/O device with other users transparently.

We plan to evaluate the performance of gmblock on higher-end systems, equipped with PCI-X/PCI Express buses and more capable storage devices. We also plan to integrate it as a storage substrate in an existing GPFS/NSD installation, in order to examine its interaction with the parallel filesystem layer.

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2008-03-08 - AnastasiosNanos

No permission to view TWiki.WebTopBar

This site is powered by the TWiki collaboration platform Powered by Perl

No permission to view TWiki.WebBottomBar