CSLab Web>ActivitiesProjects>Gmblock (2008-03-21, VangelisKoukis)

gmblock

Clusters built out of commodity components are becoming more and more prevalent in the supercomputing sector as a cost-effective solution for building high-performance, cost-effective parallel platforms. Symmetric Multiprocessors, or SMPs for short, are commonly used as building blocks for scalable clustered systems, when interconnected over a high bandwidth, low latency communications infrastructure, such as Myrinet or Infiniband.

To meet the I/O needs of HPC applications running on top of them, cluster filesystems are deployed, enabling access to a common filesystem namespace and concurrent I/O operations on shared data. Most high-performance cluster filesystems they are shared-disk filesystems [IBM GPFS, Redhat GFS, Oracle OCFS2], meaning that all participating access nodes need block-level access to a shared storage pool with Direct-Attached Storage semantics (e.g., as SCSI/SAS or devices). Traditionally, FibreChannel-based Storage Area Networks (SANs) have been used to meet this requirement in enterprise environments. However, reasons of cost-effectiveness, redundancy and reliability have shifted the focus from deploying dedicated SANs to providing block-level access to shared storage over the same interconnect used for IPC. This is made possible with the use of a Network Block Device, or nbd, layer, which allows cluster nodes to contribute part of their local storage in order to form virtual, shared, block-level storage pools.

The gmblock project encompasses our work on designing and implementing scalable block-lavel storage sharing over Myrinet, so that shared disk filesystems may be deployed over a shared-nothing architecture. In this case every cluster node assumes a dual role; it is both a compute and a storage node. This has several distinct advantages:

Cost-effectiveness: No need to equip every cluster node with both a NIC and a FibreChannel HBA. The SAN can be eliminated altogether and resources redirected to acquiring more compute nodes. Instead of having two maintain two distinct networks, the cluster interconnect carries storage traffic.
Scalability: The number of links to storage increases with the number of nodes. Adding a new compute node to the system increases the aggregate I/O bandwidth.
Redundancy: Instead of having only a limited number of SAN I/O controllers and links to storage enclosures, data are distributed and possibly replicated across a large number of disks and fetched over multiple links.

Previous work has highlighted the impact of high network I/O load to the total execution time of compute-intensive applications, due to memory contention. The gmblock nbd system aims to minimize the impact of remote block I/O operations due to memory and peripheral bus bandwidth limitations on the server side by constructing a direct disk-to-NIC data path; when servicing an I/O request, data are moved directly from the Myrinet NIC to the local disk pool — in the case of a write — or from the local disk to the Myrinet NIC — in the case of a read.

To build this data path, gmblock combines the OS-bypass, zero-copy networking features provided by Myrinet/GM with Linux's direct I/O layer and custom extensions to its VM mechanism. Thus, gmblock builds on existing OS and userlevel networking abstractions, employing minimal low-level architecture-specific code changes. Its server component is implemented in userspace.

The GM message-passing infrastructure is extended to support the creation and mapping of large message buffers that do not reside in RAM but in the SRAM onboard the Myrinet NIC, instead. The GM firmware is enhanced to allow them to be used transparently in message send and receive operations. The buffers are mapped to the server's VM space and Linux's VM subsystem is extended to support “direct I/O” from and to these areas. The net result is that, when the userspace server issues a read() or write() system call, the DMA engines on the storage medium are programmed by the Linux kernel to exchange data with the Myrinet NIC directly over the peripheral bus, without any CPU or main memory involvement, in a completely transparent way [IPDPS 2007]. The concept is applicable to any interconnect that features a programmable NIC and exports on-board memory to the PCI physical address space.

To compensate for the limited amount of memory available on the Myrinet NIC and allow for larger size requests to make progress without any CPU involvement, gmblock was extended [CAC 2008] to support synchronized send operations. Their semantics allow the storage medium and Myrinet NIC to coordinate the flow of data in a peer-to-peer manner, pipelining data from the storage medium to the Lanai SRAM and to the fiber link. Performance evaluation of gmblock showed significant increase in throughput, reduced pressure on the memory and peripheral buses and much improved execution times for concurrently executing compute-intensive applications compared to a TCP/IP- and a GM-based nbd implementation.

Publications

E. Koukis, A. Nanos and N. Koziris, “Synchronized Send Operations for Efficient Streaming Block I/O over Myrinet,” Proceedings of the Workshop on Communication Architecture for Clusters (CAC 2008), held in conjunction with the 22nd International Parallel and Distributed Processing Symposium (IPDPS 2008), Miami, FL, USA, 14-18 April, 2008, to appear
E. Koukis and N. Koziris, “Efficient Block Device Sharing over Myrinet with Memory Bypass,” Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), p. 29, Long Beach, CA, USA, 26-30 March, 2007