Difference: Gmblock (2 vs. 3)

Revision 32008-03-11 - VangelisKoukis

Line: 1 to 1
Changed:
<
<
META TOPICPARENT name="MemBUS"

GMBlock

>
>
META TOPICPARENT name="ActivitiesProjects"

gmblock

 
Changed:
<
<
Clusters built out of commodity components are becoming more and more prevalent in the supercomputing sector as a cost-effective solution for building high-performance, costeffective parallel platforms. Symmetric Multiprocessors, or SMPs for short, are commonly used as building blocks for scalable clustered systems, when interconnected over a high bandwidth, low latency communications infrastructure, such as Myrinet or Infiniband.
>
>
Clusters built out of commodity components are becoming more and more prevalent in the supercomputing sector as a cost-effective solution for building high-performance, cost-effective parallel platforms. Symmetric Multiprocessors, or SMPs for short, are commonly used as building blocks for scalable clustered systems, when interconnected over a high bandwidth, low latency communications infrastructure, such as Myrinet or Infiniband.
 
Changed:
<
<
To meet the I/O needs of modern HPC applications, a distributed, cluster filesystem needs to be deployed, allowing processes running on cluster nodes to access a common filesystem namespace and perform I/O from and to shared data concurrently. Today, there are various cluster filesystems implementations which focus on high performance, i.e. high aggregate I/O bandwidth, low I/O latency and high number of sustainable I/O operations per second, as multiple clients perform concurrent access to shared data. At the core of their design is a shared-disk approach, in which all participating cluster nodes are assumed to have equal access to a shared storage pool.
>
>
To meet the I/O needs of HPC applications running on top of them, cluster filesystems are deployed, enabling access to a common filesystem namespace and concurrent I/O operations on shared data. Most high-performance cluster filesystems they are shared-disk filesystems [IBM GPFS, Redhat GFS, Oracle OCFS2], meaning that all participating access nodes need block-level access to a shared storage pool with Direct-Attached Storage semantics (e.g., as SCSI/SAS or devices). Traditionally, FibreChannel-based Storage Area Networks (SANs) have been used to meet this requirement in enterprise environments. However, reasons of cost-effectiveness, redundancy and reliability have shifted the focus from deploying dedicated SANs to providing block-level access to shared storage over the same interconnect used for IPC. This is made possible with the use of a Network Block Device, or nbd, layer, which allows cluster nodes to contribute part of their local storage in order to form virtual, shared, block-level storage pools.
 
Changed:
<
<
In such environments, the requirement that all nodes have access to a shared storage pool is usually fulfilled by utilizing a high-end Storage Area Network (SAN), traditionally based on Fibre-Channel. An SAN is a networking infrastructure providing high-speed connections between multiple nodes and a number of hard disk enclosures. The disks are treated by the nodes as Direct-attached Storage, i.e. the protocols used are similar to those employed for accessing locally attached disks, such as SCSI over FC.
>
>
The gmblock project encompasses our work on designing and implementing scalable block-lavel storage sharing over Myrinet, so that shared disk filesystems may be deployed over a shared-nothing architecture. In these case every cluster node assumes a dual role; it is both a compute and a storage node. This has several distinct advantages:
 
Changed:
<
<
However, this storage architecture entails maintaining two separate networks, one for access to shared storage and a distinct one for Inter-Process Communication (IPC) between peer processes. This increases the cost per node, since the SAN needs to scale to a large number of nodes and each new member of the cluster needs to be equipped with an appropriate interface to access it (e.g. an FC Host Bus Adapter). Moreover, while the number of nodes increases, the aggregate bandwidth to the storage pool remains constant, since it is determined by the number of physical links to the storage enclosures. To address these problems, a shared-disk filesystem is more commonly deployed in such way that only a small fraction of the cluster nodes is physically connected to the SAN (“storage” nodes), exporting the shared disks for block-level access by the remaining nodes, over the cluster interconnection network. In this approach, all nodes can access the parallel filesystem by issuing block read and write requests over the interconnect. The storage nodes receive the requests, pass them to the storage subsystem and eventually return the results of the operations back to the client node.
>
>
  • Cost-effectiveness: No need to equip every cluster node with both a NIC and a FibreChannel HBA. The SAN can be eliminated altogether and resources redirected to acquiring more compute nodes. Instead of having two maintain two distinct networks, the cluster interconnect carries storage traffic.
  • Scalability: The number of links to storage increases with the number of nodes. Adding a new compute node to the system increases the aggregate I/O bandwidth.
  • Redundancy: Instead of having only a limited number of SAN I/O controllers and links to storage enclosures, data are distributed and possibly replicated across a large number of disks and fetched over multiple links.
 
Changed:
<
<
Taken to extreme, this design approach allows shared-disk filesystems to be deployed over shared-nothing architectures, by having each node contribute part or all of its locally available storage (e.g. a number of directly attached Serial ATA or SCSI disks) to a virtual, shared, block-level storage pool. This model has a number of distinct advantages:
>
>
Previous work has highlighted the impact of high network I/O load to the total execution time of compute-intensive applications, due to memory contention. The gmblock nbd system aims to minimize the impact of remote block I/O operations due to memory and peripheral bus bandwidth limitations on the server side by constructing a direct disk-to-NIC data path; when servicing an I/O request, data are moved directly from the Myrinet NIC to the local disk pool — in the case of a write — or from the local disk to the Myrinet NIC — in the case of a read.
 
Changed:
<
<
  • aggregate bandwidth to storage increases as more nodes are added to the system; since more I/O links to disks are added with each node, the performance of the I/O subsystem scales along with the computational capacity of the cluster
  • the total installation cost is drastically reduced, since a dedicated SAN remains small, or is eliminated altogether, allowing resources to be diverted to acquiring more cluster nodes. These nodes have a dual role, both as compute and as storage nodes.
>
>
To build this data path, gmblock combines the OS-bypass, zero-copy networking features provided by Myrinet/GM with Linux's direct I/O layer and custom extensions to its VM mechanism. Thus, gmblock builds on existing OS and userlevel networking abstractions, employing minimal low-level architecture-specific code changes. Its server component is implemented in userspace.
 
Changed:
<
<
The cornerstone of this design is the network disk sharing layer, usually implemented in a client/server approach. It runs as a server on the storage nodes, receiving requests and passing them transparently to a directly-attached storage medium. It also runs as a client on cluster nodes, exposing a block device interface to the Operating System and the locally executing instance of the parallel filesystem. There are various implementations of such systems, facilitating blocklevel sharing of storage devices over the interconnect. GPFS includes the NSD (Network Shared Disks) layer, which takes care of forwarding block access requests to storage nodes over TCP/IP. Traditionally, the Linux kernel has included the NBD (Network Block Device) driver and Redhat’s GFS can also be deployed over an improved version called GNBD.
>
>
The GM message-passing infrastructure is extended to support the creation and mapping of large message buffers that do not reside in RAM but in the SRAM onboard the Myrinet NIC, instead. The GM firmware is enhanced to allow them to be used transparently in message send and receive operations. The buffers are mapped to the server's VM space and Linux's VM subsystem is extended to support “direct I/O” from and to these areas. The net result is that, when the userspace server issues a read() or write() system call, the DMA engines on the storage medium are programmed by the Linux kernel to exchange data with the Myrinet NIC directly over the peripheral bus, without any CPU or main memory involvement, in a completely transparent way [IPDPS 2007]. The concept is applicable to any interconnect that features a programmable NIC and exports on-board memory to the PCI physical address space.
 
Changed:
<
<
However, all of these implementations are based on TCP/IP. Thus, they treat all modern cluster interconnects uniformly, without any regard to their advanced communication features, such as support for zero-copy message exchange using DMA. Employing a complex protocol stack residing in the kernel results in very good code portability but imposes significant protocol overhead; using TCP/IP-related system calls results in frequent data copying between userspace and kernelspace, increased CPU utilization and high latency. Moreover, this means that less CPU time is made available to the actual computational workload executing on top of the cluster, as its I/O load increases.
>
>
To compensate for the limited amount of memory available on the Myrinet NIC and allow for larger size requests to make progress without any CPU involvement, gmblock was extended [CAC 2008] to support synchronized send operations. Their semantics allow the storage medium and Myrinet NIC to coordinate the flow of data in a peer-to-peer manner, pipelining data from the storage medium to the Lanai SRAM and to the fiber link. A performance evaluation of gmblock compared to showed significant increase in throughput, reduced pressure on the memory and peripheral buses and much improved execution times for concurrently executing compute-intensive applications compared to a TCP/IP- and a GM-based nbd implementation.
 
Changed:
<
<
On the other hand, cluster interconnects such as Myrinet and Infiniband are able to remove the OS from the critical path (OS bypass) by offloading communication protocol processing to embedded microprocessors onboard the NIC and employing DMA engines for direct message exchange from and to userspace buffers. This leads to significant improvements in bandwidth and latency and reduces host CPU utilization dramatically. There are research efforts focusing on implementing block device sharing over such interconnects and exploiting their Remote DMA (RDMA) capabilities. However, the problem still remains that the data follow an unoptimized path. Whenever block data need to be exchanged with a remote node they first need to be transferred from the storage pool to main memory, then from main memory to the interconnect NIC. These unnecessary data transfers impact the computational capacity of a storage node significantly, by aggravating contention on shared resources as is the shared bus to main memory and the peripheral (e.g. PCI) bus. Even with no processor sharing involved, compute-intensive workloads may suffer significant slowdowns since the memory access cost for each processor becomes significantly higher due to contention with I/O on the shared path to memory.
>
>

Publications

 
Changed:
<
<
We present an efficient method for implementing network shared block-level storage over processor and DMA-enabled cluster interconnects, using Myrinet as a case study. Our approach, called gmblock, is based on direct transfers of block data from disk to network and viceversa and does not require any intermediate copies in RAM.

It builds on existing OS and user level networking abstractions, enhancing them to support the necessary functionality, instead of relying on low-level architecture-specific code changes. By employing OS abstractions, such as the Linux VM mechanism and the Linux I/O subsystem to construct the proposed data path, we maintain UNIX-like process semantics (I/O system calls, process isolation and memory protection, enforcement of access rights, etc). Moreover, application code remains portable across different architectures and only minimal code changes are necessary. Thus, gmblock can be a viable solution for implementing shared-disk parallel filesystems on top of distributed storage, and can be integrated in existing parallel I/O infrastructures as an efficient substrate for network block device sharing. The proposed extensions to the Linux VM mechanism and the GM message-passing framework for Myrinet are generic enough to be useful for a variety of other uses, providing an efficient framework for direct data movement between I/O and communication devices, abstracting away architecture- and device-specific details.

We believe this approach is generic enough to be applicable to different types of devices. An example would be a system for direct streaming of data coming from a video grabber to structured, filesystem based storage. Moreover our approach lends itself to resource virtualization: each user may make use of the optimized data path while sharing access to the NIC or other I/O device with other users transparently.

We plan to evaluate the performance of gmblock on higher-end systems, equipped with PCI-X/PCI Express buses and more capable storage devices. We also plan to integrate it as a storage substrate in an existing GPFS/NSD installation, in order to examine its interaction with the parallel filesystem layer.

>
>
 
This site is powered by the TWiki collaboration platform Powered by Perl

No permission to view TWiki.WebBottomBar