SMTProcessors < CSLab

CSLab Web>ActivitiesProjects>SMTProcessors (2008-03-12, NikosAnastopoulos)

SMT processors

Simultaneous Multithreading (SMT) is a hardware technique that allows a conventional superscalar processor to issue instructions from multiple hardware contexts in a single cycle. It targets at maximum utilization of processor resources by simultaneously processing independent operations. The key motivation behind this technique is the under-utilization of processor resources observed in many applications, either due to insufficient inherent instruction level parallelism (ILP) or due to long latency operations, such as cache misses and branch mispredictions.

Applications exhibiting either kind of behavior can benefit from SMT when they are parallelized into multiple threads, since idle issue slots of a low-ILP thread or multiple idle cycles of a stalled thread can be overlapped with useful instructions from other threads. On the other hand, the large degree of resource sharing in SMT processors (caches, instruction queues, functional units, fetch/decode/retirement units, etc.) may lead to significant performance drawbacks when threads contend for shared resources, e.g. cross-thread cache line evictions or competition for the same functional units at the same time.

Our research efforts on SMT have focused on exploring the potential and limits for performance improvement of single applications when they execute on Hyper-Threading enabled processors. Hyper-Threading (HT) technology is Intel's two-threaded, low-end approach to SMT. Our published works have examined SMT performance using representative applications from different areas, e.g. from highly-tuned, compute-bound scientific kernels [ICPP 2006, HPCC 2006], to pointer-intensive and memory-bound applications [SCJ 2007].

In these works, we have investigated two main alternatives to utilize the multiple hardware contexts of the processor: Thread-Level Parallelization (TLP) and Speculative Precomputation (SPR). In TLP scheme, sequential codes are parallelized so that the total amount of work is distributed evenly among threads for execution, as in traditional shared memory multiprocessors. In SPR scheme, the execution of an application is facilitated by additional helper threads, which run under the same shared cache and speculatively prefetch data that are going to be used by the computation threads in the near future, thus hiding memory latency and reducing cache misses. It targets performance improvement of applications that are not easily parallelizable or exhibit hardly predictable access patterns.

A subtle issue for the implementation and effectiveness of SPR is the synchronization between computation and prefetcher threads. In general, in an "all-shared" execution environment such as this of SMT, inter-thread synchronization is a key factor for multithreaded performance. While simulated SMT models in literature have proposed hardware extensions to support low-latency, resource-conservant synchronization, HT-enabled processors do not provide similar explicit mechanisms to be used directly by user-level applications. As a result, multithreaded applications executing on HT-enabled processors rely either on low-latency, resource-hungry spin-loops-based synchronization primitives, or on high-latency, resource-friendly OS-based primitives.

In order to best balance the conflicting requirements for high responsiveness and low resource consumption, we have proposed the use of MONITOR/MWAIT instructions for synchronization of threads executing on a HT-enabled processor [MTAAP 2008]. These instructions implement a condition-wait as close as possible to the hardware level, preventing from excessive resource waste and enabling fast notification and resumption of threads that wait on synchronization events. Since these instructions are privileged, we have presented a framework through which one can use them to build condition-wait and notification primitives with minimal kernel involvement. Using this framework, we have also demonstrated the implementation of synchronization barriers, which we evaluated in the context of artificial micro-benchmarks as well as SPR.

Publications

Nikos Anastopoulos and Nectarios Koziris, "Facilitating Efficient Synchronization of Asymmetric Threads on Hyper-Threaded Processors," In Proceedings of the 2nd Workshop on Multithreaded Architectures and Applications (MTAAP 2008).

Evangelia Athanasaki, Nikos Anastopoulos, Kornilios Kourtis and Nectarios Koziris, "Exploring the Capacity of a Modern SMT Architecture to Deliver High Scientific Application Performance," In Proceedings of the 2006 International Conference on High Performance Computing and Communications (HPCC 2006).

Evangelia Athanasaki, Nikos Anastopoulos, Kornilios Kourtis and Nectarios Koziris, "Exploring the Performance Limits of Simultaneous Multithreading for Scientific Codes," In Proceedings of the 2006 International Conference on Parallel Processing (ICPP 2006).

Evangelia Athanasaki, Nikos Anastopoulos, Kornilios Kourtis and Nectarios Koziris, "Exploring the Performance Limits of Simultaneous Multithreading for Memory Intensive Applications," In The Journal of Supercomputing, Volume 44 , Issue 1 (April 2008).

Evangelia Athanasaki, Kornilios Kourtis, Nikos Anastopoulos and Nectarios Koziris, "Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT," In Proceedings of the 10th Panhellenic Conference on Informatics (PCI 2005).