SMT processors

Simultaneous Multithreading (SMT) is a hardware technique that allows a conventional superscalar processor to issue instructions from multiple hardware contexts in a single cycle. It targets at maximum utilization of processor resources by simultaneously processing independent operations. The key motivation behind this technique is the under-utilization of processor resources observed in many applications, either due to insufficient inherent instruction level parallelism (ILP) or due to long latency operations, such as cache misses and branch mispredictions.

Applications exhibiting either kind of behavior can benefit from SMT when they are parallelized into multiple threads, since idle issue slots of a low-ILP thread or multiple idle cycles of a stalled thread can be overlapped with useful instructions from other threads. On the other hand, the large degree of resource sharing in SMT processors (caches, instruction queues, functional units, fetch/decode/retirement units, etc.) may lead to significant performance drawbacks when threads contend for shared resources, e.g. cross-thread cache line evictions or competition for the same functional units at the same time.

Our research efforts on SMT have focused on exploring the potential and limits for performance improvement of single applications when they execute on Hyper-Threading enabled processors. Hyper-Threading (HT) technology is Intel's two-threaded, low-end approach to SMT. Our published works have examined SMT performance using representative applications from different areas, e.g. from highly-tuned, compute-bound scientific kernels [ICPP 2006, HPCC 2006], to pointer-intensive and memory-bound applications [SCJ 2007].

In these works, we have investigated two main alternatives to utilize the multiple hardware contexts of the processor: Thread-Level Parallelization (TLP) and Speculative Precomputation (SPR). In TLP scheme, sequential codes are parallelized so that the total amount of work is distributed evenly among threads for execution, as in traditional shared memory multiprocessors. In SPR scheme, the execution of an application is facilitated by additional helper threads, which run under the same shared cache and speculatively prefetch data that are going to be used by the computation threads in the near future, thus hiding memory latency and reducing cache misses. It targets performance improvement of applications that are not easily parallelizable or exhibit hardly predictable access patterns.

A subtle issue for the implementation and effectiveness of SPR is the synchronization between computation and prefetcher threads. In general, in an "all-shared" execution environment such as this of SMT, inter-thread synchronization is a key factor for multithreaded performance. While simulated SMT models in literature have proposed hardware extensions to support low-latency, resource-conservant synchronization, HT-enabled processors do not provide similar explicit mechanisms to be used directly by user-level applications. As a result, multithreaded applications executing on HT-enabled processors rely either on low-latency, resource-hungry spin-loops-based synchronization primitives, or on high-latency, resource-friendly OS-based primitives.

In order to best balance the conflicting requirements for high responsiveness and low resource consumption, we have proposed the use of MONITOR/MWAIT instructions for synchronization of threads executing on a HT-enabled processor [MTAAP 2008]. These instructions implement a condition-wait as close as possible to the hardware level, preventing from excessive resource waste and enabling fast notification and resumption of threads that wait on synchronization events. Since these instructions are privileged, we have presented a framework through which one can use them to build condition-wait and notification primitives with minimal kernel involvement. Using this framework, we have also demonstrated the implementation of synchronization barriers, which we evaluated in the context of artificial micro-benchmarks as well as SPR.


Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2008-03-12 - NikosAnastopoulos

No permission to view TWiki.WebTopBar

This site is powered by the TWiki collaboration platform Powered by Perl

No permission to view TWiki.WebBottomBar