SMT processors
Simultaneous Multithreading (SMT) is a hardware technique that allows a
conventional superscalar processor to issue instructions from multiple hardware
contexts in a single cycle. It targets at maximum utilization of processor
resources by simultaneously processing independent operations. The key
motivation behind this technique is the under-utilization of processor resources
observed in many applications, either due to insufficient inherent instruction
level parallelism (ILP) or due to long latency operations, such as cache misses
and branch mispredictions.
Applications exhibiting either kind of behavior can benefit from SMT when they
are parallelized into multiple threads, since idle issue slots of a low-ILP
thread or multiple idle cycles of a stalled thread can be overlapped with useful
instructions from other threads. On the other hand, the large degree of resource
sharing in SMT processors (caches, instruction queues, functional units,
fetch/decode/retirement units, etc.) may lead to significant performance
drawbacks when threads contend for shared resources, e.g. cross-thread cache
line evictions or competition for the same functional units at the same time.
Our research efforts on SMT have focused on exploring the potential and limits
for performance improvement of single applications when they execute on
Hyper-Threading enabled processors. Hyper-Threading (HT) technology is Intel's
two-threaded, low-end approach to SMT. Our published works have examined SMT
performance using representative applications from different areas, e.g. from
highly-tuned, compute-bound scientific kernels
[
ICPP 2006,
HPCC 2006], to pointer-intensive
and memory-bound applications [
SCJ 2007].
In these works, we have investigated two main alternatives to utilize the
multiple hardware contexts of the processor: Thread-Level Parallelization (TLP)
and Speculative Precomputation (SPR). In TLP scheme, sequential codes are parallelized
so that the total amount of work is distributed evenly among threads for
execution, as in traditional shared memory multiprocessors. In SPR scheme, the
execution of an application is facilitated by additional helper threads, which
run under the same shared cache and speculatively prefetch data that are going
to be used by the computation threads in the near future, thus hiding memory
latency and reducing cache misses. It targets performance improvement of
applications that are not easily parallelizable or exhibit hardly predictable
access patterns.
A subtle issue for the implementation and effectiveness of SPR is the
synchronization between computation and prefetcher threads. In general, in an
"all-shared" execution environment such as this of SMT, inter-thread
synchronization is a key factor for multithreaded performance. While simulated
SMT models in literature have proposed hardware extensions to support
low-latency, resource-conservant synchronization, HT-enabled processors do not
provide similar explicit mechanisms to be used directly by user-level
applications. As a result, multithreaded applications executing on HT-enabled
processors rely either on low-latency, resource-hungry spin-loops-based
synchronization primitives, or on high-latency, resource-friendly OS-based
primitives.
In order to best balance the conflicting requirements for high responsiveness
and low resource consumption, we have proposed the use of MONITOR/MWAIT
instructions for synchronization of threads executing on a HT-enabled processor
[
MTAAP 2008].
These instructions implement a condition-wait as close as possible
to the hardware level, preventing from excessive resource waste and enabling
fast notification and resumption of threads that wait on synchronization events.
Since these instructions are privileged, we have presented a framework through
which one can use them to build condition-wait and notification primitives with
minimal kernel involvement. Using this framework, we have also demonstrated the
implementation of synchronization barriers, which we evaluated in the context of
artificial micro-benchmarks as well as SPR.
Publications