Past Events

7th Computing Systems Research Day - 9 January 2024

Schedule

  • 11:45-12.00 | Welcome

  • Abstract

    Cloud systems are experiencing significant shifts both in their hardware, with an increased adoption of heterogeneity, and their software, with the prevalence of microservices and serverless frameworks. These trends require fundamentally rethinking how the cloud system stack should be designed. In this talk, I will briefly describe the challenges these hardware and software trends introduce, and discuss under what conditions hardware acceleration can be beneficial to these new application classes, as well as how applying machine learning (ML) to systems problems can improve the cloud’s performance, efficiency, and ease of use. I will first present Sage, a performance debugging system that leverages ML to identify and resolve the root causes of performance issues in cloud microservices. I will then discuss Ursa, an analytically-driven cluster manager for microservices that addresses some of the shortcomings of applying ML to large-scale systems problems.

    Bio

    Christina Delimitrou is an Associate Professor at MIT, where she works on computer architecture and computer systems. She focuses on improving the performance, predictability, and resource efficiency of large-scale cloud infrastructures by revisiting the way they are designed and managed. Christina is the recipient of the 2020 TCCA Young Computer Architect Award, an Intel Rising Star Award, a Microsoft Research Faculty Fellowship, an NSF CAREER Award, a Sloan Research Scholarship, two Google Faculty Research Awards, and a Facebook Faculty Research Award. Her work has also received 5 IEEE Micro Top Picks awards and several best paper awards. Before joining MIT, Christina was an Assistant Professor at Cornell University, and received her PhD from Stanford University. She had previously earned an MS also from Stanford, and a diploma in Electrical and Computer Engineering from the National Technical University of Athens. More information can be found at: http://people.csail.mit.edu/delimitrou/

  • 13:00-13:45 | Lunch Break

  • Abstract

    Datacenters have witnessed a staggering evolution in networking technologies, driven by insatiable application demands for larger datasets and inter-server data transfers. Modern NICs can already handle 100s of Gbps of traffic, a bandwidth capability equivalent to several memory channels. Direct Cache Access mechanisms like DDIO that contain network traffic inside the CPU’s caches are therefore essential to effectively handle growing network traffic rates. However, at high rates, a large fraction of network traffic “leaks” from the CPU’s caches to memory, a problem often referred to as “leaky DMA”, significantly capping the network bandwidth a server can effectively utilize. This talk will present an analysis of network data leaks in the era of high-speed networking and our insights around the interactions between network buffers and the cache and memory hierarchy. We will present Sweeper, our proposed hardware extension and API that allows applications to efficiently manage the coherence state of network buffers in the cache-memory hierarchy, drastically reducing memory bandwidth consumption and boosting a server’s peak sustainable network bandwidth by up to 2.6×.

    Bio

    Short Bio: Marina is a 5th year PhD student in the School of Computer Science at Georgia Tech, advised by assistant professor Alexandros Daglis. Her research focuses on designing new interfaces between hardware, network stacks and applications to unlock the performance potential of emerging datacenter technologies. Her work on hardware-software co-design for emerging network and memory technologies has been published at MICRO 2021 and MICRO 2022.

  • Abstract

    Distributed transaction processing is a fundamental building block for large-scale data management in the cloud. Given the threats of security violations in untrusted cloud environments, our work focuses on: How to design a distributed transactional KV store that achieves high-performance serializable transactions, while providing strong security properties? We introduce TREATY, a secure distributed transactional KV storage system that supports serializable ACID transactions while guaranteeing strong security properties: confidentiality, integrity, and freshness. TREATY leverages trusted execution environments (TEEs) to bootstrap its security properties, but it extends the trust provided by the limited enclave (volatile) memory region within a single node to build a secure (stateful) distributed transactional KV store over the untrusted storage, network and machines. To achieve this, TREATY embodies a secure two-phase commit protocol co-designed with a high-performance network library for TEEs. Further, TREATY ensures secure and crash-consistent persistency of committed transactions using a stabilization protocol. Our evaluation on a real hardware testbed based on the YCSB and TPC-C benchmarks shows that TREATY incurs reasonable overheads, while achieving strong security properties

    Bio

    Dimitra Giantsidi is a final-year PhD student at the University of Edinburgh (UoE), member of the Institute for Computing Systems Architecture (ICSA) and the Chair of Distributed and Operating Systems, advised by Prof. Pramod Bhatotia. Her research lies in the field of dependability in distributed systems with focus on the fault tolerance and security. Exploring the applications of modern hardware, such as Trusted Execution Environments and Direct I/O for net-working and storage, my work aims to increase the security and performance of widely adopted distributed systems. Before joining ICSA, Dimitra graduated from School of Electrical and Computer Engineering, NTUA, Greece.

  • Abstract

    GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN) applications. However, DNN applications often underutilize GPUs, even when using large batch sizes and eliminating input data processing or communication stalls. DNN workloads consist of data-dependent operators, with different compute and memory requirements. While an operator may saturate GPU compute units or memory bandwidth, it often leaves other GPU resources idle. Despite the prevalence of GPU sharing techniques, current approaches are not sufficiently fine-grained or interference-aware to maximize GPU utilization while minimizing interference at the granularity of 10s of 𝜇s. We present Orion, a system that transparently intercepts GPU kernel launches from multiple clients sharing a GPU. Orion schedules work on the GPU at the granularity of individual operators and minimizes interference by taking into account each operator’s compute and memory requirements. We integrate Orion in PyTorch and demonstrate its benefits in various DNN workload collocation use cases.

    Bio

    Foteini Strati is a 3rd year PhD student at the Systems Group of ETH Zurich, working on systems for Machine Learning. She is interested in increasing resource utilization and fault tolerance of Machine Learning workloads. She has obtained a MSc degree in Computer Science from ETH Zurich, and a Diploma in Electrical and Computer Engineering from NTUA.

  • Abstract

    Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. We demonstrate that the current multi-GPU BLAS libraries target very specific problems and data characteristics, resulting in serious performance degradation for any slightly deviating workload, and do not take into account energy efficiency at all. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into the PARALiA framework coupled with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7× and energy efficiency by 2.5× over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems.

    Bio

    Last-year PhD candidate at cslab NTUA after graduating its Electrical and Computer Engineering (ECE) department and specializing in computer engineering through a master-integrated bachelor. My PhD explores the optimization of Linear Algebra routines on multi-GPU clusters with model-based autotuning, and my research interests include accelerators, parallel processing, HPC and performance engineering.

  • 15:15-15:45 | Coffee Break

  • Abstract

    Fully Homomorphic Encryption (FHE) enables computing directly on encrypted data, letting clients securely offload computation to untrusted servers. While enticing, FHE suffers from two key challenges. First, it incurs very high overheads: it is about 10,000x slower than native, unencrypted computation on a CPU. Second, FHE is extremely hard to program: translating even simple applications like neural networks takes months of tedious work by FHE experts. In this talk, I will describe a hardware and software stack that tackles these challenges and enables the widespread adoption of FHE. First, I will give a systems-level introduction to FHE, describing its programming interface, key characteristics, and performance tradeoffs while abstracting away its complex, cryptography-heavy implementation details. Then, I will introduce a programmable hardware architecture that accelerates FHE programs by 5,000x vs. a CPU with similar area and power, erasing most of the overheads of FHE. Finally, I will introduce a new compiler that abstracts away the details of FHE. This compiler exposes a simple, numpy-like tensor programming interface, and produces FHE programs that match or outperform painstakingly optimized manual versions. Together, these techniques make FHE fast and easy to use across many domains, including deep learning, tensor algebra, and other learning and analytic tasks.

    Bio

    Daniel Sanchez is a Professor of Electrical Engineering and Computer Science at MIT. His research interests include scalable memory hierarchies, architectural support for parallelization, and accelerators for sparse computations and secure computing. He earned a Ph.D. in Electrical Engineering from Stanford University in 2012 and received the NSF CAREER award in 2015.

  • 16:45-17:00 | Closing Remarks



6th Computing Systems Research Day - 11 January 2023

Schedule

  • Abstract

    We will examine the RowHammer problem in DRAM, which is the first example of how a circuit-level failure mechanism in Dynamic Random Access Memory (DRAM) can cause a practical and widespread system security vulnerability. RowHammer is the phenomenon that repeatedly accessing a row in a modern DRAM chip predictably causes errors in physically-adjacent rows. It is caused by a hardware failure mechanism called read disturb errors, a manifestation of circuit-level cell-to-cell interference in a scaled memory technology. Building on our initial fundamental work that appeared at ISCA 2014, Google Project Zero demonstrated that this hardware phenomenon can be exploited by user-level programs to gain kernel privileges. Many other works demonstrated other attacks exploiting RowHammer, including remote takeover of a server vulnerable to RowHammer, takeover of a mobile device by a malicious user-level application, and destruction of predictive capabilities of commonly-used deep neural networks.

    Unfortunately, the RowHammer problem still plagues cutting-edge DRAM chips, DDR4 and beyond. Based on our recent characterization studies of more than 1500 DRAM chips from six technology generations that appeared at ISCA 2020 and MICRO 2021, we will show that RowHammer at the circuit level is getting much worse, newer DRAM chips are much more vulnerable to RowHammer than older ones, and existing mitigation techniques do not work well. We will also show that existing proprietary mitigation techniques employed in DDR4 DRAM chips, which are advertised to be Rowhammer-free, can be bypassed via many-sided hammering (also known as TRRespass & Uncovering TRR).

    Throughout the talk, we will analyze the properties of the RowHammer problem, examine circuit/device scaling characteristics, and discuss solution ideas. We will also discuss what other problems may be lurking in DRAM and other types of memory, e.g., NAND flash memory, Phase Change Memory and other emerging memory technologies, which can potentially threaten the foundations of reliable and secure systems, as the memory technologies scale to higher densities. We will conclude by describing and advocating a principled approach to memory reliability and security research that can enable us to better anticipate and prevent such vulnerabilities.

    Bio

    Onur Mutlu is a Professor of Computer Science at ETH Zurich. He is also a faculty member at Carnegie Mellon University, where he previously held the Strecker Early Career Professorship. His current broader research interests are in computer architecture, systems, hardware security, and bioinformatics. Various techniques he, along with his group and collaborators, has invented over the years have influenced industry and have been employed in commercial microprocessors and memory/storage systems. He obtained his PhD and MS in ECE from the University of Texas at Austin and BS degrees in Computer Engineering and Psychology from the University of Michigan, Ann Arbor. He started the Computer Architecture Group at Microsoft Research (2006-2009), and held various product and research positions at Intel Corporation, Advanced Micro Devices, VMware, and Google. He received the Google Security and Privacy Research Award, Intel Outstanding Researcher Award, NVMW Persistent Impact Prize, IEEE High Performance Computer Architecture Test of Time Award, the IEEE Computer Society Edward J. McCluskey Technical Achievement Award, ACM SIGARCH Maurice Wilkes Award, the inaugural IEEE Computer Society Young Computer Architect Award, the inaugural Intel Early Career Faculty Award, US National Science Foundation CAREER Award, Carnegie Mellon University Ladd Research Award, faculty partnership awards from various companies, and a healthy number of best paper or “Top Pick” paper recognitions at various computer systems, architecture, and security venues. He is an ACM Fellow, IEEE Fellow, and an elected member of the Academy of Europe (Academia Europaea). His computer architecture and digital logic design course lectures and materials are freely available on YouTube (https://www.youtube.com/OnurMutluLectures), and his research group makes a wide variety of software and hardware artifacts freely available online (https://safari.ethz.ch/). For more information, please see his webpage at https://people.inf.ethz.ch/omutlu/.

  • 11:30-12:00 | Break

  • Abstract

    Current hardware and operating system abstractions were conceived at a time when we had minimal security threats, homogeneous compute, scarce memory resources, and limited numbers of users. These assumptions are not true today. On one hand, software and hardware vulnerabilities have escalated the need for confidential computing primitives. On the other hand, emerging datacenter paradigms like microservices and serverless computing have led to the sharing of computing resources among hundreds of users at a time through lightweight virtualization primitives. In this new era of computing, we can no longer afford to build each layer separately. Instead, we have to rethink the synergy between the operating system and hardware from the ground up. In this talk, I will focus on datacenter challenges and recent results focused on virtual memory, memory management, lightweight virtualization, and confidential computing.

    Bio

    Dimitrios is an assistant professor in the Computer Science Department at Carnegie Mellon University. His research bridges computer architecture and operating systems with a focus on performance, security, and scalability. He has received several awards for his cross-cutting research including several Meta Faculty Awards, the joint 2021 ACM SIGARCH & IEEE CS TCCA Outstanding Dissertation award for “contributions to redesigning the abstractions and interfaces that connect hardware and operating systems”, the David J. Kuck Outstanding Ph.D. Thesis Award for the best PhD thesis in the computer science department at the University of Illinois at Urbana-Champaign, two IEEE MICRO Top Picks in Computer Architecture, and two ASPLOS Best Paper Awards.

  • 13:00-13:30 | Break

  • Abstract

    We introduce the first open-source FPGA-based infrastructure, MetaSys, with a prototype in a RISC-V core, to enable the rapid implementation and evaluation of a wide range of cross-layer techniques in real hardware. Hardware-software cooperative techniques are powerful approaches to improve the performance, quality of service, and security of general-purpose processors. They are however typically challenging to rapidly implement and evaluate in real hardware as they require full-stack changes to the hardware, OS, system software, and instruction-set architecture (ISA).

    MetaSys implements a rich hardware-software interface and lightweight metadata support that can be used as a common basis to rapidly implement and evaluate new cross-layer techniques. We demonstrate MetaSys’s versatility and ease-of-use by implementing and evaluating three cross-layer techniques for: (i) prefetching for graph analytics; (ii) bounds checking in memory unsafe languages, and (iii) return address protection in stack frames; each technique only requiring ~100 lines of Chisel code over MetaSys.

    Using MetaSys, we perform the first detailed experimental study to quantify the performance overheads of using a single metadata management system to enable multiple cross-layer optimizations in CPUs. We identify the key sources of bottlenecks and system inefficiency of a general metadata management system. We design MetaSys to minimize these inefficiencies and provide increased versatility compared to previously-proposed metadata systems. Using three use cases and a detailed characterization, we demonstrate that a common metadata management system can be used to efficiently support diverse cross-layer techniques in CPUs.

    Bio

    Konstantinos Kanellopoulos is currently pursuing his PhD at ETH Zurich in the SAFARI research group (https://safari.ethz.ch/). He completed his MEng and BSc at NTUA. His research interests lie at the intersection of software and hardware with a focus on OS/Hardware co-design.

  • Abstract

    Despite groundbreaking technological innovations and revolutions, the Memory Wall is still a major performance obstacle for modern systems. Hardware prefetching is a widely deployed latency-tolerance technique that has proven successful at shrinking the processor-memory performance gap. However, state-of-the-art designs of hardware cache prefetchers are far from approaching the performance of an ideal hardware cache prefetcher. Virtual memory is a memory management technique that has been vital for the success of computing due to its unique programmability and security benefits. However, virtual memory does not come for free due to the page walk memory references introduced for fetching address translation entries. Virtual memory makes the Memory Wall ‘taller’ due to the requirement of traversing the memory hierarchy potentially multiple times upon TLB misses. To make matters worse, the advent of emerging workloads with massive data and code footprints that experience high TLB miss rates, place tremendous pressure on the memory hierarchy due to the need for frequently performing page walks, threatening the performance of computing. Our work demonstrates that hardware prefetching has the potential to attenuate the Memory Wall bottleneck in virtual memory systems. In this direction, we (i) propose fully-legacy preserving hardware prefetching schemes for the last-level TLBs that aim at reducing the TLB miss rates of both data and instruction references, and (ii) exploit address translation metadata available at the microarchitecture to improve the effectiveness of hardware cache prefetchers operating in the physical address space without opening new security vulnerabilities.

    Bio

    Georgios Vavouliotis is a 4th year Ph.D. student at Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, supervised by Marc Casas and Lluc Alvarez while closely collaborating with professors Daniel A. Jimenez, Boris Grot, and Paul Gratz. Georgios also holds an Electrical and Computer Engineering diploma from the National Technical University of Athens, supervised by professors Vasileios Karakostas and Georgios Goumas. Georgios’s research interests include all shades of computer architecture. Currently, he is conducting microarchitectural research aimed at reducing the address translation overheads via hardware prefetching, enhancing the efficacy of cache prefetchers, and improving cache management for graph processing applications. His research work has received accolades and has been featured in top-tier computer architecture conferences.

  • Abstract

    Reliability evaluation in early stages of microprocessor designs varies in the level of hardware modelling accuracy, the speed of the evaluation, and the granularity of the assessment report. In this talk, we revisit the system vulnerability stack for transient faults and present a new microarchitecture-driven methodology for fast and accurate reliability evaluation. We reveal severe pitfalls in widely used vulnerability measurement approaches that operate at software or architecture abstraction layers, aiming to assess the effects of hardware faults. These approaches separate the hardware and the software layers for the assessment, under the assumption that they can reasonably model the effect of hardware faults to the software layer. However, thanks to their speed advantage, they have eventually become common practice for evaluating the overall system resilience. We show that estimations based on higher abstraction layers can deliver contradicting results against the cross-layer methodologies. To this end, we present AVGI, a new cross-layer evaluation methodology, which delivers orders of magnitude faster assessment of the Architectural Vulnerability Factor (AVF) of a microprocessor chip, while retaining the high accuracy of cross-layer reliability evaluation (including microarchitecture, architecture, and software layers).

    Bio

    George Papadimitriou (http://users.uoa.gr/~georgepap) is a postdoctoral researcher in the Dept. of Informatics & Telecommunications at the University of Athens. He earned a PhD in Computer Science from the same department in 2019, where he worked with Prof. Dimitris Gizopoulos. His research focuses on dependable and energy-efficient computer architectures, microprocessor reliability, functional correctness of hardware designs and design validation of microprocessors. Recently, he co-organized and participated in several tutorials around these research areas at MICRO and ISCA and has published more than 35 papers in international conferences and journals.

  • 15:00-15:30 | Break

  • Abstract

    Recent shell-script parallelization systems enjoy mostly automated parallel speedups by compiling scripts ahead-of-time. Unfortunately, such static parallelization is hampered by the dynamic behaviors pervasive in shell scripts—e.g., variable expansion and command substitution—which often requires reasoning about the current state of the shell and filesystem. We present a just-in-time (JIT) shell-script compiler, PaSh-JIT, that intermixes evaluation and parallelization during a script’s run-time execution. JIT parallelization collects run-time information about the system’s state, but must not alter the behavior of the original script and must maintain minimal overhead. PaSh-JIT addresses these challenges by (1) using a dynamic interposition framework, guided by a static preprocessing pass, (2) developing runtime support for transparently pausing and resuming shell execution; and (3) operating as a stateful server, communicating with the current shell by passing messages—all without requiring modifications to the system’s underlying shell interpreter. When run on a wide variety of benchmarks, including the POSIX shell test suite, PaSh-JIT (1) does not break scripts, even in cases that are likely to break shells in widespread use; and (2) offers significant speedups, whenever parallelization is possible. These results show that PaSh-JIT can be used as a drop-in replacement for any non-interactive shell use, providing significant speedups without any risk of breakage.

    Bio

    Konstantinos Kallas is a 5th year PhD student at the University of Pennsylvania working with Rajeev Alur. His area of interest is the intersection of programming languages and computer systems. Recently, together with several other amazing people Konstantinos has been working on invigorating the research on the shell. The first paper in this line of work, introducing PaSh, a system for parallelizing shell scripts, received the best paper award at EuroSys 2021. He has also published papers improving the shell at ICFP 2021, HotOS 2021, and OSDI 2022. Konstantinos is also interested in programming models and runtimes for stateful serverless computing. He has worked on Azure Durable Functions and has published papers on this topic in OOPSLA 2021, VLDB 2022, and POPL 2023. You can find more information about him here: https://angelhof.github.io/.

  • Abstract

    When processing join queries over big data, a DBMS can become unresponsive due to the sheer size of the output that has to be returned to the user. However, oftentimes, users have preferences over the answers and only some of these answers are required. To exploit this, and guarantee that intermediate results are relatively small, new query processing algorithms are necessary. We develop “any-k” algorithms that return the most important answers as quickly as possible, followed by the rest in quick succession. For a large class of queries, the top results are returned in linear time in input size, even if the entire set of answers is much larger. Our prototype implementation of any-k outperforms by orders of magnitude the traditional DBMS approach that first performs the join and then sorts the output in order to find the top answers. Project website: https://northeastern-datalab.github.io/anyk/

    Bio

    Nikolaos (Nikos) Tziavelis is a 5th year PhD candidate at Northeastern University, advised by Mirek Riedewald and Wolfgang Gatterbauer, and before that, he received a Diploma in Electrical and Computer Engineering from the National Technical University of Athens. His research interests lie in novel algorithms for query processing, efficient data representations, and distributed computing, for which he is generously supported by a Google PhD fellowship.

  • Abstract

    The development of quantum computers has been advancing rapidly in recent years. In addition to researchers and companies building bigger and bigger machines, these computers are already being actively connected to the internet and offered as cloud-based quantum computer services. As quantum computers become more widely accessible, potentially malicious users could try to execute their code on the machines to leak information from other users, to interfere with or manipulate results of other users, or to reverse engineer the underlying quantum computer architecture and its intellectual property, for example. To analyze such new security threats to cloud-based quantum computers, this presentation will cover recent research and evaluation of different types of quantum computer viruses. The presentation will also present first of its kind quantum computer antivirus as a new means of protecting the expensive and fragile quantum computer hardware from quantum computer viruses. The novel antivirus can analyze quantum computer programs, also called circuits, and detect possibly malicious ones before they execute on quantum computer hardware. As a compile-time technique, it does not introduce any new overhead at run-time of the quantum computer. The presented research is joint work with different members of CASLAB at Yale University.

    Βιο

    Theodoros is a 2nd year PhD student at Yale University under the supervision of Prof. Jakub Szefer and member of the Computer Architecture and Security Laboratory (CASLAB). Before joining Yale, he received his Diploma in Electrical and Computer Engineering from the National Technical University of Athens in 2021, where he completed his thesis in CSLab under the supervision of Prof. Dionisios Pnevmatikatos and Prof. Vasileios Karakostas. His research interests broadly encompass computer architecture and hardware security of computing systems, including security of quantum computers. In this line of work, along with other members of CASLAB, he has published papers presenting a novel hardware architecture for securing quantum computers at HOST 2022 and HOST 2023.