Past Events
8th Computing Systems Research Day - 7 January 2025
Schedule
-
12:00-12:15 | Welcome
-
Abstract
Modern cloud applications need microsecond-level responsiveness, yet current virtualization approaches often cause millisecond-scale delays. This talk presents two complementary solutions that bring virtualized environments closer to bare-metal performance. First, Rorke is a microsecond-scale VM scheduler for oversubscribed cloud environments. By approximating processor sharing at the host and dynamically adapting time slices, Rorke cuts tail latency by over 10x for popular low-latency workloads without harming throughput in non-oversubscribed scenarios. Second, Machnet is a userspace network stack designed for public clouds. Rather than relying on specialized NIC features unavailable in virtual NICs, Machnet uses a “Least Common Denominator” approach and a microkernel design to support flexible execution models. It achieves substantial latency and CPU efficiency gains, demonstrating 80% lower latency and 75% lower CPU utilization for a key-value store compared to today’s best solutions. Together, Rorke and Machnet bring virtualized infrastructure closer than ever to bare-metal levels of performance, setting a new standard for cloud computing efficiency.
Bio
Kostis Kaffes joined the Department of Computer Science at Columbia University as an assistant professor in June 2023. Kostis obtained an MSc and PhD from Stanford University in 2018 and 2022, respectively, and an undergraduate degree from the National Technical University of Athens in Greece in 2015. He is broadly interested in computer systems, cloud computing, and scheduling. He has worked on end-host, rack-scale, and cluster-scale scheduling for microsecond-scale tail latency. He has also been seeking ways to accelerate machine learning systems and use machine learning to improve operating systems management. Prior to Columbia, he spent a year at Google SRG.
-
13:15-14:00 | Lunch Break
-
Abstract
The shared and distributed memory capabilities of the emerging Compute Express Link (CXL) interconnect urge the rethink of traditional system software interfaces. In this talk, we will discuss the challenges of CXL-connected distributed systems and explore one such interface: remote fork over CXL fabrics for cluster-wide process cloning. In detail, we will introduce CXLfork that realizes zero-serialization, zero-copy process cloning across nodes. CXLfork utilizes globally-shared CXL memory for cluster-wide deduplication of process states and enables fine-grained control of state tiering between local and CXL memory. We will show how it can be integrated to Serverless Runtimes to achieve fearless concurrency. In detail, we will introduce CXLporter, an efficient horizontal autoscaler for serverless functions deployed over CXL fabrics. Overall, CXLfork attains a remote fork latency close to that of a local fork, outperforming state-of-practice by 2.26x on average, and reducing local memory consumption by 87% on average. Integrated to CXLporter it can achieve high throughput with 3x less local memory resources.
Bio
Chloe Alverti is a postdoctoral researcher at University of Urbana Champaign (UIUC), hosted by professor Josep Torrellas. Her ongoing research is part of the ACE Center for Evolvable Computing. She received her PhD in 2022 from the School of Electrical and Computer Engineering at National Technical University of Athens (NTUA), where she was a member of the Computing Systems Laboratory (CSLAB) supervised by professor Georgios Goumas. During her studies she spent 3 months as a visiting scholar at University of Wisconsin-Madison working with Professor Michael Swift. Before her PhD she worked for two years as a research assistant at Chalmers University of Technology advised by professor Per Stenstrom. Her research interests are focused on system software and hardware co-design for efficient memory access and virtualization, recently focusing on distributed systems.
-
Abstract
An interference-aware memory orchestration framework that enables effective/optimized data placement decisions on memory-disaggregated cloud infrastructures, named Adrias, is introduced. The key features of Adrias could be summarized through: i) its ability to forecast the tendency of system-wide metrics in the future, thus driving proactive memory orchestration decisions; ii) its accurate performance predictions for deployed applications w.r.t. memory heterogeneity (local/fast vs. remote/slow DRAM) and interference and iii) its power to leverage disaggregated memory with minimal impact on the performance of deployed applications without the employment of dynamic memory management mechanisms. Adrias exploits system-level performance monitoring information and leverages deep learning approaches to place incoming applications on the pool of available memory resources.
Bio
Dr. Dimosthenis Masouros received his Diploma and Ph.D. degrees from the Department of Electrical and Computer Engineering at the National Technical University of Athens, Greece, in 2016 and 2023, respectively. His research focuses on systems optimization, with an emphasis on leveraging machine learning techniques to address challenges in resource allocation, application scheduling and systems performance prediction. His current research interests include optimizing performance and energy efficiency in emerging paradigms such as serverless computing, Large Language Models, Federated Learning, and other related technologies. He has been actively involved in five European research projects and has authored over 40 peer-reviewed papers in leading international conferences and journals.
-
Abstract
Large pages have been the de facto mitigation technique to address the translation overheads of virtual memory, with prior work mostly focusing on the large page sizes supported by the x86 architecture, i.e., 2MiB and 1GiB. ARMv8-A and RISC-V support additional intermediate translation sizes, i.e., 64KiB and 32MiB, via OS-assisted TLB coalescing, but their performance potential has largely fallen under the radar due to the limited system software support. In this paper, we propose Elastic Translations (ET), a holistic memory management solution, to fully explore and exploit the aforementioned translation sizes for both native and virtualized execution. ET implements mechanisms that make the OS memory manager coalescing-aware, enabling the transparent and efficient use of intermediate-sized translations. ET also employs policies to guide translation size selection at runtime using lightweight HW-assisted TLB miss sampling. We design and implement ET for ARMv8-A in Linux and KVM. Our real-system evaluation of ET shows that ET improves the performance of memory intensive workloads by up to 39% in native execution and by 30% on average in virtualized execution.
Bio
Stratos Psomadakis is a final-year PhD student at the National Technical University of Athens under the supervision of Prof. Georgios Goumas. His research interests lie in the intersection of Operating Systems and Hardware, with a focus on virtual memory and emerging ISAs.
-
15:30-16:00 | Coffee Break
-
Abstract
Language-agnostic composition environments such as OSes, Shells, microservices, and serverless have always held the promise of significant benefits, including in developer effort, financial costs, and component specialization. Unfortunately, these environments hinder the performance optimizations, strong correctness, and security guarantees that are typical of language-aware, semantics-first environments. In this talk, I will discuss how recent developments across fields allow overcoming these challenges, offer several benefits, and enable new opportunities for exciting research that has the potential for widespread impact.
Bio
Nikos Vasilakis is on the faculty of Computer Science at Brown University. His research encompasses software systems, programming languages, and security, with a current focus on automatically transforming systems to add new capabilities such as parallelism, distribution, isolation, and correctness. Prof. Vasilakis is also the chair of the Technical Steering Committee behind PaSh, a shell-script optimization system hosted by the Linux Foundation. More: https://nikos.vasilak.is and https://atlas.cs.brown.edu
-
17:00-17:15 | Closing Remarks
7th Computing Systems Research Day - 9 January 2024
Schedule
-
11:45-12:00 | Welcome
-
Abstract
Cloud systems are experiencing significant shifts both in their hardware, with an increased adoption of heterogeneity, and their software, with the prevalence of microservices and serverless frameworks. These trends require fundamentally rethinking how the cloud system stack should be designed. In this talk, I will briefly describe the challenges these hardware and software trends introduce, and discuss under what conditions hardware acceleration can be beneficial to these new application classes, as well as how applying machine learning (ML) to systems problems can improve the cloud’s performance, efficiency, and ease of use. I will first present Sage, a performance debugging system that leverages ML to identify and resolve the root causes of performance issues in cloud microservices. I will then discuss Ursa, an analytically-driven cluster manager for microservices that addresses some of the shortcomings of applying ML to large-scale systems problems.
Bio
Christina Delimitrou is an Associate Professor at MIT, where she works on computer architecture and computer systems. She focuses on improving the performance, predictability, and resource efficiency of large-scale cloud infrastructures by revisiting the way they are designed and managed. Christina is the recipient of the 2020 TCCA Young Computer Architect Award, an Intel Rising Star Award, a Microsoft Research Faculty Fellowship, an NSF CAREER Award, a Sloan Research Scholarship, two Google Faculty Research Awards, and a Facebook Faculty Research Award. Her work has also received 5 IEEE Micro Top Picks awards and several best paper awards. Before joining MIT, Christina was an Assistant Professor at Cornell University, and received her PhD from Stanford University. She had previously earned an MS also from Stanford, and a diploma in Electrical and Computer Engineering from the National Technical University of Athens. More information at http://people.csail.mit.edu/delimitrou/
-
13:00-13:45 | Lunch Break
-
Abstract
Datacenters have witnessed a staggering evolution in networking technologies, driven by insatiable application demands for larger datasets and inter-server data transfers. Modern NICs can already handle 100s of Gbps of traffic, a bandwidth capability equivalent to several memory channels. Direct Cache Access mechanisms like DDIO that contain network traffic inside the CPU’s caches are therefore essential to effectively handle growing network traffic rates. However, at high rates, a large fraction of network traffic leaks from the CPU’s caches to memory, a problem often referred to as “leaky DMA”, significantly capping the network bandwidth a server can effectively utilize. This talk will present an analysis of network data leaks in the era of high-speed networking and our insights around the interactions between network buffers and the cache and memory hierarchy. We will present Sweeper, our proposed hardware extension and API that allows applications to efficiently manage the coherence state of network buffers in the cache-memory hierarchy, drastically reducing memory bandwidth consumption and boosting a server’s peak sustainable network bandwidth by up to 2.6x.
Bio
Marina Vemmou is a 5th year PhD student in the School of Computer Science at Georgia Tech, advised by assistant professor Alexandros Daglis. Her research focuses on designing new interfaces between hardware, network stacks and applications to unlock the performance potential of emerging datacenter technologies. Her work on hardware-software co-design for emerging network and memory technologies has been published at MICRO 2021 and MICRO 2022.
-
Abstract
Distributed transaction processing is a fundamental building block for large-scale data management in the cloud. Given the threats of security violations in untrusted cloud environments, our work focuses on: How to design a distributed transactional KV store that achieves high-performance serializable transactions, while providing strong security properties? We introduce TREATY, a secure distributed transactional KV storage system that supports serializable ACID transactions while guaranteeing strong security properties: confidentiality, integrity, and freshness. TREATY leverages trusted execution environments (TEEs) to bootstrap its security properties, but it extends the trust provided by the limited enclave (volatile) memory region within a single node to build a secure (stateful) distributed transactional KV store over the untrusted storage, network and machines. To achieve this, TREATY embodies a secure two-phase commit protocol co-designed with a high-performance network library for TEEs. Further, TREATY ensures secure and crash-consistent persistency of committed transactions using a stabilization protocol. Our evaluation on a real hardware testbed based on the YCSB and TPC-C benchmarks shows that TREATY incurs reasonable overheads, while achieving strong security properties.
Bio
Dimitra Giantsidi is a final-year PhD student at the University of Edinburgh (UoE), member of the Institute for Computing Systems Architecture (ICSA) and the Chair of Distributed and Operating Systems, advised by Prof. Pramod Bhatotia. Her research lies in the field of dependability in distributed systems with focus on the fault tolerance and security. Her work aims to increase the security and performance of widely adopted distributed systems by exploring the applications of modern hardware, such as Trusted Execution Environments and Direct I/O for networking and storage. Before joining ICSA, Dimitra graduated from the School of Electrical and Computer Engineering, NTUA, Greece.
-
Abstract
GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN) applications. However, DNN applications often underutilize GPUs, even when using large batch sizes and eliminating input data processing or communication stalls. DNN workloads consist of data-dependent operators, with different compute and memory requirements. While an operator may saturate GPU compute units or memory bandwidth, it often leaves other GPU resources idle. Despite the prevalence of GPU sharing techniques, current approaches are not sufficiently fine-grained or interference-aware to maximize GPU utilization while minimizing interference at the granularity of 10s of microseconds. We present Orion, a system that transparently intercepts GPU kernel launches from multiple clients sharing a GPU. Orion schedules work on the GPU at the granularity of individual operators and minimizes interference by taking into account each operator’s compute and memory requirements. We integrate Orion in PyTorch and demonstrate its benefits in various DNN workload collocation use cases.
Bio
Foteini Strati is a 3rd year PhD student at the Systems Group of ETH Zurich, working on systems for Machine Learning. She is interested in increasing resource utilization and fault tolerance of Machine Learning workloads. She has obtained a MSc degree in Computer Science from ETH Zurich, and a Diploma in Electrical and Computer Engineering from NTUA.
-
Abstract
Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. We demonstrate that the current multi-GPU BLAS libraries target very specific problems and data characteristics, resulting in serious performance degradation for any slightly deviating workload, and do not take into account energy efficiency at all. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into the PARALiA framework coupled with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7x and energy efficiency by 2.5x over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems.
Bio
Last-year PhD candidate at CSLab NTUA after graduating its Electrical and Computer Engineering (ECE) department and specializing in computer engineering through a master-integrated bachelor. His PhD explores the optimization of Linear Algebra routines on multi-GPU clusters with model-based autotuning, and his research interests include accelerators, parallel processing, HPC and performance engineering.
-
15:15-15:45 | Coffee Break
-
Abstract
Fully Homomorphic Encryption (FHE) enables computing directly on encrypted data, letting clients securely offload computation to untrusted servers. While enticing, FHE suffers from two key challenges. First, it incurs very high overheads: it is about 10,000x slower than native, unencrypted computation on a CPU. Second, FHE is extremely hard to program: translating even simple applications like neural networks takes months of tedious work by FHE experts. In this talk, I will describe a hardware and software stack that tackles these challenges and enables the widespread adoption of FHE. First, I will give a systems-level introduction to FHE, describing its programming interface, key characteristics, and performance tradeoffs while abstracting away its complex, cryptography-heavy implementation details. Then, I will introduce a programmable hardware architecture that accelerates FHE programs by 5,000x vs. a CPU with similar area and power, erasing most of the overheads of FHE. Finally, I will introduce a new compiler that abstracts away the details of FHE. This compiler exposes a simple, numpy-like tensor programming interface, and produces FHE programs that match or outperform painstakingly optimized manual versions. Together, these techniques make FHE fast and easy to use across many domains, including deep learning, tensor algebra, and other learning and analytic tasks.
Bio
Daniel Sanchez is a Professor of Electrical Engineering and Computer Science at MIT. His research interests include scalable memory hierarchies, architectural support for parallelization, and accelerators for sparse computations and secure computing. He earned a Ph.D. in Electrical Engineering from Stanford University in 2012 and received the NSF CAREER award in 2015.
-
16:45-17:00 | Closing Remarks
6th Computing Systems Research Day - 11 January 2023
Schedule
-
Abstract
We will examine the RowHammer problem in DRAM, which is the first example of how a circuit-level failure mechanism in Dynamic Random Access Memory (DRAM) can cause a practical and widespread system security vulnerability. RowHammer is the phenomenon that repeatedly accessing a row in a modern DRAM chip predictably causes errors in physically-adjacent rows. It is caused by a hardware failure mechanism called read disturb errors, a manifestation of circuit-level cell-to-cell interference in a scaled memory technology. Building on our initial fundamental work that appeared at ISCA 2014, Google Project Zero demonstrated that this hardware phenomenon can be exploited by user-level programs to gain kernel privileges. Many other works demonstrated other attacks exploiting RowHammer, including remote takeover of a server vulnerable to RowHammer, takeover of a mobile device by a malicious user-level application, and destruction of predictive capabilities of commonly-used deep neural networks. Unfortunately, the RowHammer problem still plagues cutting-edge DRAM chips, DDR4 and beyond. Based on our recent characterization studies of more than 1500 DRAM chips from six technology generations that appeared at ISCA 2020 and MICRO 2021, we will show that RowHammer at the circuit level is getting much worse, newer DRAM chips are much more vulnerable to RowHammer than older ones, and existing mitigation techniques do not work well. We will also show that existing proprietary mitigation techniques employed in DDR4 DRAM chips, which are advertised to be Rowhammer-free, can be bypassed via many-sided hammering (also known as TRRespass and Uncovering TRR). Throughout the talk, we will analyze the properties of the RowHammer problem, examine circuit/device scaling characteristics, and discuss solution ideas. We will also discuss what other problems may be lurking in DRAM and other types of memory, e.g., NAND flash memory, Phase Change Memory and other emerging memory technologies, which can potentially threaten the foundations of reliable and secure systems, as the memory technologies scale to higher densities. We will conclude by describing and advocating a principled approach to memory reliability and security research that can enable us to better anticipate and prevent such vulnerabilities.
Bio
Onur Mutlu is a Professor of Computer Science at ETH Zurich. He is also a faculty member at Carnegie Mellon University, where he previously held the Strecker Early Career Professorship. His current broader research interests are in computer architecture, systems, hardware security, and bioinformatics. Various techniques he, along with his group and collaborators, has invented over the years have influenced industry and have been employed in commercial microprocessors and memory/storage systems. He obtained his PhD and MS in ECE from the University of Texas at Austin and BS degrees in Computer Engineering and Psychology from the University of Michigan, Ann Arbor. He started the Computer Architecture Group at Microsoft Research (2006-2009), and held various product and research positions at Intel Corporation, Advanced Micro Devices, VMware, and Google. He is an ACM Fellow, IEEE Fellow, and an elected member of the Academy of Europe (Academia Europaea). His course lectures and materials are freely available on YouTube at https://www.youtube.com/OnurMutluLectures
-
11:30-12:00 | Break
-
Abstract
Current hardware and operating system abstractions were conceived at a time when we had minimal security threats, homogeneous compute, scarce memory resources, and limited numbers of users. These assumptions are not true today. On one hand, software and hardware vulnerabilities have escalated the need for confidential computing primitives. On the other hand, emerging datacenter paradigms like microservices and serverless computing have led to the sharing of computing resources among hundreds of users at a time through lightweight virtualization primitives. In this new era of computing, we can no longer afford to build each layer separately. Instead, we have to rethink the synergy between the operating system and hardware from the ground up. In this talk, I will focus on datacenter challenges and recent results focused on virtual memory, memory management, lightweight virtualization, and confidential computing.
Bio
Dimitrios Skarlatos is an assistant professor in the Computer Science Department at Carnegie Mellon University. His research bridges computer architecture and operating systems with a focus on performance, security, and scalability. He has received several awards for his cross-cutting research including several Meta Faculty Awards, the joint 2021 ACM SIGARCH and IEEE CS TCCA Outstanding Dissertation award, the David J. Kuck Outstanding Ph.D. Thesis Award for the best PhD thesis in the computer science department at the University of Illinois at Urbana-Champaign, two IEEE MICRO Top Picks in Computer Architecture, and two ASPLOS Best Paper Awards.
-
13:00-13:30 | Break
-
Abstract
We introduce the first open-source FPGA-based infrastructure, MetaSys, with a prototype in a RISC-V core, to enable the rapid implementation and evaluation of a wide range of cross-layer techniques in real hardware. Hardware-software cooperative techniques are powerful approaches to improve the performance, quality of service, and security of general-purpose processors. They are however typically challenging to rapidly implement and evaluate in real hardware as they require full-stack changes to the hardware, OS, system software, and instruction-set architecture (ISA). MetaSys implements a rich hardware-software interface and lightweight metadata support that can be used as a common basis to rapidly implement and evaluate new cross-layer techniques. We demonstrate MetaSys’s versatility and ease-of-use by implementing and evaluating three cross-layer techniques for: (i) prefetching for graph analytics; (ii) bounds checking in memory unsafe languages, and (iii) return address protection in stack frames; each technique only requiring ~100 lines of Chisel code over MetaSys. Using MetaSys, we perform the first detailed experimental study to quantify the performance overheads of using a single metadata management system to enable multiple cross-layer optimizations in CPUs. We identify the key sources of bottlenecks and system inefficiency of a general metadata management system. We design MetaSys to minimize these inefficiencies and provide increased versatility compared to previously-proposed metadata systems. Using three use cases and a detailed characterization, we demonstrate that a common metadata management system can be used to efficiently support diverse cross-layer techniques in CPUs.
Bio
Konstantinos Kanellopoulos is currently pursuing his PhD at ETH Zurich in the SAFARI research group (https://safari.ethz.ch/). He completed his MEng and BSc at NTUA. His research interests lie at the intersection of software and hardware with a focus on OS/Hardware co-design.
-
Abstract
Despite groundbreaking technological innovations and revolutions, the Memory Wall is still a major performance obstacle for modern systems. Hardware prefetching is a widely deployed latency-tolerance technique that has proven successful at shrinking the processor-memory performance gap. However, state-of-the-art designs of hardware cache prefetchers are far from approaching the performance of an ideal hardware cache prefetcher. Virtual memory is a memory management technique that has been vital for the success of computing due to its unique programmability and security benefits. However, virtual memory does not come for free due to the page walk memory references introduced for fetching address translation entries. Virtual memory makes the Memory Wall taller due to the requirement of traversing the memory hierarchy potentially multiple times upon TLB misses. To make matters worse, the advent of emerging workloads with massive data and code footprints that experience high TLB miss rates place tremendous pressure on the memory hierarchy due to the need for frequently performing page walks, threatening the performance of computing. Our work demonstrates that hardware prefetching has the potential to attenuate the Memory Wall bottleneck in virtual memory systems. In this direction, we (i) propose fully-legacy preserving hardware prefetching schemes for the last-level TLBs that aim at reducing the TLB miss rates of both data and instruction references, and (ii) exploit address translation metadata available at the microarchitecture to improve the effectiveness of hardware cache prefetchers operating in the physical address space without opening new security vulnerabilities.
Bio
Georgios Vavouliotis is a 4th year Ph.D. student at Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, supervised by Marc Casas and Lluc Alvarez while closely collaborating with professors Daniel A. Jimenez, Boris Grot, and Paul Gratz. Georgios also holds an Electrical and Computer Engineering diploma from the National Technical University of Athens, supervised by professors Vasileios Karakostas and Georgios Goumas. His research work has received accolades and has been featured in top-tier computer architecture conferences.
-
Abstract
Reliability evaluation in early stages of microprocessor designs varies in the level of hardware modelling accuracy, the speed of the evaluation, and the granularity of the assessment report. In this talk, we revisit the system vulnerability stack for transient faults and present a new microarchitecture-driven methodology for fast and accurate reliability evaluation. We reveal severe pitfalls in widely used vulnerability measurement approaches that operate at software or architecture abstraction layers, aiming to assess the effects of hardware faults. These approaches separate the hardware and the software layers for the assessment, under the assumption that they can reasonably model the effect of hardware faults to the software layer. However, thanks to their speed advantage, they have eventually become common practice for evaluating the overall system resilience. We show that estimations based on higher abstraction layers can deliver contradicting results against the cross-layer methodologies. To this end, we present AVGI, a new cross-layer evaluation methodology, which delivers orders of magnitude faster assessment of the Architectural Vulnerability Factor (AVF) of a microprocessor chip, while retaining the high accuracy of cross-layer reliability evaluation (including microarchitecture, architecture, and software layers).
Bio
George Papadimitriou is a postdoctoral researcher in the Dept. of Informatics and Telecommunications at the University of Athens. He earned a PhD in Computer Science from the same department in 2019, where he worked with Prof. Dimitris Gizopoulos. His research focuses on dependable and energy-efficient computer architectures, microprocessor reliability, functional correctness of hardware designs and design validation of microprocessors. He has published more than 35 papers in international conferences and journals.
-
15:00-15:30 | Break
-
Abstract
Recent shell-script parallelization systems enjoy mostly automated parallel speedups by compiling scripts ahead-of-time. Unfortunately, such static parallelization is hampered by the dynamic behaviors pervasive in shell scripts such as variable expansion and command substitution, which often requires reasoning about the current state of the shell and filesystem. We present a just-in-time (JIT) shell-script compiler, PaSh-JIT, that intermixes evaluation and parallelization during a script’s run-time execution. JIT parallelization collects run-time information about the system’s state, but must not alter the behavior of the original script and must maintain minimal overhead. PaSh-JIT addresses these challenges by (1) using a dynamic interposition framework, guided by a static preprocessing pass, (2) developing runtime support for transparently pausing and resuming shell execution, and (3) operating as a stateful server, communicating with the current shell by passing messages, all without requiring modifications to the system’s underlying shell interpreter. When run on a wide variety of benchmarks, including the POSIX shell test suite, PaSh-JIT does not break scripts, even in cases that are likely to break shells in widespread use, and offers significant speedups whenever parallelization is possible.
Bio
Konstantinos Kallas is a 5th year PhD student at the University of Pennsylvania working with Rajeev Alur. His area of interest is the intersection of programming languages and computer systems. The first paper in this line of work, introducing PaSh, a system for parallelizing shell scripts, received the best paper award at EuroSys 2021. He has also published papers improving the shell at ICFP 2021, HotOS 2021, and OSDI 2022. More information at https://angelhof.github.io/
-
Abstract
When processing join queries over big data, a DBMS can become unresponsive due to the sheer size of the output that has to be returned to the user. However, oftentimes, users have preferences over the answers and only some of these answers are required. To exploit this, and guarantee that intermediate results are relatively small, new query processing algorithms are necessary. We develop “any-k” algorithms that return the most important answers as quickly as possible, followed by the rest in quick succession. For a large class of queries, the top results are returned in linear time in input size, even if the entire set of answers is much larger. Our prototype implementation of any-k outperforms by orders of magnitude the traditional DBMS approach that first performs the join and then sorts the output in order to find the top answers. Project website: https://northeastern-datalab.github.io/anyk/
Bio
Nikolaos (Nikos) Tziavelis is a 5th year PhD candidate at Northeastern University, advised by Mirek Riedewald and Wolfgang Gatterbauer. He received a Diploma in Electrical and Computer Engineering from the National Technical University of Athens. His research interests lie in novel algorithms for query processing, efficient data representations, and distributed computing, for which he is generously supported by a Google PhD fellowship.
-
Abstract
The development of quantum computers has been advancing rapidly in recent years. In addition to researchers and companies building bigger and bigger machines, these computers are already being actively connected to the internet and offered as cloud-based quantum computer services. As quantum computers become more widely accessible, potentially malicious users could try to execute their code on the machines to leak information from other users, to interfere with or manipulate results of other users, or to reverse engineer the underlying quantum computer architecture and its intellectual property. To analyze such new security threats to cloud-based quantum computers, this presentation will cover recent research and evaluation of different types of quantum computer viruses. The presentation will also present a first of its kind quantum computer antivirus as a new means of protecting the expensive and fragile quantum computer hardware from quantum computer viruses. The novel antivirus can analyze quantum computer programs, also called circuits, and detect possibly malicious ones before they execute on quantum computer hardware. As a compile-time technique, it does not introduce any new overhead at run-time of the quantum computer.
Bio
Theodoros Trochatos is a 2nd year PhD student at Yale University under the supervision of Prof. Jakub Szefer and member of the Computer Architecture and Security Laboratory (CASLAB). Before joining Yale, he received his Diploma in Electrical and Computer Engineering from the National Technical University of Athens in 2021, where he completed his thesis in CSLab under the supervision of Prof. Dionisios Pnevmatikatos and Prof. Vasileios Karakostas. His research interests broadly encompass computer architecture and hardware security of computing systems, including security of quantum computers.
5th Computing Systems Research Day - 7 January 2020
Schedule
-
Abstract
Modern datacenters are growing at phenomenal speeds and sizes that would have been considered impractical just ten years ago. The latest mega sites are now 250 MW and growing. Memory in recent years has also emerged as the most precious silicon in datacenters because online services host data in memory for tight latency constraints and containerised third-party workloads often run in memory for faster turnaround. Unfortunately, memory has slowed down with Moore’s Law in capacity scaling. Modern software stacks and services also heavily fragment memory exacerbating the pressure on memory. In this talk, I will make the case that today’s server blades are derived from the desktop PC and OS of the 80’s with the CPU dominating access to memory and the OS orchestrating movement to/from memory through legacy abstractions and interfaces. I will then present promising avenues for a clean-slate server design with novel abstractions for a tighter integration of memory with not just an accelerator ecosystem but also network and storage.
Bio
Babak Falsafi is Professor in the School of Computer and Communication Sciences and the founding director of the EcoCloud research center at EPFL. He has worked on server architecture for quite some time and has had contributions to a few industrial platforms including the WildFire/WildCat family of multiprocessors by Sun Microsystems (now Oracle), memory system technologies for IBM BlueGene/P and Q and ARM cores, and server evaluation methodologies in use by AMD, HP and Google (PerfKit). His recent work on scale-out server processor design laid the foundation for the first server-grade ARM CPU, Cavium ThunderX. He is a fellow of ACM and IEEE.
-
11:00-11:30 | Break
-
Abstract
The increasing demand of big data analytics for more main memory capacity in datacenters and exascale computing environments is driving the integration of heterogeneous memory technologies. The new technologies exhibit vastly greater differences in access latencies, bandwidth and capacity compared to the traditional NUMA systems. Leveraging this heterogeneity while also delivering application performance enhancements requires intelligent data placement. We present Kleio, a page scheduler with machine intelligence for applications that execute across hybrid memory components. Kleio is a hybrid page scheduler that combines existing, lightweight, history-based data tiering methods for hybrid memory, with novel intelligent placement decisions based on deep neural networks. We contribute new understanding toward the scope of benefits that can be achieved by using intelligent page scheduling in comparison to existing history-based approaches, and towards the choice of the deep learning algorithms and their parameters that are effective for this problem space. Kleio incorporates a new method for prioritizing pages that leads to highest performance boost, while limiting the resulting system resource overheads. Our performance evaluation indicates that Kleio reduces on average 80% of the performance gap between the existing solutions and an oracle with knowledge of future access pattern. Kleio provides hybrid memory systems with fast and effective neural network training and prediction accuracy levels, which bring significant application performance improvements with limited resource overheads, so as to lay the grounds for its practical integration in future systems. Kleio was a best paper award finalist at HPDC 2019 (28th International Symposium on High-Performance Parallel and Distributed Computing, Phoenix, AZ, USA, June 22-29, 2019).
Bio
Thaleia Doudali is a PhD student in Computer Science at Georgia Tech advised by Ada Gavrilovska. Her current research focuses on building system-level solutions that optimize application performance on systems with heterogeneous memory components, such as DRAM and Non Volatile Memory. Thaleia has industry experience and patents from internships at AMD, VMware and Dell EMC. Prior to Georgia Tech, she received an undergraduate diploma in Electrical and Computer Engineering from the National Technical University of Athens where she was advised by Nectarios Koziris and Ioannis Konstantinou.
-
Abstract
With explosive growth in dataset sizes and increasing machine memory capacities, per-application memory footprints are commonly reaching into hundreds of GBs. Such huge datasets pressure the TLB, resulting in frequent misses that must be resolved through a page walk, a long-latency pointer chase through multiple levels of the in-memory radix tree-based page table. To accelerate page walks, we introduce Address Translation with Prefetching (ASAP), a light-weight technique for directly indexing individual levels of the page table radix tree. Direct indexing enables ASAP to fetch nodes from deeper levels of the page table without first accessing the preceding levels, thus lowering the page walk latency. ASAP is non-speculative and is fully legacy-preserving, requiring no modifications to the existing radix tree-based page table, TLBs and other software and hardware mechanisms for address translation.
Bio
Dmitrii Ustiugov is a senior PhD student at the University of Edinburgh (UoE), co-advised by Prof. Boris Grot (UoE) and Prof. Edouard Bugnion (EPFL), whose research interests span across Computer Systems and Architecture with a focus on software and hardware support for memory systems.
-
Abstract
We propose synergistic software and hardware mechanisms that alleviate the address translation overhead in virtualized systems. On the software side, we propose contiguity-aware (CA) paging, a novel physical memory allocation technique that creates larger-than-a-page contiguous mappings while preserving the flexibility of demand paging. CA paging is applicable to the hypervisor and guest OS memory manager independently, as well as in native systems. On the hardware side, we propose SpOT, a simple micro-architectural mechanism to hide TLB miss latency by exploiting the regularity of large contiguous mappings to predict address translations. We implement and emulate the proposed techniques for the x86-64 architecture in Linux and KVM, and evaluate them across a variety of memory-intensive workloads. Our results show that: (i) CA paging is highly effective at creating vast contiguous mappings in both native and virtualized scenarios, even when memory is fragmented, and (ii) SpOT exploits the provided contiguity and reduces address translation overhead of nested paging from ~18% to ~1.2%.
Bio
Chloe Alverti is a 2nd year PhD student at the Computing Systems Laboratory (CSLab) of the Electrical and Computer Engineering School (ECE, NTUA) under the supervision of professor Georgios Goumas. Her studies include efficient virtual memory mechanisms focusing on both operating system and architectural support optimizations. Preliminary results of her current work were presented in the PACT 2018 ACM Student Research Competition poster session, awarded with the 1st place. Prior to her PhD studies, she was employed for two years as a research assistant at Chalmers University of Technology, Sweden, working on FP7 EU project EuroServer under the supervision of professor Per Stenstrom.
-
13:00-14:00 | Lunch Break
-
Abstract
Pushing functionality to the hardware layer offers numerous advantages, such as performance increase, transparency of functionality, software simplification, energy efficiency, etc. In this presentation, I will talk about two solutions for increasing security as well as performance leveraging hardware. The first solution introduces microarchitectural changes inside the CPU to implement Instruction Set Randomization and Control Flow Integrity. The second solution leverages heterogeneous hardware architectures by utilizing GPU-CPU hybrid systems to increase computational performance. I will close the presentation with an outlook of future research directions.
Bio
Dr. Sotiris Ioannidis received a BSc degree in Mathematics and an MSc degree in Computer Science from the University of Crete in 1994 and 1996 respectively. In 1998 he received an MSc degree in Computer Science from the University of Rochester and in 2005 he received his PhD from the University of Pennsylvania. Ioannidis held a Research Scholar position at the Stevens Institute of Technology until 2007 and is a Research Director at the Institute of Computer Science (ICS) of the Foundation for Research and Technology Hellas (FORTH). In 2019 he was elected Associate Professor at the School of Electrical and Computer Engineering of the Technical University of Crete (TUC). He has been a Member of the ENISA Permanent Stakeholders Group (PSG) since 2017. His research interests are in the area of systems and network security, security policy, privacy and high-speed networks. Ioannidis has authored more than 150 publications in international conferences and journals, as well as book chapters, and has both chaired and served in numerous program committees in prestigious conferences such as ACM CCS and IEEE S&P. Ioannidis is a Marie-Curie Fellow and has participated in numerous international and European projects.
-
15:00-15:30 | Break
-
Abstract
The promise of automatic parallelization, freeing programmers from the error-prone and time-consuming process of making efficient use of parallel processing resources, remains unrealized. For decades, the imprecision of memory analysis limited the applicability of non-speculative automatic parallelization. The introduction of speculative automatic parallelization overcame these applicability limitations, but, even in the case of no misspeculation, these speculative techniques exhibit high communication and bookkeeping costs for validation and commit. This paper presents Perspective, a speculative-DOALL parallelization framework that maintains the applicability of speculative techniques while approaching the efficiency of non-speculative ones. Unlike current approaches which subsequently apply speculative techniques to overcome the imprecision of memory analysis, Perspective combines the first speculation-aware memory analyzer, new efficient speculative privatization methods, and a parallelization planner to find the best performing set of parallelization techniques. By reducing speculative parallelization overheads in ways not possible with prior parallelization systems, Perspective obtains higher overall program speedup (23.0x for 12 general-purpose C/C++ programs running on a 28-core shared-memory machine) than Privateer (11.5x), the most applicable prior automatic speculative-DOALL system.
Bio
Sotiris Apostolakis is a PhD student in the Liberty Research group at Princeton University, under the supervision of Prof. David I. August. He has also collaborated with Prof. Simone Campanoni from Northwestern University. His research focus is on compilers and automatic parallelization. During his internships at Facebook (summer 2018) and Intel (summer 2017), he worked on binary analysis. Before joining Princeton, he earned his diploma in Electrical and Computer Engineering at the National Technical University of Athens, where he worked with Georgios Goumas, Nikela Papadopoulou, and Nectarios Koziris on performance prediction of large-scale systems.
-
Abstract
Extreme heterogeneity in high-performance computing has led to a plethora of programming models for intra-node programming. The increasing complexity of those approaches and the lack of a unifying model has rendered the task of developing performance-portable applications intractable. To address these challenges, we present the Data-centric Parallel Programming (DAPP) concept, which decouples program definition from its optimized implementation. The latter is realized through Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that combines fine-grained data dependencies with high-level control-flow and is amenable to program transformations. We demonstrate the potential of the data-centric viewpoint with OMEN, a state-of-the-art quantum transport (QT) solver. We reduce the original C++ code of OMEN from 15k lines to 3k lines of Python code and 2k SDFG nodes. We subsequently tune the generated code for two of the fastest supercomputers in the world (June 2019), and achieve up to two orders of magnitude higher performance; sustained 85.45 Pflop/s on 4,560 nodes of Summit (42.55% of the peak) in double precision, and 90.89 Pflop/s in mixed precision.
Bio
Alex Ziogas is a PhD student at the Scalable Parallel Computing Laboratory at ETH Zurich, under the supervision of Prof. Torsten Hoefler. He received his Diploma in Electrical and Computer Engineering from the National Technical University of Athens, under the supervision of Prof. Georgios Goumas. His research interests lie in performance optimization and modeling for parallel and distributed computing systems. Recently, he has been working on data-centric representations and optimizations for High-Performance Computing applications. He was awarded the 2019 Gordon Bell prize for his work on optimizing Quantum Transport Simulations.
-
Abstract
Remote Procedure Calls are widely used to connect datacenter applications with strict tail-latency service level objectives in the scale of microseconds. Existing solutions utilize streaming or datagram-based transport protocols for RPCs that impose overheads and limit the design flexibility. Our work exposes the RPC abstraction to the endpoints and the network, making RPCs first-class datacenter citizens and allowing for in-network RPC scheduling. We propose R2P2, a UDP-based transport protocol specifically designed for RPCs inside a datacenter. R2P2 exposes pairs of requests and responses and allows efficient and scalable RPC routing by separating the RPC target selection from request and reply streaming. Leveraging R2P2, we implement a novel join-bounded-shortest-queue (JBSQ) RPC load balancing policy, which lowers tail latency by centralizing pending RPCs in the router and ensures that requests are only routed to servers with a bounded number of outstanding requests. The R2P2 router logic can be implemented either in a software middlebox or within a P4 switch ASIC pipeline. Our evaluation shows that the protocol is suitable for microsecond-scale RPCs and that its tail latency outperforms both random selection and classic HTTP reverse proxies. The P4-based implementation of R2P2 on a Tofino ASIC adds less than 1 microsecond of latency whereas the software middlebox implementation adds 5 microseconds latency and requires only two CPU cores to route RPCs at 10 Gbps line-rate. R2P2 improves the tail latency of web index searching on a cluster of 16 workers operating at 50% of capacity by 5.7x over NGINX.
Bio
Marios Kogias is a 5th year PhD student at EPFL working with Edouard Bugnion. His research focuses on datacenter systems, and specifically on microsecond-scale Remote Procedure Calls. He is interested in improving the tail-latency of networked systems by rethinking both operating systems mechanisms and networking while leveraging new emerging datacenter hardware for in-network compute. Marios has interned at Microsoft Research, Google, and CERN, and he is an IBM PhD Fellow.
-
Abstract
Over the past decade, a plethora of systems have emerged to support data analytics in various domains such as SQL and machine learning, among others. In each of the data analysis domains, there are now many different specialized systems that leverage domain-specific optimizations to efficiently execute their workloads. An alternative approach is to build a general-purpose data analytics system that uses a common execution engine and programming model to support workloads in different domains. In this work, we choose representative systems of each class (Spark, TensorFlow, Presto and Hive) and benchmark their performance on a wide variety of machine learning and SQL workloads. We perform an extensive comparative analysis on the strengths and limitations of each system and highlight major areas for improvement for all systems.
Bio
Evdokia Kassela is a PhD student and researcher at the Computing Systems Laboratory of the National Technical University of Athens (NTUA). She received her diploma in Electrical and Computer Engineering from NTUA in 2013. Her research interests lie in the field of distributed systems, cloud computing and big-data technologies. Nikodimos Provatas was born in 1993 in Athens, Greece. He graduated from the School of Electrical and Computer Engineering of NTUA in 2016 with a diploma thesis on the field of distributed systems. From 2016 he has started his PhD at the Computing Systems Laboratory. His research interests focus on machine learning on big data.
-
17:30-17:45 | Break
-
Abstract
Preserving computational performance increase after the end of scaling of traditional CMOS relies on making the most of emerging technologies such as devices, memories, photonics, specialized architectures, and others. However, if we are wildly successful in accelerating computation, the bottleneck will quickly shift to communication or system management of heterogeneous resources. In this talk, I will discuss my current and future approach to change the computational model and architectures to better fit emerging devices, as well as provide the capability to evaluate emerging devices rapidly at the system scale to better guide future device and architecture research. Then, I will discuss communication where photonics can provide the means to efficiently perform resource disaggregation and also increase the performance over power efficiency of hierarchical networks by better matching network connectivity to application demands.
Bio
George Michelogiannakis is a research scientist at Lawrence Berkeley National Laboratory and an adjunct professor at Stanford University. He has extensive work on networking (both off- and on-chip) and computer architecture. His latest work focuses on the post Moore’s law era looking into specialization, emerging devices (transistors), memories, photonics, and 3D integration. He is also currently working on optics and architecture for HPC and datacenter networks.
4th Computing Systems Research Day - 7 January 2019
Schedule
-
Abstract
To meet ever-increasing computational needs and fixed power budget, computing systems are forced to adopt more efficient computational engines. With the end of Moore and Dennard scaling, technology alone cannot satisfy these needs, hence systems incorporate heterogeneous accelerators, that is, units optimized for specific sets of functions. At the Microprocessor and Hardware Laboratory of the Technical University of Crete we have a long track of research in reconfigurable accelerators. I will give a brief overview of our recent work on accelerators for intensive big-data (Classification and Frequent sub-graph mining), streaming applications (Stream Join and ECM Exponential Sketch generation) and bioinformatics. These works have been designed and prototyped for high-performance reconfigurable platforms such as Convey and Maxeler.
Bio
Dionisis Pnevmatikatos is a Professor and Director of the Microprocessor and Hardware Laboratory at the School of Electrical and Computer Engineering of the Technical University of Crete. He received his PhD in Computer Science from the University of Wisconsin-Madison in 1995. His research interests include Computer Architecture with a focus on the use of Reconfigurable Logic for the creation of efficient accelerators in heterogeneous parallel systems. He has also worked on the design of dependable systems, architectures for hardware or reconfigurable logic application acceleration, network packet processors, and related areas. He has served as coordinator of the European research project FASTER (FP7) and as Principal Investigator in the European research projects DeSyRe (FP7), AXIOM, dRedBox, and EXTRA (H2020), as well as several national projects. He is a regular member of Program Committees at key conferences in his area such as FPL and DATE.
-
13:45-14:30 | Break
-
Abstract
Single core processing power has stagnated, forcing us to use increasingly complex processing systems in order to extract performance: multicores, GPUs, asymmetric multiprocessors, distributed computing, computation offloading. Writing fast and correct programs for them is tough even for experts. For most programmers it is almost impossible. Existing development tools do not help enough. Most analysis and optimization is left to the programmer, while the decisions that our tools can make are often suboptimal or wrong. With hardware becoming more complicated, the gap between what we need our tools to do and what they can achieve will only grow. In this talk, I will present a new method for bridging this gap. The central idea is substituting expert understanding of how code is structured and works with automatically trained deep neural networks. Such learned models can give us all the information we need to analyze the code and drive optimization decisions. This approach allows us to build new powerful tools with little human input, even less expertise, and in a mostly language agnostic way, dramatically reducing the difficulty and cost of creating such tools.
Bio
Pavlos Petoumenos is a Senior Researcher at the University of Edinburgh and a Research Fellow of the Royal Academy of Engineering. His work focuses on code optimization techniques for performance, energy, and size. Much of his recent output explores ways of automating optimization decisions through machine and deep learning. He was awarded a PhD from the University of Patras for his work on cache sharing and cache replacement techniques.
-
Abstract
The continuous growth of computer systems have introduced a new era for computing. The performance and power gains that came through advancements in transistor technology driven by Moore’s law have begun to diminish due to Dennard’s Scaling hitting the physical boundaries. The increasing demand for performance along with resource constraints have brought energy and power efficiency to the forefront of research agenda. Power efficiency requirement is imposed by thermal problems in modern chips while energy efficiency is needed for long lasting batteries and low electricity costs. The inability of multi-core processors to meet the above requirements have shifted research towards heterogeneous architectures. This work explores scheduling techniques on single-ISA heterogeneous architectures, and more specifically on ARM big.LITTLE systems. The state-of-the-art schedulers for big.LITTLE systems are based on the default Time Preemptive Scheduling mechanism of the Linux kernel which can miss rapid phase changes of the workload. This work proposes a novel scheduling mechanism, called Context Preemptive Scheduling, that exploits features of the ARM architecture to closely track phase changes in running programs and invokes the migration process of the scheduler in time.
Bio
Ioanna Alifieraki is a software engineer at Canonical Ltd. Prior to this, she was a PhD student at the University of Manchester. She received her Diploma in Electrical and Computer Engineering from NTUA in 2014.
-
Abstract
Over the past few years, a large body of research has been devoted to optimizing sparse matrix-vector multiplication (SpMV) on General Purpose Graphics Processing Units (GPGPUs). Numerous sparse matrix formats and associated algorithms have been proposed, with different strengths and weaknesses. However, while previous works particularly focus on parallelization strategies that tackle load imbalance, in this paper we emphasize that other SpMV bottlenecks have not been thoroughly addressed on GPGPUs. Towards this direction, we present a bottleneck-aware SpMV auto-tuner (BASMAT), a holistic approach for the optimization of SpMV on GPGPUs that addresses all encountered bottlenecks, focusing both on fast execution and low preprocessing.
Bio
Athena Elafrou is a graduate of the Electrical and Computer Engineering (ECE) School of NTUA. She is currently a PhD candidate with the parallel systems research group of CSLab at ECE/NTUA. Her current research interests focus on high-performance sparse linear algebra and deep learning on parallel systems.
-
16:00-16:30 | Break
-
Abstract
The recently proposed dataplanes for microsecond scale applications, such as IX and ZygOS, use non-preemptive policies to schedule requests to cores. For the many real-world scenarios where request service times follow distributions with high dispersion or a heavy tail, they allow short requests to be blocked behind long requests, which leads to poor tail latency. Shinjuku is a single-address space operating system that uses hardware support for virtualization to make preemption practical at the microsecond scale. This allows Shinjuku to implement centralized scheduling policies that preempt requests as often as every 5 microseconds and work well for both light and heavy tailed request service time distributions. We demonstrate that Shinjuku provides significant tail latency and throughput improvements over IX and ZygOS for a wide range of workload scenarios. For the case of a RocksDB server processing both point and range queries, Shinjuku achieves up to 6.6x higher throughput and 88% lower tail latency.
Bio
Kostis Kaffes is a PhD student in Electrical Engineering at Stanford University, advised by Christos Kozyrakis. His research interests lie in the areas of computer systems, cloud computing, and scheduling. Recently, he has been working on end-host preemptive scheduling for microsecond-scale tail latency. Previously, he completed his Diploma in Electrical and Computer Engineering at the National Technical University of Athens, where he worked with Nectarios Koziris and Georgios Goumas on interference-aware VM scheduling.
-
Abstract
Modern large scale computer clusters benefit significantly from elasticity. Elasticity allows a cluster to dynamically allocate computer resources, based on the user’s fluctuating workload demands. Many cloud providers use threshold-based approaches, which have been proven to be difficult to configure and optimise, while others use reinforcement learning and decision-tree approaches, which struggle when having to handle large multidimensional cluster states. In this work we use Deep Reinforcement Learning techniques to achieve automatic elasticity. We use three different approaches of a Deep Reinforcement Learning agent, called DERP (Deep Elastic Resource Provisioning), that takes as input the current multi-dimensional state of a cluster and manages to train and converge to the optimal elasticity behaviour after a finite amount of training steps. The system automatically decides and proceeds on requesting/releasing VM resources from the provider and orchestrating them inside a NoSQL cluster according to user-defined policies/rewards. We compare our agent to state-of-the-art, Reinforcement Learning and decision-tree based, approaches in demanding simulation environments and show that it gains rewards up to 1.6 times better on its lifetime. We then test our approach in a real life cluster environment and show that the system resizes clusters in real-time and adapts its performance through a variety of demanding optimisation strategies, input and training loads.
Bio
Constantinos Bitsakos is a graduate of the Electrical and Computer Engineering (ECE) School of NTUA. He has worked in the industry for 5 years as a full stack web developer. He is currently a PhD candidate with the distributed systems research group of CSLab at ECE/NTUA. His current research interests focus on deep reinforcement learning and game theoretic approaches applied for high elasticity on cloud computing.
-
Abstract
In recent years we observe the rapid growth of large-scale analytics applications in a wide range of domains, from healthcare infrastructures to traffic management. The high volume of data that need to be processed has stimulated the development of special purpose frameworks which handle the data deluge by parallelizing data processing and concurrently using multiple computing nodes. These frameworks differentiate significantly in terms of the policies they follow to decompose their workloads into multiple tasks and also on the way they exploit the available computing resources. As a result, based on the framework that applications have been implemented in, we observe significant variations in their resource utilization and execution times. Therefore, determining the appropriate framework for executing a big data application is not trivial. In this work we propose Orion, a novel resource negotiator for cloud infrastructures that support multiple big data frameworks such as Apache Spark, Apache Flink and TensorFlow. More specifically, given an application, Orion determines the most appropriate framework to assign it to. Additionally, Orion reserves the required resources so that the application is able to meet its performance requirements. Our negotiator exploits state-of-the-art prediction techniques for estimating the application’s execution time when it is assigned to a specific framework with varying configuration parameters and processing resources.
Bio
Nikolaos Chalvantzis is a graduate of the Electrical and Computer Engineering (ECE) School of NTUA. He works as a PhD candidate with the distributed systems research group of CSLab at ECE/NTUA. His current research interests and publications focus on distributed systems, cloud elasticity and resource provisioning. Nikolaos also holds a degree in Music (String Performance).
3rd Computing Systems Research Day - 8 January 2018
Schedule
-
Abstract
Cloud computing promises flexibility, high performance, and low cost. Despite its prevalence, most datacenters hosting cloud computing services still operate at very low utilization, posing serious scalability concerns. There are several reasons behind low cloud utilization, dominated by overly conservative users trying to avoid the unpredictable performance of multi-tenancy. A crucial system that can improve the efficiency of cloud infrastructures, while guaranteeing high performance for each submitted application is the cluster manager; the system that orchestrates where applications are placed and how many resources they receive. In this talk, I will first describe Quasar, a cluster management system that leverages practical ML techniques to quickly determine the type and amount of resources a new cloud application needs to satisfy its quality of service constraints. Quasar also introduces a new declarative interface in cluster management, where users express their applications’ performance, not resource, requirements to the system. We have built and deployed Quasar in local clusters, as well as production systems, including Twitter and AT&T, and showed that it guarantees high application performance, while improving system utilization by 2-3x. Second, I will talk about the security vulnerabilities cloud multi-tenancy creates, and show how similar ML techniques to those used in Quasar can enable an adversary to extract confidential information about an application, and negatively impact its performance. Finally, I will briefly discuss the direction in which cloud applications and systems are evolving, and how big data can help us improve the way we design and manage these complex, large-scale systems.
Bio
Christina Delimitrou is an assistant professor of Electrical and Computer Engineering, and Computer Science at Cornell, working in computer architecture, systems, and applied data mining. She is a member of the Computer Systems Lab and directs the SAIL group at Cornell. Christina has received a PhD in Electrical Engineering from Stanford University. She previously earned an MS in Electrical Engineering, also from Stanford, and a diploma in Electrical and Computer Engineering from the National Technical University of Athens. She is the recipient of a John and Norma Balen Sesquicentennial Faculty Fellowship, a Facebook Research Fellowship, and a Stanford Graduate Fellowship.
-
14:00-14:30 | Break
-
Abstract
The vast amount of new data being generated is outpacing the development of infrastructures and continues to grow at much higher rates than Moore’s law, a problem that is commonly referred to as the “data deluge problem”. This brings current computational machines in the struggle to exceed Exascale processing powers by 2020 and this is where the energy boundary is setting the second, bottom-side alarm: A reasonable power envelope for future Supercomputers has been projected to be 20MW, while world’s current No. 2 Supercomputer Sunway TaihuLight provides 93 Pflops and requires already 15.37 MW. This simply means that we have reached so far below 10% of the Exascale target but we consume already more than 75% of the targeted energy limit! The way to escape is currently following the paradigm of disaggregating and disintegrating resources, massively introducing at the same time optical technologies for interconnect purposes. Disaggregating computing from memory and storage modules can allow for flexible and modular settings where hardware requirements can be tailored to meet the certain energy and performance metrics targeted per application. At the same time, optical interconnect and photonic integration technologies are rapidly replacing electrical interconnects continuously penetrating at deeper hierarchy levels. In this work, we will discuss the main performance and energy challenges currently faced by the computing industry and we will present our recent research on photonic technologies towards realizing resource disaggregation at all hierarchy levels spanning from rack- through board- down to disintegrated chip-scale computing.
Bio
Dr. Nikos Pleros joined the faculty of the Department of Informatics, Aristotle University of Thessaloniki, Greece, in September 2007, where he is currently serving as an Assistant Professor. He obtained the Diploma and the PhD Degree in Electrical & Computer Engineering from the National Technical University of Athens (NTUA) in 2000 and 2004, respectively. His research interests include optical interconnect technologies and architectures, photonic integrated circuit technologies, optical technologies for disaggregated data center architectures and high-performance computing, optical RAM memories and optical caches, silicon photonics and plasmonics, optical signal processing, optical switching and fiber-wireless technologies and protocols for 5G mobile networks.
-
Abstract
FPGA and GPU based accelerators have recently become first class citizens in datacenters. Despite their high cost, however, accelerators remain underutilized for large periods of time, as vendors prefer to dedicate them to workloads for guaranteed QoS. At the same time, accelerator sharing is difficult, due to vendor locked communication paths with software applications. In this work in progress, we modified the agents of Apache Mesos with Vinetalk, an accelerator middleware that abstracts the entire communication path between OS processes and accelerator hardware without adding more than 10% performance overhead. We demonstrate the ease of integration of software applications with GPUs and, in collaboration with ICCS with FPGA logic. Finally, we show that the use of Vinetalk-enhanced Mesos allows analytics pipelines, such as Apache Spark, to use for the first time executors with heterogeneous characteristics.
Bio
Christos Kozanitis is a research collaborator at FORTH-ICS. He received his M.S. and Ph.D in Computer Science and Engineering from the University of California, San Diego in 2009 and 2013 respectively. Parts of his PhD work influenced products from companies such as Cisco and Illumina. He also held a two-year postdoctoral appointment at the AMP Lab of the University of California, Berkeley, where he used and adapted state of the art big data technologies, such as Apache Spark SQL, Apache Parquet and Apache Avro to process large amounts of DNA sequencing data. His current research interests involve the improvement in software, storage and hardware level of modern datacenters in order to speed up the processing of big data workloads.
-
Abstract
In this work we introduce RCU-HTM, a technique that combines Read-Copy-Update (RCU) with Hardware Transactional Memory (HTM) to implement highly efficient concurrent Binary Search Trees (BSTs). Similarly to RCU-based algorithms, we perform the modifications of the tree structure in private copies of the affected parts of the tree rather than in-place. This allows threads that traverse the tree to proceed without any synchronization and without being affected by concurrent modifications. The novelty of RCU-HTM lies at leveraging HTM to permit multiple updating threads to execute concurrently. After appropriately modifying the private copy, we execute an HTM transaction, which atomically validates that all the affected parts of the tree have remained unchanged since they have been read and, only if this validation is successful, installs the copy in the tree structure. We apply RCU-HTM on AVL and Red-Black balanced BSTs and compare their performance to state-of-the-art lock-based, non-blocking, RCU- and HTM-based BSTs. Our experimental evaluation reveals that BSTs implemented with RCU-HTM achieve high performance, not only for read-only operations, but also for update operations. More specifically, our evaluation includes a diverse range of tree sizes and operation workloads and reveals that BSTs based on RCU-HTM outperform other alternatives by more than 18%, on average, on a multi-core server with 44 hardware threads.
Bio
Dimitrios Siakavaras is a Ph.D. candidate at the Computing Systems Laboratory of the National Technical University of Athens (NTUA). His research interests include concurrent programming, concurrent data structures and transactional memory. He received his Diploma in Electrical and Computer Engineering from NTUA in 2012.
-
16:00-16:30 | Break
-
Abstract
Review-based recommender systems have become dominant in recent years. In these systems, the traditional user-item ratings’ matrix is augmented with textual evaluations of the items by the users. In this talk, we are going to explore how this extra information source can be incorporated in matrix factorization algorithms, which constitute the state-of-the-art in recommender systems. More specifically, we will examine a special category of machine learning techniques for text analysis known as neural language models. The talk will conclude with the presentation of some preliminary results of the discussed techniques on reference datasets.
Bio
Georgios Alexandridis is an electrical and computer engineer and a post-doc affiliate of the Intelligent Systems, Content and Interaction Laboratory of the National Technical University of Athens (NTUA). He graduated from the Department of Electrical and Computer Engineering of the University of Patras and also holds a doctoral degree from the School of Electrical and Computer Engineering of NTUA. His research interests are in the areas of Machine Learning, Artificial Intelligence and Big Data analysis.
-
Abstract
Task-based dataflow programming models and runtimes are promising candidates for programming multicore and manycore architectures. These programming models analyze dynamically task dependencies at runtime and schedule independent tasks concurrently to the processing elements. In such models, cache locality and efficient utilization of the on-chip cache resources is critical for performance and energy efficiency. In this talk we will describe a number of combined hardware-software approaches to improve data movement and locality in the cache hierarchy and better utilize the on-chip cache resources. We will also present our recent research activities on interconnects and communication primitives for exascale systems.
Bio
Vassilis Papaefstathiou received his Ph.D. in Computer Science (2013) from the University of Crete. From 2001 to 2003 he worked on IC design and verification in ISD S.A. and collaborated closely with STMicroelectronics on industrial SoC designs. From 2005 to 2013 he was a Research Engineer in the Computer Architecture and VLSI Systems Laboratory at the Institute of Computer Science, FORTH, Greece. From 2014 to 2016 he was a Postdoctoral Researcher at the Computer Science and Engineering Department at Chalmers University of Technology, Sweden. Since September 2016 he is with FORTH. His research interests are on Parallel Computer Architecture, High-Performance Computing, High-Speed Interconnects, Low-Power Datacenter Servers, and Storage Systems.
-
Abstract
The advent of the Big Data era has given birth to a variety of new architectures aiming at applications with increased scalability, robustness and fault tolerance. At the same time, though, these architectures have complicated the application structure, leading to an exponential growth of the applications’ configuration space and increased difficulty in predicting their performance. In this work, we describe a novel, automated profiling methodology that makes no assumptions on application structure. Our approach utilizes oblique Decision Trees in order to recursively partition an application’s configuration space in disjoint regions, choose a set of representative samples from each subregion according to a defined policy and return a model for the entire space as a composition of linear models over each subregion.
Bio
Giannis Giannakopoulos is a Ph.D. candidate at the Computing Systems Laboratory of the National Technical University of Athens (NTUA). His research interests include Large Scale Data Management, Distributed Systems and Cloud Computing. He received his Diploma in Electrical and Computer Engineering from NTUA in 2012.