MONDAY, MARCH 13
11:50 am – 1:00 pm | Price Center Ballroom West
Lunch
1:00 pm – 1:10 pm | Price Center Ballroom East
Opening Remarks
Chair: Hung-Wei Tseng, UCR
1:10 pm – 2:20 pm | Price Center Ballroom East
Keynote I:
Chair: Hung-Wei Tseng, UCR
Breaking Through Memory Walls: Unleashing the Potential of Advanced Memory Technologies
Yang Seok Ki (Samsung);
Breaking Through Memory Walls: Unleashing the Potential of Advanced Memory Technologies
Yang Seok Ki (Samsung);
Speaker: Dr. Yang Seok Ki, Samsung
Abstract: Recent advancements in memory technology, including the adoption of technologies such as CXL and HBM, changes in the industry, such as Intel's decision to drop Optane business, and the emergence of massive machine learning models such as GPT-3, have highlighted the need to revisit classic memory wall problems. These changes have brought about both new challenges and opportunities, and it is important to understand how they affect computer system performance and design, particularly in relation to memory hierarchy and capability. The memory wall can be viewed from four perspectives: bandwidth, latency, capacity, and power. In this presentation, Dr. Ki will provide an overview of the range of products developed by the memory industry across DRAM and NAND to address memory wall problems. Specifically, Dr. Ki will discuss data-centric computing technologies and their implications for memory and storage in the context of the memory wall, as well as the key technical challenges involved in making such technologies viable solutions for real impacts in the industry.
Speaker bio: Dr. Yang Seok Ki is currently serving as the Vice President of Memory Solutions Lab (MSL) at Samsung Semiconductor Inc., located in San Jose, California. He has been a part of the Samsung team since 2011, during which time he has overseen the development of numerous advanced development projects such as SmartSSD, Key-Value SSD, CXL Memory Expander, and Memory Semantic SSD. Additionally, he has spearheaded the creation of the NVMe Key Value Standard, SNIA Key Value API, SNIA Computational Storage Architecture and API. Dr. Ki is also a member of the Open Computing Project (OCP) Future Technology Initiative (FTI), chairing the computational storage workstream. Prior to joining Samsung, Dr. Ki worked for Oracle's Server Technology Group. He also conducted research in High Performance Computing (HPC), Grid Computing, and Cloud at Information Science Institute at the University of Southern California and the Center for Networked Systems at the University of California, San Diego. He received his Ph.D. in Electrical Engineering and Computer Engineering from Seoul National University, as well as his Bachelor's and Master's degrees in Computer Engineering from the same institution. Dr. Ki has completed the Engineering Leadership Professional Program (ELPP) from the University of California, Berkeley. Dr. Ki has been granted over 70 patents for his inventions and has authored more than 50 papers.
2:40 pm – 4:00 pm | Price Center Ballroom East
Session 1: Memorable Paper Award Finalists I (Systems)
Chair: Steven Swanson, UCSD
XRP: In-Kernel Storage Functions with eBPF
Yuhong Zhong (Columbia University);Haoyu Li (Columbia University);Yu Jian Wu (Columbia University);Ioannis Zarkadas (Columbia University);Jeffrey Tao (Columbia University);Evan Mesterhazy (Columbia University);Michael Makris (Columbia University);Junfeng Yang (Columbia University);Amy Tai (Google);Ryan Stutsman (University of Utah);Asaf Cidon (Columbia University);
XRP: In-Kernel Storage Functions with eBPF
Yuhong Zhong (Columbia University);Haoyu Li (Columbia University);Yu Jian Wu (Columbia University);Ioannis Zarkadas (Columbia University);Jeffrey Tao (Columbia University);Evan Mesterhazy (Columbia University);Michael Makris (Columbia University);Junfeng Yang (Columbia University);Amy Tai (Google);Ryan Stutsman (University of Utah);Asaf Cidon (Columbia University);
Speaker: Yuhong Zhong, Columbia University
Abstract: With the emergence of microsecond-scale NVMe storage devices, the Linux kernel storage stack overhead has become significant, almost doubling access times. We present XRP, a framework that allows applications to execute user-defined storage functions, such as index lookups or aggregations, from an eBPF hook in the NVMe driver, safely bypassing most of the kernel’s storage stack. To preserve file system semantics, XRP propagates a small amount of kernel state to its NVMe driver hook where the user-registered eBPF functions are called. We show how two key-value stores, BPF-KV, a simple B+-tree key-value store, and WiredTiger, a popular log-structured merge tree storage engine, can leverage XRP to significantly improve throughput and latency.
Speaker bio: Yuhong Zhong is a first-year Ph.D. student in Computer Science at Columbia University advised by Asaf Cidon. Yuhong is broadly interested in computer systems, especially storage systems and memory technologies. Before starting his Ph.D., Yuhong was a software engineer at VMware in the vSAN group working on vSAN Express Storage Architecture. He received his master’s degree at Columbia University and his bachelor’s degree at Harbin Institute of Technology.
BlockFlex: Enabling Storage Harvesting with Software-Defined Storage
Benjamin Reidys (UIUC);Jinghan Sun (UIUC);Anirudh Badam (Microsoft Research);Shadi Noghabi (Microsoft Research);Jian Huang (UIUC);
BlockFlex: Enabling Storage Harvesting with Software-Defined Storage
Benjamin Reidys (UIUC);Jinghan Sun (UIUC);Anirudh Badam (Microsoft Research);Shadi Noghabi (Microsoft Research);Jian Huang (UIUC);
Speaker: Benjamin Reidys, University of Illinois Urbana-Champaign
Abstract: Cloud platforms today make efficient use of storage resources by slicing them among multi-tenant applications on demand. However, our study discloses that cloud storage is still seriously underutilized for both allocated and unallocated storage. Although cloud providers have developed harvesting techniques to allow evictable virtual machines (VMs) to use unallocated resources, these techniques cannot be directly applied to storage resources, due to the lack of systematic support for the isolation of space, bandwidth, and data security in storage devices. In this paper, we present BlockFlex, a learning-based storage harvesting framework, which can harvest available flash-based storage resources at a fine-grained granularity in modern cloud platforms. We rethink the abstractions of storage virtualization and enable transparent harvesting of both allocated and unallocated storage for evictable VMs. BlockFlex explores both heuristics and learning-based approaches to maximize the storage utilization, while ensuring the performance and security isolation between regular and evictable VMs at the storage device level. We develop BlockFlex with programmable solid-state drives (SSDs) and demonstrate its efficiency with various datacenter workloads.
Speaker bio: Benjamin Reidys is a third-year Ph.D. student at the University of Illinois Urbana-Champaign. His research incorporates ideas from operating systems, networking, and computer architecture to optimize memory and storage systems, especially with network/storage co-design.
NVLeak: Off-Chip Side-Channel Attacks via Non-Volatile Memory Systems
Zixuan Wang (UC San Diego);Mohammadkazem Taram (UC San Diego and Purdue University);Daniel Moghimi (Google);Steven Swanson (UC San Diego);Dean Tullsen (UC San Diego);Jishen Zhao (UC San Diego);
NVLeak: Off-Chip Side-Channel Attacks via Non-Volatile Memory Systems
Zixuan Wang (UC San Diego);Mohammadkazem Taram (UC San Diego and Purdue University);Daniel Moghimi (Google);Steven Swanson (UC San Diego);Dean Tullsen (UC San Diego);Jishen Zhao (UC San Diego);
Speaker: Zixuan Wang, UC San Diego
Abstract: We study microarchitectural side-channel attacks and defenses on non-volatile RAM (NVRAM) DIMMs. In this study, we first perform reverse-engineering of NVRAMs as implemented by the Intel Optane DIMM and reveal several of its previously undocumented microarchitectural details: on-DIMM cache structures (NVCache) and wear-leveling policies. Based on these findings, we first develop cross-core and cross-VM covert channels to establish the channel capacity of these shared hardware resources. Then, we devise NVCache-based side channels under the umbrella of NVLeak. We apply NVLeak to a series of attack case studies, including compromising the privacy of databases and key-value storage backed by NVRAM and spying on the execution path of code pages when NVRAM is used as a volatile runtime memory. Our results show that side-channel attacks exploiting NVRAM are practical and defeat previously-proposed defense that only focuses on on-chip hardware resources. To fill this gap in defense, we develop system-level mitigations based on cache partitioning to prevent side-channel leakage from NVCache. This paper is one of the first to study the architectural side-channel attacks in commercial NVRAM products, and the techniques and ideas can be applied to investigate future memory hardware designs.
Speaker bio: Zixuan Wang is currently a 5th year PhD candidate at UC San Diego, working with Prof. Jishen Zhao and Prof. Steven Swanson. His research spans computer architecture, operating system, and security. His current research focuses on building scalable memory systems to improve system scalability, programmability, and security.
Leveraging Data Compression for Performance-Efficient and Long-Lasting NVM-based Last-Level Caches
Carlos Escuin (Universidad de Zaragoza);Asif Ali Khan (TU Dresden);Pablo Ibáñez (Universidad de Zaragoza);Teresa Monreal (Universitat Politècnica de Catalunya);Denis Navarro (Universidad de Zaragoza);José M. Llabería (Universitat Politècnica de Catalunya);Jeronimo Castrillon (Center for Advancing Electronics Dresden, TU Dresden);Víctor Viñals (Universidad de Zaragoza);
Leveraging Data Compression for Performance-Efficient and Long-Lasting NVM-based Last-Level Caches
Carlos Escuin (Universidad de Zaragoza);Asif Ali Khan (TU Dresden);Pablo Ibáñez (Universidad de Zaragoza);Teresa Monreal (Universitat Politècnica de Catalunya);Denis Navarro (Universidad de Zaragoza);José M. Llabería (Universitat Politècnica de Catalunya);Jeronimo Castrillon (Center for Advancing Electronics Dresden, TU Dresden);Víctor Viñals (Universidad de Zaragoza);
Speaker: Carlos Escuin, Universidad de Zaragoza
Abstract: Non-volatile memory (NVM) technologies are interesting alternatives for building on-chip Last-Level Caches (LLCs). Their advantages, compared to SRAM memory, are higher density and lower static power, but each write operation slightly wears out the bitcell, to the point of losing its storage capacity. In this context, this paper summarizes three contributions to the state-of-the-art NVM-based LLCs. Data compression reduces the size of the blocks and, together with wear-leveling mechanisms, can defer the wear out of NVMs. Moreover, as capacity is reduced by write wear, data compression enables degraded cache frames to allocate blocks whose compressed size is adequate. Our first contribution is a microarchitecture design that leverages data compression and an intra-frame wear-leveling to gracefully deal with NVM-LLCs capacity degradation. The second contribution leverages this microarchitecture design to propose new insertion policies for hybrid LLCs using Set Dueling and taking into account the compression capabilities of the blocks. From a methodological point of view, although different approaches are used in the literature to analyze the degradation of a NVM-LLC, none of them allows to study in detail its temporal evolution. In this sense, the third contribution is a forecasting procedure that combines detailed simulation and prediction, enabling an accurate analysis of different cache content mechanisms (replacement, wear leveling, compression, etc.) on the temporal evolution of the performance of multiprocessor systems employing such NVM-LLCs. Using this forecasting procedure we show that the proposed NVM-LLCs organizations and the insertion policies for hybrid LLCs significantly outperform the state-of-the-art in both performance and lifetime metrics.
Speaker bio: Carlos Escuin is a final year Ph.D. candidate in Computer Architecture at the Universidad de Zaragoza, Spain. He received his B.Sc. degree in Computer Science from the Universidad de Zaragoza, Spain, and the M.Sc. degree in High-Performance Computing from the Universitat Politècnica de Catalunya, Spain, in 2016 and 2018, respectively. His current research interests include computer architecture, memory hierarchy, cache memories, non-volatile memories, and computing in memory.
4:20 pm-5:40 pm | Price Center Ballroom East
4:20 pm-5:40 pm | Price Center Ballroom West
Session 2A: Persistent Memory
Chair: Mai Zheng, ISU
Snapshot: Fast, Userspace Crash Consistency Using msync
Suyash Mahar (University of California, San Diego);Mingyao Shen (University of California, San Diego);Terence Kelly (None);Steven Swanson (University of California, San Diego);
Snapshot: Fast, Userspace Crash Consistency Using msync
Suyash Mahar (University of California, San Diego);Mingyao Shen (University of California, San Diego);Terence Kelly (None);Steven Swanson (University of California, San Diego);
Speaker: Suyash Mahar, UC San Diego
Abstract: Crash-consistency using persistent memory programming libraries requires programmers to use complex transactions and manual annotations. In contrast, failure-atomic msync() (FAMS) interface is much simpler as it transparently tracks updates and guarantees that modified data is atomically durable on call to the failure-atomic variant of msync(). However, FAMS suffers from several drawbacks, like the overhead of the msync() system call and the write amplification from page-level dirty data tracking. To address these drawbacks while preserving the advantages of FAMS, we propose Snapshot, an efficient userspace implementation of FAMS. Snapshot uses novel, compiler-based annotation to transparently track updates in userspace and syncs them with the backing persistent memory copy on a call to msync(). By keeping a copy in DRAM, Snapshot improves access latency. Moreover, with automatic tracking and syncing changes only on a call to msync(), Snapshot provides crash-consistency guarantees, unlike the POSIX msync() system call. For a KV-Store backed by Intel Optane running the YCSB benchmark, Snapshot achieves at least 1.2× speedup over PMDK while significantly outperforming non-crash-consistent msync(). On an emulated CXL memory semantic SSD, Snapshot significantly outperforms PMDK by up to 10.9× on all but one YCSB workload, where PMDK is 1.2× faster than Snapshot. Further, Kyoto Cabinet commits perform up to 8.0× faster with Snapshot than its built-in, msync()-based crash-consistency mechanism.
Speaker bio: Suyash Mahar is a third year PhD student at UC San Diego with interest in Memory Systems and their applications. He is advised by Prof. Steven Swanson. Over the years, he has worked on different areas of memory systems with Meta Platforms, Intel Labs, University of Virginia, Technion, and CMU.
Capri: Compiler and Architecture Support for Whole-System
Jungi Jeong (Purdue University);Jianping Zeng (Purdue University);Changhee Jung (Purdue University);
Capri: Compiler and Architecture Support for Whole-System
Jungi Jeong (Purdue University);Jianping Zeng (Purdue University);Changhee Jung (Purdue University);
Speaker: Jianping Zeng, Purdue University
Abstract: This paper investigates whole-system persistence (WSP) that ensures hassle-free crash consistency for all programs while simultaneously leveraging both advantages of the non-volatile memory technologies: high-density and in-memory persistence. Despite the promising characteristics, there are two challenges that must be addressed to make WSP a reality. First, programs must be able to resume the execution from where they had a failure. Second, failure recovery must be offered to any program including the OS in a transparent manner while minimizing persistence overheads. To this end, this paper presents Capri, a compiler and architecture co-designed scheme for region-level whole-system persistence. First, the Capri compiler partitions program into a series of regions whose boundaries serve as recovery points. Then, the Capri architecture provides the regions with crash consistency through hardware-based atomic updates. Finally, with the novel interplay between the architecture and the compiler, Capri provides failure atomicity on the cheap, i.e., with 0%, 12.4%, and 9.1% performance overheads for SPEC CPU2017, STAMP, and Splash3 benchmarks, respectively.
Speaker bio: Jianping Zeng is a 5'th year PhD student at Purdue University working with Prof. Changhee Jung. His research interests focus on leveraging compiler and architecture co-design to improve the system reliability such as soft error resilience and NVM crash consistency. His works are usually published in top-tier venues, e.g., PLDI, MICRO, HPDC, and ISCA.
Odinfs: Scaling PM Performance with Opportunistic Delegation
Diyu Zhou (EPFL);Yuchen Qian (EPFL);Vishal Gupta (EPFL);Zhifei Yang (EPFL);Changwoo Min (Virginia Tech);Sanidhya Kashyap (EPFL);
Odinfs: Scaling PM Performance with Opportunistic Delegation
Diyu Zhou (EPFL);Yuchen Qian (EPFL);Vishal Gupta (EPFL);Zhifei Yang (EPFL);Changwoo Min (Virginia Tech);Sanidhya Kashyap (EPFL);
Speaker: Diyu Zhou, EPFL
Abstract: Existing file systems for persistent memory (PM) exploit its byte-addressable non-volatile access with low latency and high bandwidth. However, they do not utilize two unique PM properties effectively. The first one is contention awareness, i.e., a small number of threads cannot thoroughly saturate the PM bandwidth, while many concurrent accesses lead to significant PM performance degradation. The second one is NUMA awareness, i.e., exploiting the remote PM efficiently, as accessing remote PM naively leads to significant performance degradation. We present Odinfs, a NUMA-aware scalable datapath PM file system that addresses these two challenges using a novel opportunistic delegation scheme. Under this scheme, Odinfs decouples the PM accesses from application threads with the help of background threads that access PM on behalf of the application. Because of PM access decoupling, Odinfs automatically parallelizes the access to PM across NUMA nodes in a controlled and localized manner. Our evaluation shows that Odinfs outperforms existing PM file systems up to 24.7×.
Speaker bio: Diyu Zhou is postdoctoral researcher at EPFL. He completed his Ph.D. at UCLA advised by Yuval Tamir. His research focuses on building high-performance, scalable, and reliable computer systems.
Temporal Exposure Reduction Protection for Persistent Memory
Yuanchao Xu (North Carolina State University);Chencheng Ye (Huazhong University of Science and Technology);Xipeng Shen (North Carolina State University);Yan Solihin (University of Central Florida);
Temporal Exposure Reduction Protection for Persistent Memory
Yuanchao Xu (North Carolina State University);Chencheng Ye (Huazhong University of Science and Technology);Xipeng Shen (North Carolina State University);Yan Solihin (University of Central Florida);
Speaker: Yuanchao Xu, North Carolina State University
Abstract: The long-living nature and byte-addressability of persistent memory (PM) amplifies the importance of strong memory protections. This paper develops temporal exposure reduction protection (TERP) as a framework for enforcing memory safety. Aiming to minimize the time when a PM region is accessible, TERP offers a complementary dimension of memory protection. The paper gives a formal definition of TERP, explores the semantics space of TERP constructs, and the relations with security and composability in both sequential and parallel executions. It proposes programming system and architecture solutions for the key challenges for the adoption of TERP, which draws on novel supports in both compilers and hardware to efficiently meet the exposure time target. Experiments validate the efficacy of the proposed support of TERP, in both efficiency and exposure time minimization.
Speaker bio: Yuanchao Xu is a fifth-year PhD candidate at North Carolina State University and he has been a student researcher at SystemGroup@Google for two years. His research interest lies in the areas of computer architecture and computer security, with a focus on improving memory security (ASPLOS 2020, ISCA 2020, HPCA 2022), reliability (ISCA 2021, ISCA 2022), and performance (HPCA 2021, MICRO 2021, HPCA 2023, ASPLOS 2023) through computer architecture and system software (compilers, runtime, etc.). His research has influenced the recent development of memory support in industry, with some techniques being actively pursued at Google. He received the Computer Science Outstanding Research Award at NCSU in 2021.
Session 2B: Flash-based Storage
Chair: Richard Wesel, UCLA
LeaFTL: A Learning-based Flash Translation Layer for Solid-State Drives
Jinghan Sun (UIUC);Shaobo Li (UIUC);Yunxin Sun (ETH Zurich);Chao Sun (Western Digital Research);Dejan Vucinic (Western Digital Research);Jian Huang (UIUC);
LeaFTL: A Learning-based Flash Translation Layer for Solid-State Drives
Jinghan Sun (UIUC);Shaobo Li (UIUC);Yunxin Sun (ETH Zurich);Chao Sun (Western Digital Research);Dejan Vucinic (Western Digital Research);Jian Huang (UIUC);
Speaker: Jinghan Sun, University of Illinois Urbana-Champaign
Abstract: In modern solid-state drives (SSDs), the indexing of flash pages is a critical component in their storage controllers. It not only affects the data access performance, but also determines the efficiency of the precious in-device DRAM resource. A variety of address mapping schemes and optimizations have been proposed. However, most of them were developed with human-driven heuristics. In this paper, we present a learning-based flash translation layer (FTL), named LeaFTL, which learns the address mapping to tolerate dynamic data access patterns via linear regression at runtime. By grouping a large set of mapping entries into a learned segment, it significantly reduces the memory footprint of the address mapping table, which further benefits the data caching in SSD controllers. LeaFTL also employs various optimization techniques, including out-of-band metadata verification to tolerate mispredictions, optimized flash allocation, and dynamic compaction of learned index segments. We implement LeaFTL with both a validated SSD simulator and a real open-channel SSD board. Our evaluation with various storage workloads demonstrates that LeaFTL saves the memory consumption of the mapping table by 2.9$\times$ and improves the storage performance by 1.4$\times$ on average, in comparison with state-of-the-art FTL schemes.
Speaker bio: Jinghan is a CS Ph.D. candidate at University of Illinois at Urbana-Champaign. His main research interests include storage systems and ML for systems. More specifically, he is working towards building learning-based storage systems, from application-level indexes to storage devices like SSDs.
FusionFS: Fusing I/O Operations using CISCOps in Firmware File Systems
Jian Zhang (Rutgers University);Yujie Ren (Rutgers University);Sudarsun Kannan (Rutgers University);
FusionFS: Fusing I/O Operations using CISCOps in Firmware File Systems
Jian Zhang (Rutgers University);Yujie Ren (Rutgers University);Sudarsun Kannan (Rutgers University);
Speaker: Jian Zhang, Rutgers University
Abstract: We present FusionFS, a direct-access in-storage filesystem that exploits near-storage computational capability for fast I/O and data processing, consequently reducing I/O bottlenecks. In FusionFS, we introduce a new abstraction, CISCOps, which combines multiple I/O and data processing operations into one fused operation and offloads them for near-storage processing. By offloading, CISCOps significantly reduces dominant I/O overheads such as system calls, data movement, communication, and other software overheads. Further, to enhance the use of CISCOps, we introduce MicroTx, a fine-grained crash consistency and fast (automatic) recovery mechanism for both I/O and data processing operations. Finally, we explore efficient and fair use of in-storage compute resources by proposing a novel Completely Fair Scheduler (CFS) for in-storage compute and memory resources across tenants. Evaluation of FusionFS against the state-of-the-art user-level, kernel-level, and firmware-level file systems using microbenchmarks, macrobenchmarks, and real-world applications show up to 6.12X, 5.09X, and 2.07X performance gains, and 2.65X faster recovery.
Speaker bio: Jian Zhang is a third-year Ph.D. candidate in the computer science department at Rutgers University. His research interests are in file systems, memory management, and computational storage
Optimizing Write Voltages to Achieve Equal Reliability for All Pages in Flash Memory
Semira Galijasevic (UCLA);Richard Wesel (UCLA);
Optimizing Write Voltages to Achieve Equal Reliability for All Pages in Flash Memory
Semira Galijasevic (UCLA);Richard Wesel (UCLA);
Speaker: Semira Galijasevic, UCLA
Abstract: This paper uses a mutual-information maximization paradigm to optimize the voltage levels written to cells in a Flash memory. To enable low-latency, each page of Flash memory stores only one coded bit in each Flash memory cell. For example, three-level cell (TLC) Flash has three bit channels, one for each of three pages, that together determine which of eight voltage levels are written to each cell. Each Flash page is required to store the same number of data bits, but the various bits stored in the cell typically do not have to provide the same mutual information. A modified version of dynamic-assignment Blahut-Arimoto (DAB) moves the constellation points and adjusts the probability mass function for each bit channel to increase the mutual information of a worst bit channel with the goal of each bit channel providing the same mutual information. The resulting constellation provides essentially the same mutual information to each page while negligibly reducing the mutual information of the overall constellation. The optimized constellations feature points that are neither equally spaced nor equally likely. However, modern shaping techniques such as probabilistic amplitude shaping can provide coded modulations that support such constellations.
Speaker bio: Semira Galijasevic received Associate degree in General Science at Santa Monica College in 2019. She received Bachelor of Science degree, summa cum laude, in Electrical Engineering from the University of California, Los Angeles (UCLA) in 2021. Currently, she is working on her Master of Science and Ph.D degrees at UCLA. Semira is a member of UCLA Communication Systems Laboratory led by Prof. Richard Wesel. Her current research interests include information theory and coding for storage.
Optimizations of Linux Software RAID System for Next-Generation Storage
Shushu Yi (Peking University);Yanning Yang (Beijing University of Posts and Telecommunications);Yunxiao Tang (Peking University);Zixuan Zhou (Peking University);Junzhe Li (Peking University);Yue Chen (Beijing University of Posts and Telecommunications);Myoungsoo Jung (KAIST);Jie Zhang (Peking University);
Optimizations of Linux Software RAID System for Next-Generation Storage
Shushu Yi (Peking University);Yanning Yang (Beijing University of Posts and Telecommunications);Yunxiao Tang (Peking University);Zixuan Zhou (Peking University);Junzhe Li (Peking University);Yue Chen (Beijing University of Posts and Telecommunications);Myoungsoo Jung (KAIST);Jie Zhang (Peking University);
Speaker: Shushu Yi, Peking University
Abstract: Redundant Array of Inexpensive Disks (RAID) has been widely adopted to enhance the performance, capacity, and reliability of the SSD array. However, as SSD has experienced significant technology shifts, the RAID engine is becoming the performance bottleneck of the future storage system that employs the next-generation SSDs. Linux software RAID, referred to as mdraid\cite{mdraid_layer}, can break the performance bound by employing multiple CPU threads to prepare the parity data simultaneously. However, this approach can impose significant software overheads, which in turn introduces a huge burden to the CPU. Consequently, the performance of mdraid, unfortunately, cannot scale as the number of CPU threads and SSD devices increase\cite{wang2022straid}. Our evaluation results reveal that the overheads of the lock mechanism account for 30.8\% of the total delays in mdraid storage stack. One may consider removing the lock mechanism from the mdraid. However, the locks play a critical role in guaranteeing crash consistency and taking charge of data management. To address this, we propose \emph{ScalaRAID}, which refines the role domain of locks and designs a new data structure to prevent different threads from preempting the RAID resources. By doing so, ScalaRAID can maximize the thread-level parallelism and reduce the time consumption of I/O request handling. Our evaluation results reveal that ScalaRAID can improve throughput by 89.4\% while decreasing 99.99$^{th}$ percentile latency by 85.4\% compared to mdraid.
Speaker bio: The presenter Shushu Yi is a Ph.D. student from Peking University under the supervision of Dr. Jie Zhang. He has a wide range of research interests including storage systems, networks, distributed storage, and in-storage processing.
5:40 pm-8:00 pm | BJ’s Restaurant & Brewhouse
Dinner: Outside Patio (Warm Clothes Recommended)
TUESDAY, MARCH 14
8:00 am – 8:50 am | Price Center Ballroom East
Breakfast
9:00 am – 10:20 am | Price Center Ballroom East
Keynote II:
Chair: Paul Siegel, UCSD
Codes for Random Access in Non-Volatile Storage
Yuval Cassuto (Technion - Israel Institute of Technology);
Codes for Random Access in Non-Volatile Storage
Yuval Cassuto (Technion - Israel Institute of Technology);
Speaker: Yuval Cassuto, Department of ECE, Technion - Israel Institute of Technology
Abstract: The advent of emerging data-intensive applications forces the computing world to shift from its classical hierarchy of processor-memory-storage to a continuum of processing and memory/storage units of different sizes and performance features. A major unknown on the way to this revolution is: how to guarantee high data reliability in this richer memory/storage hierarchy? The talk will address this question by presenting a new type of error-correcting codes that enable both powerful error correction and low-latency random access, whereas with existing codes one must choose between the two. The new codes have a two-level block structure: each block is divided to smaller sub-blocks, and code-design tools guarantee reliability performance in the sub-block and full-block access modes simultaneously. In addition, we propose an intermediate access mode in which a target sub-block is decoded by also allowing access to a predetermined number of sub-blocks around it, and this number is shown to offer a useful tradeoff between reliability and access speed.
Speaker bio: Yuval Cassuto is an Associate Professor at the Viterbi Department of Electrical and Computer Engineering, Technion – Israel Institute of Technology. His research interests lie at the intersection of the theoretical information sciences and the engineering of practical computing and storage systems. He has served on the technical program committees of leading conferences in both theory and systems. During 2010-2011 he has been a Scientist at EPFL, the Swiss Federal Institute of Technology in Lausanne. From 2008 to 2010 he was a Research Staff Member at Hitachi Global Storage Technologies, San Jose Research Center. In 2018-2019 he held a Visiting Professor position at Western Digital Research, and a Visiting Scholar position at UC Berkeley. He received the B.Sc degree in Electrical Engineering, summa cum laude, from the Technion in 2001, and the M.S. and Ph.D. degrees in Electrical Engineering from the California Institute of Technology, in 2004 and 2008, respectively. From 2000 to 2002, he was with Qualcomm, Israel R&D Center, where he worked on modeling, design and analysis in wireless communications. Dr. Cassuto has won the Best Student Paper Award in data storage from the IEEE Communications Society in 2010 as a student, and in 2019 as an adviser. He also won faculty awards from Qualcomm, Intel, and Western Digital. As an undergraduate student, he won the 2001 Texas Instruments DSP and Analog Challenge $100,000 prize.
10:50 am-11:50 am | Price Center Ballroom East
Session 3: Memorable Paper Award Finalists II (Coding)
Chair: Lara Dolecek, UCLA
Trading Partially Stuck Cells with Errors
Haider Al Kim (Institute for Communications Engineering, Technical University of Munich (TUM), Germany);Sven Puchinger (Hensoldt Sensors GmbH, 89077 Ulm, Germany);Ludo Tolhuizen (Philips Research, High Tech Campus 34, Netherlands);Antonia Wachter-Zeh (Institute for Communications Engineering, Technical University of Munich (TUM), Germany);
Trading Partially Stuck Cells with Errors
Haider Al Kim (Institute for Communications Engineering, Technical University of Munich (TUM), Germany);Sven Puchinger (Hensoldt Sensors GmbH, 89077 Ulm, Germany);Ludo Tolhuizen (Philips Research, High Tech Campus 34, Netherlands);Antonia Wachter-Zeh (Institute for Communications Engineering, Technical University of Munich (TUM), Germany);
Speaker: Haider Al Kim, Technical University of Munich (TUM)
Abstract: This work considers coding for \emph{partially stuck} memory cells. Such memory cells can only store partial information as some of their levels cannot be used due to, e.g., wearout. First, we present $2^{\mu}$-ary partially stuck cell code construction (over the finite field $\mathbb{F}_{2^{\mu}}$, where integer $\mu>1$) for \emph{masking} partially stuck cells while correcting substitution errors. "Masking" finds a word whose entries coincide with writable levels at the (partially) stuck cells. Then we investigate a technique where the encoder, after a first masking step, introduces errors at some partially stuck positions of a codeword in order to satisfy the stuck at constraints. It turns that treating some of the partially stuck cells as erroneous cells can decrease the required redundancy for some parameters, e.g., by Lemma 2.
Speaker bio: Haider Al Kim has been a doctoral researcher in Coding and Cryptography since October 2018 at the Techincal University of Munich (TUM). He completed a Bachelor of Science in Information and Communication Engineering with honors in 2008 at the University of Baghdad. During their undergraduate studies, Haider ranked as second in a class of 27 students. Haider completed a Master of Engineering in Telecommunication Networks at the University of Technology, Sydney (UTS), Australia, in 2015. He achieved a High Distinction (HD) and a GPA of 3.7 out of 4. Haider's research interests include coding theory and its applications in various fields, such as wireless communication, data storage, and cryptography.
Polar Coded Merkle Tree: Improved Detection of Data Availability Attacks in Blockchain Systems
Debarnab Mitra (UCLA);Lev Tauz (UCLA);Lara Dolecek (UCLA);
Polar Coded Merkle Tree: Improved Detection of Data Availability Attacks in Blockchain Systems
Debarnab Mitra (UCLA);Lev Tauz (UCLA);Lara Dolecek (UCLA);
Speaker: Debarnab Mitra, UCLA
Abstract: In blockchain systems, light nodes are known to be vulnerable to data availability (DA) attacks where they accept an invalid block with unavailable portions. Previous works have used LDPC and 2-D Reed Solomon (2D-RS) codes with Merkle Trees to mitigate DA attacks. While these codes improve the DA detection probability, they are difficult to apply to blockchains with large blocks due to generally intractable code guarantees for large codelengths (LDPC), large decoding complexity (2D-RS), or large coding fraud proof sizes (2D-RS). We address these issues by proposing the novel Polar Coded Merkle Tree (PCMT) which is a Merkle Tree built from the encoding graphs of polar codes and a specialized polar code construction called Sampling Efficient Freezing (SEF). We demonstrate that the PCMT with SEF polar codes perform well in detecting DA attacks for large block sizes.
Speaker bio: Debarnab Mitra is a Ph.D. candidate in the Department of Electrical and Computer Engineering at UCLA. He earned his M.S. degree from the ECE Department at UCLA in 2020, for which he was awarded the Outstanding MS Thesis Award in Signals and Systems. Prior to that, he graduated from IIT Bombay with a B. Tech. (Hons.) in Electrical Engineering and a minor in Computer Science and Engineering in 2018. His research interests include information theory, channel coding, and its applications to blockchains and non-volatile memories.
Polar Codes with Local-Global Decoding
Ziyuan Zhu (UCSD);Wei Wu (UCSD);Paul Siegel (UCSD);
Polar Codes with Local-Global Decoding
Ziyuan Zhu (UCSD);Wei Wu (UCSD);Paul Siegel (UCSD);
Speaker: Ziyuan Zhu, Graduate student researcher in CMRR
Abstract: In this paper, we investigate a coupled polar code architecture that supports both local and global decoding. This local-global construction is motivated by practical applications in data storage and transmission where reduced-latency recovery of sub-blocks of the coded information is required. Local decoding allows random access to sub-blocks of the full code block. When local decoding performance is insufficient, global decoding provides improved data reliability. The coupling scheme incorporates a systematic outer polar code and a partitioned mapping of the outer codeword to semipolarized bit-channels of the inner polar codes. Error rate simulation results are presented for 2 and 4 sub-blocks. Design issues affecting the trade-off between local and global decoding performance are also discussed.
Speaker bio: Ziyuan Zhu, a second-year graduate student researcher (GSR) in CMRR, has been actively involved in error-correct coding (ECC) research and programs. He will start the Ph.D. program in Fall 2023, with a focus on data storage and coding theory.
11:50 am – 1:00 pm | Price Center Ballroom West
Lunch
1:00 pm-2:20 pm | Price Center Ballroom East
1:00 pm-2:20 pm | Price Center Ballroom West
Session 4A: Heterogeneous Memory Management
Chair: Jishen Zhao, UCSD
Pond: A Case for CXL Memory Pooling in Cloud Datacenters
Huaicheng Li (Virginia Tech);Daniel S. Berger (Microsoft Azure and University of Washington);Lisa Hsu (Unaffiliated);Daniel Ernst (Microsoft Azure);Pantea Zardoshti (Microsoft Azure);Stanko Novakovic (Google);Monish Shah (Microsoft Azure);Samir Rajadnya (Microsoft Azure);Scott Lee (Microsoft);Ishwar Agarwal (Intel);Mark D. Hill (Microsoft Azure and University of Wisconsin-Madison);Marcus Fontoura ();Ricardo Bianchini (Microsoft Azure);
Pond: A Case for CXL Memory Pooling in Cloud Datacenters
Huaicheng Li (Virginia Tech);Daniel S. Berger (Microsoft Azure and University of Washington);Lisa Hsu (Unaffiliated);Daniel Ernst (Microsoft Azure);Pantea Zardoshti (Microsoft Azure);Stanko Novakovic (Google);Monish Shah (Microsoft Azure);Samir Rajadnya (Microsoft Azure);Scott Lee (Microsoft);Ishwar Agarwal (Intel);Mark D. Hill (Microsoft Azure and University of Wisconsin-Madison);Marcus Fontoura ();Ricardo Bianchini (Microsoft Azure);
Speaker: ,
Abstract: Public cloud providers seek to meet stringent performance requirements at low hardware costs. A key driver of performance and cost is main memory. Memory pooling promises to improve DRAM utilization and thereby reduce costs. However, pooling is challenging under cloud performance requirements. We present Pond, the first memory pooling system that meets cloud performance goals and significantly reduces DRAM costs. Pond builds on the Compute Express Link (CXL) standard for load/store access to pool memory and two key insights. First, our analysis of cloud production traces shows that pooling across 8-16 sockets is enough to achieve most of the benefits. This enables a small-pool design with low access latency. Second, it is possible to create machine learning models that can accurately predict how much local and pool memory to allocate to a virtual machine (VM) to resemble same-NUMA-node memory performance. Our evaluation shows that Pond reduces DRAM costs by 7% with performance within 1-5% of same-NUMA-node VM allocations.
Heterogeneous Memory Architectures for Energy-efficient DNN Task Adaptation on Edge Devices
Zirui Fu (Tufts University);Aleksandre Avaliani (Tufts University);Marco Donato (Tufts University);
Heterogeneous Memory Architectures for Energy-efficient DNN Task Adaptation on Edge Devices
Zirui Fu (Tufts University);Aleksandre Avaliani (Tufts University);Marco Donato (Tufts University);
Speaker: Zirui Fu, Tufts University
Abstract: Executing machine learning inference tasks on resource-constrained edge devices requires careful hardware-software co-design optimizations. Recent examples have shown how transformer-based deep neural network (DNN) models such as ALBERT can be used to enable the execution of natural language processing (NLP) inference on mobile systems-on-chip (SoCs) housing custom hardware accelerators. However, while these existing solutions are effective in alleviating the latency, energy, and area costs of running single NLP tasks, achieving real-time multi-task inference (MTI) requires running computations over multiple variants of the model parameters, which are tailored to each of the targeted tasks. This approach leads to either prohibitive on-chip memory requirements or paying the cost of off-chip memory access. Additionally, the deployment of multiple tailored copies of the model parameters is not a scalable solution when the number of targeted tasks increases. We propose a memory-centric hardware/software co-design optimization, adapter-ALBERT, to solve these conflicts from multiple directions. The proposed DNN model's performance and robustness to data compression methods are evaluated across several language tasks from the GLUE benchmark. We demonstrate the advantage of mapping the model to an SRAM and RRAM heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator to extrapolate performance, power, and area improvements over the execution of a traditional ALBERT model on the same hardware platform.
Speaker bio: Zirui Fu is a current Ph.D. student in Tufts Emerging Circuits and Systems (TECS) Lab at Tufts University. Before joining Tufts University, He received his B.S. in Electrical Engineering from University of California, Irvine in 2017 and M.S. in Computer Engineering from New York University in 2021. His research mainly focuses on energy-efficient DNN and embedding non-volatile memory co-design and applications.
Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance Awareness
Zhen Xie (Argonne National Laboratory);Dong Li (University of California, Merced);
Merchandiser: Data Placement on Heterogeneous Memory for Task-Parallel HPC Applications with Load-Balance Awareness
Zhen Xie (Argonne National Laboratory);Dong Li (University of California, Merced);
Speaker: Dong Li, University of California, Merced
Abstract: The emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. Deciding the placement of data objects on HM is critical for high performance. We reveal a performance problem related to data placement on HM. The problem is manifested as load imbalance among tasks in task-parallel HPC applications. The root of the problem comes from being unaware of parallel-task semantics and an incorrect assumption that bringing frequently accessed pages to fast memory always leads to better performance. To address this problem, we introduce a load balance-aware page management system, named Merchandiser. Merchandiser introduces task semantics during memory profiling, rather than being application-agnostic. Using the limited task semantics, Merchandiser effectively sets up coordination among tasks on the usage of HM to finish all tasks fast instead of only considering any individual task. Merchandiser is highly automated to enable high usability. Evaluating with memory-consuming HPC applications, we show that Merchandiser reduces load imbalance and leads to an average of 17.1% and 15.4% (up to 26.0% and 23.2%) performance improvement, compared with a hardware-based solution and an industry-quality software-based solution.
Speaker bio: Dr. Dong Li is an associate professor at University of California, Merced. He is the director of Parallel Architecture, System, and Algorithm Lab (PASA) and a co-director of High Performance Computing Systems and Architecture Group at UC Merced. Previously (2011-2014), he was a research scientist at the Oak Ridge National Laboratory (ORNL). Before that, he earned a PhD in computer science from Virginia Tech. He is an associate editor for IEEE Transactions on Parallel and Distributed Systems (TPDS). Dong's research focuses on high performance computing (HPC), and maintains a strong relevance to computer systems (especially systems for large-scale AI/ML).
A brief primer on Persistent Memory Objects
Derrick Greenspan (University of Central Florida College of Engineering and Computer Science);Naveed Ul Mustafa (University of Central Florida College of Engineering and Computer Science);Zoran Kolega (University of Central Florida College of Engineering and Computer Science);Mark Heinrich (University of Central Florida College of Engineering and Computer Science);Yan Solihin (University of Central Florida College of Engineering and Computer Science);
A brief primer on Persistent Memory Objects
Derrick Greenspan (University of Central Florida College of Engineering and Computer Science);Naveed Ul Mustafa (University of Central Florida College of Engineering and Computer Science);Zoran Kolega (University of Central Florida College of Engineering and Computer Science);Mark Heinrich (University of Central Florida College of Engineering and Computer Science);Yan Solihin (University of Central Florida College of Engineering and Computer Science);
Speaker: Naveed Ul Mustafa,
Abstract: Persistent memory (PM) is expected to augment or replace DRAM as main memory. PM combines byte-addressability with non-volatility, providing an opportunity to host byte-addressable data persistently. This paper identifies and addresses three problems with PM. First, we design a * persistent memory object* (PMO) abstraction that allows data to be retained in memory across process lifetimes and system power cycles. Second, we address the programmability of PM crash consistency management with a primitive *psync*, that decouples crash consistency and concurrency management; psync allows the programmer to specify {\em when} data are crash consistent but conceals *how* it happens. Third, we address the security of PMOs while at rest against corruption and disclosure attacks, and introduce mechanisms to protect against these attacks by providing encryption and integrity checking. Our PMO design outperforms NOVA-Fortis, a memory-mapped file-based approach providing crash consistency, by $3.61\times$ and $3.2\times$ for two sets of evaluated workloads. Adding protection for at-rest data to the design incurs a modest overhead, between $3 - 46\%$, depending on the level of protection.
Session 4B: Coding and Devices
Chair: Eyal En Gad, Micron
Generalized Integrated Interleaved Codes for High-Density DRAMs
Yok Jye Tang (The Ohio State University);Xinmiao Zhang (The Ohio State University);
Generalized Integrated Interleaved Codes for High-Density DRAMs
Yok Jye Tang (The Ohio State University);Xinmiao Zhang (The Ohio State University);
Speaker: Yok Jye Tang, Ohio State University
Abstract: As the density of dynamic random access memories (DRAMs) keeps increasing, which results in higher error rates, the conventional single error correction and double error detection codes are no longer sufficient. Generalized integrated interleaved (GII) codes based on Reed-Solomon (RS) codes are among the best error-correcting codes for high-density DRAMs due to their hyper-speed decoding and good correction capability. However, the very short codeword length required by DRAMs leads to miscorrections that substantially degrade the error-correcting performance of GII-RS codes if untreated. Previous miscorrection mitigation schemes for longer GII-BCH codes lead to additional code rate loss when applied to very short GII-RS codes and hence affect the cost of DRAMs. This paper presents new miscorrection mitigation schemes with improved code rates. A small number of parity bits are allocated in an optimized manner and decoding trials are carried out to close the performance gap. Moreover, low-latency hardware implementation architectures have been developed for the proposed GII-RS decoder. For the example code considered for DRAMs, the proposed decoder reduces the worst-case latency by 45% with small area overhead while having the same average latency and critical path compared to the alternative design.
Speaker bio: Yok Jye Tang received the B.S. degree in electrical and computer engineering from The Ohio State University, Columbus, OH, USA, in 2019, where he is currently pursuing the Ph.D. degree in electrical engineering. His current research interest includes high-performance very-large-scale-integration (VLSI) architectures for error-correcting codes.
Variable Coded Batch Matrix Multiplication
Lev Tauz (UCLA);Debarnab Mitra (UCLA);Lara Dolecek (UCLA);
Variable Coded Batch Matrix Multiplication
Lev Tauz (UCLA);Debarnab Mitra (UCLA);Lara Dolecek (UCLA);
Speaker: Debarnab Mitra, University of California, Los Angeles
Abstract: We introduce the novel Variable Coded Distributed Batch Matrix Multiplication (VCDBMM) problem which generalizes many previous coded distributed matrix multiplication problems by allowing for matrix products to re-use matrices, thus creating natural redundancy in the system. Inspired in part by Cross-Subspace Alignment codes, we develop Flexible Cross-Subspace Alignment (FCSA) codes that are flexible enough to utilize this natural redundancy and provide a full characterization of FCSA codes which allow for a wide variety of system complexities including good straggler resilience and fast decoding. We theoretically demonstrate that, under certain practical conditions, FCSA codes are within a factor of 2 of the optimal solution in terms of straggler resilience. Furthermore, simulations demonstrate that our codes can achieve even better optimality gaps in practice, even going as low as 1.7.
Speaker bio: Debarnab Mitra is a Ph.D. candidate in the Department of Electrical and Computer Engineering at UCLA. He earned his M.S. degree from the ECE Department at UCLA in 2020, for which he was awarded the Outstanding MS Thesis Award in Signals and Systems. Prior to that, he graduated from IIT Bombay with a B. Tech. (Hons.) in Electrical Engineering and a minor in Computer Science and Engineering in 2018. His research interests include information theory, channel coding, and its applications to blockchains and non-volatile memories.
Enabling Distance-based Addressing in Non-Volatile Memory systems
Ananth Krishna Prasad (University of Utah);Rajeev Balasubramonian (University of Utah);Mahdi Nazm Bojnordi (University of Utah);
Enabling Distance-based Addressing in Non-Volatile Memory systems
Ananth Krishna Prasad (University of Utah);Rajeev Balasubramonian (University of Utah);Mahdi Nazm Bojnordi (University of Utah);
Speaker: Ananth Krishna Prasad, University of Utah
Abstract: This paper proposes Distance Addressable Memory, a hardware-software codesign based Memory System for accelerating distance-based applications. In-memory indexing of data is enabled through a novel associative-container based algorithm for approximate similarity search applications. The proposed Memory System exposes configuration parameters to the user which allows for runtime index generation. Initial results show competing performance with a state-of-the-art hardware accelerator while exposing a larger design space to the user.
Speaker bio: Ananth Krishna Prasad received his bachelor’s degree from the Birla Institute of Technology of Science, Pilani. He is currently a graduate research assistant with the School of Computing, University of Utah. His research centers on energy-efficient memory system design, hardware acceleration of big-data applications and designing non-von neumann architectures through leveraging emerging NVM technologies.
Improving the Compatibility and Manufacturability of Digital Architectures for Processing Using Resistive Memory
Minh S. Q. Truong (Carnegie Mellon University);Liting Shen (Carnegie Mellon University);Alexander Glass (Carnegie Mellon University);Alison Hoffmann (Carnegie Mellon University);L. Richard Carley (Carnegie Mellon University);James A. Bain (Carnegie Mellon University);Saugata Ghose (University of Illinois Urbana-Champaign);
Improving the Compatibility and Manufacturability of Digital Architectures for Processing Using Resistive Memory
Minh S. Q. Truong (Carnegie Mellon University);Liting Shen (Carnegie Mellon University);Alexander Glass (Carnegie Mellon University);Alison Hoffmann (Carnegie Mellon University);L. Richard Carley (Carnegie Mellon University);James A. Bain (Carnegie Mellon University);Saugata Ghose (University of Illinois Urbana-Champaign);
Speaker: Minh S. Q. Truong, Carnegie Mellon University
Abstract: To mitigate the cost of data movement in modern applications, recent works propose a new computing paradigm called processing-using-memory (PUM), which leverages electrical interactions between memory cells to realize useful computations within the memory arrays, without the need for additional CMOS logic. While PUM can be implemented using several types of memories, prior works build PUM architectures around a specific memory technology. Unfortunately, this architectural specificity, combined with uncertainty around which technologies will ultimately be commercialized at scale, disincentivizes developers from developing a much-needed software stack for PUM. We aim to decouple PUM architectures as much as possible from (1) a specific memory technology, as well as (2) a specific logic family (i.e., the device-level specification on how to implement operations for PUM). Building upon the RACER architecture (which uses ReRAM to achieve a 107× speedup and 189× energy savings compared to a 16-core Xeon CPU), we design two key components to increase RACER's compatibility and manufacturability. First, we develop flexible interface circuits that allow RACER to be compatible with any PUM logic family. These interface circuits ensure that most of RACER’s circuitry can stay unchanged as we adopt the architecture to new logic families and to other resistive memory technologies. Second, we design a new logic family for ReRAM, called OSCAR, that is widely compatible with contemporary device prototypes. State-of-the-art logic families (e.g., MAGIC, FELIX) require switching constraints that are difficult to achieve in practical devices. OSCAR has significantly-relaxed constraints that work within the bounds of existing devices. We show that the modified RACER architecture, using OSCAR, can enable practical PUM on real resistive memory devices while improving performance and energy savings by 30% and 37%, respectively, over the original RACER work. Both our interface circuitry and OSCAR can be adapted to a wide range of digital PUM architectures, and we believe that its abstraction of technology-specific details is a crucial step towards enabling software development.
Speaker bio: Minh is a Ph.D. student in the Electrical and Computer Engineering Department at Carnegie Mellon University. He received dual B.S. degrees in electrical engineering and computer engineering from the University of California, Davis in 2019. His current Ph.D. research seeks to create new classes of computer systems based on the processing-in-memory paradigm to reduce the power consumption of data-intensive applications by orders of magnitudes, and to enable efficient edge and cloud computing. His general research interest lies at the intersection of computer systems, microarchitecture, circuits, and how to design a holistic computer system.
2:40 pm-4:00 pm | Price Center Ballroom East
Session 5A: Future NVM Research (Panel Discussion)
Chair: Jian Huang
Panelists: Andy Rudoff (Intel), Fei Wen (Qualcomm), Steven Swanson (UCSD), and Lara Dolecek (UCLA)
4:20 pm-5:40 pm | Price Center Ballroom East
4:20 pm-5:40 pm | Price Center Ballroom West
Session 6A: Memory Resilience and Persistency
Chair: Hung-Wei Tseng, UCR
Security in Era of Persistent Memory
Naveed Ul Mustafa (University of Central Florida College of Engineering and Computer Science);Yuanchao Xu (North Carolina State University);Xipeng Shen (North Carolina State University and Facebook);Yan Solihin (University of Central Florida);
Security in Era of Persistent Memory
Naveed Ul Mustafa (University of Central Florida College of Engineering and Computer Science);Yuanchao Xu (North Carolina State University);Xipeng Shen (North Carolina State University and Facebook);Yan Solihin (University of Central Florida);
Speaker: Naveed Ul Mustafa, University of Central Florida, FL
Abstract: Persistent Memeory Object (PMO) is a general system abstraction for holding persistent data in persistent main memory, managed by an operating system. PMO programming model breaks inter-process isolation as it results in sharing of persistent data between two processes as they alternatively access the same PMO. The uncoordinated data-access opens a new avenue for cross-run and cross-process security attacks. In this paper, we discuss threat vulnerabilities that are either new or increased in intensity under PMO programming model. We also discuss security implications of using the PMO, highlighting sample PMO-based attacks and potential strategies to defend against them.
Speaker bio: Naveed Ul Mustafa is a postdoctoral researcher at ARPERS lab, UCF, FL. His current research interests include computer architecture and security. Contact him at unknown.naveedulmustafa@ucf.edu.
Asynchronous Persistence with ASAP
Ahmed Abulila (Microsoft Corporation);Izzat El Hajj (American University of Beirut);Myoungsoo Jung (KAIST);Nam Sung Kim (UIUC);
Asynchronous Persistence with ASAP
Ahmed Abulila (Microsoft Corporation);Izzat El Hajj (American University of Beirut);Myoungsoo Jung (KAIST);Nam Sung Kim (UIUC);
Speaker: Ahmed Abulila, Unaffiliated
Abstract: Supporting atomic durability of updates for persistent memories is typically achieved with Write-Ahead Logging (WAL). WAL flushes log entries to persistent memory before making the actual data persistent to ensure that a consistent state can be recovered if a crash occurs. Performing WAL in hardware is attractive because it makes most aspects of log management transparent to software, and it completes log persist operations (LPs) and data persist operations (DPs) in the background, overlapping them with the execution of other instructions. Prior hardware logging solutions commit atomic regions synchronously. That is, once the end of a region is reached, all outstanding persist operations required for the region to commit must complete before instruction execution may proceed. For undo logging, LPs and DPs are both performed synchronously to ensure that the region commits synchronously. For redo logging, DPs can be performed asynchronously, but LPs are performed synchronously to ensure that the region commits synchronously. In both cases, waiting for synchronous persist operations (LP or DP) at the end of an atomic region causes atomic regions to incur high latency. To tackle this limitation, we propose ASAP, a hardware logging solution that allows atomic regions to commit asynchronously. That is, once the end of an atomic region is reached, instruction execution may proceed without waiting for outstanding persist operations to complete. As such, both LPs and DPs can be performed asynchronously. The challenge with allowing atomic regions to commit asynchronously is that it can lead to control and data dependence violations in the commit order of the atomic regions, leaving data in an unrecoverable state in case of a crash. To address this issue, ASAP tracks and enforces control and data dependencies between atomic regions in hardware to ensure that the regions commit in the proper order. Our evaluation shows that ASAP outperforms the state-of-the-art hardware undo and redo logging techniques by 1.41x and 1.53x, respectively, while achieving 0.96x the ideal performance when no persistence is enforced, at a small hardware cost (< 3%). ASAP also reduces memory traffic to persistent memory by 38% and 48%, compared with the state-of-the-art hardware undo and redo logging techniques, respectively. ASAP is robust against increasing persistent memory latency, making it suitable for both fast and slow persistent memory technologies.
Speaker bio: Ahmed Abulila is a Senior Hardware Engineer. He earned his Ph.D. in Computer Engineering from the University of Illinois at Urbana-Champaign and his MSc degree from the University of Wisconsin Madison. His primary research interests lie in the general area of computer architecture, with particular emphasis on memory systems and SoC design. After his graduate studies, he spent three years with Microsoft developing custom silicon for a wide range of systems and playing a crucial role in defining cutting-edge CPU and SOC designs.
CapOS: Capacitor Resilience for Energy Harvesting Systems
Jongouk Choi (University of Central Florida);Hyunwoo Joe (Electronics and Telecommunications Research Institute(ETRI));Changhee Jung (Purdue University);
CapOS: Capacitor Resilience for Energy Harvesting Systems
Jongouk Choi (University of Central Florida);Hyunwoo Joe (Electronics and Telecommunications Research Institute(ETRI));Changhee Jung (Purdue University);
Speaker: Jongouk Choi, University of Central Florida
Abstract: Energy harvesting systems have emerged as an alternative to battery-operated Internet of Things (IoT) devices. To deal with frequent power outages in the absence of battery, energy harvesting systems rely on a capacitor-backed checkpoint mechanism also known as just-in-time (JIT) checkpointing. It checkpoints volatile data in nonvolatile memory (NVM) just before a power outage occurs—using the energy buffered in the capacitor—and restores the checkpointed data from NVM in the wake of the outage. While the JIT checkpointing gives an illusion that volatile data survive a power outage as if they were nonvolatile, it turns out that due to capacitor degradation, energy harvesting systems can unexpectedly fail the JIT checkpointing, losing or corrupting data across the outage. To address the problem, this paper presents an OS-driven solution called CapOS. At a high level, CapOS diagnoses the capacitor in a reactive yet safe manner. When the JIT checkpoint failure occurs, CapOS detects the capacitor degradation without causing the data corruption. To recover from such a capacitor error, CapOS electrically isolates the degraded capacitor—so that it restores its original capacitance by itself with the help of capacitor’s resilient nature—and disables the JIT checkpointing. In case power outages occur during the capacitor isolation, CapOS leverages undo logging with interval-based checkpointing for their recovery. Once the capacitor is fully recovered, CapOS gets back to the capacitor-based JIT checkpointing. The experimental results demonstrate that CapOS can effectively address the capacitor error of energy harvesting systems at a low runtime cost, without compromising the recovery of power outages.
Speaker bio: Jongouk Choi is an Assistant Professor at University of Central Florida (UCF), Department of Computer Science, and Cybersecurity and Privacy Cluster. He generally develops architecture/compiler co-design solutions to improve performance, reduce hardware complexity, and address reliability/security problems.
NearPM: A Near-memory Processing Prototype for Storage-class Workloads
Yasas Seneviratne (University of Virginia);Korakit Seemakhupt (University of Virginia);Sihang Liu (University of Virginia);Samira Khan (University of Virginia);
NearPM: A Near-memory Processing Prototype for Storage-class Workloads
Yasas Seneviratne (University of Virginia);Korakit Seemakhupt (University of Virginia);Sihang Liu (University of Virginia);Samira Khan (University of Virginia);
Speaker: ,
Abstract: Persistent Memory (PM) technologies enable program recovery to a consistent state in case of failure. To ensure this crash-consistent behavior, programs need to enforce persist ordering by employing mechanisms, such as logging and checkpointing, which introduce additional data movement. The emerging near-data processing (NDP) architectures can effectively reduce this data movement overhead. In this work, we propose NearPM, a near-data processor that supports accelerable primitives in crash-consistent programs. Using these primitives NearPM accelerates commonly used crash consistency mechanisms logging, checkpointing, and shadow-paging. NearPM further reduces the synchronization overheads between the NDP and the CPU to guarantee persistent ordering by moving ordering handling near memory. We ensure correct persist ordering between CPU and NDP devices, as well as multiple NDP devices with Partitioned Persist Ordering (PPO). We prototype NearPM on an FPGA platform.1 NearPM executes data-intensive operations in crash consistency mechanisms with correct ordering guarantees while the rest of the program runs on the CPU. We evaluate nine PM workloads, where each workload supports three crash consistency mechanisms – logging, checkpointing, and shadow paging. Overall, NearPM achieves 4.3 − 9.8× speedup in the NDP-offloaded operations and 1.22 − 1.35× speedup in end-to-end execution.
NVMM cache design: Logging vs. Paging
Rémi Dulong (University of Neuchatel, Switzerland);Quentin Acher (ENS Rennes, France);Baptiste Lepers (University of Neuchatel, Switzerland);Valerio Schiavoni (University of Neuchatel, Switzerland);Pascal Felber (University of Neuchatel, Switzerland);Gaël Thomas (Télécom SudParis - Institut Polytechnique de Paris);
NVMM cache design: Logging vs. Paging
Rémi Dulong (University of Neuchatel, Switzerland);Quentin Acher (ENS Rennes, France);Baptiste Lepers (University of Neuchatel, Switzerland);Valerio Schiavoni (University of Neuchatel, Switzerland);Pascal Felber (University of Neuchatel, Switzerland);Gaël Thomas (Télécom SudParis - Institut Polytechnique de Paris);
Speaker: Rémi Dulong, University of Neuchâtel, Switzerland
Abstract: Modern NVMM is closing the gap between DRAM and persistent storage, both in terms of performance and features. Having both byte addressability and persistence on the same device gives NVMM an unprecedented set of features, leading to the following question: How should we design an NVMM-based caching system to fully exploit its potential? We build two caching mechanisms, NVPages and NVLog, based on two radically different design approaches. NVPages stores memory pages in NVMM, similar to the Linux Page Cache (LPC). NVLog uses NVMM to store a log of pending write operations to be submitted to the LPC, while it ensures reads with a small DRAM cache. Our study shows and quantifies advantages and flaws for both designs.
Speaker bio: Rémi Dulong is a PhD student at the University of Neuchâtel, Switzerland, co-supervised by Prof. Pascal Felber (University of Neuchâtel) and Prof. Gaël Thomas (Télécom SudParis, France). His research topics are Non-Volatile main memory (NVMM), and Remote Direct Memory Access (RDMA). His current research is about caching mechanisms exploiting NVMM, high speed network protocols, and programmable network components.
Session 6B: Fault Tolerance
Chair: Dong Li, UC Merced
Compiler-Directed High-Performance Intermittent Computation with Power Failure Immunity
Jongouk Choi (University of Central Florida);Larry Kittinger (Block.one);Qingrui Liu (Annapurna Labs);Changhee Jung (Purdue University);
Compiler-Directed High-Performance Intermittent Computation with Power Failure Immunity
Jongouk Choi (University of Central Florida);Larry Kittinger (Block.one);Qingrui Liu (Annapurna Labs);Changhee Jung (Purdue University);
Speaker: Jongouk Choi, University of Central Florida
Abstract: This paper introduces power failure immunity (PFI), an essential program execution property for energy harvesting systems to achieve efficient intermittent computation. PFI ensures program code regions never fail more than once i.e., at most single in-region outage, during intermittent computation as if they are immunized after the first power outage. To enforce PFI automatically for such batteryless systems that use a tiny energy buffer instead, we present its compiler-directed enforcement. The compiler leverages a precise static analysis to partition the program into recoverable regions with the energy buffer size in mind so that their execution can be completed—using the full energy buffered in a single charge cycle—regardless of program execution paths. In this way, no matter how unstable the energy harvesting source is, no region fails more than once. In the virtue of PFI, this paper presents ROCKCLIMB, a high performance and rollback-free intermittent computation scheme. It guarantees that PFI-enforced regions never fail, i.e., there is no in-region outage at all. To achieve this, ROCKCLIMB checks if the fully buffered energy is secured at each region boundary. If it is not secured, ROCKCLIMB waits until the energy buffer is fully charged, before executing the following region. In particular, the rollback-free nature of ROCKCLIMB obviates the need to log memory writes—required for rollback recovery—since no region is power-interrupted. As a result, PFI+ROCKCLIMB achieves rollback-free and memory-log-free intermittent computation, ensuring forward execution progress and maximizing it even in the presence of frequent power outages. Our real board experiments demonstrate that PFI+ROCKCLIMB outperforms the state-of-the-art work by 5%—550% on average in various energy harvesting conditions.
Speaker bio: Jongouk Choi is an Assistant Professor at University of Central Florida (UCF), Department of Computer Science, and Cybersecurity and Privacy Cluster. He generally develops architecture/compiler co-design solutions to improve performance, reduce hardware complexity, and address reliability/security problems.
Zhuque: Failure Isn't an Option, It's an Exception
George Hodgkins (University of Colorado, Boulder);Yi Xu (University of California, San Diego);Steve Swanson (University of California - San Diego);Joseph Izraelevitz (University of Colorado, Boulder);
Zhuque: Failure Isn't an Option, It's an Exception
George Hodgkins (University of Colorado, Boulder);Yi Xu (University of California, San Diego);Steve Swanson (University of California - San Diego);Joseph Izraelevitz (University of Colorado, Boulder);
Speaker: George Hodgkins, University of Colorado, Boulder
Abstract: Persistent memory (PMEM) exposes fast storage devices as byte-addressable main memory, allowing the proces- sor to access persistent data via load and store instruc- tions. The durability of PMEM allows an application’s in-memory state to survive across system reboots and un- expected power failures, but leveraging this capability is not simple. The contents of traditional CPU caches do not survive power loss, and, since caches may delay evicting a modified cache line, writes may not reach PMEM in program order. Programming systems to help address the challenges of persistent memory programming have proliferated over the last decade. Unfortunately, due to volatile caches, most current solutions impose significant performance overhead and are based on fundamentally limited pro- gramming models. However, the advent of PMEM devices supporting flush-on-fail semantics (such as eADR for NVDIMMs or GPF for CXL devices) means that caches are, in properly configured systems, effectively persistent [7]. Our work demonstrates the potential these systems offer for much simpler, faster PMEM programming models.
Speaker bio: George is a PhD student at the University of Colorado, working on programming models for heterogenous memory systems. He holds an MS in Electrical Engineering from the University of Colorado, Boulder, and a BS in Computer Engineering from the University of Houston.
Analyzing Configuration Dependencies of DAX File Systems
Tabassum Mahmud (Iowa State University);Om Rameshwar Gatla (Iowa State University);Duo Zhang (Iowa State University);Carson Love (Iowa State University);Ryan Bumann (Iowa State University);Mai Zheng (Iowa State University);
Analyzing Configuration Dependencies of DAX File Systems
Tabassum Mahmud (Iowa State University);Om Rameshwar Gatla (Iowa State University);Duo Zhang (Iowa State University);Carson Love (Iowa State University);Ryan Bumann (Iowa State University);Mai Zheng (Iowa State University);
Speaker: Tabassum Mahmud, Iowa State University
Abstract: File systems (FS) play an essential role for managing precious data. To meet diverse needs, they often support many configuration parameters. Such flexibility comes at the price of additional complexity which can lead to subtle configuration-related issues. For example, in Dec. 2020, Windows users observed that the ChkDsk utility of NTFS may destroy NTFS on SSDs due to a configuration-related bug. With more and more heterogeneous devices (e.g., PM/CXL devices, SmartSSD) and advanced features being introduced (e.g., DAX), the combinatorial explosion of configuration states is expected to exacerbate, which calls for novel solutions. To address this challenge, we study 78 configuration-related issues of two major Linux file systems with DAX support (i.e., Ext4 and XFS), and identify a prevalent pattern called multilevel configuration dependencies. Based on the study, we build an extensible tool called CONFD to extract the dependencies automatically, and create six plugins to address different configuration issues. Our experiments on Ext4 and XFS show that CONFD can extract more than 150 dependencies with a low false positive rate (8.4%). Moreover, the dependency-guided plugins can identify various configuration issues, including inaccurate documentations, configuration handling issues, and regression test failures induced by valid configurations.
Speaker bio: Tabassum is a fourth year Ph.D. student from Iowa State University advised by Dr. Mai Zheng. She is interested in Storage Systems, System Reliability and Distributed Systems. Her current research focuses on configuration dependencies in File Systems and relevant utility programs. She is actively looking for internship opportunities for summer 2023.
An Analytical Model for Endurance, Write speed and Reliability Trade-offs in Embedded Non-volatile Memories
Zihan Zhang (Tufts University);Marco Donato (Tufts University);
An Analytical Model for Endurance, Write speed and Reliability Trade-offs in Embedded Non-volatile Memories
Zihan Zhang (Tufts University);Marco Donato (Tufts University);
Speaker: ,
Abstract: Current approaches for evaluating embedded nonvolatile memory (eNVMs) implementations rely on a combination of circuit level and array level simulations, fault injection libraries and characteristics from existing bitcell prototypes. Evaluation platforms such as NVMExplorer allow researchers to evaluate eNVM-based architectures in terms of read and write latency and energy, area, process node scalability and reliability. However, the characterization of some of the eNVM device parameters, such as endurance, are limited to few published results, providing a sparser set of design space points. Providing a fine-grained characterization of these design parameters is critical since it would allow to better identify optimal design points across a diverse set of application targets. This project aims to develop an endurance and reliability analytical tool to extend the NVMExplorer framework. This model will help researchers and industry professionals select the appropriate eNVM design point by examining the relationship between endurance and other operating parameters in various eNVM technologies.