SUNDAY, MARCH 7
12:00 pm – 3:00 pm
Tutorial
Chair: TBA
Metall: A Persistent Memory Allocator for Accelerating Data Analytics
Roger Pearce & Keita Iwabuchi
MONDAY, MARCH 8
8:00 am – 8:10 am
Opening Remarks
8:10 am – 9:00 am
Keynote I
Chair: Eitan Yaakobi
Accelerating Deep Neural Networks with Analog Memory Devices
Geoffrey Burr (IBM Research - Almaden);
Accelerating Deep Neural Networks with Analog Memory Devices
Geoffrey Burr (IBM Research - Almaden);
Speaker: Geoffrey W. Burr, IBM Research - Almaden
Abstract Deep Neural Networks (DNNs) are very large artificial neural networks trained using very large datasets, typically using the supervised learning technique known as backpropagation. Currently, CPUs and GPUs are used for these computations. Over the next few years, we can expect special-purpose hardware accelerators based on conventional digital-design techniques to optimize the GPU framework for these DNN computations. Even after the improved computational performance and efficiency that is expected from these special-purpose digital accelerators, there would still be an opportunity for even higher performance and even better energy-efficiency for inference and training of DNNs, by using neuromorphic computation based on analog memory devices. In this presentation, I discuss the origin of this opportunity as well as the challenges inherent in delivering on it, including materials and devices for analog volatile and non-volatile memory, circuit and architecture choices and challenges, and the current status and prospects.
9:00 am – 9:20 am
Break / Poster Session | GatherTown
9:20 am – 10:10 am
Session 1: Memorable Paper Award Finalists I (ECC and Devices)
Chair: Paul Siegel
Cooperative Data Protection for Topology-Aware Decentralized Storage Networks
Siyi Yang (University of California, Los Angeles);
Ahmed Hareedy (Duke University);
Robert Calderbank (Duke University);
Lara Dolecek (University of California, Los Angeles);
Cooperative Data Protection for Topology-Aware Decentralized Storage Networks
Siyi Yang (University of California, Los Angeles); Ahmed Hareedy (Duke University); Robert Calderbank (Duke University); Lara Dolecek (University of California, Los Angeles);
Speaker: Siyi Yang, UCLA
Abstract While codes with hierarchical locality have been intensively studied in the context of centralized cloud storage due to their effectiveness in reducing the average reading time, those in the context of decentralized storage networks (DSNs) have not yet been discussed. In this paper, we propose a joint coding scheme where each node receives extra protection through the cooperation with nodes in its neighborhood in a heterogeneous DSN with any given topology. Our proposed construction not only supports desirable properties such as scalability and flexibility, which are critical in dynamic networks, but also adapts to arbitrary topologies, a property that is essential in DSNs but has been overlooked in existing works.
Power Spectra of Finite-Length Constrained Codes with Level-Based Signaling
Jessica Centers (Duke University);
Xinyu Tan (Duke University);
Ahmed Hareedy (Duke University);
Robert Calderbank (Duke University);
Power Spectra of Finite-Length Constrained Codes with Level-Based Signaling
Jessica Centers (Duke University); Xinyu Tan (Duke University); Ahmed Hareedy (Duke University); Robert Calderbank (Duke University);
Speaker: Jessica Centers and Xinyu Tan, Electrical and Computer Engineering Department, Duke University
Abstract In various practical systems, certain data patterns are prone to errors if written or transmitted. Constrained codes are used to eliminate error-prone patterns, and they can also achieve other goals. Recently, we introduced efficient binary symmetric lexicographically-ordered constrained (LOCO) codes and asymmetric LOCO (A-LOCO) codes to increase density in magnetic recording systems and lifetime in Flash systems by eliminating the relevant detrimental patterns. Due to their applications, LOCO and A-LOCO codes are associated with level-based signaling. In this paper, we first modify a framework from the literature in order to introduce a method to derive the power spectrum of a sequence of constrained data associated with level-based signaling. We then provide a generalized method for developing the one-step state transition matrix (OSTM) for finite-length codes constrained by the separation of transitions. Via their OSTMs, we devise closed form solutions for the spectra of finite-length LOCO and A-LOCO codes.
Optimal Reconstruction Codes for Deletion Channel
Johan Chrisnata (Nanyang Technological University);
Han Mao Kiah (Nanyang Technological University);
Eitan Yaakobi (Technion - Israel Institute of Technology);
Optimal Reconstruction Codes for Deletion Channel
Johan Chrisnata (Nanyang Technological University); Han Mao Kiah (Nanyang Technological University); Eitan Yaakobi (Technion - Israel Institute of Technology);
Speaker: Johan Chrisnata, Nanyang Technological University
Abstract The sequence reconstruction problem, introduced by Levenshtein in 2001, considers a communication scenario where the sender transmits a codeword from some codebook and the receiver obtains multiple noisy reads of the codeword. Motivated by modern storage devices, we introduced a variant of the problem where the number of noisy reads $N$ is fixed (Kiah \etal{ }2020). Of significance, for the single-deletion channel, using $\log_2\log_2 n +O(1)$ redundant bits, we designed a reconstruction code of length $n$ that reconstructs codewords from two distinct noisy reads. In this work, we show that $\log_2\log_2 n -O(1)$ redundant bits are necessary for such reconstruction codes, thereby, demonstrating the optimality of our previous construction. Furthermore, we show that these reconstruction codes can be used in $t$-deletion channels (with $t\ge 2$) to uniquely reconstruct codewords from $n^{t-1}+O\left(n^{t-2}\right)$ distinct noisy reads.
Partial MDS Codes with Regeneration
Lukas Holzbaur (Technical University of Munich);
Sven Puchinger (Technical University of Denmark (DTU));
Eitan Yaakobi (Technion - Israel Institute of Technology);
Antonia Wachter-Zeh (Technical University of Munich);
Partial MDS Codes with Regeneration
Lukas Holzbaur (Technical University of Munich); Sven Puchinger (Technical University of Denmark (DTU)); Eitan Yaakobi (Technion - Israel Institute of Technology); Antonia Wachter-Zeh (Technical University of Munich);
Speaker: Lukas Holzbaur, Technical University of Munich
Abstract Partial MDS (PMDS) and sector-disk (SD) codes are classes of erasure correcting codes that combine locality with strong erasure correction capabilities. We construct PMDS and SD codes where each local code is a bandwidth-optimal regenerating MDS code. In the event of a node failure, these codes reduce both, the number of servers that have to be contacted as well as the amount of network traffic required for the repair process. The constructions require significantly smaller field size than the only other construction known in literature. Further, we present a PMDS code construction that allows for efficient repair for patterns of node failures that exceed the local erasure correction capability of the code and thereby invoke repair across different local groups.
Non-Uniform Windowed Decoding For Multi-Dimensional Spatially-Coupled LDPC Codes
Lev Tauz (University of California, Los Angeles);
Lara Dolecek (University of California, Los Angeles);
Homa Esfahanizadeh (Massachusetts Institute of Technology);
Non-Uniform Windowed Decoding For Multi-Dimensional Spatially-Coupled LDPC Codes
Lev Tauz (University of California, Los Angeles); Lara Dolecek (University of California, Los Angeles); Homa Esfahanizadeh (Massachusetts Institute of Technology);
Speaker: Lev Tauz, Electrical and Computer Engineering, University of California, Los Angeles
Abstract In this work, we propose a non-uniform windowed decoder for multi-dimensional spatially-coupled LDPC (MD-SC-LDPC) codes over the binary erasure channel. An MD-SC-LDPC code is constructed by connecting together several SC-LDPC codes into one larger code that provides major benefits over a variety of channel models. We propose and analyze a novel non-uniform decoder that allows for greater flexibility between latency and code reliability. Our theoretical derivations and empirical results show that our non-uniform decoder greatly improves upon the standard windowed decoder in terms of design flexibility, latency, and complexity.
10:10 am – 10:30 am
Break / Poster Session | GatherTown
10:30 am-11:20 am
Session 2: Memorable Paper Award Finalists II (Systems and Architecture)
Chair: Samira Khan
Characterizing and Modeling Non-Volatile Memory Systems
Zixuan Wang (University of California San Diego);
Xiao Liu (University of California, San Diego);
Jian Yang (University of California, San Diego);
Theodore Michailidis (University of California, San Diego);
Steven Swanson (University of California, San Diego);
Jishen Zhao (University of California, San Diego);
Characterizing and Modeling Non-Volatile Memory Systems
Zixuan Wang (University of California San Diego); Xiao Liu (University of California, San Diego); Jian Yang (University of California, San Diego); Theodore Michailidis (University of California, San Diego); Steven Swanson (University of California, San Diego); Jishen Zhao (University of California, San Diego);
Speaker: Zixuan Wang, University of California, San Diego
Abstract Scalable server-grade non-volatile RAM (NVRAM) DIMMs became commercially available with the release of Intel’s Optane DIMM. Recent studies on Optane DIMM systems unveil discrepant performance characteristics, compared to what many researchers assumed before the product release. Most of these studies focus on system software design and performance analysis. To thoroughly analyze the source of this discrepancy and facilitate real-NVRAM-aware architecture design, we propose a framework that characterizes and models Optane DIMM’s microarchitecture. Our framework consists of a Low-level profilEr for Non-volatile memory Systems (LENS) and a Validated cycle-Accurate NVRAM Simulator (VANS). LENS allows us to comprehensively analyze the performance attributes and reverse engineer NVRAM microarchitectures. Based on LENS characterization, we develop VANS, which models the sophisticated microarchitecture design of Optane DIMM, and is validated by comparing with the detailed performance characteristics of Optane-DIMM-attached Intel servers. VANS adopts a modular design that can be easily modified to extend to other NVRAM architecture designs; it can also be attached to full-system simulators, such as gem5.
Assise: Performance and Availability via Client-local NVM in a Distributed File System
Thomas Anderson (University of Washington);
Marco Canini (KAUST);
Jongyul Kim (KAIST);
Dejan Kostić (KTH Royal Institute of Technology);
Youngjin Kwon (KAIST);
Simon Peter (University of Texas at Austin);
Waleed Reda (KTH Royal Institute of Technology and Université catholique de Louvain);
Henry N. Schuh (University of Washington);
Emmett Witchel (UT Austin);
Assise: Performance and Availability via Client-local NVM in a Distributed File System
Thomas Anderson (University of Washington); Marco Canini (KAUST); Jongyul Kim (KAIST); Dejan Kostić (KTH Royal Institute of Technology); Youngjin Kwon (KAIST); Simon Peter (University of Texas at Austin); Waleed Reda (KTH Royal Institute of Technology and Université catholique de Louvain); Henry N. Schuh (University of Washington); Emmett Witchel (UT Austin);
Speaker: Waleed Reda, KTH Royal Institute of Technology and Université catholique de Louvain
Abstract The adoption of low-latency non-volatile memory (NVM) at scale upends the existing client-server model for distributed file systems. Instead, by leveraging client-local NVM storage, we can provide applications with much higher IO performance, sub-second application failover, and strong consistency. To that end, we built the Assise distributed file system, which uses client-local NVM as a linearizable and crash-recoverable cache between applications. Assise maximizes locality for all file IO by carrying out IO on process-local and client-local NVM whenever possible. By maintaining consistency at IO operation granularity, rather than at fixed block sizes, Assise minimizes coherence overheads and prevents block amplification. In doing so, Assise provides orders of magnitude lower tail latency, higher scalability, and higher availability than the state-of-the-art.
Clobber-NVM: Log Less, Re-execute More
Yi Xu (UC San Diego);
Joseph Izraelevitz (University of Colorado, Boulder);
Steven Swanson (UC San Diego);
Clobber-NVM: Log Less, Re-execute More
Yi Xu (UC San Diego); Joseph Izraelevitz (University of Colorado, Boulder); Steven Swanson (UC San Diego);
Speaker: Yi Xu, University of California, San Diego
Abstract Non-volatile memory allows direct access to persistent storage via a load/store interface. However, because the cache is volatile, cached updates to persistent state will be dropped after a power loss. Failure-atomicity NVM libraries provide the means to apply sets of writes to persistent state atomically. Unfortunately, most of these libraries impose significant overhead. This work proposes Clobber-NVM, a failure-atomicity library that ensures data consistency by reexecution. Clobber-NVM’s novel logging strategy, clobber logging, records only those transaction inputs that are overwritten during transaction execution. Then, after a failure, it recovers to a consistent state by restoring overwritten inputs and reexecuting any interrupted transactions. Clobber-NVM utilizes a clobber logging compiler pass for identifying the minimal set of writes that need to be logged. Based on our experiments, classical undo logging logs up to 42.6X more bytes than Clobber-NVM, and requires 2.4X to 4.7X more expensive ordering instructions (e.g., clflush and sfence). Less logging leads to better performance: Relative to prior art, Clobber-NVM provides up to 2.5X performance improvement over Mnemosyne and 2.6X over Intel’s PMDK.
CoSpec: Compiler Directed Speculative Intermittent Computation
Jongouk Choi (Purdue University);
Qingrui Liu (Annapurna Labs);
Changhee Jung (Purdue University);
CoSpec: Compiler Directed Speculative Intermittent Computation
Jongouk Choi (Purdue University); Qingrui Liu (Annapurna Labs); Changhee Jung (Purdue University);
Speaker: Jongouk Choi, Purdue University
Abstract Energy harvesting systems have emerged as an alternative to battery-operated embedded devices. Due to the intermittent nature of energy harvesting, researchers equip the systems with nonvolatile memory (NVM) and crash consistency mechanisms. However, prior works require non-trivial hardware modifications, e.g., a voltage monitor, nonvolatile flip-flops/scratchpad, dependence tracking modules, etc., thereby causing significant area/power/manufacturing costs. For low-cost yet performant intermittent computation, this paper presents CoSpec, a new architecture/compiler co-design scheme that works for commodity in-order processors used in energy-harvesting systems. To achieve crash consistency without requiring unconven- tional architectural support, CoSpec leverages speculation assuming that power failure is not going to occur and thus holds all committed stores in a store buffer (SB)—as if they were speculative—in case of mispeculation. CoSpec compiler first partitions a given program into a series of recoverable code regions with the SB size in mind, so that no region overflows the SB. When the program control reaches the end of each region, the speculation turns out to be successful, thus releasing all the buffered stores of the region to NVM. If power failure occurs during the execution of a region, all its speculative stores disappear in the volatile SB, i.e., they never affect program states in NVM. Consequently, the interrupted region can be restarted with consistent program states in the wake of power failure. To hide the latency of the SB release—i.e., essentially NVM writes—at each region boundary, CoSpec overlaps the NVM writes of the current region with the speculative execution of the next region. Such instruction level parallelism gives an illusion of out- of-order execution on top of the in-order processor, achieving a speedup of more than 1.2X when there is no power outage. Our experiments on a set of real energy harvesting traces with frequent outages demonstrate that CoSpec outperforms the state-of-the-art scheme by 1.8∼3X on average.
CrossFS: A Cross-layered Direct-Access File System
Yujie Ren (Rutgers University);
Changwoo Min (Virginia Tech);
Sudarsun Kannan (Rutgers University);
CrossFS: A Cross-layered Direct-Access File System
Yujie Ren (Rutgers University); Changwoo Min (Virginia Tech); Sudarsun Kannan (Rutgers University);
Speaker: Yujie Ren, Rutgers Unversity
Abstract We design CrossFS, a cross-layered direct-access file system disaggregated across user-level, firmware, and kernel layers for scaling I/O performance and improving concurrency. CrossFS is designed to exploit host- and device-level compute capabilities. CrossFS introduces a file descriptor-based concurrency control that maps each file descriptor to one hardware-level I/O queue for concurrency with or without data sharing across threads and processes. This design allows CrossFS's firmware component to process disjoint access across file descriptors concurrently. CrossFS delegates concurrency control to powerful host-CPUs, which convert the file descriptor synchronization problem into an I/O queue request ordering problem. CrossFS exploits byte-addressable nonvolatile memory for I/O queue persistence to guarantee crash consistency in the cross-layered design and designs a lightweight firmware-level journaling mechanism. Finally, CrossFS designs a firmware-level I/O scheduler for efficient dispatch of file descriptor requests. Evaluation of emulated CrossFS on storage-class memory shows up to 4.87x concurrent access gains for bench- marks and 2.32x gains for real-world applications over the state-of-the-art kernel, user-level, and firmware file systems.
11:20 am-12:00 pm
Networking Session
TUESDAY, MARCH 9
8:00 am – 8:10 am
Opening Remarks
8:10 am – 9:00 am
Keynote II
Chair: Jishen Zhao
Twizzler: Rethinking the Operating System Stack for Byte-Addressable NVM
Ethan Miller (University of California, Santa Cruz);
Twizzler: Rethinking the Operating System Stack for Byte-Addressable NVM
Ethan Miller (University of California, Santa Cruz);
Speaker: Ethan Miller, University of California, Santa Cruz
Abstract Byte-addressable non-volatile memory (NVM) promises applications the ability to persist small units of data, enabling new programming paradigms and system designs. However, such gains will require significant help from the operating system: it needs to "get out of the way" while still providing strong guarantees for security and resource management. This talk will describe our approach to designing an operating system and programming environment that leverages the advantages of NVM to provide a single-level store for application data. Under this approach, NVM can be accessed, transparently, by any thread at any time, with pointers retaining their meanings across multiple invocations. Equally important, the operating system is minimally involved in program operation, limiting itself to managing virtualized devices, scheduling threads, and managing page tables to enforce user-managed access controls at page-level granularity. Structuring the system in this way provides both a simpler programming model and, in many cases, higher performance, allowing NVM-based systems to fully leverage the new ability to persist data with a single write while providing a stronger, more flexible security model than traditional operating systems.
9:00 am – 9:20 am
Break / Poster Session | GatherTown
09:20 am – 10:00 am
09:20 am – 10:00 am
Session 3A: ECC
Chair: Homa Esfahanizadeh (MIT)
Flexible Partial MDS Codes
Weiqi Li (University of California, Irvine);
Taiting Lu (University of California, Irvine);
Zhiying Wang (University of California, Irvine);
Hamid Jafarkhani (University of California, Irvine);
Flexible Partial MDS Codes
Weiqi Li (University of California, Irvine); Taiting Lu (University of California, Irvine); Zhiying Wang (University of California, Irvine); Hamid Jafarkhani (University of California, Irvine);
Speaker: Weiqi Li, Center for Pervasive Communications and Computing (CPCC), University of California, Irvine, USA
Abstract The partial MDS (PMDS) code was introduced by Blaum et al. for RAID systems. Given the redundancy level and the number of symbols in each node, PMDS codes can tolerate a mixed type of failures consisting of entire node failures and partial errors (symbols failures). Aiming at reducing the expected accessing latency, this paper presents flexible PMDS codes that can recover the information from a flexible number of nodes according to the number of available nodes, while the total number of symbols required remains the same. We analyze the reliability and latency of our flexible PMDS codes.
Codes for Cost-Efficient DNA Synthesis
Andreas Lenz (Technical University of Munich);
Yi Liu (Center for Memory and Recording Research, UCSD);
Cyrus Rashtchian (UCSD);
Paul Siegel (UCSD);
Andrew Tan (UCSD);
Antonia Wachter-Zeh (Technical University of Munich);
Eitan Yaakobi (Technion - Israel Institute of Technology);
Codes for Cost-Efficient DNA Synthesis
Andreas Lenz (Technical University of Munich); Yi Liu (Center for Memory and Recording Research, UCSD); Cyrus Rashtchian (UCSD); Paul Siegel (UCSD); Andrew Tan (UCSD); Antonia Wachter-Zeh (Technical University of Munich); Eitan Yaakobi (Technion - Israel Institute of Technology);
Speaker: Andreas Lenz, Technical University of Munich
Abstract As a step toward more efficient DNA data storage systems, we study the design of codes that minimize the time and number of required materials needed to synthesize the DNA strands. We consider a popular synthesis process that builds many strands in parallel in a step-by-step fashion using a fixed supersequence $S$. The machine iterates through $S$ one nucleotide at a time, and in each cycle, it adds the next nucleotide to a subset of the strands. We show that by introducing redundancy to the synthesized strands, we can significantly decrease the number of synthesis cycles required to produce the strands. We derive the maximum amount of information per synthesis cycle assuming $S$ is an arbitrary periodic sequence. To prove our results, we exhibit new connections to cost-constrained codes.
Systematic Single-Deletion Multiple-Substitution Correcting Codes
Wentu Song (Singapore University of Technology and Design);
Nikita Polyanskii (Technical University of Munich, Germany, and Skolkovo Institute of Science and Technology);
Kui Cai (Singapore University of Technology and Design);
Xuan He (Singapore University of Technology and Design);
Systematic Single-Deletion Multiple-Substitution Correcting Codes
Wentu Song (Singapore University of Technology and Design); Nikita Polyanskii (Technical University of Munich, Germany, and Skolkovo Institute of Science and Technology); Kui Cai (Singapore University of Technology and Design); Xuan He (Singapore University of Technology and Design);
Speaker: Wentu Song, Singapore University of Technology and Design, Singapore
Abstract Recent work by Smagloy et al. (ISIT 2020) shows that the redundancy of a single-deletion s-substitution correcting code is asymptotically at least (s+1)log (n)+o(log(n)), where n is the length of the codes. They also provide a construction of single-deletion and single-substitution codes with redundancy 6log(n)+8. In this paper, we propose a family of systematic single-deletion s-substitution correcting codes of length n with asymptotical redundancy at most (3s+4)log(n)+o(log(n)) and polynomial encoding/decoding complexity, where s>=2 is a constant. Specifically, the encoding and decoding complexity of the proposed codes are O(n^{s+3}) and O(n^{s+2}), respectively.
Single Indel/Edit Correcting Codes: Linear-Time Encoders and Order-Optimality
Kui Cai (Singapore University of Technology and Design);
Yeow Meng Chee (National University of Singapore);
Ryan Gabrys (Spawar Systems Center);
Han Mao Kiah (Nanyang Technological University);
TUAN THANH NGUYEN (Singapore University of Technology and Design);
Single Indel/Edit Correcting Codes: Linear-Time Encoders and Order-Optimality
Kui Cai (Singapore University of Technology and Design); Yeow Meng Chee (National University of Singapore); Ryan Gabrys (Spawar Systems Center); Han Mao Kiah (Nanyang Technological University); TUAN THANH NGUYEN (Singapore University of Technology and Design);
Speaker: Tuan Thanh Nguyen , Singapore University of Technology and Design
Abstract An indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this work, we investigate quaternary codes that correct a single indel or single edit and provide linear-time algorithms that encode binary messages into these codes of length n. Particularly, we provide two linear-time encoders: one corrects a single edit with ⌈log n⌉ + O(log log n) redundancy bits, while the other corrects a single indel with ⌈log n⌉ + 2 redundant bits. These two encoders are order-optimal. The former encoder is the first known order- optimal encoder that corrects a single edit, while the latter encoder (that corrects a single indel) reduces the redundancy of the best known encoder of Tenengolts (1984) by at least four bits.
Coding and Bounds for Partially Defect Memory Cells
Haider Al Kim (Technical University of Munich, TUM);
Sven Puchinger (Technical University of Denmark (DTU));
Antonia Wachter-Zeh (Technical University of Munich, TUM);
Coding and Bounds for Partially Defect Memory Cells
Haider Al Kim (Technical University of Munich, TUM); Sven Puchinger (Technical University of Denmark (DTU)); Antonia Wachter-Zeh (Technical University of Munich, TUM);
Speaker: Haider Al Kim, PhD Candidate at Technical University of Munich / Department of Electrical and Computer Engineering / Coding and Cryptography (COD) Group
Abstract This paper considers coding for \emph{partially stuck} memory cells. Such memory cells can only store partial information as some of their levels cannot be used due to, e.g., wear out. First, we present a code construction for masking such partially stuck cells while additionally correcting errors. Second, we derive a sphere-packing and a Gilbert-Varshamov bound for codes that can mask a certain number of partially stuck cells and correct errors additionally. A numerical comparison between the new bounds and our constructions of PSMCs for any $u\leq n$ shows that our construction matches the Gilbert--Varshamov-like bound for several code parameters.
Session 3B: Storage and Operating Systems
Chair: Sudarsun Kannan
Manycore-Based Scalable SSD Architecture Towards One and More Million IOPS
Jie Zhang (KAIST);
Miryeong Kwon (KAIST);
Michael Swift (University of Wisconsin-Madison);
Myoungsoo Jung (KAIST);
Manycore-Based Scalable SSD Architecture Towards One and More Million IOPS
Jie Zhang (KAIST); Miryeong Kwon (KAIST); Michael Swift (University of Wisconsin-Madison); Myoungsoo Jung (KAIST);
Speaker: Jie Zhang, Korea Advanced Institute of Science and Technology (KAIST)
Abstract NVMe is designed to unshackle flash from a traditional storage bus by allowing hosts to employ many threads to achieve higher bandwidth. While NVMe enables users to fully exploit all levels of parallelism offered by modern SSDs, current firmware designs are not scalable and have difficulty in handling a large number of I/O requests in parallel due to its limited computation power and many hardware contentions. We propose DeepFlash, a novel manycore-based storage platform that can process more than a million I/O requests in a second (1MIOPS) while hiding long latencies imposed by its internal flash media. Inspired by a parallel data analysis system, we design the firmware based on many-to-many threading model that can be scaled horizontally. The proposed DeepFlash can extract the maximum performance of the underlying flash memory complex by concurrently executing multiple firmware components across many cores within the device. To show its extreme parallel scalability, we implement DeepFlash on a many-core prototype processor that employs dozens of lightweight cores, analyze new challenges from parallel I/O processing and address the challenges by applying concurrency-aware optimizations. Our comprehensive evaluation reveals that DeepFlash can serve around 4.5 GB/s, while minimizing the CPU demand on microbenchmarks and real server workloads.
The Storage Hierarchy is Not a Hierarchy: Optimizing Caching on Modern Storage Devices with Orthus
Kan Wu (University of Wisconsin-Madison);
Zhihan Guo (University of Wisconsin—Madison);
Guanzhou Hu (University of Wisconsin-Madison);
Kaiwei Tu (University of Wisconsin-Madison);
Ramnatthan Alagappan (VMware Research Group);
Rathijit Sen (Microsoft);
Kwanghyun Park (Microsoft);
Andrea Arpaci-Dusseau (University of Wisconsin-Madison);
Remzi Arpaci-Dusseau (University of Wisconsin–Madison);
The Storage Hierarchy is Not a Hierarchy: Optimizing Caching on Modern Storage Devices with Orthus
Kan Wu (University of Wisconsin-Madison); Zhihan Guo (University of Wisconsin—Madison); Guanzhou Hu (University of Wisconsin-Madison); Kaiwei Tu (University of Wisconsin-Madison); Ramnatthan Alagappan (VMware Research Group); Rathijit Sen (Microsoft); Kwanghyun Park (Microsoft); Andrea Arpaci-Dusseau (University of Wisconsin-Madison); Remzi Arpaci-Dusseau (University of Wisconsin–Madison);
Speaker: Kan Wu, University of Wisconsin-Madison
Abstract We introduce non-hierarchical caching (NHC), a novel approach to caching in modern storage hierarchies. NHC improves performance as compared to classic caching by redirecting excess load to devices lower in the hierarchy when it is advantageous to do so. NHC dynamically adjusts allocation and access decisions, thus maximizing performance (e.g., high throughput, low 99%-ile latency). We implement NHC in Orthus-CAS (a block-layer caching kernel module) and Orthus-KV (a user-level caching layer for a key-value store). We show the efficacy of NHC via a thorough empirical study: Orthus-KV and Orthus-CAS offer significantly better performance (by up to 2×) than classic caching on various modern hierarchies, under a range of realistic workloads.
Architecting Throughput Processors with New Flash
Jie Zhang (KAIST);
Myoungsoo Jung (KAIST);
Architecting Throughput Processors with New Flash
Jie Zhang (KAIST); Myoungsoo Jung (KAIST);
Speaker: Jie Zhang, Korea Advanced Institute of Science and Technology (KAIST)
Abstract We propose ZnG, a new GPU-SSD integrated architecture, which can maximize the memory capacity in the GPU and address the performance penalty imposed by SSD. Specifically, ZnG replaces all GPU internal DRAM with an ultra-low-latency SSD to maximize the GPU memory capacity. ZnG further removes the performance bottleneck of SSD by replacing the flash channels with a high-throughput flash network and integrating the SSD firmware in the GPU MMU to reap the benefits of hardware acceleration. Although the NAND flash array within the SSD can deliver high accumulated bandwidth, only a small fraction of its bandwidth can be utilized by the memory requests, due to the mismatch of access granularity. To address this, ZnG employs a large L2 cache and flash registers to buffer the memory requests. Our evaluation results indicate that ZnG can achieve 7.5x higher performance than prior work.
Explaining SSD failures using Anomaly Detection
Chandranil Chakraborttii (University of California, Santa Cruz);
Heiner Litz (University of California, Santa Cruz);
Explaining SSD failures using Anomaly Detection
Chandranil Chakraborttii (University of California, Santa Cruz); Heiner Litz (University of California, Santa Cruz);
Speaker: Chandranil Chakraborttii, University of California Santa Cruz
Abstract NAND flash-based solid-state drives (SSDs) represent an important storage tier in data centers holding most of today’s warm and hot data. Even with the advanced fault tolerance techniques and low failure rates, large hyperscale data centers utilizing 100,000’s of SSDs suffer from multiple device failures daily. Data center operators are interested in predicting SSD device failures for two main reasons. First, even with RAID [2] and replication [5] techniques in place, device failures induce transient recovery and repair overheads, affecting the cost and tail latency of storage systems. Second, predicting near-term failure trends helps to inform the device acquisition process, thus avoiding capacity bottlenecks. Hence, it is important to predict both the short-term individual device failures as well as near-term failure trends. Prior studies on predicting storage device failures [1, 6, 7, 9] suffer from the following main challenges. First, as they utilize black-box machine learning (ML) techniques, they are unaware of the underlying failure reasons rendering it difficult to determine the failure types that these models can predict. Second, the models in prior work struggle with dynamic environments that suffer from previously unseen failures that have not been included in the training set. These two challenges are especially relevant for the SSD failure detection problem which suffers from high class-imbalance. In particular, the number of healthy drive observations is generally orders of magnitude larger than the number of failed drive observations, thus posing a problem for training most traditional supervised ML models. To address these challenges, we propose to utilize 1-class ML models that are trained only on the majority class. By ignoring the minority class for training, our 1-class models avoid overfitting to an incomplete set of failure types, thereby improving the overall prediction performance by up to 9.5% in terms of ROC AUC score. Furthermore, we introduce a new learning technique for SSD failure detection, 1-class autoencoder, which enables interpretability of the trained models while providing high prediction accuracy. In particular, 1-class autoencoders provide insights into what features and their combinations are most relevant to flagging a particular type of device failure. This enables categorization of failed drives based on their failure type, thus informing about specific procedures (e.g., repair, swap, etc.) that need to be applied to resolve the failure. For analysis and evaluation of our proposed techniques, we leverage a cloud-scale dataset from Google that has already been used in prior work [1, 8]. This dataset contains 40 million observations from over 30,000 drives over a period of six years. For each observation, the dataset contains 21 different SSD telemetry parameters including SMART (Self-Monitoring, Analysis, and Reporting Technology) parameters, the amount of read and written data, error codes, as well as the information about blocks that became nonoperational over time. Around 30% of the drives that failed during the data collection process were replaced while the rest were removed, and hence no longer appeared in the dataset. As a result, we obtained approximately 300 observations for each healthy drive (40 million observations in total) and 4 to 140 observations for each failed drive (15000 total observations). We treated each data point as an independent observation and normalized all the non-categorical data values to be between 0 and 1. One of our primary goals was to select the most distinguishing features that are highly correlated to the failures for training. We used three different feature selection methods, Filter, Embedded, and Wrapper [4] techniques, for selecting the most important features contributing to failures for our dataset. The resulting set of top features selected were correctable error count, cumulative bad block count, cumulative bad block count, cumulative p/e cycle, erase count, final read error, read count, factory bad block count, write count, and status read-only. The dataset containing only the top selected features is then used for training the different ML models. In a datacenter, we envision our SSD failure prediction technique to be implemented as shown in Figure 1. The telemetry traces are collected periodically from all SSDs in the datacenter and sent to the preprocessing pipeline transforming all input data into numeric values while filtering out incomplete and noisy values. Following data preprocessing, feature selection is performed to extract the most important features from the data set. The preprocessed data is then either utilized for training or inference. For inference, device anomalies are reported and classified according to our 1-class autoencoder approach. SSDs can then be manually analyzed by a technician or replaced directly. As an alternative, a scrubber can be leveraged to validate the model predictions by performing a low-level analysis of the SSD. To evaluate the five ML techniques, we first label all 40 million observations in the dataset to separate between healthy and failed drive observations. We then perform a 90% - 10% split of the dataset into a training set and an evaluation set respectively. For training the 1-class models we remove all failed drive observations from the training set, however, the evaluation set is kept identical for our proposed 1-class techniques and the three baselines. We use ROC AUC score as a metric for comparing the performance of our approaches with chosen baselines, which is inline with prior work [1] and use 10-fold cross-validation for evaluating all approaches. The 1-class autoencoder model utilizes 4 hidden layers comprising of 50, 25, 25, and 50 neurons, respectively. The neurons utilize a 𝑡𝑎𝑛ℎ activation function, 𝐴𝑑𝑎𝑚 optimizer, and the model is trained for 100 epochs. We use early stopping with a patience value of 5 ensuring that the training of the model stops when the loss does not decrease after 5 consecutive epochs. Increasing the number of hidden layers beyond 4 increases the training time significantly without providing performance benefits. Figure 2 illustrates the comparative performance of different ML techniques for predicting SSD failures one day ahead. Among the baselines, Random Forest performs best, providing a ROC AUC score of 0.85. Both our 1-class models outperform the best baseline. In particular, 1-class isolation forest achieves a ROC AUC score of 0.91, representing a 7% improvement over the best baseline while 1-class AutoEncoder, outperforms Random Forest by 9.5%. This work introduces 1-class autoencoders for interpreting failures. In particular, our technique exposes the reasons determined by our model to flag a particular device failure. This is achieved by utilizing the reconstruction error generated by the model while reproducing the output using the trained representation of a healthy drive. The failed drives do not conform to the representation, hence, generate an output that differs significantly from the actual input producing a large reconstruction error. . We study the reconstruction error per feature to generate the failure reasons. The features which contribute more than average error per feature to the reconstruction error, is defined as a significant reason. The results show that many failed drives show a higher than normal number of 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑎𝑏𝑙𝑒_𝑒𝑟𝑟𝑜𝑟𝑠 and 𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒_𝑏𝑎𝑑_𝑏𝑙𝑜𝑐𝑘, wever they were selected as a reason for failure only for 35%, and 30% of the cases respectively. Hence, our analysis shows that there exist particularly relevant features that indicate device failures in many cases, however, only the combination of several features enables accurate failure prediction. To conclude, this paper provides a comprehensive analysis of machine learning techniques to predict SSD failures in the cloud. We observe that prior works on SSD failure prediction suffer from the inability to predict previously unseen failure types motivating us to explore 1-class machine learning models such as 1-class isolation forest and 1-class autoencoder. We show that our approaches outperform prior work by 9.5% ROC-AUC score by improving on the prediction accuracy for failed drives. Finally, we show that 1-class autoencoders enable interpretability of model predictions by exposing the reasons determined by the model for predicting a failure. A more comprehensive evaluation of our approach in discussed in [3], where we show the adaptability of 1-class models to dynamic environments with new types of failures emerging over time and the impact of predicting further ahead in time.
Deterministic I/O and Resource Isolation for OS-level Virtualization In Server Computing
Miryeong Kwon (KAIST);
Donghyun Gouk (KAIST);
Changrim Lee (KAIST);
ByoungGeun Kim (Samsung);
Jooyoung Hwang (Samsung);
Myoungsoo Jung (KAIST);
Deterministic I/O and Resource Isolation for OS-level Virtualization In Server Computing
Miryeong Kwon (KAIST); Donghyun Gouk (KAIST); Changrim Lee (KAIST); ByoungGeun Kim (Samsung); Jooyoung Hwang (Samsung); Myoungsoo Jung (KAIST);
Speaker: Miryeong Kwon, KAIST
Abstract We propose DC-store, a storage framework that offers deterministic I/O performance for a multi-container execution environment. DC-store’s hardware-level design implements multiple NVM sets on a shared storage pool, each providing a deterministic SSD access time by removing internal resource conflicts. In parallel, software support of DC-Store is aware of the NVM sets and enlightens Linux kernel to isolate noisy neighbor containers, performing page frame reclaiming, from peers. We prototype both hardware and software counterparts of DC-Store and evaluate them in a real system. The evaluation results demonstrate that containerized data-intensive applications on DC-Store exhibit 31% shorter average execution time, on average, compared to those on a baseline system.
10:00 am – 10:30 am
Break / Poster Session | GatherTown
10:30 am – 11:10 am
10:30 am – 11:10 am
Session 4A: Devices and ECC
Chair: Ahmed Hareedy (Duke)
Tunable Fine Learning Rate controlled by pulse width modulation in Charge Trap Flash (CTF) for Synaptic Application
Shalini Shrivastava (Indian Institute of Technology Bombay);
Udayan Ganguly (Indian Institute of Technology Bombay);
Tunable Fine Learning Rate controlled by pulse width modulation in Charge Trap Flash (CTF) for Synaptic Application
Shalini Shrivastava (Indian Institute of Technology Bombay); Udayan Ganguly (Indian Institute of Technology Bombay);
Speaker: Shalini Shrivastava, Indian Institute of Technology Bombay, Mumbai, India
Abstract The brain-inspired neuromorphic computation is on high demand for the next generation computational systems due to its high performance, low-power and high energy efficiency. The highly mature technology of today, Flash memory, is the first and has been a promising electronic synaptic device since 1989. The linear, gradual, and symmetric learning rate are the basic requirements for a high-performance synaptic device. In this paper, we demonstrate a fine-controlled learning rate in Charge Trap Flash (CTF) by pulse width modulation of the input gate pulse. We further study the effect of the cycle to cycle (C2C) and device to device (D2D) variability, and the limits of charge fluctuation with scaling on the learning rate. The comparison of CTF as synapse with other state-of-the-art devices is carried out. The learning rate with CTF can be tuned from 0.2% to 100%, which is remarkable for a single device. Further, the C2C variability does not affect the conductance however it is limited by D2D variability only for learning levels > 8000. We also show that the CTF synapse has a lower sensitivity to charge fluctuation even with scaled devices. The tunable learning rate and lower sensitivity to variability and charge fluctuation in CTF synapse is significant compared to the state-of-the-art. The tunable learning rate of CTF is very promising and of great interest for brain-inspired computing systems.
Ferroelectric, Analog Resistive Switching in BEOL Compatible TiN/HfZrO4/TiOx Junctions
Laura Bégon-Lours (IBM Research Gmbh);
Mattia Halter (IBM Research Gmbh, ETH Zurich);
Youri Popoff (IBM Research Gmbh, ETH Zurich);
Bert Jan Offrein (IBM Research Gmbh);
Ferroelectric, Analog Resistive Switching in BEOL Compatible TiN/HfZrO4/TiOx Junctions
Laura Bégon-Lours (IBM Research Gmbh); Mattia Halter (IBM Research Gmbh, ETH Zurich); Youri Popoff (IBM Research Gmbh, ETH Zurich); Bert Jan Offrein (IBM Research Gmbh);
Speaker: Laura Bégon-Lours, IBM Research
Abstract Thanks to their compatibility with CMOS technologies, hafnium based ferroelectric devices receive increasing interest for the fabrication of neuromorphic hardware. In this work, an analog resistive memory device is fabricated with a process developed for Back-End-Of-Line integration. A 4.5 nm thick HfZrO4 (HZO) layer is crystallized into the ferroelectric phase, a thickness thin enough to allow electrical conduction through the layer. A TiOx interlayer is used to create an asymmetric junction as required for transferring a polarization state change into a modification of the conductivity. Memristive functionality is obtained, in the pristine state as well as after ferroelectric wake-up, involving redistribution of oxygen vacancies in the ferroelectric layer. The resistive switching is shown to originate directly from the ferroelectric properties of the HZO layer.
HD-RRAM: Improving Write Operations on MLC 3D Cross-point Resistive Memory
Chengning Wang (Huazhong University of Science and Technology);
Dan Feng (Huazhong University of Science and Technology);
Wei Tong (Huazhong University of Science and Technology);
Yu Hua (Huazhong University of Science and Technology);
Jingning Liu (Huazhong University of Science and Technology);
Bing Wu (Huazhong University of Science and Technology);
Wei Zhao (Huazhong University of Science and Technology);
Linghao Song (Duke University);
Yang Zhang (Huazhong University of Science and Technology);
Jie Xu (Huazhong University of Science and Technology);
Xueliang Wei (Huazhong University of Science and Technology);
Yiran Chen (Duke University);
HD-RRAM: Improving Write Operations on MLC 3D Cross-point Resistive Memory
Chengning Wang (Huazhong University of Science and Technology); Dan Feng (Huazhong University of Science and Technology); Wei Tong (Huazhong University of Science and Technology); Yu Hua (Huazhong University of Science and Technology); Jingning Liu (Huazhong University of Science and Technology); Bing Wu (Huazhong University of Science and Technology); Wei Zhao (Huazhong University of Science and Technology); Linghao Song (Duke University); Yang Zhang (Huazhong University of Science and Technology); Jie Xu (Huazhong University of Science and Technology); Xueliang Wei (Huazhong University of Science and Technology); Yiran Chen (Duke University);
Speaker: Chengning Wang, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, China.
Abstract Multilevel cell (MLC), cross-point array structure, and three-dimensional (3D) array integration are three technologies to scale up the density of resistive memory. However, composing the three technologies together strengthens the interactions between array-level and cell-level nonidealities (IR drop, sneak current, and cycle-to-cycle variation) in resistive memory arrays and significantly degrades the array write performance. We propose a nonidealities-tolerant high-density resistive memory (HD-RRAM) system based on multilayered MLC 3D cross-point arrays that can weaken the interactions between nonidealities and mitigate their degradation effects on the write performance. HD-RRAM is equipped with a double-transistor array architecture with multiside asymmetric bias, proportional-control state tuning, and MLC parallel writing techniques. The evaluation shows that HD-RRAM system can reduce the access latency by 27.5% and energy consumption by 37.2% over an aggressive baseline.
Reconstruction Algorithms for DNA-Storage Systems
Omer Sabary (University of California, San Diego);
Alexander Yucovich (Technion);
Guy Shapira (Technion);
Eitan Yaakobi (Technion);
Reconstruction Algorithms for DNA-Storage Systems
Omer Sabary (University of California, San Diego); Alexander Yucovich (Technion); Guy Shapira (Technion); Eitan Yaakobi (Technion);
Speaker: Omer Sabary, University of California, San Diego
Abstract In the \emph{trace reconstruction problem} a length-$n$ string $x$ yields a collection of noisy copies, called \emph{traces}, $y_1, \ldots, y_t$ where each $y_i$ is independently obtained from $x$ by passing through a \emph{deletion channel}, which deletes every symbol with some fixed probability. The main goal under this paradigm is to determine the required minimum number of i.i.d traces in order to reconstruct $x$ with high probability. The trace reconstruction problem can be extended to the model where each trace is a result of $x$ passing through a \emph{deletion-insertion-substitution channel}, which introduces also insertions and substitutions. Motivated by the storage channel of DNA, this work is focused on another variation of the trace reconstruction problem, which is referred by the \emph{DNA reconstruction problem}. A \emph{DNA reconstruction algorithm} is a mapping $R: (\Sigma_q^*)^t \to \Sigma_q^*$ which receives $t$ traces $y_1, \ldots, y_t$ as an input and produces $\widehat{x}$, an estimation of $x$. The goal in the DNA reconstruction problem is to minimize the edit distance $d_e(x,\widehat{x})$ between the original string and the algorithm's estimation. For the deletion channel case, the problem is referred by the \emph{deletion DNA reconstruction problem} and the goal is to minimize the Levenshtein distance $d_L(x,\widehat{x})$. In this work, we present several new algorithms for these reconstruction problems. Our algorithms look globally on the entire sequence of the traces and use dynamic programming algorithms, which are used for the \emph{shortest common supersequence} and the \emph{longest common subsequence} problems, in order to decode the original sequence. Our algorithms do not require any limitations on the input and the number of traces, and more than that, they perform well even for error probabilities as high as $0.27$. The algorithms have been tested on simulated data as well as on data from previous DNA experiments and are shown to outperform all previous algorithms.
Session 4B: Accelerating Applications I
Chair: Baris Kasikci
Digital-based Processing In-Memory for Acceleration of Unsupervised Learning
Mohsen Imani (UC Irvine);
Saransh Gupta (UC San Diego);
Yeseong Kim (UC San Diego);
Tajana Rosing (UC San Diego);
Digital-based Processing In-Memory for Acceleration of Unsupervised Learning
Mohsen Imani (UC Irvine); Saransh Gupta (UC San Diego); Yeseong Kim (UC San Diego); Tajana Rosing (UC San Diego);
Speaker: Mohsen Imani, University of California Irvine
Abstract Today’s applications generate a large amount of data that need to be processed by learning algorithms. In practice, the majority of the data are not associated with any labels. Unsupervised learning, i.e., clustering methods, are the most commonly used algorithms for data analysis. However, running clustering algorithms on traditional cores results in high energy consumption and slow processing speed due to a large amount of data movement between memory and processing units. In this paper, we propose DUAL, a Digital-based Unsupervised learning AcceLeration, which supports a wide range of popular algorithms on conventional crossbar memory. Instead of working with the original data, DUAL maps all data points into high-dimensional space, replacing complex clustering operations with memory-friendly operations. We accordingly design a PIM-based architecture that supports all essential operations in a highly parallel and scalable way. DUAL supports a wide range of essential operations and enables in-place computations, allowing data points to remain in memory. We have evaluated DUAL on several popular clustering algorithms for a wide range of large-scale datasets. Our evaluation shows that DUAL provides a comparable quality to existing clustering algorithms while using a binary representation and a simplified distance metric. DUAL also provides 58.8× speedup and 251.2× energy efficiency improvement as compared to the state-of-the-art solution running on GPU
High Precision In-Memory Computing for Deep Neural Network Acceleration
Mohsen Imani (University of California Irvine);
Saransh Gupta (University of California San Diego);
Yeseong Kim (University of California San Diego);
Minxuan Zhou (University of California San Diego);
Tajana Rosing (University of California San Diego);
High Precision In-Memory Computing for Deep Neural Network Acceleration
Mohsen Imani (University of California Irvine); Saransh Gupta (University of California San Diego); Yeseong Kim (University of California San Diego); Minxuan Zhou (University of California San Diego); Tajana Rosing (University of California San Diego);
Speaker: ,
Abstract Processing In-Memory (PIM) has shown a great potential to accelerate inference tasks of Convolutional Neural Network (CNN). However, existing PIM architectures do not support high precision computation, e.g., in floating point precision, which is essential for training accurate CNN models. In addition, most of the existing PIM approaches require analog/mixed-signal circuits, which do not scale, exploiting insufficiently reliable multi-bit Non-Volatile Memory (NVM). In this paper, we propose FloatPIM, a fully-digital scalable PIM architecture that accelerates CNN in both training and testing phases. FloatPIM natively supports floating-point representation, thus enabling accurate CNN training. FloatPIM also enables fast communication between neighboring memory blocks to reduce internal data movement of the PIM architecture. We evaluate the efficiency of FloatPIM on ImageNet dataset using popular large-scale neural networks.
DRAM-less Accelerator for Energy Efficient Data Processing
Jie Zhang (KAIST);
Gyuyoung Park (KAIST);
David Donofrio (Lawrence Berkeley National Laboratory);
John Shalf (Lawrence Berkeley National Laboratory);
Myoungsoo Jung (KAIST);
DRAM-less Accelerator for Energy Efficient Data Processing
Jie Zhang (KAIST); Gyuyoung Park (KAIST); David Donofrio (Lawrence Berkeley National Laboratory); John Shalf (Lawrence Berkeley National Laboratory); Myoungsoo Jung (KAIST);
Speaker: Jie Zhang, Korea Advanced Institute of Science and Technology (KAIST)
Abstract General purpose hardware accelerators have become major data processing resources in many computing domains. However, the processing capability of hardware accelerations is often limited by costly software interventions and memory copies to support compulsory data movement between different processors and solid-state drives (SSDs). This in turn also wastes a significant amount of energy in modern accelerated systems. In this work, we propose, DRAM-less, a hardware automation approach that precisely integrates many state-of-the-art phase change memory (PRAM) modules into its data processing network to dramatically reduce unnecessary data copies with a minimum of software modifications. We implement a new memory controller that plugs a real 3x nm multi-partition PRAM to 28nm technology FPGA logic cells and interoperate its design into a real PCIe accelerator emulation platform. The evaluation results reveal that our DRAM-less achieves, on average, 47% better performance than advanced acceleration approaches that use a peer-to-peer DMA.
Cross-Layer Design Space Exploration of NVM-based Caches for Deep Learning
Ahmet Inci (Carnegie Mellon University);
Mehmet Meric Isgenc (Carnegie Mellon University);
Diana Marculescu (The University of Texas at Austin);
Cross-Layer Design Space Exploration of NVM-based Caches for Deep Learning
Ahmet Inci (Carnegie Mellon University); Mehmet Meric Isgenc (Carnegie Mellon University); Diana Marculescu (The University of Texas at Austin);
Speaker: Ahmet Inci, Carnegie Mellon University
Abstract Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM++, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and iso-area performance and energy analysis for systems whose last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8x and 4.7x energy-delay product (EDP) reduction and 2.4x and 2.8x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide up to 2x and 2.3x EDP reduction and accommodate 2.3x and 3.3x cache capacity when compared to SRAM, respectively. We also perform a scalability analysis and show that STT-MRAM and SOT-MRAM achieve orders of magnitude EDP reduction when compared to SRAM for large cache capacities. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPUs for DL applications.
Sentinel: Efficient Tensor Migration and Allocation on Persistent Memory-based Heterogeneous Memory Systems for Deep Learning
Jie Ren (University of California, Merced);
Jiaolin Luo (University of California, Merced);
Kai Wu (University of California, Merced);
Minjia Zhang (Microsoft Research);
Hyeran Jeon (University of California, Merced);
Dong Li (University of California, Merced);
Sentinel: Efficient Tensor Migration and Allocation on Persistent Memory-based Heterogeneous Memory Systems for Deep Learning
Jie Ren (University of California, Merced); Jiaolin Luo (University of California, Merced); Kai Wu (University of California, Merced); Minjia Zhang (Microsoft Research); Hyeran Jeon (University of California, Merced); Dong Li (University of California, Merced);
Speaker: Jie Ren, University of California, Merced
Abstract Memory capacity is a major bottleneck for training deep neural networks (DNN). Heterogeneous memory (HM) combining fast and slow memories provides a promising direction to increase memory capacity. However, HM imposes challenges on tensor migration and allocation for high performance DNN training. Prior work heavily relies on DNN domain knowledge, unnecessarily causes tensor migration due to page-level false sharing, and wastes fast memory space. We present Sentinel, a software runtime system that automatically optimizes tensor management on HM. Sentinel uses dynamic profiling, and coordinates operating system (OS) and runtime-level profiling to bridge the semantic gap between OS and applications, which enables tensor-level profiling. This profiling enables co-allocating tensors with similar lifetime and memory access frequency into the same pages. Such fine-grained profiling and tensor collocation avoids unnecessary data movement, improves tensor movement efficiency, and enables larger batch training because of saving in fast memory space. Sentinel reduces fast memory consumption by 80% while retaining comparable performance to fast memory-only system; Sentinel consistently outperforms a state-of-the-art solution on CPU by 37% and two state-of-the-art solutions on GPU by 2x and 21% respectively in training throughput. On two billion-sized datasets BIGANN and DEEP1B, HM-ANN outperforms state-of-the-art compression-based solutions such as L&C and IMI+OPQ in recall-vs-latency by a large margin, obtaining 46% higher recall under the same search latency. We also extend existing graph-based methods such as HNSW and NSG with two strong baseline implementations on HM. At billion-point scale, HM-ANN is 2X and 5.8X faster than our HNSW and NSG baselines respectively to reach the same accuracy.
11:10 am-12:00 pm
Networking Session
WEDNESDAY, MARCH 10
8:00 am – 8:40 am
Session 5A: Hybrid Memory
Chair: Dong Li (UC Merced)
Dancing in the Dark: Profiling for Tiered Memory
Jinyoung Choi (University of California, Riverside);
Sergey Blagodurov (Advanced Micro Devices (AMD));
Hung-Wei Tseng (University of California, Riverside);
Dancing in the Dark: Profiling for Tiered Memory
Jinyoung Choi (University of California, Riverside); Sergey Blagodurov (Advanced Micro Devices (AMD)); Hung-Wei Tseng (University of California, Riverside);
Speaker: Jinyoung Choi, University of California, Riverside
Abstract With the DDR standard facing density challenges and the emergence of the non-volatile memory technologies such as Cross-Point, phase change, and fast FLASH media, compute and memory vendors are contending with a paradigm shift in the datacenter space. The decades-long status quo of designing servers with DRAM technology as an exclusive memory solution is likely coming to an end. Future systems will increasingly employ tiered memory architectures (TMAs) in which multiple memory technologies work together to satisfy applications’ ever- growing demands for more memory, less latency, and greater bandwidth. Exactly how to expose each memory type to software is an open question. Recent systems have focused on hardware caching to leverage faster DRAM memory while exposing slower non-volatile memory to OS-addressable space. The hardware approach that deals with the non-uniformity of TMA, however, requires complex changes to the processor and cannot use fast memory to increase the system’s overall memory capacity. Mapping an entire TMA as OS-visible memory alleviates the challenges of the hardware approach but pushes the burden of managing data placement in the TMA to the software layers. The software, however, does not see the memory accesses by default; in order to make informed memory-scheduling decisions, software must rely on hardware methods to gain visibility into the load/store address stream. The OS then uses this information to place data in the most suitable memory location. In the original paper, we evaluate different methods of memory- access collection and propose a hybrid tiered-memory approach that offers comprehensive visibility into TMA.
Investigating Hardware Caches for Terabyte-scale NVDIMMs
Julian T. Angeles (University of California, Davis);
Mark Hildebrand (University of California, Davis);
Venkatesh Akella (University of California, Davis);
Jason Lowe-Power (University of California, Davis);
Investigating Hardware Caches for Terabyte-scale NVDIMMs
Julian T. Angeles (University of California, Davis); Mark Hildebrand (University of California, Davis); Venkatesh Akella (University of California, Davis); Jason Lowe-Power (University of California, Davis);
Speaker: Julian T. Angeles, PhD, Department of Computer Science, UC Davis
Abstract Non-volatile memory (NVRAM) based on phase-change memory (such as Optane DC Persistent Memory Module) is making its way into Intel servers to address the needs of emerging applications that have a huge memory footprint. These systems have both DRAM and NVRAM on the same memory channel with the smaller capacity DRAM serving as a cache to the larger capacity NVRAM in the so called 2LM mode. In this work, we perform a preliminary study on the performance of applications known for having diverse workload characteristics and irregular memory access patterns, using DRAM caches on real hardware. To accomplish this, we evaluate a variety of graph processing algorithms on large real world graph inputs using Galois, a high performance shared memory graph analytics framework. We identify a few key characteristics of these large-scale, bandwidth bound applications that DRAM caches don't account for and prevent them from taking full advantage of PMM read and write bandwidth. We argue that software based techniques are necessary for orchestrating the data movement to take full advantage of these new heterogeneous memory systems.
HMMU: A Hardware-based Hybrid Memory Management Unit
Fei Wen (Texas A&M University);
Mian Qin (Texas A&M University);
Paul Gratz (Texas A&M University);
Narasimha Reddy (Texas A&M University);
HMMU: A Hardware-based Hybrid Memory Management Unit
Fei Wen (Texas A&M University); Mian Qin (Texas A&M University); Paul Gratz (Texas A&M University); Narasimha Reddy (Texas A&M University);
Speaker: Fei Wen, Texas A&M University
Abstract The current mobile applications have rapidly growing memory footprints, posing a great challenge for memory system design. Insufficient DRAM main memory will incur frequent data swaps between memory and storage, a process that hurts performance, consumes energy and deteriorates the write endurance of typical flash storage devices. Alternately, a larger DRAM has higher leakage power and drains the battery faster. Further, DRAM scaling trends make further growth of DRAM in the mobile space prohibitive due to cost. Emerging non-volatile memory (NVM) has the potential to alleviate these issues due to its higher capacity per cost than DRAM and minimal static power. Recently, a wide spectrum of NVM technologies, including phase-change memories (PCM), memristor, and 3D XPoint have emerged. Despite the mentioned advantages, NVM has longer access latency compared to DRAM and NVM writes can incur higher latencies and wear costs. Therefore, integration of these new memory technologies in the memory hierarchy requires a fundamental rearchitecting of traditional system designs. In this work, we propose a hardware-accelerated memory manager (HMMU) that addresses in a flat address space, with a small partition of the DRAM reserved for sub-page block level management. We design a set of data placement and data migration policies within this memory manager, such that we may exploit the advantages of each memory technology. By augmenting the system with this HMMU, we reduce the overall memory latency while also reducing writes to the NVM. Experimental results show that our design achieves a 39% reduction in energy consumption with only a 12% performance degradation versus an all-DRAM baseline that is likely untenable in the future.
Unbounded Hardware Transactional Memory for a Hybrid DRAM/NVM Memory System
Jungi Jeong (Purdue University);
Jaewan Hong (KAIST);
Seungryoul Maeng (KAIST);
Changhee Jung (Purdue University);
Youngjin Kwon (KAIST);
Unbounded Hardware Transactional Memory for a Hybrid DRAM/NVM Memory System
Jungi Jeong (Purdue University); Jaewan Hong (KAIST); Seungryoul Maeng (KAIST); Changhee Jung (Purdue University); Youngjin Kwon (KAIST);
Speaker: Jungi Jeong, Purdue University
Abstract Persistent memory programming requires failure atomicity. To achieve this in an efficient manner, recent proposals use hardware-based logging for atomic-durable updates and hardware transactional memory (HTM) for isolation. Although the unbounded HTMs are promising for both performance and programmability reasons, none of the previous studies satisfies the practical requirements. They either require unrealistic hardware overheads or do not allow transactions to exceed on-chip cache boundaries. Furthermore, it has never been possible to use both DRAM and NVM in HTM, though it is becoming a popular persistency model. To this end, this study proposes UHTM, unbounded hardware transactional memory for DRAM and NVM hybrid memory systems. UHTM combines the cache coherence protocol and address-signatures to detect conflicts in the entire memory space. This approach improves concurrency by significantly reducing the false-positive rates of previous studies. More importantly, UHTM allows both DRAM and NVM data to interact with each other in transactions without compromising the consistency guarantee. This is rendered possible by UHTM’s hybrid version management that provides an undo-based log for DRAM and a redo-based log for NVM. The experimental results show that UHTM outperforms the state-of-the-art durable HTM, which is LLC-bounded, by 56% on average and up to 818%.
Sparta: High-Performance, Element-Wise Sparse Tensor Contraction on Persistent Memory-based Heterogeneous Memory
Jiawen Liu (University of California, Merced);
Jie Ren (University of California, Merced);
Roberto Gioiosa (Pacific Northwest National Laboratory);
Dong Li (University of California, Merced);
Jiajia Li (Pacific Northwest National Laboratory);
Sparta: High-Performance, Element-Wise Sparse Tensor Contraction on Persistent Memory-based Heterogeneous Memory
Jiawen Liu (University of California, Merced); Jie Ren (University of California, Merced); Roberto Gioiosa (Pacific Northwest National Laboratory); Dong Li (University of California, Merced); Jiajia Li (Pacific Northwest National Laboratory);
Speaker: Jiawen Liu, University of California, Merced
Abstract Sparse tensor contractions appear commonly in many applications. Efficiently computing a two sparse tensor product is challenging: It not only inherits the challenges from common sparse matrix-matrix multiplication (SpGEMM), i.e., indirect memory access and unknown output size before computation, but also raises new challenges because of high dimensionality of tensors, expensive multi-dimensional index search, and massive intermediate and output data. To address the above challenges, we introduce three optimization techniques by using multi-dimensional, efficient hash table representation for the accumulator and larger input tensor, and all-stage parallelization. Evaluating with 15 datasets, we show that Sparta brings 28 − 576 x speedup over traditional sparse tensor contraction with SPA. With our proposed algorithm- and memory heterogeneity-aware data management, Sparta brings extra performance improvement on the heterogeneous memory with DRAM and Intel Optane DC Persistent Memory Module (PMM) over a state-of-the-art software-based data management solution, a hardware-based data management solution, and PMM-only by 30.7% (up to 98.5%), 10.7% (up to 28.3%) and 17% (up to 65.1%) respectively.
8:00 am – 8:40 am
Session 5B: Crash Consistent Recovery
Chair: Hung-Wei Tseng
Cross-Failure Bug Detection in Persistent Memory Programs
Sihang Liu (University of Virginia);
Korakit Seemakhupt (University of Virginia);
Yizhou Wei (University of Virginia);
Thomas Wenisch (University of Michigan);
Aasheesh Kolli (Pennsylvania State University);
Samira Khan (University of Virginia);
Cross-Failure Bug Detection in Persistent Memory Programs
Sihang Liu (University of Virginia); Korakit Seemakhupt (University of Virginia); Yizhou Wei (University of Virginia); Thomas Wenisch (University of Michigan); Aasheesh Kolli (Pennsylvania State University); Samira Khan (University of Virginia);
Speaker: ,
Abstract Persistent memory (PM) technologies, such as Intel’s Optane memory, deliver high performance, byte-addressability, and persistence, allowing programs to directly manipulate persistent data in memory without any OS intermediaries. An important requirement of these programs is that persistent data must remain consistent across a failure, which we refer to as the crash consistency guarantee. However, maintaining crash consistency is not trivial. We identify that a consistent recovery critically depends not only on the execution before the failure, but also on the recovery and resumption after failure. We refer to these stages as the pre- and post-failure execution stages. In order to holistically detect crash consistency bugs, we categorize the underlying causes behind inconsistent recovery due to incorrect interactions between the pre- and post-failure execution. First, a program is not crash-consistent if the post-failure stage reads from locations that are not guaranteed to be persisted in all possible access interleavings during the pre-failure stage — a type of programming error that leads to a race that we refer to as a cross-failure race. Second, a program is not crash-consistent if the post-failure stage reads persistent data that has been left semantically inconsistent during the pre-failure stage, such as a stale log or uncommitted data. We refer to this type of bugs as a cross-failure semantic bug. Together, they form the cross-failure bugs in PM programs. In this work, we provide XFDetector, a tool that detects cross-failure bugs by automatically injecting failures into the pre-failure execution, and checking for cross-failure races and semantic bugs in the post-failure continuation. XFDetector has detected four new bugs in three pieces of PM software: one of PMDK’s examples, a PM-optimized Redis database, and a PMDK library function.
Towards Bug-free Persistent Memory Applications
Ian Neal (University of Michigan);
Andrew Quinn (University of Michigan);
Baris Kasikci (University of Michigan);
Towards Bug-free Persistent Memory Applications
Ian Neal (University of Michigan); Andrew Quinn (University of Michigan); Baris Kasikci (University of Michigan);
Speaker: Ian Neal, University of Michigan
Abstract Persistent Memory (PM) aims to revolutionize the storage-memory hierarchy, but programming these systems is error-prone. Our work investigates how to to help developers write better, bug-free PM applications by automatically debugging them. We first perform a study of bugs in persistent memory applications to identify the opportunities and pain-points of debugging these systems. Then, we discuss our work on AGAMOTTO, a generic and extensible system for automatically detecting PM bugs. Unlike existing tools that rely on extensive test cases or annotations, AGAMOTTO automatically detects bugs in PM systems by extending symbolic execution to model persistent memory. AGAMOTTO has so far identified 84 new bugs in 5 different PM applications and frameworks while incurring no false positives. We then discuss HIPPOCRATES, a system that automatically fixes bugs in PM systems. HIPPOCRATES “does no harm”: its fixes are guaranteed to fix an PM bug without introducing new bugs. We show that HIPPOCRATES produces fixes that are functionally equivalent to developer fixes and that HIPPOCRATES fixes have performance that rivals manually-developed code.
Corundum: Statically-Enforced Persistent Memory Safety
Morteza Hoseinzadeh (UC San Diego);
Steven Swanson (UC San Diego);
Corundum: Statically-Enforced Persistent Memory Safety
Morteza Hoseinzadeh (UC San Diego); Steven Swanson (UC San Diego);
Speaker: Morteza Hoseinzadeh, University of California, San Diego
Abstract Fast, byte-addressable, persistent main memories (PM) make it possible to build complex data structures that can survive system failures. Programming for PM is challenging, not least because it combines well-known programming challenges like locking, memory management, and pointer safety with novel PM-specific bug types. It also requires logging updates to PM to facilitate recovery after a crash. A misstep in any of these areas can corrupt data, leak resources, prevent successful recovery after a crash. Existing PM libraries in a variety of languages – C, C++, Python, Java – simplify some of these areas, but they still require the programmer to learn(and flawlessly apply) complex rules to ensure correctness. Opportunities for data-destroying bugs abound. This paper presents Corundum, a Rust-based library with an idiomatic PM programming interface, and leverages Rust’s type system to statically avoid most common PM programming bugs. Corundum lets programmers develop persistent data structures using familiar Rust constructs and have confidence that they are free of many types of bugs. We have implementedCorundum and found its performance to be as good or better than Intel’s widely-used PMDK library.
Fast, Flexible and Comprehensive Bug Detection for Persistent Memory Programs
Bang Di (Hunan University);
Jiawen Liu (University of California, Merced);
Hao Chen (Hunan University);
Dong Li (University of California, Merced);
Fast, Flexible and Comprehensive Bug Detection for Persistent Memory Programs
Bang Di (Hunan University); Jiawen Liu (University of California, Merced); Hao Chen (Hunan University); Dong Li (University of California, Merced);
Speaker: Bang Di, Hunan University
Abstract Debugging PM programs faces a fundamental tradeoff between performance overhead and bug coverage (comprehensiveness). Large performance overhead or limited bug coverage makes debugging infeasible or ineffective for PM programs. In this paper, we propose PMDebugger, a debugger to detect crash consistency bugs. Unlike prior work, PMDebugger is fast, flexible and comprehensive for bug detection. The design of PMDebugger is driven by the characterization of how three fundamental operations in PM programs (store, cache writeback and fence) typically happen in PM programs. PMDebugger uses a hierarchical design composed of PM debugging-specific data structures, operations and bug-detection algorithms (rules). We generalize nine rules to detect crash-consistency bugs for various PM persistency models. Compared with a state-of-the-art detector (XFDetector) and an industry-quality detector (Pmemcheck), PMDebugger leads to 49.3x and 3.4x speedup on average. Compared with another state-of-the-art detector (PMTest) optimized for high performance, PMDebugger achieves comparable performance, without heavily relying on the programmer’s annotation but detect 38 more bugs than PMTest on ten applications. PMDebugger also identifies more bugs than XFDetector, Pmemcheck and PMTest. PMDebugger detects 19 new bugs in a real application (memcached) and two new bugs from Intel PMDK.
Tracking in Order to Recover - Detectable Recovery of Lock-Free Data Structures
Hagit Attiya (Technion);
Ohad Ben-Baruch (Ben-Gurion University);
Panagiota Fatourou (FORTH ICS and University of Crete, Greece);
Danny Hendler (Ben-Gurion University);
Eleftherios Kosmas (University of Crete, Greece);
Tracking in Order to Recover - Detectable Recovery of Lock-Free Data Structures
Hagit Attiya (Technion); Ohad Ben-Baruch (Ben-Gurion University); Panagiota Fatourou (FORTH ICS and University of Crete, Greece); Danny Hendler (Ben-Gurion University); Eleftherios Kosmas (University of Crete, Greece);
Speaker: Ohad Ben-Baruch, Ben-Gurion University
Abstract This paper presents the \emph{tracking approach} for deriving \emph{detectably recoverable} (and thus also \emph{durable}) implementations of many widely-used concurrent data structures. Such data structures, satisfying \emph{detectable recovery}, are appealing for emerging systems featuring byte-addressable \emph{non-volatile main memory} (\emph{NVRAM}), whose persistence allows to efficiently resurrect failed processes after crashes.Their implementation is important because they are building blocks for the construction of simple, well-structured, sound and error-resistant multiprocessor systems. For instance, in many big-data applications, shared in-memory tree-based data indices are created for fast data retrieval and useful data analytics.
Building Fast Recoverable Persistent Data Structures
Haosen Wen (University of Rochester);
Wentao Cai (University of Rochester);
Mingzhe Du (University of Rochester);
Louis Jenkins (University of Rochester);
Benjamin Valpey (University of Rochester);
Michael L. Scott (University of Rochester);
Building Fast Recoverable Persistent Data Structures
Haosen Wen (University of Rochester); Wentao Cai (University of Rochester); Mingzhe Du (University of Rochester); Louis Jenkins (University of Rochester); Benjamin Valpey (University of Rochester); Michael L. Scott (University of Rochester);
Speaker: Haosen Wen, University of Rochester
Abstract The recent emergence of fast, dense, nonvolatile main memory suggests that certain long-lived data might remain in its natural pointer-rich format across program runs and hardware reboots. Operations on such data must be instrumented with explicit write-back and fence instructions to ensure consistency in the wake of a crash. Techniques to minimize the cost of this instrumentation are an active topic of research. We present what we believe to be the first general-purpose approach to building \emph{buffered durably linearizable} persistent data structures, and a system, Montage, to support that approach. Montage is built on top of the Ralloc nonblocking persistent allocator. It employs a slow-ticking \emph{epoch clock}, and ensures that no operation appears to span an epoch boundary. It also arranges to persist only that data minimally required to reconstruct the structure after a crash. If a crash occurs in epoch $e$, all work performed in epochs $e$ and $e-1$ is lost, but work from prior epochs is preserved.
8:40 am – 9:00 am
Break / Poster Session | GatherTown
9:00 am – 9:40 am
Session 6A: Hardware for Crash Consistency
Chair: Changhee Jung (Purdue)
ArchTM: Architecture-Aware, High Performance Transaction for Persistent Memory
Kai Wu (University of California, Merced);
Jie Ren (University of California, Merced);
Ivy Peng (Lawrence Livermore National Laboratory);
Dong Li (University of California, Merced);
ArchTM: Architecture-Aware, High Performance Transaction for Persistent Memory
Kai Wu (University of California, Merced); Jie Ren (University of California, Merced); Ivy Peng (Lawrence Livermore National Laboratory); Dong Li (University of California, Merced);
Speaker: Kai Wu, University of California Merced
Abstract Failure-atomic transactions are a critical mechanism for accessing and manipulating data on persistent memory (PM)with crash consistency. We identify that small random writes in metadata modifications and locality-oblivious memory al-location in traditional PM transaction systems mismatch PMarchitecture. We present ArchTM, a PM transaction system based on two design principles: avoiding small writes and encouraging sequential writes. ArchTM is a variant of copy-on-write (CoW) system to reduce write traffic to PM. Unlike conventional CoW schemes, ArchTM reduces metadata modifications through a scalable lookup table on DRAM. ArchTM introduces an annotation mechanism to ensure crash consistency and a locality-aware data path in memory allocation to increases coalesable writes inside PM devices. We evaluateArchTM against four state-of-the-art transaction systems (one in PMDK, Romulus, DudeTM, and one from oracle). ArchTM outperforms the competitor systems by 58x, 5x, 3x and 7x on average, using micro-benchmarks and real-world workloads on real PM.
PMEM-Spec: Persistent Memory Speculation (Strict Persistency Can Trump Relaxed Persistency)
Jungi Jeong (Purdue University);
Changhee Jung (Purdue University);
PMEM-Spec: Persistent Memory Speculation (Strict Persistency Can Trump Relaxed Persistency)
Jungi Jeong (Purdue University); Changhee Jung (Purdue University);
Speaker: Jungi Jeong, Purdue University
Abstract Persistency models define the persist-order that controls the order in which stores update persistent memory (PM). As with memory consistency, the relaxed persistency models provide better performance than the strict ones by relaxing the ordering constraints. To support such relaxed persistency models, previous studies resort to APIs for annotating the persist-order in program and hardware implementations for enforcing the programmer-specified order. However, this approach to the relaxed persistency support imposes costly burdens on both architects and programmers. In particular, the goal of this study is to demonstrate that the strict persistency model can outperform the relaxed models with way less hardware complexity and programming difficulty. To achieve that, this paper presents PMEM-Spec that speculatively allows any PM accesses without stalling or buffering, detecting the ordering violation (e.g., misspeculation) for PM loads and stores. PMEM-Spec treats misspeculation as power failure and thus leverage failure-atomic transactions to recover from misspeculation by aborting and restarting them purposely. Since the ordering violation rarely occurs, PMEMSpec can accelerate persistent memory accesses without significant misspeculation penalty. Experimental results show that PMEM-Spec outperforms the epoch-based persistency models with Intel X86 ISAs and the state-of-the-art hardware support by 27.2% and 10.6%, respectively.
Efficient Hardware-Assisted Out-of-Place Update for Non-Volatile Memory
Miao Cai (Nanjing University);
Chance C. Coats (University of Illinois at Urbana-Champaign);
Jeonghyun Woo (University of Illinois at Urbana-Champaign);
Jian Huang (University of Illinois at Urbana-Champaign);
Efficient Hardware-Assisted Out-of-Place Update for Non-Volatile Memory
Miao Cai (Nanjing University); Chance C. Coats (University of Illinois at Urbana-Champaign); Jeonghyun Woo (University of Illinois at Urbana-Champaign); Jian Huang (University of Illinois at Urbana-Champaign);
Speaker: Jian Huang, University of Illinois at Urbana-Champaign
Abstract Byte-addressable non-volatile memory (NVM) is a promising technology that provides near-DRAM performance with scalable memory capacity. However, it requires atomic data durability to ensure memory persistency. Therefore, many techniques, including logging and shadow paging, have been proposed. However, most of them either introduce extra write traffic to NVM or suffer from significant performance overhead on the critical path of program execution, or even both. In this paper, we propose a transparent and efficient hardware-assisted out-of-place update (Hoop) mechanism that supports atomic data durability, without incurring much extra writes and performance overhead. The key idea is to write the updated data to a new place in NVM, while retaining the old data until the updated data becomes durable. To support this, we develop a lightweight indirection layer in the memory controller to enable efficient address translation and adaptive garbage collection for NVM. We evaluate Hoop with a variety of popular data structures and data-intensive applications, including key-value stores and databases. Our evaluation shows that Hoop achieves low critical-path latency with small write amplification, which is close to that of a native system without persistence support. Compared with state-of-the-art crash-consistency techniques, it improves application performance by up to 1.7X, while reducing the write amplification by up to 2.1X. Hoop also demonstrates scalable data recovery capability on multi-core systems.
TSOPER: Efficient Coherence-Based Strict Persistency
Per Ekemark (Uppsala University, Sweden);
Yuan Yao (Uppsala University, Sweden);
Alberto Ros (Universidad de Murcia, Spain);
Konstantinos Sagonas (Uppsala University, Sweden and National Technical Univ. of Athens, Greece);
Stefanos Kaxiras (Uppsala University, Sweden);
TSOPER: Efficient Coherence-Based Strict Persistency
Per Ekemark (Uppsala University, Sweden); Yuan Yao (Uppsala University, Sweden); Alberto Ros (Universidad de Murcia, Spain); Konstantinos Sagonas (Uppsala University, Sweden and National Technical Univ. of Athens, Greece); Stefanos Kaxiras (Uppsala University, Sweden);
Speaker: Per Ekemark, Uppsala University, Sweden
Abstract We propose a novel approach for hardware-based strict TSO persistency, called TSOPER. We allow a TSO persistency model to freely coalesce values in the caches, by forming atomic groups of achelines to be persisted. A group persist is initiated for an atomic group if any of its newly written values are exposed to the outside world. A key difference with prior work is that our architecture is based on the concept of a TSO persist buffer, that sits in parallel to the shared LLC, and persists atomic groups directly from private caches to NVM, bypassing the coherence serialization of the LLC. To impose dependencies among atomic groups that are persisted from the private caches to the TSO persist buffer, we introduce a sharing-list coherence protocol that naturally captures the order of coherence operations in its sharing lists, and thus can reconstruct the dependencies among different atomic groups entirely at the private cache level without involving the shared LLC. The combination of the sharing-list coherence and the TSO persist buffer allows persist operations and writes to non-volatile memory to happen in the background and trail the coherence operations. Coherence runs ahead at full speed; persistency follows belatedly. Our evaluation shows that TSOPER provides the same level of reordering as a program-driven relaxed model, hence, approximately the same level of performance, albeit without needing the programmer or compiler to be concerned about false sharing, data-race-free semantics, etc., and guaranteeing all software that can run on top of TSO, automatically persists in TSO.
Characterizing non-volatile memory transactional systems
Pradeep Fernando (Georgia Tech);
Irina Calciu (VMware);
Jayneel Gandhi (VMware);
Aasheesh Kolli (Penn State);
Ada Gavrilovska (Georgia Tech);
Characterizing non-volatile memory transactional systems
Pradeep Fernando (Georgia Tech); Irina Calciu (VMware); Jayneel Gandhi (VMware); Aasheesh Kolli (Penn State); Ada Gavrilovska (Georgia Tech);
Speaker: Pradeep Fernando, Georgia Institute of Technology
Abstract Emerging non-volatile memory (NVM) technologies promise memory speed byte-addressable persistent storage with a load/store interface. However, programming applications to directly manipulate NVM data is complex and error-prone. Applications generally employ libraries that hide the low-level details of the hardware and provide a transactional programming model to achieve crash-consistency. Furthermore, applications continue to expect correctness during concurrent executions, achieved through the use of synchronization. To achieve this, applications seek well-known ACID guarantees. However, realizing this presents designers of transactional systems with a range of choices in how to combine several low-level techniques, given target hardware features and workload characteristics. This presentation will discuss the tradeoffs associated with these choices and present detailed experimental analysis performed across a range of single- and multi-threaded workloads using a simulation environment and real PMEM hardware.
9:00 am – 9:40 am
Session 6B: Accelerating Applications II
Chair: Jishen Zhao
Building Scalable Dynamic Hash Tables on Persistent Memory
Baotong Lu (The Chinese University of Hong Kong);
Xiangpeng Hao (Simon Fraser University);
Tianzheng Wang (Simon Fraser University);
Eric Lo (The Chinese University of Hong Kong);
Building Scalable Dynamic Hash Tables on Persistent Memory
Baotong Lu (The Chinese University of Hong Kong); Xiangpeng Hao (Simon Fraser University); Tianzheng Wang (Simon Fraser University); Eric Lo (The Chinese University of Hong Kong);
Speaker: Baotong Lu, The Chinese University of Hong Kong
Abstract Byte-addressable persistent memory (PM) brings hash tables the potential of low latency, cheap persistence, and instant recovery. The recent advent of Intel Optane DC Persistent Memory Modules(DCPMM) further accelerates this trend. Many new hash table designs have been proposed, but most of them were based on emulation and perform sub-optimally on real PM. They were also piece-wise and partial solutions that side-step many important properties, in particular good scalability, high load factor, and instant recovery. We present Dash, a holistic approach to building dynamic and scalable hash tables on real PM hardware with all the aforementioned properties. Based on Dash, we adapted two popular dynamic hashing schemes (extendible hashing and linear hashing). On a 24-core machine with Intel Optane DCPMM, we show that compared to state-of-the-art, Dash-enabled hash tables can achieve up to∼3.9×higher performance with up to over 90% load factor and an instant recovery time of 57ms regardless of data size.
Performance Prediction of Graph Analytics on Persistent Memory
Diego Braga (UFBA);
Daniel Mosse (PITT);
Vinicius Petrucci (University of Pittsburgh & UFBA);
Performance Prediction of Graph Analytics on Persistent Memory
Diego Braga (UFBA); Daniel Mosse (PITT); Vinicius Petrucci (University of Pittsburgh & UFBA);
Speaker: Diego Moura, Federal University of Bahia
Abstract Considering a system with heterogeneous memory (DRAM and PMEM, in this case Intel Optane), the problem we address is to decide which application will be allocated on each type of resource. We built a model that estimates the impact of running the application on Intel Optane using performance counters from previous runs on DRAM. From this model we present an offline application placement for the context of heterogeneous memories. Our results show that judicious allocation can yield average reduction of 22% and 120% in makespan and degradation metrics respectively.
Disaggregating Persistent Memory and Controlling Them Remotely: An Exploration of Passive Disaggregated Key-Value Stores
Shin-Yeh Tsai (Facebook);
Yizhou Shan (University of California, San Diego);
Yiying Zhang (University of California, San Diego);
Disaggregating Persistent Memory and Controlling Them Remotely: An Exploration of Passive Disaggregated Key-Value Stores
Shin-Yeh Tsai (Facebook); Yizhou Shan (University of California, San Diego); Yiying Zhang (University of California, San Diego);
Speaker: ,
Abstract Many datacenters and clouds manage storage systems separately from computing services for better manageability and resource utilization. These existing disaggregated storage systems use hard disks or SSDs as storage media. Recently, the technology of persistent memory (PM) has matured and seen initial adoption in several datacenters. Disaggregating PM could enjoy the same benefits of traditional disaggregated storage systems, but it requires new designs because of its memory-like performance and byte addressability. In this paper, we explore the design of disaggregating PM and managing them remotely from compute servers, a model we call passive disaggregated persistent memory, or pDPM. Compared to the alternative of managing PM at storage servers, pDPM significantly lowers monetary and energy costs and avoids scalability bottlenecks at storage servers. We built three key-value store systems using the pDPM model. The first one lets all compute nodes directly access and manage storage nodes. The second uses a central coordinator to orchestrate the communication between compute and storage nodes. These two systems have various performance and scalability limitations. To solve these problems, we built Clover, a pDPM system that separates the location, communication mechanism, and management strategy of the data plane and the metadata/control plane.
Making Volatile Index Structures Persistent using TIPS
R. Madhava Krishnan (Virginia Tech);
Wook-Hee Kim (Virginia Tech);
Hee Won Lee (Consultant);
Minsung Jang (Perspecta Labs);
Sumit Kumar Monga (Virginia Tech);
Ajit Mathew (Virginia Tech);
Changwoo Min (Virginia Tech);
Making Volatile Index Structures Persistent using TIPS
R. Madhava Krishnan (Virginia Tech); Wook-Hee Kim (Virginia Tech); Hee Won Lee (Consultant); Minsung Jang (Perspecta Labs); Sumit Kumar Monga (Virginia Tech); Ajit Mathew (Virginia Tech); Changwoo Min (Virginia Tech);
Speaker: R. Madhava Krishnan, Virginia Tech
Abstract We propose TIPS – a generic framework that systematically converts volatile indexes to their persistent counterpart. Any volatile index can be plugged-in into the TIPS framework and become persistent with only minimal source code changes. TIPS neither places restrictions on the concurrency model nor requires in-depth knowledge of the volatile index. TIPS supports a strong consistency guarantee, i.e., durable linearizability, and internally handles the persistent memory leaks across crashes. TIPS relies on novel DRAM-NVMM tiering to achieve high-performance and good scalability. It uses a hybrid logging technique called the UNO logging to minimize the crash consistency overhead. We converted seven different indexes and a real-world key-value store application using TIPS and evaluated them using YCSB workloads. TIPS-enabled indexes outperform the state-of-the-art persistent indexes significantly in addition to offering many other benefits that existing persistent indexes do not provide.
HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Persistent Memory-based Heterogeneous Memory
Jie Ren (University of California, Merced);
Minjia Zhang (Microsoft Research);
Dong Li (University of California, Merced);
HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Persistent Memory-based Heterogeneous Memory
Jie Ren (University of California, Merced); Minjia Zhang (Microsoft Research); Dong Li (University of California, Merced);
Speaker: Jie Ren, University of California, Merced
Abstract The state-of-the-art approximate nearest neighbor search (ANNS) algorithms face a fundamental tradeoff between query latency and accuracy, because of small main memory capacity: To store indices in main memory for fast query response, They have to limit the number of data points or store compressed vectors, which hurts search accuracy. The emergence of heterogeneous memory (HM) brings opportunities to largely increase memory capacity and break the above tradeoff: Using HM, billions of data points can be placed in main memory on a single machine without using any data compression. However, HM consists of both fast (but small) memory and slow (but large) memory, and using HM inappropriately slows down query time significantly. In this work, we present a novel graph-based similarity search algorithm called HM-ANN, which takes both memory and data heterogeneity into consideration and enables billion-scale similarity search on a single node without using compression.
9:40 am – 10:00 am
Break / Poster Session | GatherTown
10:00 am – 10:50 am
Award Finalists NVMW 2020
Chair: Steven Swanson
PMTest: A Fast and Flexible Testing Framework for Persistent Memory Programs
Sihang Liu (University of Virginia);
Yizhou Wei (University of Virginia);
Jishen Zhao (UC, San Diego);
Aasheesh Kolli (Penn State University & VMware Research);
Samira Khan (University of Virginia);
PMTest: A Fast and Flexible Testing Framework for Persistent Memory Programs
Sihang Liu (University of Virginia); Yizhou Wei (University of Virginia); Jishen Zhao (UC, San Diego); Aasheesh Kolli (Penn State University & VMware Research); Samira Khan (University of Virginia);
Speaker: Sihang Liu, University of Virginia
Abstract Recent non-volatile memory technologies such as 3D XPoint and NVDIMMs have enabled persistent memory (PM) systems that can manipulate persistent data directly in memory. This advancement of memory technology has spurred the development of a new set of crash-consistent software (CCS) for PM - applications that can recover persistent data from memory in a consistent state in the event of a crash (e.g., power failure). CCS developed for persistent memory ranges from kernel modules to user-space libraries and custom applications. However, ensuring crash consistency in CCS is difficult and error-prone. Programmers typically employ low-level hardware primitives or transactional libraries to enforce ordering and durability guarantees that are required for ensuring crash consistency. Unfortunately, hardware can reorder instructions at runtime, making it difficult for the programmers to test whether the implementation enforces the correct ordering and durability guarantees. We believe that there is an urgent need for developing a testing framework that helps programmers identify crash consistency bugs in their CCS. We find that prior testing tools lack generality, i.e., they work only for one specific CCS or memory persistency model and/or introduce significant performance overhead. To overcome these drawbacks, we propose PMTest, a crash consistency testing framework that is both flexible and fast. PMTest provides flexibility by providing two basic assertion-like software checkers to test two fundamental characteristics of all CCS: the ordering and durability guarantee. These checkers can also serve as the building blocks of other application-specific, high-level checkers. PMTest enables fast testing by deducing the persist order without exhausting all possible orders. In the evaluation with eight programs, PMTest not only identified 45 synthetic crash consistency bugs, but also detected 3 new bugs in a file system (PMFS) and in applications developed using a transactional library (PMDK), while on average being 7.1× faster than the state-of-the-art tool.
Semi-Asymmetric Parallel Graph Algorithms for NVRAMs
Laxman Dhulipala (Carnegie Mellon University);
Charles McGuffey (Carnegie Mellon University);
Hongbo Kang (Carnegie Mellon University);
Yan Gu (UC Riverside);
Guy E. Blelloch (Carnegie Mellon University);
Phillip B. Gibbons (Carnegie Mellon University);
Julian Shun (Massahusetts Institute of Technology);
Semi-Asymmetric Parallel Graph Algorithms for NVRAMs
Laxman Dhulipala (Carnegie Mellon University); Charles McGuffey (Carnegie Mellon University); Hongbo Kang (Carnegie Mellon University); Yan Gu (UC Riverside); Guy E. Blelloch (Carnegie Mellon University); Phillip B. Gibbons (Carnegie Mellon University); Julian Shun (Massahusetts Institute of Technology);
Speaker: Laxman Dhulipala, Carnegie Mellon University
Abstract Emerging non-volatile main memory (NVRAM) technologies provide novel features for large-scale graph analytics, combining byte-addressability, low idle power, and improved memory-density. Systems are likely to have an order of magnitude more NVRAM than traditional memory (DRAM), allowing large graph problems to be solved efficiently at a modest cost on a single machine. However, a significant challenge in achieving high performance is in accounting for the fact that NVRAM writes can be significantly more expensive than NVRAM reads. In this paper, we propose an approach to parallel graph analytics in which the graph is stored as a read-only data structure (in NVRAM), and the amount of mutable memory is kept proportional to the number of vertices. Similar to the popular semi-external and semi-streaming models for graph analytics, the approach assumes that the vertices of the graph fit in a fast read-write memory (DRAM), but the edges do not. In NVRAM systems, our approach eliminates writes to the NVRAM, among other benefits. We present a model, the \emph{Parallel Semi-Asymmetric Model (PSAM)}, to analyze algorithms in the setting, and run experiments on a 48-core NVRAM system to validate the effectiveness of these algorithms. To this end, we study over a dozen graph problems. We develop parallel algorithms for each that are efficient, often work-optimal, in the model. Experimentally, we run all of the algorithms on the largest publicly-available graph and show that our PSAM algorithms outperform the fastest prior algorithms designed for DRAM or NVRAM. We also show that our algorithms running on NVRAM nearly match the fastest prior algorithms running solely in DRAM, by effectively hiding the costs of repeatedly accessing NVRAM versus DRAM.
Error-Correcting WOM Codes for Worst-Case and Random Errors
Amit Solomon (Technion);
Yuval Cassuto (Technion);
Error-Correcting WOM Codes for Worst-Case and Random Errors
Amit Solomon (Technion); Yuval Cassuto (Technion);
Speaker: Amit Solomon, Technion - Israel Institute of Technology
Abstract We construct error-correcting WOM (write-once memory) codes that guarantee correction of any specified number of errors in q-level memories. The constructions use suitably designed short q-ary WOM codes and concatenate them with outer error-correcting codes over different alphabets, using suitably designed mappings. In addition to constructions for guaranteed error correction, we develop an error-correcting WOM scheme for random errors using the concept of multi-level coding.
Efficient Architectures for Generalized Integrated Interleaved Decoder
Xinmiao Zhang (The Ohio State University);
Zhenshan Xie (The Ohio State University);
Efficient Architectures for Generalized Integrated Interleaved Decoder
Xinmiao Zhang (The Ohio State University); Zhenshan Xie (The Ohio State University);
Speaker: Zhenshan Xie, The Ohio State University
Abstract Generalized integrated interleaved (GII) codes nest short sub-codewords to generate parities shared by the sub-codewords. They allow hyper-speed decoding with excellent correction capability, and are essential to next-generation data storage. On the other hand, the hardware implementation of GII decoders faces many challenges, including low achievable clock frequency and large silicon area. This abstract presents novel algorithmic reformulations and architectural transformations to address each bottleneck. For an example GII code that has the same rate and length as eight un-nested (255, 223) Reed-Solomon (RS) codes, our GII decoder only has 30% area overhead compared to the RS decoder while achieving 7 orders of magnitude lower frame error rate. Its critical path only consists of 7 XOR gates, and can easily achieve more than 40GByte/s throughput.
SOLQC: Synthetic Oligo Library Quality Control Tool
Omer Sabary (Technion – Israel Institute of Technology);
Yoav Orlev (Interdisciplinary Center Herzliya);
Roy Shafir (Technion – Israel Institute of Technology, Interdisciplinary Center Herzliya);
Leon Anavy (Technion – Israel Institute of Technology);
Eitan Yaakobi (Technion – Israel Institute of Technology);
Zohar Yakhini (Technion – Israel Institute of Technology, Interdisciplinary Center Herzliya);
SOLQC: Synthetic Oligo Library Quality Control Tool
Omer Sabary (Technion – Israel Institute of Technology); Yoav Orlev (Interdisciplinary Center Herzliya); Roy Shafir (Technion – Israel Institute of Technology, Interdisciplinary Center Herzliya); Leon Anavy (Technion – Israel Institute of Technology); Eitan Yaakobi (Technion – Israel Institute of Technology); Zohar Yakhini (Technion – Israel Institute of Technology, Interdisciplinary Center Herzliya);
Speaker: Omer Sabary, University of California, San Diego
Abstract DNA-based storage has attracted significant attention due to recent demonstrations of the viability of storing information in macromolecules using synthetic oligo libraries. As DNA storage experiments, as well as other experiments of synthetic oligo libraries are growing by numbers and complexity, analysis tools can facilitate quality control and help in assessment and inference. We present a novel analysis tool, called SOLQC, which enables fast and comprehensive analysis of synthetic oligo libraries, based on next generation sequencing (NGS) analysis performed by the user. SOLQC provides statistical information such as the distribution of variant representation, different error rates, and their dependence on sequence or library properties. SOLQC produces graphical descriptions of the analysis results. The results are reported in a flexible report format. We demonstrate SOLQC by analyzing literature libraries. We also discuss the potential benefits and relevance of the different components of the analysis.
10:50 am – 11:00 am
Concluding Remarks
11:00 am – 11:30 am
Networking Session
LIST OF POSTERS
Thread-specific Database Buffer Management in Multi-core NVM Storage Environments
Tsuyoshi Ozawa (Institute of Industrial Science, The University of Tokyo);
Yuto Hayamizu (Institute of Industrial Science, The University of Tokyo);
Kazuo Goda (Institute of Industrial Science, The University of Tokyo);
Masaru Kitsuregawa (Institute of Industrial Science, The University of Tokyo);
Thread-specific Database Buffer Management in Multi-core NVM Storage Environments
Tsuyoshi Ozawa (Institute of Industrial Science, The University of Tokyo); Yuto Hayamizu (Institute of Industrial Science, The University of Tokyo); Kazuo Goda (Institute of Industrial Science, The University of Tokyo); Masaru Kitsuregawa (Institute of Industrial Science, The University of Tokyo);
Speaker: Tsuyoshi Ozawa, The University of Tokyo
Abstract Database buffer management is a cornerstone in modern database management systems (DBMS). So far, a *shared buffer* strategy has been widely employed to improve the cache efficiency and reduce the IO workload. However, it involves a significant processing overhead induced by the inter-thread synchronization, thus failing to exploit the potential bandwidth that recent non-volatile memory (NVM) storage devices offer. This paper proposes to employ a *separated buffer* strategy. According to this strategy, the database buffer manager is allowed to achieve significantly higher throughput, even though it may produce an extra amount of IO workload. In recent multi-core NVM storage environments, separated buffer performs faster in query processing than shared buffer. This paper presents our experimental study with the TPC-H dataset on two different NVM machines, demonstrating that separated buffer achieves up to 1.47 million IOPS and finally performs up to 637% faster in query processing than shared buffer.
Leveraging Intel Optane for HPC workflows
Ranjan Sarpangala Venkatesh (Georgia Institute of Technology);
Tony Mason (University of British Columbia);
Pradeep Fernando (Georgia Institute of Technology);
Greg Eisenhauer (Georgia Institute of Technology);
Ada Gavrilovska (Georgia Institute of Technology);
Leveraging Intel Optane for HPC workflows
Ranjan Sarpangala Venkatesh (Georgia Institute of Technology); Tony Mason (University of British Columbia); Pradeep Fernando (Georgia Institute of Technology); Greg Eisenhauer (Georgia Institute of Technology); Ada Gavrilovska (Georgia Institute of Technology);
Speaker: Ranjan Sarpangala Venkatesh, Georgia Institute of Technology
Abstract High Performance Computing (HPC) workload demands are increasing data volumes, which gives rise to data movement challenges. In-situ execution of HPC workflows coupling simulation and analytics applications is a common mechanism for reducing cross-node traffic. Further, data movement costs can be reduced by using large capacity persistent memory, such as Intel's Optane PMEM. Recent work has described best practices for optimizing use of Optane by tuning based on workload characteristics. However, optimizing one component of an HPC workload does not guarantee optimal performance in the end-to-end workflow. Instead, we propose and evaluate new strategies for optimizing the use of Intel Optane for such HPC workflows.
Durability Through NVM Checkpointing
David Aksun (EPFL);
James Larus (EPFL);
Durability Through NVM Checkpointing
David Aksun (EPFL); James Larus (EPFL);
Speaker: David Aksun, EPFL
Abstract Non-Volatile Memory (NVM) is an emerging type of memory that offers fast, byte-addressable persistent storage. One promising use for persistent memory is constructing robust, high-performance internet and cloud services, which often maintain very large, in-memory databases and need to quickly recover from faults or failures. The research community has focused on storing these large data structures in NVM, in essence using it as durable RAM. The focus of the paper is to take advantage of the existing DRAM to provide better runtime performance. Heavily read or written data should reside in DRAM, where it can be accessed at a fraction of the cost, and only modified values should be persisted in NVM. This paper presents CpNvm, a runtime system that uses periodic checkpoints to maintain a recoverable copy of a program's data, with overhead low enough for widespread use. To use CpNvm, a program's developer must insert a library call at the first write to a location in a persistent data structure and make another call when the structure is in a consistent state. Our primary goal is high performance, even at the cost of relaxing the crash-recovery model. CpNvm offers durability for large data structures at low overhead cost. Running on Intel Optane NVM, we achieve overheads of $0-15\%$ on the YCSB benchmarks running with minimally-modified Masstree and overheads of $6.5\%$ or less for Memcached.
PMIdioBench: A Benchmark Suite for Understanding the Idiosyncrasies of Persistent Memory
Shashank Gugnani (The Ohio State University);
Arjun Kashyap (The Ohio State University);
Xiaoyi Lu (The Ohio State University);
PMIdioBench: A Benchmark Suite for Understanding the Idiosyncrasies of Persistent Memory
Shashank Gugnani (The Ohio State University); Arjun Kashyap (The Ohio State University); Xiaoyi Lu (The Ohio State University);
Speaker: ,
Abstract High capacity persistent memory (PMEM) is finally commercially available in the form of Intel's Optane DC Persistent Memory Module (DCPMM). Early evaluations of DCPMM show that its behavior is more nuanced and idiosyncratic than previously thought. Several assumptions made about its performance that guided the design of PMEM-enabled systems have been shown to be incorrect. Unfortunately, several peculiar performance characteristics of DCPMM are related to the memory technology (3D-XPoint) used and its internal architecture. It is expected that other technologies (such as STT-RAM, ReRAM, NVDIMM), with highly variable characteristics, will be commercially shipped as PMEM in the near future. Current evaluation studies fail to understand and categorize the idiosyncratic behavior of PMEM; i.e., how do the peculiarities of DCPMM related to other classes of PMEM. Clearly, there is a need for a study which can guide the design of systems and is agnostic to PMEM technology and internal architecture. In this work, we first list and categorize the idiosyncratic behavior of PMEM by performing targeted experiments with our proposed PMIdioBench benchmark suite on a real DCPMM platform. Next, we conduct detailed studies to guide the design of storage systems, considering generic PMEM characteristics. The first study guides data placement on NUMA systems with PMEM while the second study guides the design of lock-free data structures, for both eADR- and ADR-enabled PMEM systems. Our results are often counter-intuitive and highlight the challenges of system design with PMEM.
SuperMem: Enabling Application-transparent Secure Persistent Memory with Low Overheads
Pengfei Zuo (Huazhong University of Science and Technology & Univ. of California Santa Barbara);
Yu Hua (Huazhong University of Science and Technology);
Yuan Xie (Univ. of California Santa Barbara);
SuperMem: Enabling Application-transparent Secure Persistent Memory with Low Overheads
Pengfei Zuo (Huazhong University of Science and Technology & Univ. of California Santa Barbara); Yu Hua (Huazhong University of Science and Technology); Yuan Xie (Univ. of California Santa Barbara);
Speaker: Pengfei Zuo, Huazhong University of Science and Technology & University of California Santa Barbara
Abstract Non-volatile memory (NVM) suffers from security vulnerability to physical access based attacks due to non-volatility. To ensure data security in NVM, counter mode encryption is often used by considering its high security level and low decryption latency. However, the counter mode encryption incurs new persistence problem for crash consistency guarantee due to the requirement for atomically persisting both data and its counter. To address this problem, existing work requires a large battery backup or complex modifications on both hardware and software layers due to employing a write-back counter cache. The large battery backup is expensive and software-layer modifications limit the portability of applications from the un-encrypted NVM to the encrypted one. Our paper proposes SuperMem, an application-transparent secure persistent memory by leveraging a write-through counter cache to guarantee the atomicity of data and counter writes without the needs of a large battery backup and software-layer modifications. To reduce the performance overhead of a baseline write-through counter cache, SuperMem leverages a locality-aware counter write coalescing scheme to reduce the number of write requests by exploiting the spatial locality of counter storage and data writes. Moreover, SuperMem leverages a cross-bank counter storage scheme to efficiently distribute data and counter writes to different banks, thus speeding up writes by exploiting bank parallelism. Experimental results demonstrate that SuperMem improves the performance by about $2\times$ compared with an encrypted NVM with a baseline write-through counter cache, and achieves the performance comparable to an ideal secure NVM that exhibits the optimal performance of an encrypted NVM.
Coding for Resistive Random-Access Memory Channels
Guanghui Song (Singapore University of Technology and Design);
Kui Cai (Singapore University of Technology and Design);
Xingwei Zhong (Singapore University of Technology and Design);
Jiang Yu (Singapore University of Technology and Design);
Jun Cheng (Doshisha University);
Coding for Resistive Random-Access Memory Channels
Guanghui Song (Singapore University of Technology and Design); Kui Cai (Singapore University of Technology and Design); Xingwei Zhong (Singapore University of Technology and Design); Jiang Yu (Singapore University of Technology and Design); Jun Cheng (Doshisha University);
Speaker: Guanghui Song, Singapore University of Technology and Design
Abstract We propose channel coding techniques to mitigate both the sneak-path interference and the channel noise for resistive random-access memory (ReRAM) channels. The main challenge is that the sneak-path interference within one memory array is data-dependent. We propose an across-array coding scheme, which assigns a codeword to multiple independent memory arrays. Since the coded bits from different arrays experience independent channels, a ``diversity" gain can be obtained during decoding, and when the codeword is adequately distributed, the code performs as that over an independent and identically distributed (i.i.d.) channel without data-dependency. We also present a real-time channel estimation scheme and a data shaping technique to improve the decoding performance.
Lifted Reed-Solomon Codes with Application to Batch Codes
Lukas Holzbaur (Technical University of Munich);
Rina Polyanskaya (Institute for Information Transmission Problems);
Nikita Polyanskii (Technical University of Munich);
Ilya Vorobyev (Skolkovo Institute of Science and Technology);
Lifted Reed-Solomon Codes with Application to Batch Codes
Lukas Holzbaur (Technical University of Munich); Rina Polyanskaya (Institute for Information Transmission Problems); Nikita Polyanskii (Technical University of Munich); Ilya Vorobyev (Skolkovo Institute of Science and Technology);
Speaker: ,
Abstract Guo, Kopparty and Sudan have initiated the study of error-correcting codes derived by lifting of affine-invariant codes. Lifted Reed-Solomon (RS) codes are defined as the evaluation of polynomials in a vector space over a field by requiring their restriction to every line in the space to be a codeword of the RS code. In this paper, we investigate lifted RS codes and discuss their application to batch codes, a notion introduced in the context of private information retrieval and load-balancing in distributed storage systems. First, we improve the estimate of the code rate of lifted RS codes for lifting parameter $m\ge 4$ and large field size. Second, a new explicit construction of batch codes utilizing lifted RS codes is proposed. For some parameter regimes, our codes have a better trade-off between parameters than previously known batch codes.
A Back-End, CMOS Compatible Ferroelectric Field Effect Transistor for Synaptic Weights
Mattia Halter (IBM Research Gmbh, ETH Zurich);
Laura Bégon-Lours (IBM Research Gmbh);
Valeria Bragaglia (IBM Research Gmbh);
Marilyne Sousa (IBM Research Gmbh);
Bert Jan Offrein (IBM Research Gmbh);
Stefan Abel (Formerly IBM Research Gmbh, currently at Lumiphase);
Mathieu Luisier (ETH Zurich);
Jean Fompeyrine (Formerly IBM Research Gmbh, currently at Lumiphase);
A Back-End, CMOS Compatible Ferroelectric Field Effect Transistor for Synaptic Weights
Mattia Halter (IBM Research Gmbh, ETH Zurich); Laura Bégon-Lours (IBM Research Gmbh); Valeria Bragaglia (IBM Research Gmbh); Marilyne Sousa (IBM Research Gmbh); Bert Jan Offrein (IBM Research Gmbh); Stefan Abel (Formerly IBM Research Gmbh, currently at Lumiphase); Mathieu Luisier (ETH Zurich); Jean Fompeyrine (Formerly IBM Research Gmbh, currently at Lumiphase);
Speaker: Mattia Halter, IBM Research GmbH - Zurich Research Laboratory, CH-8803 Rüschlikon, Switzerland Integrated Systems Laboratory, ETH Zurich, CH-8092 Zurich, Switzerland
Abstract Neuromorphic computing architectures enable the dense co-location of memory and processing elements within a single circuit. Their building blocks are non-volatile synaptic elements such as memristors. Key memristor properties include a suitable non-volatile resistance range, continuous linear resistance modulation and symmetric switching. In this work, we demonstrate voltage-controlled, symmetric and analog potentiation and depression of a ferroelectric Hf0.57Zr0.43O2 (HZO) field effect transistor (FeFET) with good linearity. Our FeFET operates with a low writing energy (fJ) and fast programming time (40ns). Retention measurements have been done over 4-bits depth with low noise (1%) in the tungsten oxide (WOx) read out channel. By adjusting the channel thickness from 15nm to 8nm, the on/off ratio of the FeFET can be engineered from 1% to 200% with an on-resistance ideally >100kΩ, depending on the channel geometry. The device concept is compatible with a back end of line (BEOL) integration into CMOS processes. It has therefore a great potential for the fabrication of high density, large-scale integrated arrays of artificial analog synapses.
Separation and Equivalence results for the Crash-stop and Crash-recovery Shared Memory Models
Ohad Ben-Baruch (BGU);
Srivatsan Ravi (University of Southern California);
Separation and Equivalence results for the Crash-stop and Crash-recovery Shared Memory Models
Ohad Ben-Baruch (BGU); Srivatsan Ravi (University of Southern California);
Speaker: Ohad Ben Baruch, Ben Gurion University
Abstract Linearizability, the traditional correctness condition for concurrent objects is considered insufficient for the non-volatile shared memory model where processes recover following a crash.For this crash-recovery shared memory model, strict-linearizability is considered appropriate since, unlike linearizability, it ensures operations that crash take effect prior to the crash or not at all. This work formalizes and answers the question of whether an implementation of a data type derived for the crash-stop shared memory model is also strict-linearizable in the crash-recovery model.We present a rigorous study to prove how helping mechanisms, typically employed by non-blocking implementations, is the algorithmic abstraction that delineates linearizability from strict-linearizability. Our first contribution formalizes the crash-recovery model and how explicit process crashes and recovery introduces further dimensionalities over the standard crash-stop shared memory model. We make the following technical contributions: (i) we prove that strict-linearizability is independent of any known help definition; (ii) we present a natural definition of help-freedom to prove that any obstruction-free, linearizable and help-free implementation of a total object type is also strict-linearizable; (iii) we prove that for a large class of object types, a non-blocking strict-linearizable implementation cannot have helping. Viewed holistically, this work provides the first precise characterization of the intricacies in applying a concurrent implementation designed for the crash-stop(and resp. crash-recovery) model to the crash-recovery (and resp.crash-stop) model
Exploratory Data Analytics might just be the Killer App for Persistent Memory
Roger Pearce (Lawrence Livermore National Laboratory);
Keita Iwabuchi (Lawrence Livermore National Laboratory);
Maya Gokhale (Lawrence Livermore National Laboratory);
Exploratory Data Analytics might just be the Killer App for Persistent Memory
Roger Pearce (Lawrence Livermore National Laboratory); Keita Iwabuchi (Lawrence Livermore National Laboratory); Maya Gokhale (Lawrence Livermore National Laboratory);
Speaker: ,
Abstract The volume of data currently generated both by science and security applications and by the modern internet-connected human experience has surpassed our ability to process and understand at adequate levels of fidelity. When deep historical or longitudinal analysis is required, the volume of data often requires heavy triage or filtering that can impede deep analysis. The promise of using High Performance Computing (HPC) for such analysis is that a unified picture of a large distributed dataset is possible; however, tools to tackle enterprise-level datasets are still in research. Exploratory Data Analytics (EDA) is often the first step used by data scientists when faced with a new dataset or analytic task, and it specifically aids in hypothesis generation and evaluation. The de facto standard among a large percentage of data scientists is Jupyter notebooks (i.e., interactive Python), in which relatively small datasets can be manipulated using popular tools such as NumPy, SciPy, Pandas, or NetworkX on a desktop or laptop environment, possibly connected with a small compute cluster backend via Apache Spark. This paradigm is limited by poor performance and scalability, yet many scientific and security data analysis tasks demand algorithms that require many unstructured analysis phases over significant data scales. This talk introduces a new-start internal research effort at LLNL investigating novel techniques to enable EDA at scale. This three year effort co-designs across applications/algorithms, system software, and persistent memory (PM) hardware. As this new research effort is still in its early stages of kick-off, this talk provides a timely opportunity to discuss collaborations with external researchers in academia and industry. Challenges and opportunities related to realizing this goal of EDA on PM will be presented.
Toward Faster and More Efficient Training on CPUs Using STT-RAM-based Last Level Cache
Alexander Hankin (Tufts University);
Maziar Amiraski (Tufts University);
Karthik Sangaiah (Drexel Univeresity);
Mark Hempstead (Tufts University);
Toward Faster and More Efficient Training on CPUs Using STT-RAM-based Last Level Cache
Alexander Hankin (Tufts University); Maziar Amiraski (Tufts University); Karthik Sangaiah (Drexel Univeresity); Mark Hempstead (Tufts University);
Speaker: ,
Abstract Artificial intelligence (AI), especially neural network-based AI, has become ubiquitous in modern day computing. However, the training phase required for these networks demands significant computational resources and is the primary bottleneck as the community scales its AI capabilities. While GPUs and AI accelerators have begun to be used to address this problem, many of the industry's AI models are still trained on CPUs and are limited in large part by the memory system. Breakthroughs in NVM research over the past couple of decades has unlocked the potential for replacing on-chip SRAM with an NVM-based alternative. Research into Spin-Torque Transfer RAM (STT-RAM) over the past decade has explored the impact of trading off volatility for improved write latency as part of the trend to bring STT-RAM on-chip. This is particularly STT-RAM is an especially attractive replacement for SRAM in the last-level cache due to its density, low leakage, and most notably, endurance.
GBTL+Metall – Adding Persistence to GraphBLAS
Kaushik Velusamy (University of Maryland, Baltimore County);
Scott McMillan (Carnegie Mellon University);
Keita Iwabuchi (Lawrence Livermore National Laboratory);
Roger Pearce (Lawrence Livermore National Laboratory);
GBTL+Metall – Adding Persistence to GraphBLAS
Kaushik Velusamy (University of Maryland, Baltimore County); Scott McMillan (Carnegie Mellon University); Keita Iwabuchi (Lawrence Livermore National Laboratory); Roger Pearce (Lawrence Livermore National Laboratory);
Speaker: Kaushik Velusamy, University of Maryland, Baltimore County
Abstract It is well known that software-hardware co-design is required for attaining high-performance implementations. System software libraries help us in achieving this goal. Metall persistent memory allocator is one such library. Metall enables large scale data analytics by leveraging emerging memory technologies. Metall is a persistent memory allocator designed to provide developers with rich C++ interfaces to allocate custom C++ data structures in persistent memory, not just from block storage and byte-addressable persistent memories (NVMe, Optane) but also in DRAM TempFS. Having a large capacity of persistent memory changes the way we solve problems and leads to algorithmic innovation. In this work, we present GraphBLAS as a real application use case to demonstrate Metall persistent memory allocator benefits. We show an example of how storing and re-attaching graph containers using Metall, eliminates the need for graph reconstruction at a one-time cost of re-attaching to Metall datastore.
POSEIDON : Safe, Fast and Scalable Persistent Memory Allocator
Wook-Hee Kim (Virginia Tech);
Anthony Demeri (Virginia Tech);
Madhava Krishnan Ramanathan (Virginia Tech);
Jaeho Kim (Gyeongsang National University);
Mohannad Ismail (Virginia Tech);
Changwoo Min (Virginia Tech);
POSEIDON : Safe, Fast and Scalable Persistent Memory Allocator
Wook-Hee Kim (Virginia Tech); Anthony Demeri (Virginia Tech); Madhava Krishnan Ramanathan (Virginia Tech); Jaeho Kim (Gyeongsang National University); Mohannad Ismail (Virginia Tech); Changwoo Min (Virginia Tech);
Speaker: Wook-Hee Kim, Virginia Tech
Abstract Persistent memory allocator is an essential component of any Non-Volatile Main Memory (NVMM) application. A slow memory allocator can bottleneck the entire application stack, while an unsecure memory allocator can render applications inconsistent upon program bugs or system failure. Unlike DRAM-based memory allocators, it is indispensable for an NVMM allocator to guarantee its heap metadata safety from both internal and external errors. An effective NVMM memory allocator should be 1) safe, 2) scalable, and 3) high performing. Unfortunately, none of the existing persistent memory allocators achieve all three requisites. For example, we found that even Intel’s de-facto NVMM allocator–libpmemobj is vulnerable to silent data corruption and persistent memory leaks resulting from a simple heap overflow. In this paper, we propose Poseidon, a safe, fast, and scalable persistent memory allocator. The premise of Poseidon revolves around providing a user application with per-CPU sub-heaps for scalability and high performance while managing the heap metadata in a segregated fashion and efficiently protecting the metadata using a scalable hardware-based protection scheme, Intel’s Memory Protection Keys (MPK). We evaluate Poseidon with a wide array of microbenchmarks and real-world benchmarks. In our evaluation, Poseidon outperforms the state-of-art allocators by a significant margin, showing improved scalability and performance, while also guaranteeing heap metadata protection.
Ribbon: High-Performance Cache Line Flushing for Persistent Memory
Kai Wu (University of California, Merced);
Ivy Peng (Lawrence Livermore National Laboratory);
Jie Ren (University of California, Merced);
Dong Li (University of California, Merced);
Ribbon: High-Performance Cache Line Flushing for Persistent Memory
Kai Wu (University of California, Merced); Ivy Peng (Lawrence Livermore National Laboratory); Jie Ren (University of California, Merced); Dong Li (University of California, Merced);
Speaker: Kai Wu, University of California Merced
Abstract Cache line flushing (CLF) is a fundamental building block for programming persistent memory (PM). CLF is prevalent in PM-aware workloads to ensure crash consistency. It also imposes high over-head. Extensive works have explored persistency semantics andCLF policies, but few have looked into the CLF mechanism. This work aims to improve the performance of CLF mechanism based on the performance characterization of well-established workloads on real PM hardware. We reveal that the performance of CLF is highly sensitive to the concurrency of CLF and cache line status. We introduce Ribbon, a runtime system that improves the performance of CLF mechanism through concurrency control and proactive CLF. Ribbon detects CLF bottleneck in oversupplied and insufficient concurrency, and adapts accordingly. Ribbon also proactively transforms dirty or nonresident cache lines into the clean resident status to reduce the latency of CLF. Furthermore, we investigate the cause for low dirtiness in flushed cache lines in in-memory database workloads. We provide cache line coalescing as an application-specific solution that achieves up to 33.3% (13.8% on average) improvement. Our evaluation of a variety of workloads in four configurations on PM shows that Ribbon achieves up to 49.8%improvement (14.8% on average) of the overall application performance.
Generative Modeling of NAND Flash Memory Voltage Level
Ziwei Liu (Center of Memory and Recording Research, UC San Diego);
Yi Liu (Center of Memory and Recording Research, UC San Diego);
Paul H. Siegel (Center of Memory and Recording Research, UC San Diego);
Generative Modeling of NAND Flash Memory Voltage Level
Ziwei Liu (Center of Memory and Recording Research, UC San Diego); Yi Liu (Center of Memory and Recording Research, UC San Diego); Paul H. Siegel (Center of Memory and Recording Research, UC San Diego);
Speaker: Ziwei Liu, Center of Memory and Recording Research, UC San Diego
Abstract Program and erase cycling (P/E cycling) data is used to characterize flash memory channels and support realistic performance simulation of error-correcting codes (ECCs). However, these applications require a massive amount of data, and collecting the data takes a lot of time. To generate a large amount of NAND flash memory read voltages using a relatively small amount of measured data, we propose a read voltage generator based on Time-Dependent Generative Moments Matching Network (TD-GMMN). This model can generate authentic read voltage distributions over a range of possible P/E cycles for a specified program level based on known read voltage distributions. Experimental results based on data generated by a mathematical MLC NAND flash memory read voltage generator demonstrate the model’s effectiveness.