PROGRAM
Registered users: Check your email for Zoom and GatherTown links to attend.
Registered users: Check your email for Zoom and GatherTown links to attend.
Chair: Steven Swanson
Metall: A Persistent Memory Allocator for Accelerating Data Analytics
Roger Pearce & Keita Iwabuchi
Chair: Eitan Yaakobi
Accelerating Deep Neural Networks with Analog Memory Devices
GeoffreyBurr (IBM Research - Almaden);
GeoffreyBurr (IBM Research - Almaden);
Speaker:Geoffrey W. Burr, IBM Research - Almaden
AbstractDeep Neural Networks (DNNs) are very large artificial neural networks trained using very large datasets, typically using the supervised learning technique known as backpropagation. Currently, CPUs and GPUs are used for these computations. Over the next few years, we can expect special-purpose hardware accelerators based on conventional digital-design techniques to optimize the GPU framework for these DNN computations. Even after the improved computational performance and efficiency that is expected from these special-purpose digital accelerators, there would still be an opportunity for even higher performance and even better energy-efficiency for inference and training of DNNs, by using neuromorphic computation based on analog memory devices. In this presentation, I discuss the origin of this opportunity as well as the challenges inherent in delivering on it, including materials and devices for analog volatile and non-volatile memory, circuit and architecture choices and challenges, and the current status and prospects.
Speaker bioGeoffrey W. Burr received his Ph.D. in Electrical Engineering from the California Institute of Technology in 1996. Since that time, Dr. Burr has worked at IBM Research--Almaden in San Jose, California, where he is currently a Distinguished Research Staff Member. He has worked in a number of diverse areas, including holographic data storage, photon echoes, computational electromagnetics, nanophotonics, computational lithography, phase-change memory, storage class memory, and novel access devices based on Mixed-Ionic-Electronic-Conduction (MIEC) materials. Dr. Burr's current research interests involve AI/ML acceleration using non-volatile memory. Geoff is an IEEE Fellow (2020), and is also a member of MRS, SPIE, OSA, Tau Beta Pi, Eta Kappa Nu, and the Institute of Physics (IOP).
Chair: Paul Siegel
Cooperative Data Protection for Topology-Aware Decentralized Storage Networks
SiyiYang (University of California, Los Angeles);AhmedHareedy (Duke University);RobertCalderbank (Duke University);LaraDolecek (University of California, Los Angeles);
SiyiYang (University of California, Los Angeles);AhmedHareedy (Duke University);RobertCalderbank (Duke University);LaraDolecek (University of California, Los Angeles);
Speaker:Siyi Yang, UCLA
AbstractWhile codes with hierarchical locality have been intensively studied in the context of centralized cloud storage due to their effectiveness in reducing the average reading time, those in the context of decentralized storage networks (DSNs) have not yet been discussed. In this paper, we propose a joint coding scheme where each node receives extra protection through the cooperation with nodes in its neighborhood in a heterogeneous DSN with any given topology. Our proposed construction not only supports desirable properties such as scalability and flexibility, which are critical in dynamic networks, but also adapts to arbitrary topologies, a property that is essential in DSNs but has been overlooked in existing works.
Speaker bioSiyi Yang is a Ph.D. candidate with the Electrical and Computer Engineering department at the University of California, Los Angeles (UCLA). She received her B.S. degree in Electrical Engineering from the Tsinghua University, in 2016 and the M.S. degree in Electrical and Computer Engineering from the University of California, Los Angeles (UCLA) in 2018. Her research interests include design of error-correction codes for non-volatile memory and distributed storage.
Power Spectra of Finite-Length Constrained Codes with Level-Based Signaling
JessicaCenters (Duke University);XinyuTan (Duke University);AhmedHareedy (Duke University);RobertCalderbank (Duke University);
JessicaCenters (Duke University);XinyuTan (Duke University);AhmedHareedy (Duke University);RobertCalderbank (Duke University);
Speaker:Jessica Centers and Xinyu Tan, Electrical and Computer Engineering Department, Duke University
AbstractIn various practical systems, certain data patterns are prone to errors if written or transmitted. Constrained codes are used to eliminate error-prone patterns, and they can also achieve other goals. Recently, we introduced efficient binary symmetric lexicographically-ordered constrained (LOCO) codes and asymmetric LOCO (A-LOCO) codes to increase density in magnetic recording systems and lifetime in Flash systems by eliminating the relevant detrimental patterns. Due to their applications, LOCO and A-LOCO codes are associated with level-based signaling. In this paper, we first modify a framework from the literature in order to introduce a method to derive the power spectrum of a sequence of constrained data associated with level-based signaling. We then provide a generalized method for developing the one-step state transition matrix (OSTM) for finite-length codes constrained by the separation of transitions. Via their OSTMs, we devise closed form solutions for the spectra of finite-length LOCO and A-LOCO codes.
Speaker bioJessica Centers is a 3rd year Ph.D. student in the Electrical and Computer Engineering Department at Duke University. Her research focusses primarily on developing signal processing techniques to utilize recently affordable sensors such as millimeter-wave radars in non-traditional applications. Xinyu Tan is junior undergraduate student pursuing a degree in Mathematics and Computer Science at Duke University. She has recently been pursuing an interest in quantum computing. Jessica and Xinyu found an interest in the analysis and development of novel constrained codes used to improve the performance in data storage and other computer systems through the Coding Theory course offered at Duke University. Through this course, Jessica and Xinyu were able to explore their interests in constrained codes more in-depth, which resulted in the paper being presented on and summarized at this conference.
Optimal Reconstruction Codes for Deletion Channel
JohanChrisnata (Nanyang Technological University);Han MaoKiah (Nanyang Technological University);EitanYaakobi (Technion - Israel Institute of Technology);
JohanChrisnata (Nanyang Technological University);Han MaoKiah (Nanyang Technological University);EitanYaakobi (Technion - Israel Institute of Technology);
Speaker:Johan Chrisnata, Nanyang Technological University
AbstractThe sequence reconstruction problem, introduced by Levenshtein in 2001, considers a communication scenario where the sender transmits a codeword from some codebook and the receiver obtains multiple noisy reads of the codeword. Motivated by modern storage devices, we introduced a variant of the problem where the number of noisy reads $N$ is fixed (Kiah \etal{ }2020). Of significance, for the single-deletion channel, using $\log_2\log_2 n +O(1)$ redundant bits, we designed a reconstruction code of length $n$ that reconstructs codewords from two distinct noisy reads. In this work, we show that $\log_2\log_2 n -O(1)$ redundant bits are necessary for such reconstruction codes, thereby, demonstrating the optimality of our previous construction. Furthermore, we show that these reconstruction codes can be used in $t$-deletion channels (with $t\ge 2$) to uniquely reconstruct codewords from $n^{t-1}+O\left(n^{t-2}\right)$ distinct noisy reads.
Speaker bioJohan Chrisnata received his Bachelor degree in mathematics from Nanyang Technological University (NTU), Singapore in 2015. From August 2015 until August 2018, he was a research officer in NTU. Currently he is pursuing a joint Ph.D. degree in mathematics from School of Physical and Mathematical Sciences at Nanyang Technological University, Singapore and Computer Science from Department of Computer Science at Technion University, Israel. His research interest includes enumerative combinatorics and coding theory.
Partial MDS Codes with Regeneration
LukasHolzbaur (Technical University of Munich);SvenPuchinger (Technical University of Denmark (DTU));EitanYaakobi (Technion - Israel Institute of Technology);AntoniaWachter-Zeh (Technical University of Munich);
LukasHolzbaur (Technical University of Munich);SvenPuchinger (Technical University of Denmark (DTU));EitanYaakobi (Technion - Israel Institute of Technology);AntoniaWachter-Zeh (Technical University of Munich);
Speaker:Lukas Holzbaur, Technical University of Munich
AbstractPartial MDS (PMDS) and sector-disk (SD) codes are classes of erasure correcting codes that combine locality with strong erasure correction capabilities. We construct PMDS and SD codes where each local code is a bandwidth-optimal regenerating MDS code. In the event of a node failure, these codes reduce both, the number of servers that have to be contacted as well as the amount of network traffic required for the repair process. The constructions require significantly smaller field size than the only other construction known in literature. Further, we present a PMDS code construction that allows for efficient repair for patterns of node failures that exceed the local erasure correction capability of the code and thereby invoke repair across different local groups.
Speaker bioLukas Holzbaur received his B.Sc. and M.Sc. degrees in electrical engineering from the Technical University of Munich (TUM), Germany, in 2014 and 2017, respectively. Since 2017 he is working towards a Ph.D. at the Institute for Communications Engineering, TUM, Germany, in the group of Prof.~Wachter-Zeh. His research interests are coding theory and its applications, in particular to distributed data storage and privacy.
Non-Uniform Windowed Decoding For Multi-Dimensional Spatially-Coupled LDPC Codes
LevTauz (University of California, Los Angeles);LaraDolecek (University of California, Los Angeles);HomaEsfahanizadeh (Massachusetts Institute of Technology);
LevTauz (University of California, Los Angeles);LaraDolecek (University of California, Los Angeles);HomaEsfahanizadeh (Massachusetts Institute of Technology);
Speaker:Lev Tauz, Electrical and Computer Engineering, University of California, Los Angeles
AbstractIn this work, we propose a non-uniform windowed decoder for multi-dimensional spatially-coupled LDPC (MD-SC-LDPC) codes over the binary erasure channel. An MD-SC-LDPC code is constructed by connecting together several SC-LDPC codes into one larger code that provides major benefits over a variety of channel models. We propose and analyze a novel non-uniform decoder that allows for greater flexibility between latency and code reliability. Our theoretical derivations and empirical results show that our non-uniform decoder greatly improves upon the standard windowed decoder in terms of design flexibility, latency, and complexity.
Speaker bioLev Tauz received their B.S. degree (Hons.) in electrical engineering and computer science from the University of California, Berkeley, in 2016 and their M.S. degree in electrical engineering and computer engineering from the University of California, Los Angeles (UCLA) in 2020. He is currently pursuing a Ph.D. degree with the Electrical and Computer Engineering Department in UCLA. He currently works at the Laboratory for Robust Information Systems (LORIS), and he is focused on coding techniques for distributed storage and computation. His research interests include distributed systems, error-correcting codes, machine learning, and graph theory. He was a recipient of the Best Preliminary Exam in Signals and Systems Award in the Electrical and Computer Engineering Department, UCLA, in 2019.
Chair: Samira Khan
Characterizing and Modeling Non-Volatile Memory Systems
ZixuanWang (University of California San Diego);XiaoLiu (University of California, San Diego);JianYang (University of California, San Diego);TheodoreMichailidis (University of California, San Diego);StevenSwanson (University of California, San Diego);JishenZhao (University of California, San Diego);
ZixuanWang (University of California San Diego);XiaoLiu (University of California, San Diego);JianYang (University of California, San Diego);TheodoreMichailidis (University of California, San Diego);StevenSwanson (University of California, San Diego);JishenZhao (University of California, San Diego);
Speaker:Zixuan Wang, University of California, San Diego
AbstractScalable server-grade non-volatile RAM (NVRAM) DIMMs became commercially available with the release of Intel’s Optane DIMM. Recent studies on Optane DIMM systems unveil discrepant performance characteristics, compared to what many researchers assumed before the product release. Most of these studies focus on system software design and performance analysis. To thoroughly analyze the source of this discrepancy and facilitate real-NVRAM-aware architecture design, we propose a framework that characterizes and models Optane DIMM’s microarchitecture. Our framework consists of a Low-level profilEr for Non-volatile memory Systems (LENS) and a Validated cycle-Accurate NVRAM Simulator (VANS). LENS allows us to comprehensively analyze the performance attributes and reverse engineer NVRAM microarchitectures. Based on LENS characterization, we develop VANS, which models the sophisticated microarchitecture design of Optane DIMM, and is validated by comparing with the detailed performance characteristics of Optane-DIMM-attached Intel servers. VANS adopts a modular design that can be easily modified to extend to other NVRAM architecture designs; it can also be attached to full-system simulators, such as gem5.
Speaker bioI am Zixuan Wang, a 3rd year Ph.D. student at University of California San Diego. I'm working with Prof. Jishen Zhao and Prof. Steven Swanson. My research interest is mainly on memory systems.
Assise: Performance and Availability via Client-local NVM in a Distributed File System
ThomasAnderson (University of Washington);MarcoCanini (KAUST);JongyulKim (KAIST);DejanKostić (KTH Royal Institute of Technology);YoungjinKwon (KAIST);SimonPeter (University of Texas at Austin);WaleedReda (KTH Royal Institute of Technology and Université catholique de Louvain);Henry N.Schuh (University of Washington);EmmettWitchel (UT Austin);
ThomasAnderson (University of Washington);MarcoCanini (KAUST);JongyulKim (KAIST);DejanKostić (KTH Royal Institute of Technology);YoungjinKwon (KAIST);SimonPeter (University of Texas at Austin);WaleedReda (KTH Royal Institute of Technology and Université catholique de Louvain);Henry N.Schuh (University of Washington);EmmettWitchel (UT Austin);
Speaker:Waleed Reda, KTH Royal Institute of Technology and Université catholique de Louvain
AbstractThe adoption of low-latency non-volatile memory (NVM) at scale upends the existing client-server model for distributed file systems. Instead, by leveraging client-local NVM storage, we can provide applications with much higher IO performance, sub-second application failover, and strong consistency. To that end, we built the Assise distributed file system, which uses client-local NVM as a linearizable and crash-recoverable cache between applications. Assise maximizes locality for all file IO by carrying out IO on process-local and client-local NVM whenever possible. By maintaining consistency at IO operation granularity, rather than at fixed block sizes, Assise minimizes coherence overheads and prevents block amplification. In doing so, Assise provides orders of magnitude lower tail latency, higher scalability, and higher availability than the state-of-the-art.
Speaker bioWaleed is a final year PhD student at the Université catholique de Louvain (UCL) and the Royal Institute of Technology (KTH). His work focuses on accelerating distributed storage systems by rearchitecting them to maximize the benefits of state-of-the-art networking and storage technologies. More recently, he has been working on speeding up distributed file systems by exploiting client-local NVM and carefully balancing their use of network and storage resources.
Clobber-NVM: Log Less, Re-execute More
YiXu (UC San Diego);JosephIzraelevitz (University of Colorado, Boulder);StevenSwanson (UC San Diego);
YiXu (UC San Diego);JosephIzraelevitz (University of Colorado, Boulder);StevenSwanson (UC San Diego);
Speaker:Yi Xu, University of California, San Diego
AbstractNon-volatile memory allows direct access to persistent storage via a load/store interface. However, because the cache is volatile, cached updates to persistent state will be dropped after a power loss. Failure-atomicity NVM libraries provide the means to apply sets of writes to persistent state atomically. Unfortunately, most of these libraries impose significant overhead. This work proposes Clobber-NVM, a failure-atomicity library that ensures data consistency by reexecution. Clobber-NVM’s novel logging strategy, clobber logging, records only those transaction inputs that are overwritten during transaction execution. Then, after a failure, it recovers to a consistent state by restoring overwritten inputs and reexecuting any interrupted transactions. Clobber-NVM utilizes a clobber logging compiler pass for identifying the minimal set of writes that need to be logged. Based on our experiments, classical undo logging logs up to 42.6X more bytes than Clobber-NVM, and requires 2.4X to 4.7X more expensive ordering instructions (e.g., clflush and sfence). Less logging leads to better performance: Relative to prior art, Clobber-NVM provides up to 2.5X performance improvement over Mnemosyne and 2.6X over Intel’s PMDK.
Speaker bioYi is a third-year PhD student at UC San Diego, advised by Prof. Steven Swanson. She is interested in memory and storage systems. Her current research focuses on compiler and library supports for persistent memory programming.
CoSpec: Compiler Directed Speculative Intermittent Computation
JongoukChoi (Purdue University);QingruiLiu (Annapurna Labs);ChangheeJung (Purdue University);
JongoukChoi (Purdue University);QingruiLiu (Annapurna Labs);ChangheeJung (Purdue University);
Speaker:Jongouk Choi, Purdue University
AbstractEnergy harvesting systems have emerged as an alternative to battery-operated embedded devices. Due to the intermittent nature of energy harvesting, researchers equip the systems with nonvolatile memory (NVM) and crash consistency mechanisms. However, prior works require non-trivial hardware modifications, e.g., a voltage monitor, nonvolatile flip-flops/scratchpad, dependence tracking modules, etc., thereby causing significant area/power/manufacturing costs. For low-cost yet performant intermittent computation, this paper presents CoSpec, a new architecture/compiler co-design scheme that works for commodity in-order processors used in energy-harvesting systems. To achieve crash consistency without requiring unconven- tional architectural support, CoSpec leverages speculation assuming that power failure is not going to occur and thus holds all committed stores in a store buffer (SB)—as if they were speculative—in case of mispeculation. CoSpec compiler first partitions a given program into a series of recoverable code regions with the SB size in mind, so that no region overflows the SB. When the program control reaches the end of each region, the speculation turns out to be successful, thus releasing all the buffered stores of the region to NVM. If power failure occurs during the execution of a region, all its speculative stores disappear in the volatile SB, i.e., they never affect program states in NVM. Consequently, the interrupted region can be restarted with consistent program states in the wake of power failure. To hide the latency of the SB release—i.e., essentially NVM writes—at each region boundary, CoSpec overlaps the NVM writes of the current region with the speculative execution of the next region. Such instruction level parallelism gives an illusion of out- of-order execution on top of the in-order processor, achieving a speedup of more than 1.2X when there is no power outage. Our experiments on a set of real energy harvesting traces with frequent outages demonstrate that CoSpec outperforms the state-of-the-art scheme by 1.8∼3X on average.
Speaker bioJongouk Choi is a Ph.D student at Purdue University. His current research interests are in computer architecture, compiler, systems, and hardware security. He obtained his MS and BS in CS from Kentucky State University. He held various research positions at LG Electronics, NASA EpSCoR, and ARM research. For more information, please see his webpage at https://www.cs.purdue.edu/homes/choi658/.
CrossFS: A Cross-layered Direct-Access File System
YujieRen (Rutgers University);ChangwooMin (Virginia Tech);SudarsunKannan (Rutgers University);
YujieRen (Rutgers University);ChangwooMin (Virginia Tech);SudarsunKannan (Rutgers University);
Speaker:Yujie Ren, Rutgers Unversity
AbstractWe design CrossFS, a cross-layered direct-access file system disaggregated across user-level, firmware, and kernel layers for scaling I/O performance and improving concurrency. CrossFS is designed to exploit host- and device-level compute capabilities. CrossFS introduces a file descriptor-based concurrency control that maps each file descriptor to one hardware-level I/O queue for concurrency with or without data sharing across threads and processes. This design allows CrossFS's firmware component to process disjoint access across file descriptors concurrently. CrossFS delegates concurrency control to powerful host-CPUs, which convert the file descriptor synchronization problem into an I/O queue request ordering problem. CrossFS exploits byte-addressable nonvolatile memory for I/O queue persistence to guarantee crash consistency in the cross-layered design and designs a lightweight firmware-level journaling mechanism. Finally, CrossFS designs a firmware-level I/O scheduler for efficient dispatch of file descriptor requests. Evaluation of emulated CrossFS on storage-class memory shows up to 4.87x concurrent access gains for bench- marks and 2.32x gains for real-world applications over the state-of-the-art kernel, user-level, and firmware file systems.
Speaker bioYujie Ren is a 4th-year Ph.D. candidate in computer science department at Rutgers University. His research interests are in file systems, memory management and computational storage. His research projects focus on reducing IO software overheads by disaggregating file system components across software/hardware layers and utilize storage compute resources.
Chair: Jishen Zhao
Twizzler: Rethinking the Operating System Stack for Byte-Addressable NVM
EthanMiller (University of California, Santa Cruz);
EthanMiller (University of California, Santa Cruz);
Speaker:Ethan Miller, University of California, Santa Cruz
AbstractByte-addressable non-volatile memory (NVM) promises applications the ability to persist small units of data, enabling new programming paradigms and system designs. However, such gains will require significant help from the operating system: it needs to "get out of the way" while still providing strong guarantees for security and resource management. This talk will describe our approach to designing an operating system and programming environment that leverages the advantages of NVM to provide a single-level store for application data. Under this approach, NVM can be accessed, transparently, by any thread at any time, with pointers retaining their meanings across multiple invocations. Equally important, the operating system is minimally involved in program operation, limiting itself to managing virtualized devices, scheduling threads, and managing page tables to enforce user-managed access controls at page-level granularity. Structuring the system in this way provides both a simpler programming model and, in many cases, higher performance, allowing NVM-based systems to fully leverage the new ability to persist data with a single write while providing a stronger, more flexible security model than traditional operating systems.
Speaker bioEthan L. Miller is a Professor in the Computer Science and Engineering Department at the University of California, Santa Cruz. He is a Fellow of the IEEE and an ACM Distinguished Scientist, and his publications have received multiple Best Paper awards. Prof. Miller received an Sc.B. from Brown University in 1987 and a Ph.D. from UC Berkeley in 1995, and has been on the UC Santa Cruz faculty since 2000. He has co-authored over 160 papers in a range of topics in file and storage systems, operating systems, parallel and distributed systems, information retrieval, and computer security; his research has received over 15,000 citations. He was a member of the team that developed Ceph, a scalable high-performance distributed file system for scientific computing that is now being adopted by several high-end computing organizations. His current research projects, which are funded by the National Science Foundation and industry support for the CRSS and SSRC, include system support for byte-addressable non-volatile memory, archival storage systems, reliable and secure storage systems, and issues in ultra-scale storage systems. Prof. Miller has worked with Pure Storage since its founding in 2009, helping to design and refine its storage architecture, resulting in over 120 awarded patents. He has also worked with other companies, including Samsung, Veritas, and Seagate, to help move research results into commercial use. Additional information is available at https://www.crss.ucsc.edu/person/elm.html.
Chair: Homa Esfahanizadeh (MIT)
Flexible Partial MDS Codes
WeiqiLi (University of California, Irvine);TaitingLu (University of California, Irvine);ZhiyingWang (University of California, Irvine);HamidJafarkhani (University of California, Irvine);
WeiqiLi (University of California, Irvine);TaitingLu (University of California, Irvine);ZhiyingWang (University of California, Irvine);HamidJafarkhani (University of California, Irvine);
Speaker:Weiqi Li, Center for Pervasive Communications and Computing (CPCC), University of California, Irvine, USA
AbstractThe partial MDS (PMDS) code was introduced by Blaum et al. for RAID systems. Given the redundancy level and the number of symbols in each node, PMDS codes can tolerate a mixed type of failures consisting of entire node failures and partial errors (symbols failures). Aiming at reducing the expected accessing latency, this paper presents flexible PMDS codes that can recover the information from a flexible number of nodes according to the number of available nodes, while the total number of symbols required remains the same. We analyze the reliability and latency of our flexible PMDS codes.
Speaker bioWeiqi Li received his B.S. and M.S. degree from Xian Jiaotong University, China in 2013 and 2016, respectively. Currently, he is pursuing a Ph.D in the Department of Electrical Engineering and Computer Science, University of California, Irvine. His research interests includes coding theory, signal processing and communication network.
Codes for Cost-Efficient DNA Synthesis
AndreasLenz (Technical University of Munich);YiLiu (Center for Memory and Recording Research, UCSD);CyrusRashtchian (UCSD);PaulSiegel (UCSD);AndrewTan (UCSD);AntoniaWachter-Zeh (Technical University of Munich);EitanYaakobi (Technion - Israel Institute of Technology);
AndreasLenz (Technical University of Munich);YiLiu (Center for Memory and Recording Research, UCSD);CyrusRashtchian (UCSD);PaulSiegel (UCSD);AndrewTan (UCSD);AntoniaWachter-Zeh (Technical University of Munich);EitanYaakobi (Technion - Israel Institute of Technology);
Speaker:Andreas Lenz, Technical University of Munich
AbstractAs a step toward more efficient DNA data storage systems, we study the design of codes that minimize the time and number of required materials needed to synthesize the DNA strands. We consider a popular synthesis process that builds many strands in parallel in a step-by-step fashion using a fixed supersequence $S$. The machine iterates through $S$ one nucleotide at a time, and in each cycle, it adds the next nucleotide to a subset of the strands. We show that by introducing redundancy to the synthesized strands, we can significantly decrease the number of synthesis cycles required to produce the strands. We derive the maximum amount of information per synthesis cycle assuming $S$ is an arbitrary periodic sequence. To prove our results, we exhibit new connections to cost-constrained codes.
Speaker bioAndreas Lenz received the B.Sc. and M.Sc. degrees (both with high distinction) in electrical engineering and information technology from Technische Universität München (TUM), Germany in 2013 and 2016, respectively. During his studies, his research interests included parameter estimation, communications, and circuit theory. Since 2016 he is working as a doctoral candidate at the Coding for Communications and Data Storage (COD) group at TUM, where he is involved in research about coding theory for insertion and deletion errors and modern data storage systems.
Systematic Single-Deletion Multiple-Substitution Correcting Codes
WentuSong (Singapore University of Technology and Design);NikitaPolyanskii (Technical University of Munich, Germany, and Skolkovo Institute of Science and Technology);KuiCai (Singapore University of Technology and Design);XuanHe (Singapore University of Technology and Design);
WentuSong (Singapore University of Technology and Design);NikitaPolyanskii (Technical University of Munich, Germany, and Skolkovo Institute of Science and Technology);KuiCai (Singapore University of Technology and Design);XuanHe (Singapore University of Technology and Design);
Speaker:Wentu Song, Singapore University of Technology and Design, Singapore
AbstractRecent work by Smagloy et al. (ISIT 2020) shows that the redundancy of a single-deletion s-substitution correcting code is asymptotically at least (s+1)log (n)+o(log(n)), where n is the length of the codes. They also provide a construction of single-deletion and single-substitution codes with redundancy 6log(n)+8. In this paper, we propose a family of systematic single-deletion s-substitution correcting codes of length n with asymptotical redundancy at most (3s+4)log(n)+o(log(n)) and polynomial encoding/decoding complexity, where s>=2 is a constant. Specifically, the encoding and decoding complexity of the proposed codes are O(n^{s+3}) and O(n^{s+2}), respectively.
Speaker bioWentu Song received the BS and MS degrees in Mathematics from Jilin University, China in 1998 and 2006, respectively, and the Ph.D. degree in Mathematics from Peking University in 2012, China. He is currently a Post Doc research fellow in the Advanced Coding and Signal Processing Lab, Singapore University of Technology and Design, where he is working on coding theory, network distributed storage, and coding for DNA based data storage.
Single Indel/Edit Correcting Codes: Linear-Time Encoders and Order-Optimality
KuiCai (Singapore University of Technology and Design);Yeow MengChee (National University of Singapore);RyanGabrys (Spawar Systems Center);Han MaoKiah (Nanyang Technological University);TUAN THANHNGUYEN (Singapore University of Technology and Design);
KuiCai (Singapore University of Technology and Design);Yeow MengChee (National University of Singapore);RyanGabrys (Spawar Systems Center);Han MaoKiah (Nanyang Technological University);TUAN THANHNGUYEN (Singapore University of Technology and Design);
Speaker:Tuan Thanh Nguyen , Singapore University of Technology and Design
AbstractAn indel refers to a single insertion or deletion, while an edit refers to a single insertion, deletion or substitution. In this work, we investigate quaternary codes that correct a single indel or single edit and provide linear-time algorithms that encode binary messages into these codes of length n. Particularly, we provide two linear-time encoders: one corrects a single edit with ⌈log n⌉ + O(log log n) redundancy bits, while the other corrects a single indel with ⌈log n⌉ + 2 redundant bits. These two encoders are order-optimal. The former encoder is the first known order- optimal encoder that corrects a single edit, while the latter encoder (that corrects a single indel) reduces the redundancy of the best known encoder of Tenengolts (1984) by at least four bits.
Speaker bioTuan Thanh Nguyen received the B.Sc. degree and the Ph.D. degree in mathematics from Nanyang Technological University, Singapore, in 2014 and 2018, respectively. He is currently a Research Fellow at Singapore University of Technology and Design (SUTD), with the Advanced Coding and Signal Processing (ACSP) Lab of SUTD. He was a research fellow in the School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, from Aug 2018 to Sep 2019. His research interest lies in the interplay between combinatorics and computer science/engineering, particularly including combinatorics and coding theory. His research project concentrates on error correction codes and constrained codes for communication systems and data storage systems, especially codes for DNA-based data storage.
Coding and Bounds for Partially Defect Memory Cells
HaiderAl Kim (Technical University of Munich, TUM);SvenPuchinger (Technical University of Denmark (DTU));AntoniaWachter-Zeh (Technical University of Munich, TUM);
HaiderAl Kim (Technical University of Munich, TUM);SvenPuchinger (Technical University of Denmark (DTU));AntoniaWachter-Zeh (Technical University of Munich, TUM);
Speaker:Haider Al Kim, PhD Candidate at Technical University of Munich / Department of Electrical and Computer Engineering / Coding and Cryptography (COD) Group
AbstractThis paper considers coding for \emph{partially stuck} memory cells. Such memory cells can only store partial information as some of their levels cannot be used due to, e.g., wear out. First, we present a code construction for masking such partially stuck cells while additionally correcting errors. Second, we derive a sphere-packing and a Gilbert-Varshamov bound for codes that can mask a certain number of partially stuck cells and correct errors additionally. A numerical comparison between the new bounds and our constructions of PSMCs for any $u\leq n$ shows that our construction matches the Gilbert--Varshamov-like bound for several code parameters.
Speaker bioCurrently, Haider Al Kim is a PhD candidate at TUM, Germany focusing on error correction codes for telecommunication networks and storage. From 2008-2018, he was a lecturer, researcher, and assistant chief network engineer at the University of Kufa (UOK) in both faculty of engineering - Electronic and Communication Department ECE and ITRDC. He finished his Master's degree in Telecommunication Networks from University of Technology Sydney (UTS), Sydney, Australia in 2014 under the supervision A. Prof. Kumbesan Sandrasegaran. He got a B.Sc. in Information and Communication Engineering from Alkhwarizmi Engineering College were ranked 2 among 27 students at the University of Baghdad, Baghdad, Iraq in 2008. Working and research areas are Wireless Telecommunication, Mobile Network, Network Management, Network Design and Implementation, and Data Analysis and Monitoring while more recently he is working on Coding for Memory with Partial Defects and Algebraic Coding.
Chair: Sudarsun Kannan
Manycore-Based Scalable SSD Architecture Towards One and More Million IOPS
JieZhang (KAIST);MiryeongKwon (KAIST);MichaelSwift (University of Wisconsin-Madison);MyoungsooJung (KAIST);
JieZhang (KAIST);MiryeongKwon (KAIST);MichaelSwift (University of Wisconsin-Madison);MyoungsooJung (KAIST);
Speaker:Jie Zhang, Korea Advanced Institute of Science and Technology (KAIST)
AbstractNVMe is designed to unshackle flash from a traditional storage bus by allowing hosts to employ many threads to achieve higher bandwidth. While NVMe enables users to fully exploit all levels of parallelism offered by modern SSDs, current firmware designs are not scalable and have difficulty in handling a large number of I/O requests in parallel due to its limited computation power and many hardware contentions. We propose DeepFlash, a novel manycore-based storage platform that can process more than a million I/O requests in a second (1MIOPS) while hiding long latencies imposed by its internal flash media. Inspired by a parallel data analysis system, we design the firmware based on many-to-many threading model that can be scaled horizontally. The proposed DeepFlash can extract the maximum performance of the underlying flash memory complex by concurrently executing multiple firmware components across many cores within the device. To show its extreme parallel scalability, we implement DeepFlash on a many-core prototype processor that employs dozens of lightweight cores, analyze new challenges from parallel I/O processing and address the challenges by applying concurrency-aware optimizations. Our comprehensive evaluation reveals that DeepFlash can serve around 4.5 GB/s, while minimizing the CPU demand on microbenchmarks and real server workloads.
Speaker bioDr. Jie Zhang is a postdoctoral researcher at KAIST. He is engaged in the research and design of computer architecture and systems including storage systems, non-volatile memory, and specialized processors. His research addresses the requirements for high-performance storage systems in the era of big data and artificial intelligence from the perspective of computer architecture. He is dedicated to breaking through the bottlenecks of data migration and the limitations of memory walls in the Von Neumann architecture.
The Storage Hierarchy is Not a Hierarchy: Optimizing Caching on Modern Storage Devices with Orthus
KanWu (University of Wisconsin-Madison);ZhihanGuo (University of Wisconsin—Madison);GuanzhouHu (University of Wisconsin-Madison);KaiweiTu (University of Wisconsin-Madison);RamnatthanAlagappan (VMware Research Group);RathijitSen (Microsoft);KwanghyunPark (Microsoft);AndreaArpaci-Dusseau (University of Wisconsin-Madison);RemziArpaci-Dusseau (University of Wisconsin–Madison);
KanWu (University of Wisconsin-Madison);ZhihanGuo (University of Wisconsin—Madison);GuanzhouHu (University of Wisconsin-Madison);KaiweiTu (University of Wisconsin-Madison);RamnatthanAlagappan (VMware Research Group);RathijitSen (Microsoft);KwanghyunPark (Microsoft);AndreaArpaci-Dusseau (University of Wisconsin-Madison);RemziArpaci-Dusseau (University of Wisconsin–Madison);
Speaker:Kan Wu, University of Wisconsin-Madison
AbstractWe introduce non-hierarchical caching (NHC), a novel approach to caching in modern storage hierarchies. NHC improves performance as compared to classic caching by redirecting excess load to devices lower in the hierarchy when it is advantageous to do so. NHC dynamically adjusts allocation and access decisions, thus maximizing performance (e.g., high throughput, low 99%-ile latency). We implement NHC in Orthus-CAS (a block-layer caching kernel module) and Orthus-KV (a user-level caching layer for a key-value store). We show the efficacy of NHC via a thorough empirical study: Orthus-KV and Orthus-CAS offer significantly better performance (by up to 2×) than classic caching on various modern hierarchies, under a range of realistic workloads.
Speaker bioKan Wu is a PhD candidate in computer science from the University of Wisconsin-Madison. He works with Professor Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau. His research interests include storage systems, databases and distributed systems, with a primary focus on emerging storage technologies such as persistent memory. He received his Bachelor of Science (B.S.) from the University of Science and Technology of China.
Architecting Throughput Processors with New Flash
JieZhang (KAIST);MyoungsooJung (KAIST);
JieZhang (KAIST);MyoungsooJung (KAIST);
Speaker:Jie Zhang, Korea Advanced Institute of Science and Technology (KAIST)
AbstractWe propose ZnG, a new GPU-SSD integrated architecture, which can maximize the memory capacity in the GPU and address the performance penalty imposed by SSD. Specifically, ZnG replaces all GPU internal DRAM with an ultra-low-latency SSD to maximize the GPU memory capacity. ZnG further removes the performance bottleneck of SSD by replacing the flash channels with a high-throughput flash network and integrating the SSD firmware in the GPU MMU to reap the benefits of hardware acceleration. Although the NAND flash array within the SSD can deliver high accumulated bandwidth, only a small fraction of its bandwidth can be utilized by the memory requests, due to the mismatch of access granularity. To address this, ZnG employs a large L2 cache and flash registers to buffer the memory requests. Our evaluation results indicate that ZnG can achieve 7.5x higher performance than prior work.
Speaker bioDr. Jie Zhang is a postdoctoral researcher at KAIST. He is engaged in the research and design of computer architecture and systems including storage systems, non-volatile memory, and specialized processors. His research addresses the requirements for high-performance storage systems in the era of big data and artificial intelligence from the perspective of computer architecture. He is dedicated to breaking through the bottlenecks of data migration and the limitations of memory walls in the Von Neumann architecture.
Explaining SSD failures using Anomaly Detection
ChandranilChakraborttii (University of California, Santa Cruz);HeinerLitz (University of California, Santa Cruz);
ChandranilChakraborttii (University of California, Santa Cruz);HeinerLitz (University of California, Santa Cruz);
Speaker:Chandranil Chakraborttii, University of California Santa Cruz
AbstractNAND flash-based solid-state drives (SSDs) represent an important storage tier in data centers holding most of today’s warm and hot data. Even with the advanced fault tolerance techniques and low failure rates, large hyperscale data centers utilizing 100,000’s of SSDs suffer from multiple device failures daily. Data center operators are interested in predicting SSD device failures for two main reasons. First, even with RAID [2] and replication [5] techniques in place, device failures induce transient recovery and repair overheads, affecting the cost and tail latency of storage systems. Second, predicting near-term failure trends helps to inform the device acquisition process, thus avoiding capacity bottlenecks. Hence, it is important to predict both the short-term individual device failures as well as near-term failure trends. Prior studies on predicting storage device failures [1, 6, 7, 9] suffer from the following main challenges. First, as they utilize black-box machine learning (ML) techniques, they are unaware of the underlying failure reasons rendering it difficult to determine the failure types that these models can predict. Second, the models in prior work struggle with dynamic environments that suffer from previously unseen failures that have not been included in the training set. These two challenges are especially relevant for the SSD failure detection problem which suffers from high class-imbalance. In particular, the number of healthy drive observations is generally orders of magnitude larger than the number of failed drive observations, thus posing a problem for training most traditional supervised ML models. To address these challenges, we propose to utilize 1-class ML models that are trained only on the majority class. By ignoring the minority class for training, our 1-class models avoid overfitting to an incomplete set of failure types, thereby improving the overall prediction performance by up to 9.5% in terms of ROC AUC score. Furthermore, we introduce a new learning technique for SSD failure detection, 1-class autoencoder, which enables interpretability of the trained models while providing high prediction accuracy. In particular, 1-class autoencoders provide insights into what features and their combinations are most relevant to flagging a particular type of device failure. This enables categorization of failed drives based on their failure type, thus informing about specific procedures (e.g., repair, swap, etc.) that need to be applied to resolve the failure. For analysis and evaluation of our proposed techniques, we leverage a cloud-scale dataset from Google that has already been used in prior work [1, 8]. This dataset contains 40 million observations from over 30,000 drives over a period of six years. For each observation, the dataset contains 21 different SSD telemetry parameters including SMART (Self-Monitoring, Analysis, and Reporting Technology) parameters, the amount of read and written data, error codes, as well as the information about blocks that became nonoperational over time. Around 30% of the drives that failed during the data collection process were replaced while the rest were removed, and hence no longer appeared in the dataset. As a result, we obtained approximately 300 observations for each healthy drive (40 million observations in total) and 4 to 140 observations for each failed drive (15000 total observations). We treated each data point as an independent observation and normalized all the non-categorical data values to be between 0 and 1. One of our primary goals was to select the most distinguishing features that are highly correlated to the failures for training. We used three different feature selection methods, Filter, Embedded, and Wrapper [4] techniques, for selecting the most important features contributing to failures for our dataset. The resulting set of top features selected were correctable error count, cumulative bad block count, cumulative bad block count, cumulative p/e cycle, erase count, final read error, read count, factory bad block count, write count, and status read-only. The dataset containing only the top selected features is then used for training the different ML models. In a datacenter, we envision our SSD failure prediction technique to be implemented as shown in Figure 1. The telemetry traces are collected periodically from all SSDs in the datacenter and sent to the preprocessing pipeline transforming all input data into numeric values while filtering out incomplete and noisy values. Following data preprocessing, feature selection is performed to extract the most important features from the data set. The preprocessed data is then either utilized for training or inference. For inference, device anomalies are reported and classified according to our 1-class autoencoder approach. SSDs can then be manually analyzed by a technician or replaced directly. As an alternative, a scrubber can be leveraged to validate the model predictions by performing a low-level analysis of the SSD. To evaluate the five ML techniques, we first label all 40 million observations in the dataset to separate between healthy and failed drive observations. We then perform a 90% - 10% split of the dataset into a training set and an evaluation set respectively. For training the 1-class models we remove all failed drive observations from the training set, however, the evaluation set is kept identical for our proposed 1-class techniques and the three baselines. We use ROC AUC score as a metric for comparing the performance of our approaches with chosen baselines, which is inline with prior work [1] and use 10-fold cross-validation for evaluating all approaches. The 1-class autoencoder model utilizes 4 hidden layers comprising of 50, 25, 25, and 50 neurons, respectively. The neurons utilize a 𝑡𝑎𝑛ℎ activation function, 𝐴𝑑𝑎𝑚 optimizer, and the model is trained for 100 epochs. We use early stopping with a patience value of 5 ensuring that the training of the model stops when the loss does not decrease after 5 consecutive epochs. Increasing the number of hidden layers beyond 4 increases the training time significantly without providing performance benefits. Figure 2 illustrates the comparative performance of different ML techniques for predicting SSD failures one day ahead. Among the baselines, Random Forest performs best, providing a ROC AUC score of 0.85. Both our 1-class models outperform the best baseline. In particular, 1-class isolation forest achieves a ROC AUC score of 0.91, representing a 7% improvement over the best baseline while 1-class AutoEncoder, outperforms Random Forest by 9.5%. This work introduces 1-class autoencoders for interpreting failures. In particular, our technique exposes the reasons determined by our model to flag a particular device failure. This is achieved by utilizing the reconstruction error generated by the model while reproducing the output using the trained representation of a healthy drive. The failed drives do not conform to the representation, hence, generate an output that differs significantly from the actual input producing a large reconstruction error. . We study the reconstruction error per feature to generate the failure reasons. The features which contribute more than average error per feature to the reconstruction error, is defined as a significant reason. The results show that many failed drives show a higher than normal number of 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑎𝑏𝑙𝑒_𝑒𝑟𝑟𝑜𝑟𝑠 and 𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒_𝑏𝑎𝑑_𝑏𝑙𝑜𝑐𝑘, wever they were selected as a reason for failure only for 35%, and 30% of the cases respectively. Hence, our analysis shows that there exist particularly relevant features that indicate device failures in many cases, however, only the combination of several features enables accurate failure prediction. To conclude, this paper provides a comprehensive analysis of machine learning techniques to predict SSD failures in the cloud. We observe that prior works on SSD failure prediction suffer from the inability to predict previously unseen failure types motivating us to explore 1-class machine learning models such as 1-class isolation forest and 1-class autoencoder. We show that our approaches outperform prior work by 9.5% ROC-AUC score by improving on the prediction accuracy for failed drives. Finally, we show that 1-class autoencoders enable interpretability of model predictions by exposing the reasons determined by the model for predicting a failure. A more comprehensive evaluation of our approach in discussed in [3], where we show the adaptability of 1-class models to dynamic environments with new types of failures emerging over time and the impact of predicting further ahead in time.
Speaker bioChandranil Chakraborttii is a doctoral candidate in the department of computer science and engineering at the University of California, Santa Cruz. His main research interests are in machine learning, data science, and storage systems with a focus on data centers. His research involves the use of machine learning techniques for improving the performance of flash-based storage systems. This improvement reflects in two major directions - reliability, and the response time of flash-based storage devices. Before starting his Ph.D. program, Chandranil spent three years as a software engineer in the software industry and has also taught for the Stanford Summer Institutes for four years.
Deterministic I/O and Resource Isolation for OS-level Virtualization In Server Computing
MiryeongKwon (KAIST);DonghyunGouk (KAIST);ChangrimLee (KAIST);ByoungGeunKim (Samsung);JooyoungHwang (Samsung);MyoungsooJung (KAIST);
MiryeongKwon (KAIST);DonghyunGouk (KAIST);ChangrimLee (KAIST);ByoungGeunKim (Samsung);JooyoungHwang (Samsung);MyoungsooJung (KAIST);
Speaker:Miryeong Kwon, KAIST
AbstractWe propose DC-store, a storage framework that offers deterministic I/O performance for a multi-container execution environment. DC-store’s hardware-level design implements multiple NVM sets on a shared storage pool, each providing a deterministic SSD access time by removing internal resource conflicts. In parallel, software support of DC-Store is aware of the NVM sets and enlightens Linux kernel to isolate noisy neighbor containers, performing page frame reclaiming, from peers. We prototype both hardware and software counterparts of DC-Store and evaluate them in a real system. The evaluation results demonstrate that containerized data-intensive applications on DC-Store exhibit 31% shorter average execution time, on average, compared to those on a baseline system.
Speaker bioMiryeong Kwon is a Ph.D. Candidate of Korea Advanced Institute of Science and Technology (KAIST). She is advised by Myongsoo Jung who leads the Computer Architecture, Non-volatile memory, and operating system. Her main research interest is OS-level virtualization environment and non-volatile and storage device management in that system.
Chair: Ahmed Hareedy (Duke)
Tunable Fine Learning Rate controlled by pulse width modulation in Charge Trap Flash (CTF) for Synaptic Application
ShaliniShrivastava (Indian Institute of Technology Bombay);UdayanGanguly (Indian Institute of Technology Bombay);
ShaliniShrivastava (Indian Institute of Technology Bombay);UdayanGanguly (Indian Institute of Technology Bombay);
Speaker:Shalini Shrivastava, Indian Institute of Technology Bombay, Mumbai, India
AbstractThe brain-inspired neuromorphic computation is on high demand for the next generation computational systems due to its high performance, low-power and high energy efficiency. The highly mature technology of today, Flash memory, is the first and has been a promising electronic synaptic device since 1989. The linear, gradual, and symmetric learning rate are the basic requirements for a high-performance synaptic device. In this paper, we demonstrate a fine-controlled learning rate in Charge Trap Flash (CTF) by pulse width modulation of the input gate pulse. We further study the effect of the cycle to cycle (C2C) and device to device (D2D) variability, and the limits of charge fluctuation with scaling on the learning rate. The comparison of CTF as synapse with other state-of-the-art devices is carried out. The learning rate with CTF can be tuned from 0.2% to 100%, which is remarkable for a single device. Further, the C2C variability does not affect the conductance however it is limited by D2D variability only for learning levels > 8000. We also show that the CTF synapse has a lower sensitivity to charge fluctuation even with scaled devices. The tunable learning rate and lower sensitivity to variability and charge fluctuation in CTF synapse is significant compared to the state-of-the-art. The tunable learning rate of CTF is very promising and of great interest for brain-inspired computing systems.
Speaker bioShalini Shrivastava received M.Tech in Electrical Engineering from IIT Bombay in 2013. She is a Ph.D student at the Department of Electrical Engineering, IIT Bombay. Her research interest includes both experimental and theoretical semiconductor device physics. She is currently doing research on energy efficient electronic devices for neuromorphic computing at IITB with Prof. Udayan Ganguly.
Ferroelectric, Analog Resistive Switching in BEOL Compatible TiN/HfZrO4/TiOx Junctions
LauraBégon-Lours (IBM Research Gmbh);MattiaHalter (IBM Research Gmbh, ETH Zurich);YouriPopoff (IBM Research Gmbh, ETH Zurich);Bert JanOffrein (IBM Research Gmbh);
LauraBégon-Lours (IBM Research Gmbh);MattiaHalter (IBM Research Gmbh, ETH Zurich);YouriPopoff (IBM Research Gmbh, ETH Zurich);Bert JanOffrein (IBM Research Gmbh);
Speaker:Laura Bégon-Lours, IBM Research
AbstractThanks to their compatibility with CMOS technologies, hafnium based ferroelectric devices receive increasing interest for the fabrication of neuromorphic hardware. In this work, an analog resistive memory device is fabricated with a process developed for Back-End-Of-Line integration. A 4.5 nm thick HfZrO4 (HZO) layer is crystallized into the ferroelectric phase, a thickness thin enough to allow electrical conduction through the layer. A TiOx interlayer is used to create an asymmetric junction as required for transferring a polarization state change into a modification of the conductivity. Memristive functionality is obtained, in the pristine state as well as after ferroelectric wake-up, involving redistribution of oxygen vacancies in the ferroelectric layer. The resistive switching is shown to originate directly from the ferroelectric properties of the HZO layer.
Speaker bioLaura is a post-doctoral researcher at IBM Research in Zurich. After studying Physics in ESPCI (Paris) she joined Unite Mixte de Physique CNRS-Thales for her PhD research on ferroelectric field-effects in high-Tc (YBCO cuprates) superconductors. To develop her skills in materials sciences and epitaxial growth of complex oxides, she joined the MESA+ institute for two years, where she demonstrated epitaxial growth of HfZrO4 on a GaN template. She is now a Marie-Curie fellow at IBM Research where she develops ferroelectric devices for synaptic weight in artificial neural networks accelerators.
HD-RRAM: Improving Write Operations on MLC 3D Cross-point Resistive Memory
ChengningWang (Huazhong University of Science and Technology);DanFeng (Huazhong University of Science and Technology);WeiTong (Huazhong University of Science and Technology);YuHua (Huazhong University of Science and Technology);JingningLiu (Huazhong University of Science and Technology);BingWu (Huazhong University of Science and Technology);WeiZhao (Huazhong University of Science and Technology);LinghaoSong (Duke University);YangZhang (Huazhong University of Science and Technology);JieXu (Huazhong University of Science and Technology);XueliangWei (Huazhong University of Science and Technology);YiranChen (Duke University);
ChengningWang (Huazhong University of Science and Technology);DanFeng (Huazhong University of Science and Technology);WeiTong (Huazhong University of Science and Technology);YuHua (Huazhong University of Science and Technology);JingningLiu (Huazhong University of Science and Technology);BingWu (Huazhong University of Science and Technology);WeiZhao (Huazhong University of Science and Technology);LinghaoSong (Duke University);YangZhang (Huazhong University of Science and Technology);JieXu (Huazhong University of Science and Technology);XueliangWei (Huazhong University of Science and Technology);YiranChen (Duke University);
Speaker:Chengning Wang, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, China.
AbstractMultilevel cell (MLC), cross-point array structure, and three-dimensional (3D) array integration are three technologies to scale up the density of resistive memory. However, composing the three technologies together strengthens the interactions between array-level and cell-level nonidealities (IR drop, sneak current, and cycle-to-cycle variation) in resistive memory arrays and significantly degrades the array write performance. We propose a nonidealities-tolerant high-density resistive memory (HD-RRAM) system based on multilayered MLC 3D cross-point arrays that can weaken the interactions between nonidealities and mitigate their degradation effects on the write performance. HD-RRAM is equipped with a double-transistor array architecture with multiside asymmetric bias, proportional-control state tuning, and MLC parallel writing techniques. The evaluation shows that HD-RRAM system can reduce the access latency by 27.5% and energy consumption by 37.2% over an aggressive baseline.
Speaker bioChengning Wang is a fifth-year Ph.D. student with Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, China. His research interests include modeling, design, analysis, and co-optimization of high-density memristive nanodevices, arrays, and analog parallel computation-in-memory for novel applications. He was invited and served as a reviewer for several science citation index journals and conferences, including IEEE ISCAS, IEEE TVLSI, IEEE TC, and Front. Inform. Technol. Electron. Eng.
Reconstruction Algorithms for DNA-Storage Systems
OmerSabary (University of California, San Diego);AlexanderYucovich (Technion);GuyShapira (Technion);EitanYaakobi (Technion);
OmerSabary (University of California, San Diego);AlexanderYucovich (Technion);GuyShapira (Technion);EitanYaakobi (Technion);
Speaker:Omer Sabary, University of California, San Diego
AbstractIn the \emph{trace reconstruction problem} a length-$n$ string $x$ yields a collection of noisy copies, called \emph{traces}, $y_1, \ldots, y_t$ where each $y_i$ is independently obtained from $x$ by passing through a \emph{deletion channel}, which deletes every symbol with some fixed probability. The main goal under this paradigm is to determine the required minimum number of i.i.d traces in order to reconstruct $x$ with high probability. The trace reconstruction problem can be extended to the model where each trace is a result of $x$ passing through a \emph{deletion-insertion-substitution channel}, which introduces also insertions and substitutions. Motivated by the storage channel of DNA, this work is focused on another variation of the trace reconstruction problem, which is referred by the \emph{DNA reconstruction problem}. A \emph{DNA reconstruction algorithm} is a mapping $R: (\Sigma_q^*)^t \to \Sigma_q^*$ which receives $t$ traces $y_1, \ldots, y_t$ as an input and produces $\widehat{x}$, an estimation of $x$. The goal in the DNA reconstruction problem is to minimize the edit distance $d_e(x,\widehat{x})$ between the original string and the algorithm's estimation. For the deletion channel case, the problem is referred by the \emph{deletion DNA reconstruction problem} and the goal is to minimize the Levenshtein distance $d_L(x,\widehat{x})$. In this work, we present several new algorithms for these reconstruction problems. Our algorithms look globally on the entire sequence of the traces and use dynamic programming algorithms, which are used for the \emph{shortest common supersequence} and the \emph{longest common subsequence} problems, in order to decode the original sequence. Our algorithms do not require any limitations on the input and the number of traces, and more than that, they perform well even for error probabilities as high as $0.27$. The algorithms have been tested on simulated data as well as on data from previous DNA experiments and are shown to outperform all previous algorithms.
Speaker bioOmer Sabary is a PhD student at the Center of Memory and Recording Research, Electrical and Computer Engineering Department at UC San Diego. His advisor is Prof. Paul H. Siegel and his research interests include coding and algorithms for DNA storage systems. He recently recieved his M.Sc. from the Technion, where his advisor was Prof. Eitan Yaakobi.
Chair: Baris Kasikci
Digital-based Processing In-Memory for Acceleration of Unsupervised Learning
MohsenImani (UC Irvine);SaranshGupta (UC San Diego);YeseongKim (UC San Diego);TajanaRosing (UC San Diego);
MohsenImani (UC Irvine);SaranshGupta (UC San Diego);YeseongKim (UC San Diego);TajanaRosing (UC San Diego);
Speaker:Mohsen Imani, University of California Irvine
AbstractToday’s applications generate a large amount of data that need to be processed by learning algorithms. In practice, the majority of the data are not associated with any labels. Unsupervised learning, i.e., clustering methods, are the most commonly used algorithms for data analysis. However, running clustering algorithms on traditional cores results in high energy consumption and slow processing speed due to a large amount of data movement between memory and processing units. In this paper, we propose DUAL, a Digital-based Unsupervised learning AcceLeration, which supports a wide range of popular algorithms on conventional crossbar memory. Instead of working with the original data, DUAL maps all data points into high-dimensional space, replacing complex clustering operations with memory-friendly operations. We accordingly design a PIM-based architecture that supports all essential operations in a highly parallel and scalable way. DUAL supports a wide range of essential operations and enables in-place computations, allowing data points to remain in memory. We have evaluated DUAL on several popular clustering algorithms for a wide range of large-scale datasets. Our evaluation shows that DUAL provides a comparable quality to existing clustering algorithms while using a binary representation and a simplified distance metric. DUAL also provides 58.8× speedup and 251.2× energy efficiency improvement as compared to the state-of-the-art solution running on GPU
Speaker bioMohsen Imani is a tenure track assistant professor in the Department of Computer Science at the University of California, Irvine. He is also a director of Bio-Inspired Architecture and Systems Laboratory (BIASLab) in Donald Bren School of Information and Computer Sciences (ICS). Dr. Imani received his Ph.D. degree from the Department of Computer Science and Engineering at UC San Diego in 2020. The PI has a stellar record of publication with over 100 papers in top IEEE/ACM conferences and journals with over 2,000 citation counts and h-index of 25. Dr. Imani's contribution has led to a new direction on brain-inspired hyperdimensional computing that enables ultra-efficient and real-time learning and cognitive support. His research was the main initiative in opening up multiple industrial and governmental research programs in Semiconductor Research Corporation (SRC) and DARPA. Dr. Imani's research has been recognized with several awards, including the Bernard and Sophia Gordon Engineering Leadership Award, the Outstanding Researcher Award, and the Powell Fellowship Award. He also received the Best Doctorate Research from UC San Diego in 2018, and several best paper nomination awards at multiple top conferences.
High Precision In-Memory Computing for Deep Neural Network Acceleration
MohsenImani (University of California Irvine);SaranshGupta (University of California San Diego);YeseongKim (University of California San Diego);MinxuanZhou (University of California San Diego);TajanaRosing (University of California San Diego);
MohsenImani (University of California Irvine);SaranshGupta (University of California San Diego);YeseongKim (University of California San Diego);MinxuanZhou (University of California San Diego);TajanaRosing (University of California San Diego);
Speaker:,
AbstractProcessing In-Memory (PIM) has shown a great potential to accelerate inference tasks of Convolutional Neural Network (CNN). However, existing PIM architectures do not support high precision computation, e.g., in floating point precision, which is essential for training accurate CNN models. In addition, most of the existing PIM approaches require analog/mixed-signal circuits, which do not scale, exploiting insufficiently reliable multi-bit Non-Volatile Memory (NVM). In this paper, we propose FloatPIM, a fully-digital scalable PIM architecture that accelerates CNN in both training and testing phases. FloatPIM natively supports floating-point representation, thus enabling accurate CNN training. FloatPIM also enables fast communication between neighboring memory blocks to reduce internal data movement of the PIM architecture. We evaluate the efficiency of FloatPIM on ImageNet dataset using popular large-scale neural networks.
DRAM-less Accelerator for Energy Efficient Data Processing
JieZhang (KAIST);GyuyoungPark (KAIST);DavidDonofrio (Lawrence Berkeley National Laboratory);JohnShalf (Lawrence Berkeley National Laboratory);MyoungsooJung (KAIST);
JieZhang (KAIST);GyuyoungPark (KAIST);DavidDonofrio (Lawrence Berkeley National Laboratory);JohnShalf (Lawrence Berkeley National Laboratory);MyoungsooJung (KAIST);
Speaker:Jie Zhang, Korea Advanced Institute of Science and Technology (KAIST)
AbstractGeneral purpose hardware accelerators have become major data processing resources in many computing domains. However, the processing capability of hardware accelerations is often limited by costly software interventions and memory copies to support compulsory data movement between different processors and solid-state drives (SSDs). This in turn also wastes a significant amount of energy in modern accelerated systems. In this work, we propose, DRAM-less, a hardware automation approach that precisely integrates many state-of-the-art phase change memory (PRAM) modules into its data processing network to dramatically reduce unnecessary data copies with a minimum of software modifications. We implement a new memory controller that plugs a real 3x nm multi-partition PRAM to 28nm technology FPGA logic cells and interoperate its design into a real PCIe accelerator emulation platform. The evaluation results reveal that our DRAM-less achieves, on average, 47% better performance than advanced acceleration approaches that use a peer-to-peer DMA.
Speaker bioDr. Jie Zhang is a postdoctoral researcher at KAIST. He is engaged in the research and design of computer architecture and systems including storage systems, non-volatile memory, and specialized processors. His research addresses the requirements for high-performance storage systems in the era of big data and artificial intelligence from the perspective of computer architecture. He is dedicated to breaking through the bottlenecks of data migration and the limitations of memory walls in the Von Neumann architecture.
Cross-Layer Design Space Exploration of NVM-based Caches for Deep Learning
AhmetInci (Carnegie Mellon University);Mehmet MericIsgenc (Carnegie Mellon University);DianaMarculescu (The University of Texas at Austin);
AhmetInci (Carnegie Mellon University);Mehmet MericIsgenc (Carnegie Mellon University);DianaMarculescu (The University of Texas at Austin);
Speaker:Ahmet Inci, Carnegie Mellon University
AbstractNon-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM++, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and iso-area performance and energy analysis for systems whose last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8x and 4.7x energy-delay product (EDP) reduction and 2.4x and 2.8x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide up to 2x and 2.3x EDP reduction and accommodate 2.3x and 3.3x cache capacity when compared to SRAM, respectively. We also perform a scalability analysis and show that STT-MRAM and SOT-MRAM achieve orders of magnitude EDP reduction when compared to SRAM for large cache capacities. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPUs for DL applications.
Speaker bioAhmet Inci received his B.Sc. degree in Electronics Engineering at Sabanci University, Istanbul, Turkey in 2017. He is currently a Ph.D. candidate at Carnegie Mellon University, co-advised by Prof. Diana Marculescu and Prof. Gauri Joshi. His research interests include systems for ML, computer architecture, hardware-efficient deep learning, and HW/ML model co-design.
Sentinel: Efficient Tensor Migration and Allocation on Persistent Memory-based Heterogeneous Memory Systems for Deep Learning
JieRen (University of California, Merced);JiaolinLuo (University of California, Merced);KaiWu (University of California, Merced);MinjiaZhang (Microsoft Research);HyeranJeon (University of California, Merced);DongLi (University of California, Merced);
JieRen (University of California, Merced);JiaolinLuo (University of California, Merced);KaiWu (University of California, Merced);MinjiaZhang (Microsoft Research);HyeranJeon (University of California, Merced);DongLi (University of California, Merced);
Speaker:Jie Ren, University of California, Merced
AbstractMemory capacity is a major bottleneck for training deep neural networks (DNN). Heterogeneous memory (HM) combining fast and slow memories provides a promising direction to increase memory capacity. However, HM imposes challenges on tensor migration and allocation for high performance DNN training. Prior work heavily relies on DNN domain knowledge, unnecessarily causes tensor migration due to page-level false sharing, and wastes fast memory space. We present Sentinel, a software runtime system that automatically optimizes tensor management on HM. Sentinel uses dynamic profiling, and coordinates operating system (OS) and runtime-level profiling to bridge the semantic gap between OS and applications, which enables tensor-level profiling. This profiling enables co-allocating tensors with similar lifetime and memory access frequency into the same pages. Such fine-grained profiling and tensor collocation avoids unnecessary data movement, improves tensor movement efficiency, and enables larger batch training because of saving in fast memory space. Sentinel reduces fast memory consumption by 80% while retaining comparable performance to fast memory-only system; Sentinel consistently outperforms a state-of-the-art solution on CPU by 37% and two state-of-the-art solutions on GPU by 2x and 21% respectively in training throughput. On two billion-sized datasets BIGANN and DEEP1B, HM-ANN outperforms state-of-the-art compression-based solutions such as L&C and IMI+OPQ in recall-vs-latency by a large margin, obtaining 46% higher recall under the same search latency. We also extend existing graph-based methods such as HNSW and NSG with two strong baseline implementations on HM. At billion-point scale, HM-ANN is 2X and 5.8X faster than our HNSW and NSG baselines respectively to reach the same accuracy.
Speaker bioJie Ren is a PhD student in University of California, Merced. Her research focuses on high performance computing, especially on memory management on persistent memory-based heterogeneous memory.
Chair: Dong Li (UC Merced)
Dancing in the Dark: Profiling for Tiered Memory
JinyoungChoi (University of California, Riverside);SergeyBlagodurov (Advanced Micro Devices (AMD));Hung-WeiTseng (University of California, Riverside);
JinyoungChoi (University of California, Riverside);SergeyBlagodurov (Advanced Micro Devices (AMD));Hung-WeiTseng (University of California, Riverside);
Speaker:Jinyoung Choi, University of California, Riverside
AbstractWith the DDR standard facing density challenges and the emergence of the non-volatile memory technologies such as Cross-Point, phase change, and fast FLASH media, compute and memory vendors are contending with a paradigm shift in the datacenter space. The decades-long status quo of designing servers with DRAM technology as an exclusive memory solution is likely coming to an end. Future systems will increasingly employ tiered memory architectures (TMAs) in which multiple memory technologies work together to satisfy applications’ ever- growing demands for more memory, less latency, and greater bandwidth. Exactly how to expose each memory type to software is an open question. Recent systems have focused on hardware caching to leverage faster DRAM memory while exposing slower non-volatile memory to OS-addressable space. The hardware approach that deals with the non-uniformity of TMA, however, requires complex changes to the processor and cannot use fast memory to increase the system’s overall memory capacity. Mapping an entire TMA as OS-visible memory alleviates the challenges of the hardware approach but pushes the burden of managing data placement in the TMA to the software layers. The software, however, does not see the memory accesses by default; in order to make informed memory-scheduling decisions, software must rely on hardware methods to gain visibility into the load/store address stream. The OS then uses this information to place data in the most suitable memory location. In the original paper, we evaluate different methods of memory- access collection and propose a hybrid tiered-memory approach that offers comprehensive visibility into TMA.
Speaker bioJinyoung Choi is a third-year Ph.D. student in the Department of Computer Science and Engineering at the University of California, Riverside. He is a current member of Extreme Storage & Computer Architecture Lab (ESCAL) under the grace guidance of Dr.Tseng. His main research interests include tiered-memory architectures, emerging memory technologies, heterogeneous architectures, and emerging interconnects. He collaborated with AMD research as a co-op student in 2019, 2020 summers. Before pursuing Ph.D. program, he worked for Telechip,Inc., Korea for 4ys as an embedded SW engineer mainly developing Linux Kernel Drivers.
Investigating Hardware Caches for Terabyte-scale NVDIMMs
Julian T.Angeles (University of California, Davis);MarkHildebrand (University of California, Davis);VenkateshAkella (University of California, Davis);JasonLowe-Power (University of California, Davis);
Julian T.Angeles (University of California, Davis);MarkHildebrand (University of California, Davis);VenkateshAkella (University of California, Davis);JasonLowe-Power (University of California, Davis);
Speaker:Julian T. Angeles, PhD, Department of Computer Science, UC Davis
AbstractNon-volatile memory (NVRAM) based on phase-change memory (such as Optane DC Persistent Memory Module) is making its way into Intel servers to address the needs of emerging applications that have a huge memory footprint. These systems have both DRAM and NVRAM on the same memory channel with the smaller capacity DRAM serving as a cache to the larger capacity NVRAM in the so called 2LM mode. In this work, we perform a preliminary study on the performance of applications known for having diverse workload characteristics and irregular memory access patterns, using DRAM caches on real hardware. To accomplish this, we evaluate a variety of graph processing algorithms on large real world graph inputs using Galois, a high performance shared memory graph analytics framework. We identify a few key characteristics of these large-scale, bandwidth bound applications that DRAM caches don't account for and prevent them from taking full advantage of PMM read and write bandwidth. We argue that software based techniques are necessary for orchestrating the data movement to take full advantage of these new heterogeneous memory systems.
Speaker bioJulian is a second year PhD student in the Computer Science Department at UC Davis. His current primary research interest lies in software-hardware co-designs for heterogeneous architectures. He received his bachelor's degree from Chico State. He is advised by Professor Jason Lowe-Power.
HMMU: A Hardware-based Hybrid Memory Management Unit
FeiWen (Texas A&M University);MianQin (Texas A&M University);PaulGratz (Texas A&M University);NarasimhaReddy (Texas A&M University);
FeiWen (Texas A&M University);MianQin (Texas A&M University);PaulGratz (Texas A&M University);NarasimhaReddy (Texas A&M University);
Speaker:Fei Wen, Texas A&M University
AbstractThe current mobile applications have rapidly growing memory footprints, posing a great challenge for memory system design. Insufficient DRAM main memory will incur frequent data swaps between memory and storage, a process that hurts performance, consumes energy and deteriorates the write endurance of typical flash storage devices. Alternately, a larger DRAM has higher leakage power and drains the battery faster. Further, DRAM scaling trends make further growth of DRAM in the mobile space prohibitive due to cost. Emerging non-volatile memory (NVM) has the potential to alleviate these issues due to its higher capacity per cost than DRAM and minimal static power. Recently, a wide spectrum of NVM technologies, including phase-change memories (PCM), memristor, and 3D XPoint have emerged. Despite the mentioned advantages, NVM has longer access latency compared to DRAM and NVM writes can incur higher latencies and wear costs. Therefore, integration of these new memory technologies in the memory hierarchy requires a fundamental rearchitecting of traditional system designs. In this work, we propose a hardware-accelerated memory manager (HMMU) that addresses in a flat address space, with a small partition of the DRAM reserved for sub-page block level management. We design a set of data placement and data migration policies within this memory manager, such that we may exploit the advantages of each memory technology. By augmenting the system with this HMMU, we reduce the overall memory latency while also reducing writes to the NVM. Experimental results show that our design achieves a 39% reduction in energy consumption with only a 12% performance degradation versus an all-DRAM baseline that is likely untenable in the future.
Speaker bioFei Wen received his Ph.D. degree in Computer Engineering from Texas A&M University in 2020. He conducted research on interconnect network design and modelling for exascale systems as research associate at the HP Labs. His current research interests include computer architecture, memory systems, and FPGA accelerator. He has expertise across the hardware/software stack, in RTL design, FPGA development, kernel programming, and architecture performance modeling.
Unbounded Hardware Transactional Memory for a Hybrid DRAM/NVM Memory System
JungiJeong (Purdue University);JaewanHong (KAIST);SeungryoulMaeng (KAIST);ChangheeJung (Purdue University);YoungjinKwon (KAIST);
JungiJeong (Purdue University);JaewanHong (KAIST);SeungryoulMaeng (KAIST);ChangheeJung (Purdue University);YoungjinKwon (KAIST);
Speaker:Jungi Jeong, Purdue University
AbstractPersistent memory programming requires failure atomicity. To achieve this in an efficient manner, recent proposals use hardware-based logging for atomic-durable updates and hardware transactional memory (HTM) for isolation. Although the unbounded HTMs are promising for both performance and programmability reasons, none of the previous studies satisfies the practical requirements. They either require unrealistic hardware overheads or do not allow transactions to exceed on-chip cache boundaries. Furthermore, it has never been possible to use both DRAM and NVM in HTM, though it is becoming a popular persistency model. To this end, this study proposes UHTM, unbounded hardware transactional memory for DRAM and NVM hybrid memory systems. UHTM combines the cache coherence protocol and address-signatures to detect conflicts in the entire memory space. This approach improves concurrency by significantly reducing the false-positive rates of previous studies. More importantly, UHTM allows both DRAM and NVM data to interact with each other in transactions without compromising the consistency guarantee. This is rendered possible by UHTM’s hybrid version management that provides an undo-based log for DRAM and a redo-based log for NVM. The experimental results show that UHTM outperforms the state-of-the-art durable HTM, which is LLC-bounded, by 56% on average and up to 818%.
Speaker bioPostdoctoral Research Associate at Purdue
Sparta: High-Performance, Element-Wise Sparse Tensor Contraction on Persistent Memory-based Heterogeneous Memory
JiawenLiu (University of California, Merced);JieRen (University of California, Merced);RobertoGioiosa (Pacific Northwest National Laboratory);DongLi (University of California, Merced);JiajiaLi (Pacific Northwest National Laboratory);
JiawenLiu (University of California, Merced);JieRen (University of California, Merced);RobertoGioiosa (Pacific Northwest National Laboratory);DongLi (University of California, Merced);JiajiaLi (Pacific Northwest National Laboratory);
Speaker:Jiawen Liu, University of California, Merced
AbstractSparse tensor contractions appear commonly in many applications. Efficiently computing a two sparse tensor product is challenging: It not only inherits the challenges from common sparse matrix-matrix multiplication (SpGEMM), i.e., indirect memory access and unknown output size before computation, but also raises new challenges because of high dimensionality of tensors, expensive multi-dimensional index search, and massive intermediate and output data. To address the above challenges, we introduce three optimization techniques by using multi-dimensional, efficient hash table representation for the accumulator and larger input tensor, and all-stage parallelization. Evaluating with 15 datasets, we show that Sparta brings 28 − 576 x speedup over traditional sparse tensor contraction with SPA. With our proposed algorithm- and memory heterogeneity-aware data management, Sparta brings extra performance improvement on the heterogeneous memory with DRAM and Intel Optane DC Persistent Memory Module (PMM) over a state-of-the-art software-based data management solution, a hardware-based data management solution, and PMM-only by 30.7% (up to 98.5%), 10.7% (up to 28.3%) and 17% (up to 65.1%) respectively.
Speaker bioJiawen Liu is a fourth-year PhD candidate at University of California, Merced, supervised by Prof. Dong Li. His research interests lie in the intersection of systems, machine learning, and high performance computing.
Chair: Hung-Wei Tseng
Cross-Failure Bug Detection in Persistent Memory Programs
SihangLiu (University of Virginia);KorakitSeemakhupt (University of Virginia);YizhouWei (University of Virginia);ThomasWenisch (University of Michigan);AasheeshKolli (Pennsylvania State University);SamiraKhan (University of Virginia);
SihangLiu (University of Virginia);KorakitSeemakhupt (University of Virginia);YizhouWei (University of Virginia);ThomasWenisch (University of Michigan);AasheeshKolli (Pennsylvania State University);SamiraKhan (University of Virginia);
Speaker:Sihang Liu, University of Virginia
AbstractPersistent memory (PM) technologies, such as Intel’s Optane memory, deliver high performance, byte-addressability, and persistence, allowing programs to directly manipulate persistent data in memory without any OS intermediaries. An important requirement of these programs is that persistent data must remain consistent across a failure, which we refer to as the crash consistency guarantee. However, maintaining crash consistency is not trivial. We identify that a consistent recovery critically depends not only on the execution before the failure, but also on the recovery and resumption after failure. We refer to these stages as the pre- and post-failure execution stages. In order to holistically detect crash consistency bugs, we categorize the underlying causes behind inconsistent recovery due to incorrect interactions between the pre- and post-failure execution. First, a program is not crash-consistent if the post-failure stage reads from locations that are not guaranteed to be persisted in all possible access interleavings during the pre-failure stage — a type of programming error that leads to a race that we refer to as a cross-failure race. Second, a program is not crash-consistent if the post-failure stage reads persistent data that has been left semantically inconsistent during the pre-failure stage, such as a stale log or uncommitted data. We refer to this type of bug as a cross-failure semantic bug. Together, they form the cross-failure bugs in PM programs. In this work, we provide XFDetector, a tool that detects cross-failure bugs by automatically injecting failures into the pre-failure execution, and checking for cross-failure races and semantic bugs in the post-failure continuation. XFDetector has detected four new bugs in three pieces of PM software: one of PMDK’s examples, a PM-optimized Redis database, and a PMDK library function.
Speaker bioSihang Liu is a 5th year Ph.D. student at the University of Virginia, advised by Professor Samira Khan. Before pursuing the doctoral degree, he has obtained Bachelor’s degrees from both the University of Michigan and Shanghai Jiaotong University. Sihang Liu’s primary research interest lies in the software and hardware co-design of persistent memory systems. On the hardware side, his research aims to optimize the performance and guarantee crash consistency for practical persistent memory systems that are integrated with both storage and memory support. On the software side, he works on testing the crash consistency guarantees of PM-based programs. His works have provided several open-source tools and detected real-world bugs in well-known persistent memory software systems. He has published these works at top conferences in Computer Architecture, including ISCA, ASPLOS, and HPCA. He has also served as reviewer for ToS, TCAD, ASPLOS AE, OSDI AE, and EuroSys AE.
Towards Bug-free Persistent Memory Applications
IanNeal (University of Michigan);AndrewQuinn (University of Michigan);BarisKasikci (University of Michigan);
IanNeal (University of Michigan);AndrewQuinn (University of Michigan);BarisKasikci (University of Michigan);
Speaker:Ian Neal, University of Michigan
AbstractPersistent Memory (PM) aims to revolutionize the storage-memory hierarchy, but programming these systems is error-prone. Our work investigates how to to help developers write better, bug-free PM applications by automatically debugging them. We first perform a study of bugs in persistent memory applications to identify the opportunities and pain-points of debugging these systems. Then, we discuss our work on AGAMOTTO, a generic and extensible system for automatically detecting PM bugs. Unlike existing tools that rely on extensive test cases or annotations, AGAMOTTO automatically detects bugs in PM systems by extending symbolic execution to model persistent memory. AGAMOTTO has so far identified 84 new bugs in 5 different PM applications and frameworks while incurring no false positives. We then discuss HIPPOCRATES, a system that automatically fixes bugs in PM systems. HIPPOCRATES “does no harm”: its fixes are guaranteed to fix an PM bug without introducing new bugs. We show that HIPPOCRATES produces fixes that are functionally equivalent to developer fixes and that HIPPOCRATES fixes have performance that rivals manually-developed code.
Speaker bioIan Neal is a PhD candidate at the University of Michigan, advised by Baris Kasikci. His current research focus is in the development of efficient and reliable systems using emerging persistent main memory technologies. He is also interested in developing verifiably secure hardware systems and tools which allow for easier development of secure systems.
Corundum: Statically-Enforced Persistent Memory Safety
MortezaHoseinzadeh (UC San Diego);StevenSwanson (UC San Diego);
MortezaHoseinzadeh (UC San Diego);StevenSwanson (UC San Diego);
Speaker:Morteza Hoseinzadeh, University of California, San Diego
AbstractFast, byte-addressable, persistent main memories (PM) make it possible to build complex data structures that can survive system failures. Programming for PM is challenging, not least because it combines well-known programming challenges like locking, memory management, and pointer safety with novel PM-specific bug types. It also requires logging updates to PM to facilitate recovery after a crash. A misstep in any of these areas can corrupt data, leak resources, prevent successful recovery after a crash. Existing PM libraries in a variety of languages – C, C++, Python, Java – simplify some of these areas, but they still require the programmer to learn(and flawlessly apply) complex rules to ensure correctness. Opportunities for data-destroying bugs abound. This paper presents Corundum, a Rust-based library with an idiomatic PM programming interface, and leverages Rust’s type system to statically avoid most common PM programming bugs. Corundum lets programmers develop persistent data structures using familiar Rust constructs and have confidence that they are free of many types of bugs. We have implementedCorundum and found its performance to be as good or better than Intel’s widely-used PMDK library.
Speaker bioMorteza Hoseinzadeh is a Ph.D. candidate in the NVSL Lab at CSE Department, UC San Diego. He has been working on building software toolchains aiming to ease persistent memory programming with confidence.
Fast, Flexible and Comprehensive Bug Detection for Persistent Memory Programs
BangDi (Hunan University);JiawenLiu (University of California, Merced);HaoChen (Hunan University);DongLi (University of California, Merced);
BangDi (Hunan University);JiawenLiu (University of California, Merced);HaoChen (Hunan University);DongLi (University of California, Merced);
Speaker:Bang Di, Hunan University
AbstractDebugging PM programs faces a fundamental tradeoff between performance overhead and bug coverage (comprehensiveness). Large performance overhead or limited bug coverage makes debugging infeasible or ineffective for PM programs. In this paper, we propose PMDebugger, a debugger to detect crash consistency bugs. Unlike prior work, PMDebugger is fast, flexible and comprehensive for bug detection. The design of PMDebugger is driven by the characterization of how three fundamental operations in PM programs (store, cache writeback and fence) typically happen in PM programs. PMDebugger uses a hierarchical design composed of PM debugging-specific data structures, operations and bug-detection algorithms (rules). We generalize nine rules to detect crash-consistency bugs for various PM persistency models. Compared with a state-of-the-art detector (XFDetector) and an industry-quality detector (Pmemcheck), PMDebugger leads to 49.3x and 3.4x speedup on average. Compared with another state-of-the-art detector (PMTest) optimized for high performance, PMDebugger achieves comparable performance, without heavily relying on the programmer’s annotation but detect 38 more bugs than PMTest on ten applications. PMDebugger also identifies more bugs than XFDetector, Pmemcheck and PMTest. PMDebugger detects 19 new bugs in a real application (memcached) and two new bugs from Intel PMDK.
Speaker bioI am a 4th-year Ph.D. student in the College of Computer Science and Electronic Engineering at Hunan University. My research focuses on computer architecture and operating systems, specifically in debugging for the Persistent Memory (PM) and GPU.
Tracking in Order to Recover - Detectable Recovery of Lock-Free Data Structures
HagitAttiya (Technion);OhadBen-Baruch (Ben-Gurion University);PanagiotaFatourou (FORTH ICS and University of Crete, Greece);DannyHendler (Ben-Gurion University);EleftheriosKosmas (University of Crete, Greece);
HagitAttiya (Technion);OhadBen-Baruch (Ben-Gurion University);PanagiotaFatourou (FORTH ICS and University of Crete, Greece);DannyHendler (Ben-Gurion University);EleftheriosKosmas (University of Crete, Greece);
Speaker:Ohad Ben-Baruch, Ben-Gurion University
AbstractThis paper presents the \emph{tracking approach} for deriving \emph{detectably recoverable} (and thus also \emph{durable}) implementations of many widely-used concurrent data structures. Such data structures, satisfying \emph{detectable recovery}, are appealing for emerging systems featuring byte-addressable \emph{non-volatile main memory} (\emph{NVRAM}), whose persistence allows to efficiently resurrect failed processes after crashes.Their implementation is important because they are building blocks for the construction of simple, well-structured, sound and error-resistant multiprocessor systems. For instance, in many big-data applications, shared in-memory tree-based data indices are created for fast data retrieval and useful data analytics.
Speaker bioOhad Ben-Baruch completed his PhD in computer science at Ben-Gurion University under the supervision of Prof. Danny Hendler and Prof. Hagit Attiya. His research focuses on shared memory and concurrent computation. More specifically, on complexity bounds for concurrent objects in the crash-stop and crash-recovery system models. In his PhD dissertation the notion of Nesting-Safe Recoverable Linearizability (NRL) was proposed, a novel model and correctness condition for the crash-recovery shared memory model which allows for nesting of recoverable objects, together with lower and upper bounds for objects implementations satisfying the condition. Ohad is currently working at BeyondMinds as a researcher and algorithms development.
Building Fast Recoverable Persistent Data Structures
HaosenWen (University of Rochester);WentaoCai (University of Rochester);MingzheDu (University of Rochester);LouisJenkins (University of Rochester);BenjaminValpey (University of Rochester);Michael L.Scott (University of Rochester);
HaosenWen (University of Rochester);WentaoCai (University of Rochester);MingzheDu (University of Rochester);LouisJenkins (University of Rochester);BenjaminValpey (University of Rochester);Michael L.Scott (University of Rochester);
Speaker:Haosen Wen, University of Rochester
AbstractThe recent emergence of fast, dense, nonvolatile main memory suggests that certain long-lived data might remain in its natural pointer-rich format across program runs and hardware reboots. Operations on such data must be instrumented with explicit write-back and fence instructions to ensure consistency in the wake of a crash. Techniques to minimize the cost of this instrumentation are an active topic of research. We present what we believe to be the first general-purpose approach to building \emph{buffered durably linearizable} persistent data structures, and a system, Montage, to support that approach. Montage is built on top of the Ralloc nonblocking persistent allocator. It employs a slow-ticking \emph{epoch clock}, and ensures that no operation appears to span an epoch boundary. It also arranges to persist only that data minimally required to reconstruct the structure after a crash. If a crash occurs in epoch $e$, all work performed in epochs $e$ and $e-1$ is lost, but work from prior epochs is preserved.
Speaker bioHaosen Wen is a senior Ph.D. candidate at University of Rochester. His research interests include storage models and applications for non-volatile byte-addressable memories, nonblocking data structures and their memory management, and in-memory database systems.
Chair: Changhee Jung (Purdue)
ArchTM: Architecture-Aware, High Performance Transaction for Persistent Memory
KaiWu (University of California, Merced);JieRen (University of California, Merced);IvyPeng (Lawrence Livermore National Laboratory);DongLi (University of California, Merced);
KaiWu (University of California, Merced);JieRen (University of California, Merced);IvyPeng (Lawrence Livermore National Laboratory);DongLi (University of California, Merced);
Speaker:Kai Wu, University of California Merced
AbstractFailure-atomic transactions are a critical mechanism for accessing and manipulating data on persistent memory (PM)with crash consistency. We identify that small random writes in metadata modifications and locality-oblivious memory al-location in traditional PM transaction systems mismatch PMarchitecture. We present ArchTM, a PM transaction system based on two design principles: avoiding small writes and encouraging sequential writes. ArchTM is a variant of copy-on-write (CoW) system to reduce write traffic to PM. Unlike conventional CoW schemes, ArchTM reduces metadata modifications through a scalable lookup table on DRAM. ArchTM introduces an annotation mechanism to ensure crash consistency and a locality-aware data path in memory allocation to increases coalesable writes inside PM devices. We evaluateArchTM against four state-of-the-art transaction systems (one in PMDK, Romulus, DudeTM, and one from oracle). ArchTM outperforms the competitor systems by 58x, 5x, 3x and 7x on average, using micro-benchmarks and real-world workloads on real PM.
Speaker bioKai is a Ph.D. candidate at University of California Merced. Before coming to UC Merced, He earned his master degree in Computer Science and Engineering from Michigan State University in 2016. His research areas are computer system, heterogeneous computing and high performance computing (HPC). He designs high performance and large-scale computer systems with hardware heterogeneity. His recent work focuses on designing system support for persistent memory-based big memory platforms.
PMEM-Spec: Persistent Memory Speculation (Strict Persistency Can Trump Relaxed Persistency)
JungiJeong (Purdue University);ChangheeJung (Purdue University);
JungiJeong (Purdue University);ChangheeJung (Purdue University);
Speaker:Jungi Jeong, Purdue University
AbstractPersistency models define the persist-order that controls the order in which stores update persistent memory (PM). As with memory consistency, the relaxed persistency models provide better performance than the strict ones by relaxing the ordering constraints. To support such relaxed persistency models, previous studies resort to APIs for annotating the persist-order in program and hardware implementations for enforcing the programmer-specified order. However, this approach to the relaxed persistency support imposes costly burdens on both architects and programmers. In particular, the goal of this study is to demonstrate that the strict persistency model can outperform the relaxed models with way less hardware complexity and programming difficulty. To achieve that, this paper presents PMEM-Spec that speculatively allows any PM accesses without stalling or buffering, detecting the ordering violation (e.g., misspeculation) for PM loads and stores. PMEM-Spec treats misspeculation as power failure and thus leverage failure-atomic transactions to recover from misspeculation by aborting and restarting them purposely. Since the ordering violation rarely occurs, PMEMSpec can accelerate persistent memory accesses without significant misspeculation penalty. Experimental results show that PMEM-Spec outperforms the epoch-based persistency models with Intel X86 ISAs and the state-of-the-art hardware support by 27.2% and 10.6%, respectively.
Speaker bioPostdoctoral Research Associate at Purdue
Efficient Hardware-Assisted Out-of-Place Update for Non-Volatile Memory
MiaoCai (Nanjing University);Chance C.Coats (University of Illinois at Urbana-Champaign);JeonghyunWoo (University of Illinois at Urbana-Champaign);JianHuang (University of Illinois at Urbana-Champaign);
MiaoCai (Nanjing University);Chance C.Coats (University of Illinois at Urbana-Champaign);JeonghyunWoo (University of Illinois at Urbana-Champaign);JianHuang (University of Illinois at Urbana-Champaign);
Speaker:Jian Huang, University of Illinois at Urbana-Champaign
AbstractByte-addressable non-volatile memory (NVM) is a promising technology that provides near-DRAM performance with scalable memory capacity. However, it requires atomic data durability to ensure memory persistency. Therefore, many techniques, including logging and shadow paging, have been proposed. However, most of them either introduce extra write traffic to NVM or suffer from significant performance overhead on the critical path of program execution, or even both. In this paper, we propose a transparent and efficient hardware-assisted out-of-place update (Hoop) mechanism that supports atomic data durability, without incurring much extra writes and performance overhead. The key idea is to write the updated data to a new place in NVM, while retaining the old data until the updated data becomes durable. To support this, we develop a lightweight indirection layer in the memory controller to enable efficient address translation and adaptive garbage collection for NVM. We evaluate Hoop with a variety of popular data structures and data-intensive applications, including key-value stores and databases. Our evaluation shows that Hoop achieves low critical-path latency with small write amplification, which is close to that of a native system without persistence support. Compared with state-of-the-art crash-consistency techniques, it improves application performance by up to 1.7X, while reducing the write amplification by up to 2.1X. Hoop also demonstrates scalable data recovery capability on multi-core systems.
Speaker bioJian Huang is an Assistant Professor in the ECE department at the University of Illinois at Urbana-Champaign. He received his Ph.D. in Computer Science at Georgia Tech in 2017. His research interests include computer systems, systems architecture, systems security, distributed systems, and especially the intersections of them. He enjoys building systems. His research contributions have been published at top-tier architecture, systems, and security conferences. His work received the Best Paper Award at USENIX ATC and IEEE Micro Top Picks. He also received NetApp Faculty Fellowship Award and Google Faculty Research Award.
TSOPER: Efficient Coherence-Based Strict Persistency
PerEkemark (Uppsala University, Sweden);YuanYao (Uppsala University, Sweden);AlbertoRos (Universidad de Murcia, Spain);KonstantinosSagonas (Uppsala University, Sweden and National Technical Univ. of Athens, Greece);StefanosKaxiras (Uppsala University, Sweden);
PerEkemark (Uppsala University, Sweden);YuanYao (Uppsala University, Sweden);AlbertoRos (Universidad de Murcia, Spain);KonstantinosSagonas (Uppsala University, Sweden and National Technical Univ. of Athens, Greece);StefanosKaxiras (Uppsala University, Sweden);
Speaker:Per Ekemark, Uppsala University, Sweden
AbstractWe propose a novel approach for hardware-based strict TSO persistency, called TSOPER. We allow a TSO persistency model to freely coalesce values in the caches, by forming atomic groups of achelines to be persisted. A group persist is initiated for an atomic group if any of its newly written values are exposed to the outside world. A key difference with prior work is that our architecture is based on the concept of a TSO persist buffer, that sits in parallel to the shared LLC, and persists atomic groups directly from private caches to NVM, bypassing the coherence serialization of the LLC. To impose dependencies among atomic groups that are persisted from the private caches to the TSO persist buffer, we introduce a sharing-list coherence protocol that naturally captures the order of coherence operations in its sharing lists, and thus can reconstruct the dependencies among different atomic groups entirely at the private cache level without involving the shared LLC. The combination of the sharing-list coherence and the TSO persist buffer allows persist operations and writes to non-volatile memory to happen in the background and trail the coherence operations. Coherence runs ahead at full speed; persistency follows belatedly. Our evaluation shows that TSOPER provides the same level of reordering as a program-driven relaxed model, hence, approximately the same level of performance, albeit without needing the programmer or compiler to be concerned about false sharing, data-race-free semantics, etc., and guaranteeing all software that can run on top of TSO, automatically persists in TSO.
Speaker bioCurrently working toward efficient and productive use of persistent memory in both single and distributed systems. Previously explored compiler optimisations involving software prefetching and data-race-freedom. I prefer my bows with either strings or arrows.
Characterizing non-volatile memory transactional systems
PradeepFernando (Georgia Tech);IrinaCalciu (VMware);JayneelGandhi (VMware);AasheeshKolli (Penn State);AdaGavrilovska (Georgia Tech);
PradeepFernando (Georgia Tech);IrinaCalciu (VMware);JayneelGandhi (VMware);AasheeshKolli (Penn State);AdaGavrilovska (Georgia Tech);
Speaker:Pradeep Fernando, Georgia Institute of Technology
AbstractEmerging non-volatile memory (NVM) technologies promise memory speed byte-addressable persistent storage with a load/store interface. However, programming applications to directly manipulate NVM data is complex and error-prone. Applications generally employ libraries that hide the low-level details of the hardware and provide a transactional programming model to achieve crash-consistency. Furthermore, applications continue to expect correctness during concurrent executions, achieved through the use of synchronization. To achieve this, applications seek well-known ACID guarantees. However, realizing this presents designers of transactional systems with a range of choices in how to combine several low-level techniques, given target hardware features and workload characteristics. This presentation will discuss the tradeoffs associated with these choices and present detailed experimental analysis performed across a range of single- and multi-threaded workloads using a simulation environment and real PMEM hardware.
Chair: Jishen Zhao
Building Scalable Dynamic Hash Tables on Persistent Memory
BaotongLu (The Chinese University of Hong Kong);XiangpengHao (Simon Fraser University);TianzhengWang (Simon Fraser University);EricLo (The Chinese University of Hong Kong);
BaotongLu (The Chinese University of Hong Kong);XiangpengHao (Simon Fraser University);TianzhengWang (Simon Fraser University);EricLo (The Chinese University of Hong Kong);
Speaker:Baotong Lu, The Chinese University of Hong Kong
AbstractByte-addressable persistent memory (PM) brings hash tables the potential of low latency, cheap persistence, and instant recovery. The recent advent of Intel Optane DC Persistent Memory Modules(DCPMM) further accelerates this trend. Many new hash table designs have been proposed, but most of them were based on emulation and perform sub-optimally on real PM. They were also piece-wise and partial solutions that side-step many important properties, in particular good scalability, high load factor, and instant recovery. We present Dash, a holistic approach to building dynamic and scalable hash tables on real PM hardware with all the aforementioned properties. Based on Dash, we adapted two popular dynamic hashing schemes (extendible hashing and linear hashing). On a 24-core machine with Intel Optane DCPMM, we show that compared to state-of-the-art, Dash-enabled hash tables can achieve up to∼3.9×higher performance with up to over 90% load factor and an instant recovery time of 57ms regardless of data size.
Speaker bioBaotong Lu is a Ph.D. candidate at the Department of Computer Science and Engineering, The Chinese University of Hong Kong (advisor: Prof. Eric Lo). He is also a visiting Ph.D. student at the Data Science Research Group, Simon Fraser University (host advisor: Prof. Tianzheng Wang). His research interest lies in the data management system, specifically next-generation database system on persistent memory and multicore processor. He is a recipient of the 2021 ACM SIGMOD Research Highlight Award.
Performance Prediction of Graph Analytics on Persistent Memory
DiegoBraga (UFBA);DanielMosse (PITT);ViniciusPetrucci (University of Pittsburgh & UFBA);
DiegoBraga (UFBA);DanielMosse (PITT);ViniciusPetrucci (University of Pittsburgh & UFBA);
Speaker:Diego Moura, Federal University of Bahia
AbstractConsidering a system with heterogeneous memory (DRAM and PMEM, in this case Intel Optane), the problem we address is to decide which application will be allocated on each type of resource. We built a model that estimates the impact of running the application on Intel Optane using performance counters from previous runs on DRAM. From this model we present an offline application placement for the context of heterogeneous memories. Our results show that judicious allocation can yield average reduction of 22% and 120% in makespan and degradation metrics respectively.
Speaker bioPhD student Diego Braga started by doing research with heterogeneous processors using arm big.little boards. In 2020, after receiving a grant to access an Intel server with Optane technology, he "migrated" his research to heterogeneous memory. Currently, he works with allocation of memory for the entire application as well as investigating which characteristics have a larger impact on performance at an object-level.
Disaggregating Persistent Memory and Controlling Them Remotely: An Exploration of Passive Disaggregated Key-Value Stores
Shin-YehTsai (Facebook);YizhouShan (University of California, San Diego);YiyingZhang (University of California, San Diego);
Shin-YehTsai (Facebook);YizhouShan (University of California, San Diego);YiyingZhang (University of California, San Diego);
Speaker:Yizhou Shan, UCSD
AbstractMany datacenters and clouds manage storage systems separately from computing services for better manageability and resource utilization. These existing disaggregated storage systems use hard disks or SSDs as storage media. Recently, the technology of persistent memory (PM) has matured and seen initial adoption in several datacenters. Disaggregating PM could enjoy the same benefits of traditional disaggregated storage systems, but it requires new designs because of its memory-like performance and byte addressability. In this paper, we explore the design of disaggregating PM and managing them remotely from compute servers, a model we call passive disaggregated persistent memory, or pDPM. Compared to the alternative of managing PM at storage servers, pDPM significantly lowers monetary and energy costs and avoids scalability bottlenecks at storage servers. We built three key-value store systems using the pDPM model. The first one lets all compute nodes directly access and manage storage nodes. The second uses a central coordinator to orchestrate the communication between compute and storage nodes. These two systems have various performance and scalability limitations. To solve these problems, we built Clover, a pDPM system that separates the location, communication mechanism, and management strategy of the data plane and the metadata/control plane.
Speaker bioYizhou Shan is PhD student at UCSD, advised by Prof. Yiying Zhang. His research focus on distributed system, persistent memory, and operating systems.
Making Volatile Index Structures Persistent using TIPS
R. MadhavaKrishnan (Virginia Tech);Wook-HeeKim (Virginia Tech);Hee WonLee (Consultant);MinsungJang (Perspecta Labs);Sumit KumarMonga (Virginia Tech);AjitMathew (Virginia Tech);ChangwooMin (Virginia Tech);
R. MadhavaKrishnan (Virginia Tech);Wook-HeeKim (Virginia Tech);Hee WonLee (Consultant);MinsungJang (Perspecta Labs);Sumit KumarMonga (Virginia Tech);AjitMathew (Virginia Tech);ChangwooMin (Virginia Tech);
Speaker:R. Madhava Krishnan, Virginia Tech
AbstractWe propose TIPS – a generic framework that systematically converts volatile indexes to their persistent counterpart. Any volatile index can be plugged-in into the TIPS framework and become persistent with only minimal source code changes. TIPS neither places restrictions on the concurrency model nor requires in-depth knowledge of the volatile index. TIPS supports a strong consistency guarantee, i.e., durable linearizability, and internally handles the persistent memory leaks across crashes. TIPS relies on novel DRAM-NVMM tiering to achieve high-performance and good scalability. It uses a hybrid logging technique called the UNO logging to minimize the crash consistency overhead. We converted seven different indexes and a real-world key-value store application using TIPS and evaluated them using YCSB workloads. TIPS-enabled indexes outperform the state-of-the-art persistent indexes significantly in addition to offering many other benefits that existing persistent indexes do not provide.
Speaker bioMadhava Krishnan is a third-year Ph.D. student at Virginia Tech working under Dr. Changwoo Min. His research interests include Operating systems, Storage systems, and Concurrency & Scalability. He currently works on developing system software for emerging persistent memory particularly focusing on persistent transactions and index structures.
HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Persistent Memory-based Heterogeneous Memory
JieRen (University of California, Merced);MinjiaZhang (Microsoft Research);DongLi (University of California, Merced);
JieRen (University of California, Merced);MinjiaZhang (Microsoft Research);DongLi (University of California, Merced);
Speaker:Jie Ren, University of California, Merced
AbstractThe state-of-the-art approximate nearest neighbor search (ANNS) algorithms face a fundamental tradeoff between query latency and accuracy, because of small main memory capacity: To store indices in main memory for fast query response, They have to limit the number of data points or store compressed vectors, which hurts search accuracy. The emergence of heterogeneous memory (HM) brings opportunities to largely increase memory capacity and break the above tradeoff: Using HM, billions of data points can be placed in main memory on a single machine without using any data compression. However, HM consists of both fast (but small) memory and slow (but large) memory, and using HM inappropriately slows down query time significantly. In this work, we present a novel graph-based similarity search algorithm called HM-ANN, which takes both memory and data heterogeneity into consideration and enables billion-scale similarity search on a single node without using compression.
Speaker bioJie Ren is a PhD student in University of California, Merced. Her research focuses on high performance computing, especially on memory management on persistent memory-based heterogeneous memory.
Chair: Steven Swanson
PMTest: A Fast and Flexible Testing Framework for Persistent Memory Programs
SihangLiu (University of Virginia);YizhouWei (University of Virginia);JishenZhao (UC, San Diego);AasheeshKolli (Penn State University & VMware Research);SamiraKhan (University of Virginia);
SihangLiu (University of Virginia);YizhouWei (University of Virginia);JishenZhao (UC, San Diego);AasheeshKolli (Penn State University & VMware Research);SamiraKhan (University of Virginia);
Speaker:Sihang Liu, University of Virginia
AbstractRecent non-volatile memory technologies such as 3D XPoint and NVDIMMs have enabled persistent memory (PM) systems that can manipulate persistent data directly in memory. This advancement of memory technology has spurred the development of a new set of crash-consistent software (CCS) for PM - applications that can recover persistent data from memory in a consistent state in the event of a crash (e.g., power failure). CCS developed for persistent memory ranges from kernel modules to user-space libraries and custom applications. However, ensuring crash consistency in CCS is difficult and error-prone. Programmers typically employ low-level hardware primitives or transactional libraries to enforce ordering and durability guarantees that are required for ensuring crash consistency. Unfortunately, hardware can reorder instructions at runtime, making it difficult for the programmers to test whether the implementation enforces the correct ordering and durability guarantees. We believe that there is an urgent need for developing a testing framework that helps programmers identify crash consistency bugs in their CCS. We find that prior testing tools lack generality, i.e., they work only for one specific CCS or memory persistency model and/or introduce significant performance overhead. To overcome these drawbacks, we propose PMTest, a crash consistency testing framework that is both flexible and fast. PMTest provides flexibility by providing two basic assertion-like software checkers to test two fundamental characteristics of all CCS: the ordering and durability guarantee. These checkers can also serve as the building blocks of other application-specific, high-level checkers. PMTest enables fast testing by deducing the persist order without exhausting all possible orders. In the evaluation with eight programs, PMTest not only identified 45 synthetic crash consistency bugs, but also detected 3 new bugs in a file system (PMFS) and in applications developed using a transactional library (PMDK), while on average being 7.1× faster than the state-of-the-art tool.
Speaker bioSihang Liu is a 4th year Ph.D. student at the University of Virginia, advised by Professor Samira Khan. Before pursuing the doctoral degree, he has obtained Bachelor’s degrees from both the University of Michigan and Shanghai Jiaotong University. Sihang Liu’s primary research interest lies in the software and hardware co-design of persistent memory systems. On the hardware side, his research aims to optimize the performance and guarantee crash consistency for practical persistent memory systems that are integrated with both storage and memory support. On the software side, he works on testing the crash consistency guarantees of PM-based programs. His works have provided several open-source tools and detected real-world bugs in well known persistent memory software systems. He has published these works at top conferences in Computer Architecture, including ISCA, ASPLOS, and HPCA. He has also served as reviewer for ToS and artifact reviewer for ASPLOS.
Semi-Asymmetric Parallel Graph Algorithms for NVRAMs
LaxmanDhulipala (Carnegie Mellon University);CharlesMcGuffey (Carnegie Mellon University);HongboKang (Carnegie Mellon University);YanGu (UC Riverside);Guy E.Blelloch (Carnegie Mellon University);Phillip B.Gibbons (Carnegie Mellon University);JulianShun (Massahusetts Institute of Technology);
LaxmanDhulipala (Carnegie Mellon University);CharlesMcGuffey (Carnegie Mellon University);HongboKang (Carnegie Mellon University);YanGu (UC Riverside);Guy E.Blelloch (Carnegie Mellon University);Phillip B.Gibbons (Carnegie Mellon University);JulianShun (Massahusetts Institute of Technology);
Speaker:Laxman Dhulipala, Carnegie Mellon University
AbstractEmerging non-volatile main memory (NVRAM) technologies provide novel features for large-scale graph analytics, combining byte-addressability, low idle power, and improved memory-density. Systems are likely to have an order of magnitude more NVRAM than traditional memory (DRAM), allowing large graph problems to be solved efficiently at a modest cost on a single machine. However, a significant challenge in achieving high performance is in accounting for the fact that NVRAM writes can be significantly more expensive than NVRAM reads. In this paper, we propose an approach to parallel graph analytics in which the graph is stored as a read-only data structure (in NVRAM), and the amount of mutable memory is kept proportional to the number of vertices. Similar to the popular semi-external and semi-streaming models for graph analytics, the approach assumes that the vertices of the graph fit in a fast read-write memory (DRAM), but the edges do not. In NVRAM systems, our approach eliminates writes to the NVRAM, among other benefits. We present a model, the \emph{Parallel Semi-Asymmetric Model (PSAM)}, to analyze algorithms in the setting, and run experiments on a 48-core NVRAM system to validate the effectiveness of these algorithms. To this end, we study over a dozen graph problems. We develop parallel algorithms for each that are efficient, often work-optimal, in the model. Experimentally, we run all of the algorithms on the largest publicly-available graph and show that our PSAM algorithms outperform the fastest prior algorithms designed for DRAM or NVRAM. We also show that our algorithms running on NVRAM nearly match the fastest prior algorithms running solely in DRAM, by effectively hiding the costs of repeatedly accessing NVRAM versus DRAM.
Speaker bioLaxman is a final-year Ph.D. student at CMU where he is very fortunate to be advised by Guy Blelloch. He has worked broadly on designing and implementing provably-efficient parallel graph algorithms in different parallel models of computation.
Error-Correcting WOM Codes for Worst-Case and Random Errors
AmitSolomon (Technion);YuvalCassuto (Technion);
AmitSolomon (Technion);YuvalCassuto (Technion);
Speaker:Amit Solomon, Technion - Israel Institute of Technology
AbstractWe construct error-correcting WOM (write-once memory) codes that guarantee correction of any specified number of errors in q-level memories. The constructions use suitably designed short q-ary WOM codes and concatenate them with outer error-correcting codes over different alphabets, using suitably designed mappings. In addition to constructions for guaranteed error correction, we develop an error-correcting WOM scheme for random errors using the concept of multi-level coding.
Speaker bioAmit Solomon is a Ph.D. candidate at the Department of Electrical Engineering and Computer Science at Massachusetts Institute of Technology, working with Prof. Muriel Mèdard at the Network Coding and Reliable Communication group in the Research Laboratory of Electronics. His research interests are coding theory, information theory, communication, among others. He received the B.Sc. degree, (cum laude), and the M.Sc. in Electrical Engineering from the Technion-Israel Institute of Technology, in 2015 and 2018, respectively. He has received the Irwin Mark Jacobs and Joan Klein Jacobs Presidential Fellowship.
Efficient Architectures for Generalized Integrated Interleaved Decoder
XinmiaoZhang (The Ohio State University);ZhenshanXie (The Ohio State University);
XinmiaoZhang (The Ohio State University);ZhenshanXie (The Ohio State University);
Speaker:Zhenshan Xie, The Ohio State University
AbstractGeneralized integrated interleaved (GII) codes nest short sub-codewords to generate parities shared by the sub-codewords. They allow hyper-speed decoding with excellent correction capability, and are essential to next-generation data storage. On the other hand, the hardware implementation of GII decoders faces many challenges, including low achievable clock frequency and large silicon area. This abstract presents novel algorithmic reformulations and architectural transformations to address each bottleneck. For an example GII code that has the same rate and length as eight un-nested (255, 223) Reed-Solomon (RS) codes, our GII decoder only has 30% area overhead compared to the RS decoder while achieving 7 orders of magnitude lower frame error rate. Its critical path only consists of 7 XOR gates, and can easily achieve more than 40GByte/s throughput.
Speaker bioZhenshan Xie received the B.S. degree in information engineering from East China University of Science and Technology, Shanghai, China, in 2014, and the M.S. degree in communications and information system from University of Chinese Academy of Sciences, Beijing, China, in 2017. He is currently working toward the Ph.D. degree in the Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH.
SOLQC: Synthetic Oligo Library Quality Control Tool
OmerSabary (Technion – Israel Institute of Technology);YoavOrlev (Interdisciplinary Center Herzliya);RoyShafir (Technion – Israel Institute of Technology, Interdisciplinary Center Herzliya);LeonAnavy (Technion – Israel Institute of Technology);EitanYaakobi (Technion – Israel Institute of Technology);ZoharYakhini (Technion – Israel Institute of Technology, Interdisciplinary Center Herzliya);
OmerSabary (Technion – Israel Institute of Technology);YoavOrlev (Interdisciplinary Center Herzliya);RoyShafir (Technion – Israel Institute of Technology, Interdisciplinary Center Herzliya);LeonAnavy (Technion – Israel Institute of Technology);EitanYaakobi (Technion – Israel Institute of Technology);ZoharYakhini (Technion – Israel Institute of Technology, Interdisciplinary Center Herzliya);
Speaker:Omer Sabary, University of California, San Diego
AbstractDNA-based storage has attracted significant attention due to recent demonstrations of the viability of storing information in macromolecules using synthetic oligo libraries. As DNA storage experiments, as well as other experiments of synthetic oligo libraries are growing by numbers and complexity, analysis tools can facilitate quality control and help in assessment and inference. We present a novel analysis tool, called SOLQC, which enables fast and comprehensive analysis of synthetic oligo libraries, based on next generation sequencing (NGS) analysis performed by the user. SOLQC provides statistical information such as the distribution of variant representation, different error rates, and their dependence on sequence or library properties. SOLQC produces graphical descriptions of the analysis results. The results are reported in a flexible report format. We demonstrate SOLQC by analyzing literature libraries. We also discuss the potential benefits and relevance of the different components of the analysis.
Speaker bioOmer Sabary received his B.Sc in Computer Science from the Technion in 2018. He is currently an M.Sc. student in the department of computer science at the Technion under the supervision of Associate Professor Eitan Yaakobi. His dissertation spans over reconstruction algorithms for DNA storage systems and error characterization of the DNA storage channel.
Thread-specific Database Buffer Management in Multi-core NVM Storage Environments
TsuyoshiOzawa (Institute of Industrial Science, The University of Tokyo);YutoHayamizu (Institute of Industrial Science, The University of Tokyo);KazuoGoda (Institute of Industrial Science, The University of Tokyo);MasaruKitsuregawa (Institute of Industrial Science, The University of Tokyo);
TsuyoshiOzawa (Institute of Industrial Science, The University of Tokyo);YutoHayamizu (Institute of Industrial Science, The University of Tokyo);KazuoGoda (Institute of Industrial Science, The University of Tokyo);MasaruKitsuregawa (Institute of Industrial Science, The University of Tokyo);
Speaker:Tsuyoshi Ozawa, The University of Tokyo
AbstractDatabase buffer management is a cornerstone in modern database management systems (DBMS). So far, a *shared buffer* strategy has been widely employed to improve the cache efficiency and reduce the IO workload. However, it involves a significant processing overhead induced by the inter-thread synchronization, thus failing to exploit the potential bandwidth that recent non-volatile memory (NVM) storage devices offer. This paper proposes to employ a *separated buffer* strategy. According to this strategy, the database buffer manager is allowed to achieve significantly higher throughput, even though it may produce an extra amount of IO workload. In recent multi-core NVM storage environments, separated buffer performs faster in query processing than shared buffer. This paper presents our experimental study with the TPC-H dataset on two different NVM machines, demonstrating that separated buffer achieves up to 1.47 million IOPS and finally performs up to 637% faster in query processing than shared buffer.
Speaker bioTsuyoshi Ozawa is a researcher at the University of Tokyo on topics in database management systems and storage systems. He is a member of ACM, USENIX, IPSJ (Information Processing Society of Japan), and DBSJ (The Database Society of Japan).
Leveraging Intel Optane for HPC workflows
Ranjan SarpangalaVenkatesh (Georgia Institute of Technology);TonyMason (University of British Columbia);PradeepFernando (Georgia Institute of Technology);GregEisenhauer (Georgia Institute of Technology);AdaGavrilovska (Georgia Institute of Technology);
Ranjan SarpangalaVenkatesh (Georgia Institute of Technology);TonyMason (University of British Columbia);PradeepFernando (Georgia Institute of Technology);GregEisenhauer (Georgia Institute of Technology);AdaGavrilovska (Georgia Institute of Technology);
Speaker:Ranjan Sarpangala Venkatesh, Georgia Institute of Technology
AbstractHigh Performance Computing (HPC) workload demands are increasing data volumes, which gives rise to data movement challenges. In-situ execution of HPC workflows coupling simulation and analytics applications is a common mechanism for reducing cross-node traffic. Further, data movement costs can be reduced by using large capacity persistent memory, such as Intel's Optane PMEM. Recent work has described best practices for optimizing use of Optane by tuning based on workload characteristics. However, optimizing one component of an HPC workload does not guarantee optimal performance in the end-to-end workflow. Instead, we propose and evaluate new strategies for optimizing the use of Intel Optane for such HPC workflows.
Speaker bioRanjan Sarpangala Venkatesh is a Ph.D. student in the School of Computer Science, at the Georgia Institute of Technology advised by Prof. Ada Gavrilovska. His current research focus is on system support for leveraging persistent memory to make HPC workflows faster. His previous research made Docker container snapshot/restore 10X faster for large memory applications. Prior to joining Georgia Tech, he worked on storage device drivers at Hewlett Packard Enterprise. He earned M.S. in Computer Science from the University of California, Santa Cruz.
Durability Through NVM Checkpointing
DavidAksun (EPFL);JamesLarus (EPFL);
DavidAksun (EPFL);JamesLarus (EPFL);
Speaker:David Aksun, EPFL
AbstractNon-Volatile Memory (NVM) is an emerging type of memory that offers fast, byte-addressable persistent storage. One promising use for persistent memory is constructing robust, high-performance internet and cloud services, which often maintain very large, in-memory databases and need to quickly recover from faults or failures. The research community has focused on storing these large data structures in NVM, in essence using it as durable RAM. The focus of the paper is to take advantage of the existing DRAM to provide better runtime performance. Heavily read or written data should reside in DRAM, where it can be accessed at a fraction of the cost, and only modified values should be persisted in NVM. This paper presents CpNvm, a runtime system that uses periodic checkpoints to maintain a recoverable copy of a program's data, with overhead low enough for widespread use. To use CpNvm, a program's developer must insert a library call at the first write to a location in a persistent data structure and make another call when the structure is in a consistent state. Our primary goal is high performance, even at the cost of relaxing the crash-recovery model. CpNvm offers durability for large data structures at low overhead cost. Running on Intel Optane NVM, we achieve overheads of $0-15\%$ on the YCSB benchmarks running with minimally-modified Masstree and overheads of $6.5\%$ or less for Memcached.
Speaker bioDavid Aksun is a Ph.D. candidate at EPFL in Lausanne. His research focuses on programming language challenges and performance optimization issues for byte-addressable non-volatile memory. His current research investigates the potential optimizations that can be used for building durable data structures. He has a B.S. (summa cum laude) in Computer Engineering from Istanbul Technical University.
PMIdioBench: A Benchmark Suite for Understanding the Idiosyncrasies of Persistent Memory
ShashankGugnani (The Ohio State University);ArjunKashyap (The Ohio State University);XiaoyiLu (The Ohio State University);
ShashankGugnani (The Ohio State University);ArjunKashyap (The Ohio State University);XiaoyiLu (The Ohio State University);
Speaker:,
AbstractHigh capacity persistent memory (PMEM) is finally commercially available in the form of Intel's Optane DC Persistent Memory Module (DCPMM). Early evaluations of DCPMM show that its behavior is more nuanced and idiosyncratic than previously thought. Several assumptions made about its performance that guided the design of PMEM-enabled systems have been shown to be incorrect. Unfortunately, several peculiar performance characteristics of DCPMM are related to the memory technology (3D-XPoint) used and its internal architecture. It is expected that other technologies (such as STT-RAM, ReRAM, NVDIMM), with highly variable characteristics, will be commercially shipped as PMEM in the near future. Current evaluation studies fail to understand and categorize the idiosyncratic behavior of PMEM; i.e., how do the peculiarities of DCPMM related to other classes of PMEM. Clearly, there is a need for a study which can guide the design of systems and is agnostic to PMEM technology and internal architecture. In this work, we first list and categorize the idiosyncratic behavior of PMEM by performing targeted experiments with our proposed PMIdioBench benchmark suite on a real DCPMM platform. Next, we conduct detailed studies to guide the design of storage systems, considering generic PMEM characteristics. The first study guides data placement on NUMA systems with PMEM while the second study guides the design of lock-free data structures, for both eADR- and ADR-enabled PMEM systems. Our results are often counter-intuitive and highlight the challenges of system design with PMEM.
SuperMem: Enabling Application-transparent Secure Persistent Memory with Low Overheads
PengfeiZuo (Huazhong University of Science and Technology & Univ. of California Santa Barbara);YuHua (Huazhong University of Science and Technology);YuanXie (Univ. of California Santa Barbara);
PengfeiZuo (Huazhong University of Science and Technology & Univ. of California Santa Barbara);YuHua (Huazhong University of Science and Technology);YuanXie (Univ. of California Santa Barbara);
Speaker:Pengfei Zuo, Huazhong University of Science and Technology & University of California Santa Barbara
AbstractNon-volatile memory (NVM) suffers from security vulnerability to physical access based attacks due to non-volatility. To ensure data security in NVM, counter mode encryption is often used by considering its high security level and low decryption latency. However, the counter mode encryption incurs new persistence problem for crash consistency guarantee due to the requirement for atomically persisting both data and its counter. To address this problem, existing work requires a large battery backup or complex modifications on both hardware and software layers due to employing a write-back counter cache. The large battery backup is expensive and software-layer modifications limit the portability of applications from the un-encrypted NVM to the encrypted one. Our paper proposes SuperMem, an application-transparent secure persistent memory by leveraging a write-through counter cache to guarantee the atomicity of data and counter writes without the needs of a large battery backup and software-layer modifications. To reduce the performance overhead of a baseline write-through counter cache, SuperMem leverages a locality-aware counter write coalescing scheme to reduce the number of write requests by exploiting the spatial locality of counter storage and data writes. Moreover, SuperMem leverages a cross-bank counter storage scheme to efficiently distribute data and counter writes to different banks, thus speeding up writes by exploiting bank parallelism. Experimental results demonstrate that SuperMem improves the performance by about $2\times$ compared with an encrypted NVM with a baseline write-through counter cache, and achieves the performance comparable to an ideal secure NVM that exhibits the optimal performance of an encrypted NVM.
Speaker bioPengfei Zuo is a research scientist at Huawei Inc. He received Ph.D. degree in Computer Science from Huazhong University of Science and Technology (HUST) in 2019, and was a visiting Ph.D. student in University of California, Santa Barbara (UCSB) during 2018-2019. He received B.E. degree in Computer Science from HUST in 2014. He has published 30+ refereed papers in major conferences and journals such as OSDI, MICRO, ASPLOS, USENIX ATC, SoCC, and DAC in the areas of computer system and architecture, with a focus on non-volatile memory systems, storage systems and techniques, and security. He served as the PC member at ICDCS’21, ICPADS’20, CloudCom’20, and Eurosys’19 (shadow).
Coding for Resistive Random-Access Memory Channels
GuanghuiSong (Singapore University of Technology and Design);KuiCai (Singapore University of Technology and Design);XingweiZhong (Singapore University of Technology and Design);JiangYu (Singapore University of Technology and Design);JunCheng (Doshisha University);
GuanghuiSong (Singapore University of Technology and Design);KuiCai (Singapore University of Technology and Design);XingweiZhong (Singapore University of Technology and Design);JiangYu (Singapore University of Technology and Design);JunCheng (Doshisha University);
Speaker:Guanghui Song, Singapore University of Technology and Design
AbstractWe propose channel coding techniques to mitigate both the sneak-path interference and the channel noise for resistive random-access memory (ReRAM) channels. The main challenge is that the sneak-path interference within one memory array is data-dependent. We propose an across-array coding scheme, which assigns a codeword to multiple independent memory arrays. Since the coded bits from different arrays experience independent channels, a ``diversity" gain can be obtained during decoding, and when the codeword is adequately distributed, the code performs as that over an independent and identically distributed (i.i.d.) channel without data-dependency. We also present a real-time channel estimation scheme and a data shaping technique to improve the decoding performance.
Speaker bioGuanghui Song received his Ph.D. degree in the department of intelligent information engineering and sciences, Doshisha University, Kyoto, Japan, in 2012. He worked as a researcher in Doshisha University and University of Western Ontario, London, Canada. Currently, he is a postdoctoral researcher in Singapore University of Technology and Design. His research interests are in the areas of channel coding theory, multi-user coding, and coding for data storage systems.
Lifted Reed-Solomon Codes with Application to Batch Codes
LukasHolzbaur (Technical University of Munich);RinaPolyanskaya (Institute for Information Transmission Problems);NikitaPolyanskii (Technical University of Munich);IlyaVorobyev (Skolkovo Institute of Science and Technology);
LukasHolzbaur (Technical University of Munich);RinaPolyanskaya (Institute for Information Transmission Problems);NikitaPolyanskii (Technical University of Munich);IlyaVorobyev (Skolkovo Institute of Science and Technology);
Speaker:,
AbstractGuo, Kopparty and Sudan have initiated the study of error-correcting codes derived by lifting of affine-invariant codes. Lifted Reed-Solomon (RS) codes are defined as the evaluation of polynomials in a vector space over a field by requiring their restriction to every line in the space to be a codeword of the RS code. In this paper, we investigate lifted RS codes and discuss their application to batch codes, a notion introduced in the context of private information retrieval and load-balancing in distributed storage systems. First, we improve the estimate of the code rate of lifted RS codes for lifting parameter $m\ge 4$ and large field size. Second, a new explicit construction of batch codes utilizing lifted RS codes is proposed. For some parameter regimes, our codes have a better trade-off between parameters than previously known batch codes.
A Back-End, CMOS Compatible Ferroelectric Field Effect Transistor for Synaptic Weights
MattiaHalter (IBM Research Gmbh, ETH Zurich);LauraBégon-Lours (IBM Research Gmbh);ValeriaBragaglia (IBM Research Gmbh);MarilyneSousa (IBM Research Gmbh);Bert JanOffrein (IBM Research Gmbh);StefanAbel (Formerly IBM Research Gmbh, currently at Lumiphase);MathieuLuisier (ETH Zurich);JeanFompeyrine (Formerly IBM Research Gmbh, currently at Lumiphase);
MattiaHalter (IBM Research Gmbh, ETH Zurich);LauraBégon-Lours (IBM Research Gmbh);ValeriaBragaglia (IBM Research Gmbh);MarilyneSousa (IBM Research Gmbh);Bert JanOffrein (IBM Research Gmbh);StefanAbel (Formerly IBM Research Gmbh, currently at Lumiphase);MathieuLuisier (ETH Zurich);JeanFompeyrine (Formerly IBM Research Gmbh, currently at Lumiphase);
Speaker:Mattia Halter, IBM Research GmbH - Zurich Research Laboratory, CH-8803 Rüschlikon, Switzerland Integrated Systems Laboratory, ETH Zurich, CH-8092 Zurich, Switzerland
AbstractNeuromorphic computing architectures enable the dense co-location of memory and processing elements within a single circuit. Their building blocks are non-volatile synaptic elements such as memristors. Key memristor properties include a suitable non-volatile resistance range, continuous linear resistance modulation and symmetric switching. In this work, we demonstrate voltage-controlled, symmetric and analog potentiation and depression of a ferroelectric Hf0.57Zr0.43O2 (HZO) field effect transistor (FeFET) with good linearity. Our FeFET operates with a low writing energy (fJ) and fast programming time (40ns). Retention measurements have been done over 4-bits depth with low noise (1%) in the tungsten oxide (WOx) read out channel. By adjusting the channel thickness from 15nm to 8nm, the on/off ratio of the FeFET can be engineered from 1% to 200% with an on-resistance ideally >100kΩ, depending on the channel geometry. The device concept is compatible with a back end of line (BEOL) integration into CMOS processes. It has therefore a great potential for the fabrication of high density, large-scale integrated arrays of artificial analog synapses.
Speaker bio**Bio** Mattia Halter is a PhD student at ETH Zurich doing his thesis full-time at IBM Research Laboratory Zurich. His work focuses specifically on the *design and fabrication of ferroelectric memristors for neuromorphic applications*. Starting at the development of novel materials and finishing by implementing and characterising crossbar arrays in the back end of line. This means he is spending most of his time in the cleanroom. His background in electrical engineering and nano technology he has obtained from ETH Zurich. When he is not doing research, you probably find him at home doing nothing waiting until the pandemic is over. **Why he likes California:** He spent an exchange year close to Big Sur in 2008-2009 as a high school student. He fondly remembers the great host family, classmates and fine Mexican barbecues.
Separation and Equivalence results for the Crash-stop and Crash-recovery Shared Memory Models
OhadBen-Baruch (BGU);SrivatsanRavi (University of Southern California);
OhadBen-Baruch (BGU);SrivatsanRavi (University of Southern California);
Speaker:Ohad Ben Baruch, Ben Gurion University
AbstractLinearizability, the traditional correctness condition for concurrent objects is considered insufficient for the non-volatile shared memory model where processes recover following a crash.For this crash-recovery shared memory model, strict-linearizability is considered appropriate since, unlike linearizability, it ensures operations that crash take effect prior to the crash or not at all. This work formalizes and answers the question of whether an implementation of a data type derived for the crash-stop shared memory model is also strict-linearizable in the crash-recovery model.We present a rigorous study to prove how helping mechanisms, typically employed by non-blocking implementations, is the algorithmic abstraction that delineates linearizability from strict-linearizability. Our first contribution formalizes the crash-recovery model and how explicit process crashes and recovery introduces further dimensionalities over the standard crash-stop shared memory model. We make the following technical contributions: (i) we prove that strict-linearizability is independent of any known help definition; (ii) we present a natural definition of help-freedom to prove that any obstruction-free, linearizable and help-free implementation of a total object type is also strict-linearizable; (iii) we prove that for a large class of object types, a non-blocking strict-linearizable implementation cannot have helping. Viewed holistically, this work provides the first precise characterization of the intricacies in applying a concurrent implementation designed for the crash-stop(and resp. crash-recovery) model to the crash-recovery (and resp.crash-stop) model
Speaker bioOhad Ben Baruch completed his PhD in computer science at Ben-Gurion University under the supervision of Prof. Danny Hendler and Prof. Hagit Attiya. His research focuses on shared memory and concurrent computation. More specifically, on complexity bounds for concurrent objects in the crash-stop and crash-recovery system models. In his PhD dissertation the notion of Nesting-Safe Recoverable Linearizability (NRL) was proposed, a novel model and correctness condition for the crash-recovery shared memory model which allows for nesting of recoverable objects, together with lower and upper bounds for objects implementations satisfying the condition. Ohad is currently working at BeyondMinds as a researcher and algorithms development.
Toward Faster and More Efficient Training on CPUs Using STT-RAM-based Last Level Cache
AlexanderHankin (Tufts University);MaziarAmiraski (Tufts University);KarthikSangaiah (Drexel Univeresity);MarkHempstead (Tufts University);
AlexanderHankin (Tufts University);MaziarAmiraski (Tufts University);KarthikSangaiah (Drexel Univeresity);MarkHempstead (Tufts University);
Speaker:Alexander Hankin, Tufts University
AbstractArtificial intelligence (AI), especially neural network-based AI, has become ubiquitous in modern day computing. However, the training phase required for these networks demands significant computational resources and is the primary bottleneck as the community scales its AI capabilities. While GPUs and AI accelerators have begun to be used to address this problem, many of the industry's AI models are still trained on CPUs and are limited in large part by the memory system. Breakthroughs in NVM research over the past couple of decades has unlocked the potential for replacing on-chip SRAM with an NVM-based alternative. Research into Spin-Torque Transfer RAM (STT-RAM) over the past decade has explored the impact of trading off volatility for improved write latency as part of the trend to bring STT-RAM on-chip. This is particularly STT-RAM is an especially attractive replacement for SRAM in the last-level cache due to its density, low leakage, and most notably, endurance.
Speaker bioAlexander Hankin is a fourth year PhD candidate in the ECE department at Tufts University under the advisement of Prof. Mark Hempstead. His research interests are centered around building architecture-level modeling and simulation tools in the areas of emerging non-volatile memories and thermal hotspots.
GBTL+Metall – Adding Persistence to GraphBLAS
KaushikVelusamy (University of Maryland, Baltimore County);ScottMcMillan (Carnegie Mellon University);KeitaIwabuchi (Lawrence Livermore National Laboratory);RogerPearce (Lawrence Livermore National Laboratory);
KaushikVelusamy (University of Maryland, Baltimore County);ScottMcMillan (Carnegie Mellon University);KeitaIwabuchi (Lawrence Livermore National Laboratory);RogerPearce (Lawrence Livermore National Laboratory);
Speaker:Kaushik Velusamy, University of Maryland, Baltimore County
AbstractIt is well known that software-hardware co-design is required for attaining high-performance implementations. System software libraries help us in achieving this goal. Metall persistent memory allocator is one such library. Metall enables large scale data analytics by leveraging emerging memory technologies. Metall is a persistent memory allocator designed to provide developers with rich C++ interfaces to allocate custom C++ data structures in persistent memory, not just from block storage and byte-addressable persistent memories (NVMe, Optane) but also in DRAM TempFS. Having a large capacity of persistent memory changes the way we solve problems and leads to algorithmic innovation. In this work, we present GraphBLAS as a real application use case to demonstrate Metall persistent memory allocator benefits. We show an example of how storing and re-attaching graph containers using Metall, eliminates the need for graph reconstruction at a one-time cost of re-attaching to Metall datastore.
Speaker bioKaushik Velusamy is a Ph.D. candidate at the University of Maryland, Baltimore County. His doctoral research focuses on optimizing large-scale graph analytics on high-performance computing systems.
POSEIDON : Safe, Fast and Scalable Persistent Memory Allocator
Wook-HeeKim (Virginia Tech);AnthonyDemeri (Virginia Tech);Madhava KrishnanRamanathan (Virginia Tech);JaehoKim (Gyeongsang National University);MohannadIsmail (Virginia Tech);ChangwooMin (Virginia Tech);
Wook-HeeKim (Virginia Tech);AnthonyDemeri (Virginia Tech);Madhava KrishnanRamanathan (Virginia Tech);JaehoKim (Gyeongsang National University);MohannadIsmail (Virginia Tech);ChangwooMin (Virginia Tech);
Speaker:Wook-Hee Kim, Virginia Tech
AbstractPersistent memory allocator is an essential component of any Non-Volatile Main Memory (NVMM) application. A slow memory allocator can bottleneck the entire application stack, while an unsecure memory allocator can render applications inconsistent upon program bugs or system failure. Unlike DRAM-based memory allocators, it is indispensable for an NVMM allocator to guarantee its heap metadata safety from both internal and external errors. An effective NVMM memory allocator should be 1) safe, 2) scalable, and 3) high performing. Unfortunately, none of the existing persistent memory allocators achieve all three requisites. For example, we found that even Intel’s de-facto NVMM allocator–libpmemobj is vulnerable to silent data corruption and persistent memory leaks resulting from a simple heap overflow. In this paper, we propose Poseidon, a safe, fast, and scalable persistent memory allocator. The premise of Poseidon revolves around providing a user application with per-CPU sub-heaps for scalability and high performance while managing the heap metadata in a segregated fashion and efficiently protecting the metadata using a scalable hardware-based protection scheme, Intel’s Memory Protection Keys (MPK). We evaluate Poseidon with a wide array of microbenchmarks and real-world benchmarks. In our evaluation, Poseidon outperforms the state-of-art allocators by a significant margin, showing improved scalability and performance, while also guaranteeing heap metadata protection.
Speaker bioWook-Hee Kim is a postdoctoral associate at Virginia Tech. His research interests fall into system software for emerging hardware technologies, including persistent memory and Remote Direct Memory Access (RDMA). He is actively working on these areas and has publications on system software for persistent memory. He obtained his B.S. and Ph.D. from Ulsan National Institute of Science and Technology (UNIST) in 2013 and 2019, respectively.
Ribbon: High-Performance Cache Line Flushing for Persistent Memory
KaiWu (University of California, Merced);IvyPeng (Lawrence Livermore National Laboratory);JieRen (University of California, Merced);DongLi (University of California, Merced);
KaiWu (University of California, Merced);IvyPeng (Lawrence Livermore National Laboratory);JieRen (University of California, Merced);DongLi (University of California, Merced);
Speaker:Kai Wu, University of California Merced
AbstractCache line flushing (CLF) is a fundamental building block for programming persistent memory (PM). CLF is prevalent in PM-aware workloads to ensure crash consistency. It also imposes high over-head. Extensive works have explored persistency semantics andCLF policies, but few have looked into the CLF mechanism. This work aims to improve the performance of CLF mechanism based on the performance characterization of well-established workloads on real PM hardware. We reveal that the performance of CLF is highly sensitive to the concurrency of CLF and cache line status. We introduce Ribbon, a runtime system that improves the performance of CLF mechanism through concurrency control and proactive CLF. Ribbon detects CLF bottleneck in oversupplied and insufficient concurrency, and adapts accordingly. Ribbon also proactively transforms dirty or nonresident cache lines into the clean resident status to reduce the latency of CLF. Furthermore, we investigate the cause for low dirtiness in flushed cache lines in in-memory database workloads. We provide cache line coalescing as an application-specific solution that achieves up to 33.3% (13.8% on average) improvement. Our evaluation of a variety of workloads in four configurations on PM shows that Ribbon achieves up to 49.8%improvement (14.8% on average) of the overall application performance.
Speaker bioKai is a Ph.D. candidate at University of California Merced. Before coming to UC Merced, He earned his master degree in Computer Science and Engineering from Michigan State University in 2016. His research areas are computer system, heterogeneous computing and high performance computing (HPC). He designs high performance and large-scale computer systems with hardware heterogeneity. His recent work focuses on designing system support for persistent memory-based big memory platforms.
Generative Modeling of NAND Flash Memory Voltage Level
ZiweiLiu (Center of Memory and Recording Research, UC San Diego);YiLiu (Center of Memory and Recording Research, UC San Diego);Paul H.Siegel (Center of Memory and Recording Research, UC San Diego);
ZiweiLiu (Center of Memory and Recording Research, UC San Diego);YiLiu (Center of Memory and Recording Research, UC San Diego);Paul H.Siegel (Center of Memory and Recording Research, UC San Diego);
Speaker:Ziwei Liu, Center of Memory and Recording Research, UC San Diego
AbstractProgram and erase cycling (P/E cycling) data is used to characterize flash memory channels and support realistic performance simulation of error-correcting codes (ECCs). However, these applications require a massive amount of data, and collecting the data takes a lot of time. To generate a large amount of NAND flash memory read voltages using a relatively small amount of measured data, we propose a read voltage generator based on Time-Dependent Generative Moments Matching Network (TD-GMMN). This model can generate authentic read voltage distributions over a range of possible P/E cycles for a specified program level based on known read voltage distributions. Experimental results based on data generated by a mathematical MLC NAND flash memory read voltage generator demonstrate the model’s effectiveness.
Speaker bioZiwei Liu is an exchange student researcher at the Center of Memory and Recording Research. Her research interest is in the intersection of flash memory, information theory, and machine learning.