SUNDAY, MARCH 8
2:00 pm – 5:30 pm | Price Center East Ballroom
5:30 pm – 8:00 pm | The Loft at Price Center East, 4th Floor
Reception
MONDAY, MARCH 9
7:45 am – 8:45 am | The Conrad Prebys Performing Arts Center – The Wu Tsai QRT.yrd
Continental Breakfast
8:45 am – 9:00 am | The Conrad Prebys Performing Arts Center – The Baker-Baum Concert Hall
Opening Remarks
9:00 am – 10:00 am | The Conrad Prebys Performing Arts Center
Keynote I
Accelerating Deep Neural Networks with Analog Memory Devices
GeoffreyBurr (IBM Research - Almaden);
Accelerating Deep Neural Networks with Analog Memory Devices
GeoffreyBurr (IBM Research - Almaden);
Speaker:Geoffrey W. Burr, IBM Research - Almaden
AbstractDeep Neural Networks (DNNs) are very large artificial neural networks trained using very large datasets, typically using the supervised learning technique known as backpropagation. Currently, CPUs and GPUs are used for these computations. Over the next few years, we can expect special-purpose hardware accelerators based on conventional digital-design techniques to optimize the GPU framework for these DNN computations. Even after the improved computational performance and efficiency that is expected from these special-purpose digital accelerators, there would still be an opportunity for even higher performance and even better energy-efficiency for inference and training of DNNs, by using neuromorphic computation based on analog memory devices. In this presentation, I discuss the origin of this opportunity as well as the challenges inherent in delivering on it, including materials and devices for analog volatile and non-volatile memory, circuit and architecture choices and challenges, and the current status and prospects.
Speaker bioGeoffrey W. Burr received his Ph.D. in Electrical Engineering from the California Institute of Technology in 1996. Since that time, Dr. Burr has worked at IBM Research--Almaden in San Jose, California, where he is currently a Distinguished Research Staff Member. He has worked in a number of diverse areas, including holographic data storage, photon echoes, computational electromagnetics, nanophotonics, computational lithography, phase-change memory, storage class memory, and novel access devices based on Mixed-Ionic-Electronic-Conduction (MIEC) materials. Dr. Burr's current research interests involve AI/ML acceleration using non-volatile memory. Geoff is an IEEE Fellow (2020), and is also a member of MRS, SPIE, OSA, Tau Beta Pi, Eta Kappa Nu, and the Institute of Physics (IOP).
10:00 am – 10:30 am
Break
10:30 am – 11:50 pm | The Conrad Prebys Performing Arts Center – The Baker-Baum Concert Hall
Session 1: Memorable Paper Award Finalists I (Systems and Architectures)
Chair: TBA
PMTest: A Fast and Flexible Testing Framework for Persistent Memory Programs
SihangLiu (University of Virginia);YizhouWei (University of Virginia);JishenZhao (UC, San Diego);AasheeshKolli (Penn State University & VMware Research);SamiraKhan (University of Virginia);
PMTest: A Fast and Flexible Testing Framework for Persistent Memory Programs
SihangLiu (University of Virginia);YizhouWei (University of Virginia);JishenZhao (UC, San Diego);AasheeshKolli (Penn State University & VMware Research);SamiraKhan (University of Virginia);
Speaker:Sihang Liu, University of Virginia
AbstractRecent non-volatile memory technologies such as 3D XPoint and NVDIMMs have enabled persistent memory (PM) systems that can manipulate persistent data directly in memory. This advancement of memory technology has spurred the development of a new set of crash-consistent software (CCS) for PM - applications that can recover persistent data from memory in a consistent state in the event of a crash (e.g., power failure). CCS developed for persistent memory ranges from kernel modules to user-space libraries and custom applications. However, ensuring crash consistency in CCS is difficult and error-prone. Programmers typically employ low-level hardware primitives or transactional libraries to enforce ordering and durability guarantees that are required for ensuring crash consistency. Unfortunately, hardware can reorder instructions at runtime, making it difficult for the programmers to test whether the implementation enforces the correct ordering and durability guarantees. We believe that there is an urgent need for developing a testing framework that helps programmers identify crash consistency bugs in their CCS. We find that prior testing tools lack generality, i.e., they work only for one specific CCS or memory persistency model and/or introduce significant performance overhead. To overcome these drawbacks, we propose PMTest, a crash consistency testing framework that is both flexible and fast. PMTest provides flexibility by providing two basic assertion-like software checkers to test two fundamental characteristics of all CCS: the ordering and durability guarantee. These checkers can also serve as the building blocks of other application-specific, high-level checkers. PMTest enables fast testing by deducing the persist order without exhausting all possible orders. In the evaluation with eight programs, PMTest not only identified 45 synthetic crash consistency bugs, but also detected 3 new bugs in a file system (PMFS) and in applications developed using a transactional library (PMDK), while on average being 7.1× faster than the state-of-the-art tool.
Speaker bioSihang Liu is a 4th year Ph.D. student at the University of Virginia, advised by Professor Samira Khan. Before pursuing the doctoral degree, he has obtained Bachelor’s degrees from both the University of Michigan and Shanghai Jiaotong University. Sihang Liu’s primary research interest lies in the software and hardware co-design of persistent memory systems. On the hardware side, his research aims to optimize the performance and guarantee crash consistency for practical persistent memory systems that are integrated with both storage and memory support. On the software side, he works on testing the crash consistency guarantees of PM-based programs. His works have provided several open-source tools and detected real-world bugs in well known persistent memory software systems. He has published these works at top conferences in Computer Architecture, including ISCA, ASPLOS, and HPCA. He has also served as reviewer for ToS and artifact reviewer for ASPLOS.
Digital-based Processing In-Memory for High Precision Deep Neural Network Acceleration
MohsenImani (UC San Diego);SaranshGupta (UC San Diego);YeseongKim (UC San Diego);MinxuanZhou (UC San Diego);TajanaRosing (UC San Diego);
Digital-based Processing In-Memory for High Precision Deep Neural Network Acceleration
MohsenImani (UC San Diego);SaranshGupta (UC San Diego);YeseongKim (UC San Diego);MinxuanZhou (UC San Diego);TajanaRosing (UC San Diego);
Speaker:Mohsen Imani, UC San Diego
AbstractProcessing In-Memory (PIM) has shown great potential to accelerate the inference tasks of the Convolutional Neural Network (CNN). However, existing PIM architectures do not support high precision computation, e.g., in floating-point precision, which is essential for training accurate CNN models. In addition, most of the existing PIM approaches require analog/mixed-signal circuits, which do not scale, exploiting insufficiently reliable multi-bit Non-Volatile Memory (NVM). In this paper, we propose FloatPIM, a fully-digital scalable PIM architecture that accelerates CNN in both training and testing phases. FloatPIM natively supports floating-point representation, thus enabling accurate CNN training. FloatPIM also enables fast communication between neighboring memory blocks to reduce the internal data movement of the PIM architecture. We evaluate the efficiency of FloatPIM on the ImageNet dataset using popular large-scale neural networks. Our evaluation shows that FloatPIM supporting floating-point precision can achieve up to 5.1% higher classification accuracy as compared to existing PIM architectures with limited fixed-point precision. FloatPIM training is on average 303.2× and 48.6× (4.3× and 15.8×) faster and more energy-efficient as compared to GTX 1080 GPU (PipeLayer [1] PIM accelerator). For testing, FloatPIM also provides 324.8× and 297.9× (6.3× and 21.6×) speedup and energy efficiency as compared to GPU (ISAAC [2] PIM accelerator) respectively
Speaker bioMohsen Imani is a Ph.D. candidate in the Department of Computer Science and Engineering at UC San Diego. His research interests are in brain-inspired computing, computer architecture, and embedded systems. His contributions resulted in grants funded by many governmental agencies (four NSF, three SRC) and industries including IBM, Intel, Micron, and Qualcomm. In addition, his Ph.D. research was the main initiative to open a new DARPA program focusing on brain-inspired computing. Mohsen has received multiple prestigious awards from the UCSD school of engineering including the Gordon Engineering Leadership Award, the Outstanding Graduate Research Award, as well as the Best Doctorate Research Award at the Computer Science Department. He also received several nominations for the best paper awards from multiple conferences, including Design Automation Conference 2019.
Semi-Asymmetric Parallel Graph Algorithms for NVRAMs
LaxmanDhulipala (Carnegie Mellon University);CharlesMcGuffey (Carnegie Mellon University);HongboKang (Carnegie Mellon University);YanGu (UC Riverside);Guy E.Blelloch (Carnegie Mellon University);Phillip B.Gibbons (Carnegie Mellon University);JulianShun (Massahusetts Institute of Technology);
Semi-Asymmetric Parallel Graph Algorithms for NVRAMs
LaxmanDhulipala (Carnegie Mellon University);CharlesMcGuffey (Carnegie Mellon University);HongboKang (Carnegie Mellon University);YanGu (UC Riverside);Guy E.Blelloch (Carnegie Mellon University);Phillip B.Gibbons (Carnegie Mellon University);JulianShun (Massahusetts Institute of Technology);
Speaker:Laxman Dhulipala, Carnegie Mellon University
AbstractEmerging non-volatile main memory (NVRAM) technologies provide novel features for large-scale graph analytics, combining byte-addressability, low idle power, and improved memory-density. Systems are likely to have an order of magnitude more NVRAM than traditional memory (DRAM), allowing large graph problems to be solved efficiently at a modest cost on a single machine. However, a significant challenge in achieving high performance is in accounting for the fact that NVRAM writes can be significantly more expensive than NVRAM reads. In this paper, we propose an approach to parallel graph analytics in which the graph is stored as a read-only data structure (in NVRAM), and the amount of mutable memory is kept proportional to the number of vertices. Similar to the popular semi-external and semi-streaming models for graph analytics, the approach assumes that the vertices of the graph fit in a fast read-write memory (DRAM), but the edges do not. In NVRAM systems, our approach eliminates writes to the NVRAM, among other benefits. We present a model, the \emph{Parallel Semi-Asymmetric Model (PSAM)}, to analyze algorithms in the setting, and run experiments on a 48-core NVRAM system to validate the effectiveness of these algorithms. To this end, we study over a dozen graph problems. We develop parallel algorithms for each that are efficient, often work-optimal, in the model. Experimentally, we run all of the algorithms on the largest publicly-available graph and show that our PSAM algorithms outperform the fastest prior algorithms designed for DRAM or NVRAM. We also show that our algorithms running on NVRAM nearly match the fastest prior algorithms running solely in DRAM, by effectively hiding the costs of repeatedly accessing NVRAM versus DRAM.
Speaker bioLaxman is a final-year Ph.D. student at CMU where he is very fortunate to be advised by Guy Blelloch. He has worked broadly on designing and implementing provably-efficient parallel graph algorithms in different parallel models of computation.
SplitFS: Reducing Software Overhead in File Systems for Persistent Memory
RohanKadekodi (University of Texas at Austin);Se KwonLee (University of Texas at Austin);SanidhyaKashyap (Georgia Insitute of Technology);TaesooKim (Georgia Institute of Technology);AasheeshKolli (Penn State University and VMware Research);VijayChidambaram (University of Texas at Austin and VMware Research);
SplitFS: Reducing Software Overhead in File Systems for Persistent Memory
RohanKadekodi (University of Texas at Austin);Se KwonLee (University of Texas at Austin);SanidhyaKashyap (Georgia Insitute of Technology);TaesooKim (Georgia Institute of Technology);AasheeshKolli (Penn State University and VMware Research);VijayChidambaram (University of Texas at Austin and VMware Research);
Speaker:Rohan Kadekodi, PhD student at The University of Texas at Austin
AbstractWe present SplitFS, a file system for persistent memory (PM) that reduces software overhead significantly compared to state-of-the-art PM file systems. SplitFS presents a novel split of responsibilities between a user-space library file system and an existing kernel PM file system. The user-space library file system handles data operations by intercepting POSIX calls, memory-mapping the underlying file, and serving the read and overwrites using processor loads and stores. Metadata operations are handled by the kernel PM file system (ext4 DAX). SplitFS introduces a new primitive termed relink to efficiently support file appends and atomic data operations. SplitFS provides three consistency modes, which different applications can use choose from without interfering with each other. SplitFS reduces software overhead by up-to 17x compared to ext4 DAX and increases application performance by 2x compared to ext4 DAX and NOVA while providing similar consistency guarantees.
Speaker bioRohan Kadekodi is a member of the UT Systems and Storage Lab at University of Texas at Austin, advised by Prof. Vijay Chidambaram. He finished his undergraduate studies at Pune Institute of Computer Technology in India, and joined the PhD program at UT Austin in the Fall of 2017. His research is focussed on building file systems and key-value stores for Persistent Memory with a focus towards performance.
11:50 pm – 2:00 pm | The Conrad Prebys Performing Arts Center – The Wu Tsai QRT.yrd & The Belanich Terrace
Lunch and Poster Session
2:00 pm-3:20 pm | The Conrad Prebys Performing Arts Center – The Baker-Baum Concert Hall
Session 2: Memorable Paper Award Finalists II (Media channels and ECC)
Chair: TBA
Who’s Afraid of Uncorrectable Bit Errors? Online Recovery of Flash Errors with Distributed Redundancy
AmyTai (VMware Research);AndrewKryczka (Facebook);Shobhit O.Kanaujia (Facebook);KyleJamieson (Princeton University);Michael J.Freedman (Princeton University);AsafCidon (Columbia University);
Who’s Afraid of Uncorrectable Bit Errors? Online Recovery of Flash Errors with Distributed Redundancy
AmyTai (VMware Research);AndrewKryczka (Facebook);Shobhit O.Kanaujia (Facebook);KyleJamieson (Princeton University);Michael J.Freedman (Princeton University);AsafCidon (Columbia University);
Speaker:Amy Tai, VMware Research
AbstractDue to its high performance and decreasing cost per bit, flash storage is the main storage medium in datacenters for hot data. However, flash endurance is a perpetual problem, and due to technology trends, subsequent generations of flash devices exhibit progressively shorter lifetimes before they experience uncorrectable bit errors. In this paper, we present an approach for addressing the flash lifetime problem by allowing devices to operate at much higher bit error rates. We present DIRECT, a set of techniques that harnesses distributed-level redundancy to enable the adoption of new generations of denser and less reliable flash storage technologies. DIRECT does so by using an end-to-end approach to increase the reliability of distributed storage systems. We implemented DIRECT on two real-world storage systems: ZippyDB, a distributed key-value store in production at Facebook that is backed by and supports transactions on top of RocksDB, and HDFS, a distributed file system. When tested on production traces at Facebook, DIRECT reduces application-visible error rates in ZippyDB by more than 100 and recovery time by more than 10,000. DIRECT also allows HDFS to tolerate a 10,000–100,000 higher bit error rate without experiencing application-visible errors. By significantly increasing the availability of distributed storage systems in the face of bit errors, DIRECT helps extend flash lifetimes.
Speaker bioAmy is a post-doc at VMware research. Her research focuses on hardware-software codesign for the storage stack.
Error-Correcting WOM Codes for Worst-Case and Random Errors
AmitSolomon (Technion);YuvalCassuto (Technion);
Error-Correcting WOM Codes for Worst-Case and Random Errors
AmitSolomon (Technion);YuvalCassuto (Technion);
Speaker:Amit Solomon, Technion - Israel Institute of Technology
AbstractWe construct error-correcting WOM (write-once memory) codes that guarantee correction of any specified number of errors in q-level memories. The constructions use suitably designed short q-ary WOM codes and concatenate them with outer error-correcting codes over different alphabets, using suitably designed mappings. In addition to constructions for guaranteed error correction, we develop an error-correcting WOM scheme for random errors using the concept of multi-level coding.
Speaker bioAmit Solomon is a Ph.D. candidate at the Department of Electrical Engineering and Computer Science at Massachusetts Institute of Technology, working with Prof. Muriel Mèdard at the Network Coding and Reliable Communication group in the Research Laboratory of Electronics. His research interests are coding theory, information theory, communication, among others. He received the B.Sc. degree, (cum laude), and the M.Sc. in Electrical Engineering from the Technion-Israel Institute of Technology, in 2015 and 2018, respectively. He has received the Irwin Mark Jacobs and Joan Klein Jacobs Presidential Fellowship.
Efficient Architectures for Generalized Integrated Interleaved Decoder
XinmiaoZhang (The Ohio State University);ZhenshanXie (The Ohio State University);
Efficient Architectures for Generalized Integrated Interleaved Decoder
XinmiaoZhang (The Ohio State University);ZhenshanXie (The Ohio State University);
Speaker:Zhenshan Xie, OSU
AbstractGeneralized integrated interleaved (GII) codes nest short sub-codewords to generate parities shared by the sub-codewords. They allow hyper-speed decoding with excellent correction capability, and are essential to next-generation data storage. On the other hand, the hardware implementation of GII decoders faces many challenges, including low achievable clock frequency and large silicon area. This abstract presents novel algorithmic reformulations and architectural transformations to address each bottleneck. For an example GII code that has the same rate and length as eight un-nested (255, 223) Reed-Solomon (RS) codes, our GII decoder only has 30% area overhead compared to the RS decoder while achieving 7 orders of magnitude lower frame error rate. Its critical path only consists of 7 XOR gates, and can easily achieve more than 40GByte/s throughput.
Speaker bioZhenshan Xie received the B.S. degree in information engineering from East China University of Science and Technology, Shanghai, China, in 2014, and the M.S. degree in communications and information system from University of Chinese Academy of Sciences, Beijing, China, in 2017. He is currently working toward the Ph.D. degree in the Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH.
SOLQC: Synthetic Oligo Library Quality Control Tool
OmerSabary (Technion – Israel Institute of Technology);YoavOrlev (Interdisciplinary Center Herzliya);RoyShafir (Technion – Israel Institute of Technology, Interdisciplinary Center Herzliya);LeonAnavy (Technion – Israel Institute of Technology);EitanYaakobi (Technion – Israel Institute of Technology);ZoharYakhini (Technion – Israel Institute of Technology, Interdisciplinary Center Herzliya);
SOLQC: Synthetic Oligo Library Quality Control Tool
OmerSabary (Technion – Israel Institute of Technology);YoavOrlev (Interdisciplinary Center Herzliya);RoyShafir (Technion – Israel Institute of Technology, Interdisciplinary Center Herzliya);LeonAnavy (Technion – Israel Institute of Technology);EitanYaakobi (Technion – Israel Institute of Technology);ZoharYakhini (Technion – Israel Institute of Technology, Interdisciplinary Center Herzliya);
Speaker:Omer Sabary, Technion – Israel Institute of Technology
AbstractDNA-based storage has attracted significant attention due to recent demonstrations of the viability of storing information in macromolecules using synthetic oligo libraries. As DNA storage experiments, as well as other experiments of synthetic oligo libraries are growing by numbers and complexity, analysis tools can facilitate quality control and help in assessment and inference. We present a novel analysis tool, called SOLQC, which enables fast and comprehensive analysis of synthetic oligo libraries, based on next generation sequencing (NGS) analysis performed by the user. SOLQC provides statistical information such as the distribution of variant representation, different error rates, and their dependence on sequence or library properties. SOLQC produces graphical descriptions of the analysis results. The results are reported in a flexible report format. We demonstrate SOLQC by analyzing literature libraries. We also discuss the potential benefits and relevance of the different components of the analysis
Speaker bioOmer Sabary received his B.Sc in Computer Science from the Technion in 2018. He is currently an M.Sc. student in the department of computer science at the Technion under the supervision of Associate Professor Eitan Yaakobi. His dissertation spans over reconstruction algorithms for DNA storage systems and error characterization of the DNA storage channel.
3:20 pm-3:50 pm
Break
3:50 pm – 5:10 pm | The Conrad Prebys Performing Arts Center – The Baker-Baum Concert Hall
Session 3: ECC
Chair: TBA
Low Cost and Power LDPC for Commodity NAND Products
EranSharon (WDC);IdanAlrod (WDC);RanZamir (WDC);OmerFainzilber (WDC);IshaiIlani (WDC);AlexBazarsky (WDC);IdanGoldenberg (WDC);
Low Cost and Power LDPC for Commodity NAND Products
EranSharon (WDC);IdanAlrod (WDC);RanZamir (WDC);OmerFainzilber (WDC);IshaiIlani (WDC);AlexBazarsky (WDC);IdanGoldenberg (WDC);
Speaker:Eran Sharon, WDC
AbstractCommodity NAND products such as USB drives, memory cards and IOT devices are highly cost and power sensitive. As such, these devices will usually utilize the densest memory available (e.g. 4 bit-per-cell memory based on the latest, most aggressively scaled technology node), which requires advanced ECC and DSP solutions in order to enable reliable storage over the “noisy” media. In order to minimize the ECC overprovisioning (and hence memory cost) needed for meeting the product reliability specs, state of the art Low Density Parity Check (LDPC) codes are used in such systems. These codes provide near Shannon limit performance with significantly higher correction capability compared to BCH codes, at the expense of larger silicon area and higher power consumption. In this work, we describe a low cost and low power LDPC solution, based on a special subcode based structure, coupled with a multi-gear decoding algorithm. It enables >4X silicon area reduction with sub-BCH power consumption.
Speaker bioEran Sharon is an Engineering Fellow at Western Digital, heading an R&D team developing a broad range of coding, DSP and memory management solutions for NVM. Eran has numerous publications in leading venues and holds over 200 issued patents in the fields of storage and communications. He received his PhD in EE (2009) and BSc degree in EE and CS (2001), both from Tel-Aviv University, Israel. He is the recipient of several awards, including Weinstein signal processing excellence prize, ACC Feder Prize for best graduate student research and several SanDisk Innovation awards.
Optimization of Read Thresholds in MLC NAND Memory for LDPC Codes
YishenYeh (University of California, San Diego);ArmanFazeli (University of California, San Diego);PaulSiegel (University of California, San Diego);
Optimization of Read Thresholds in MLC NAND Memory for LDPC Codes
YishenYeh (University of California, San Diego);ArmanFazeli (University of California, San Diego);PaulSiegel (University of California, San Diego);
Speaker:Yi-Shen Yeh, University of California, San Diego
AbstractThe decoders in multi-level-cell (MLC) NAND flash memories are usually capable of making up to hundreds of voltage reads to perfect the soft information that is required for the decoding process. While using more reads usually results in a finer quantization of the likelihoods, it also generates a higher decoding latency. In this work, we explore the trade-off between number of voltage reads and the system performance. We show that under reasonable assumptions, we can get a near optimal performance by only using $\sim 20$ suitably chosen reads. We optimize the location of these read thresholds according to the maximum mutual information (MMI) criterion by means of a gradient descent search. For thresholds with positions characterizable by a small number of parameters, we use density evolution to estimate the locations that minimize the probability of error of constructed LDPC codes with a specified degree distribution.
Speaker bioYishen Yeh is a M.S. student in Electrical and Computer Engineering Department, Communication Theory and Systems track at the University of California, San Diego (UCSD). He is currently working in the Center for Memory and Recording Research (CMRR) at UCSD on the optimization of read thresholds for NAND flash memories for LDPC codes. His interest also extends beyond coding theory and to other areas of communications including wireless sensing, 5G PHY, Wi-Fi etc.
LDPC with no Error Floors for Errors and Erasures Recovery
DavidDeclercq (Codelucida, Inc.);ShivaPlanjery (Codelucida, Inc.);BaneVasic (Codelucida, Inc. and University of Arizona);Vamsi KrishnaYella (Codelucida, Inc.);BennedictReynwar (Codelucida, Inc.);
LDPC with no Error Floors for Errors and Erasures Recovery
DavidDeclercq (Codelucida, Inc.);ShivaPlanjery (Codelucida, Inc.);BaneVasic (Codelucida, Inc. and University of Arizona);Vamsi KrishnaYella (Codelucida, Inc.);BennedictReynwar (Codelucida, Inc.);
Speaker:David Declercq, CODELUCIDA Inc.
AbstractIn this presentation, we propose an LDPC solution for emerging non volatile memories, such as 3D Xpoint or RRAMs. We optimize the quasi-cyclic LDPC so that the erasures coming from die read failures can be easily recovered before decoding. We then introduce a new Erasure-FAID decoder, with the feature of using different update rules for the erased and non-erased bits. Together with using the concept of decoder diversity, we show that our new decoder gives impressive gains compared to the legacy Reed-Solomon solution.
Speaker bioDavid Declercq was born in June 1971. He graduated his PhD in Statistical Signal Processing 1998, from the University of Cergy-Pontoise, France. He is Senior member of the IEEE and has held the junior position at the ``Institut Universitaire de France'' from 2009 to 2014. His research topics lie in digital communications and error-correction coding theory. He worked several years on the particular family of LDPC codes, both from the code and decoder design aspects. Since 2003, he developed a strong expertise on non-binary LDPC codes and decoders in high order Galois fields GF(q). A large part of his research projects are related to non-binary LDPC codes. He mainly investigated two aspects: (i) the design of GF(q) LDPC codes for short and moderate lengths, and (ii) the simplification of the iterative decoders for GF(q) LDPC codes with complexity/performance tradeoff constraints. David Declercq published more than 55 papers in major journals (IEEE-Trans. Commun., IEEE-Trans. Inf. Theo., Commun. Letters, IEEE-Trans. Circuits and Systems, etc.), and more than 150 papers in major conferences in Information Theory and Signal Processing.
EM Algorithm for DMC Channel Estimation in NAND Flash Memory
WarangratWiriya (JAIST);BrianKurkoski (JAIST);
EM Algorithm for DMC Channel Estimation in NAND Flash Memory
WarangratWiriya (JAIST);BrianKurkoski (JAIST);
Speaker:Warangrat Wiriya, Japan Advanced Institute of Science and Technology, Ishikawa, Japan
AbstractThe data read system for NAND flash memory can be modeled by a DMC channel with unknown channel transition probabilities. However, LDPC decoders need a channel estimate, and incorrect channel estimation degrades the performance of LDPC decoder. This abstract proposes using the EM algorithm to estimate channel transition probabilities, needed to compute LLRs for the LDPC decoders. At word-error rate $10^-5$, the performance of the EM system was only 0.02 dB loss compared to the system that knows the channel exactly
Speaker bioWarangrat Wiriya was born in Thailand, in 1987. She received a bachelor's and master's degree in Telecommunication Engineering from King Mongkut's Institute of Technology Ladkrabang, Thailand, in 2010 and 1012, respectively. After graduation, she joined the data storage company, Seagate Technology, Department of head and channel development. Currently, she studies doctoral program at Japan Advanced Institute of Science and Technology, Japan. Her current research interests are LDPC codes and lattice codes.
3:50 pm – 5:10 pm | The Conrad Prebys Performing Arts Center – The JAI
Session 4: Persistency and Consistency
Chair: TBA
Unleashing the Full Potential of Persistent Memory with Logless Atomic Durability
SiddharthGupta (EcoCloud, EPFL);AlexandrosDaglis (Georgia Tech);BabakFalsafi (EcoCloud, EPFL);
Unleashing the Full Potential of Persistent Memory with Logless Atomic Durability
SiddharthGupta (EcoCloud, EPFL);AlexandrosDaglis (Georgia Tech);BabakFalsafi (EcoCloud, EPFL);
Speaker:Siddharth Gupta, EcoCloud, EPFL
AbstractPersistent Memory (PM) devices, like 3D-Xpoint, are now commercially available. Software can leverage the combination of its fast access and persistence for significant performance gains. A key challenge for PM-aware software is to maintain high performance while achieving atomic durability requirements. The latter traditionally requires the use of logging, which introduces considerable overhead with additional CPU cycles, write traffic, and ordering requirements. In this paper, we exploit the data multiversioning inherent in the memory hierarchy to achieve atomic durability without logging. Our design, LAD, relies on persistent buffering space at the memory controllers (MCs)---already present in modern CPUs---to speculatively accumulate all of a transaction's updates before atomically committing them to PM. LAD employs an on-chip distributed commit protocol in hardware to manage the distributed speculative state each transaction accumulates across multiple MCs. LAD relies on modest hardware modifications to provide atomically durable transactions, delivering up to 80\% of ideal---i.e., PM-oblivious software's---performance.
Speaker bioSiddharth Gupta is a 3rd year Ph.D. student at EPFL and is supervised by Prof. Babak Falsafi. He received his Bachelor of Engineering degree in Computer Science from BITS Pilani, India in 2017. Currently, his research focuses on designing high-performance heterogeneous memory hierarchies for datacenter servers. He and his colleagues won the Best Paper Award at HPCA 2019.
Fine-Grain Checkpointing with In Cache Line Logging
NachshonCohen (Amazon);David T.Aksun (EPFL);HillelAvni (Huawei);JamesLarus (EPFL);
Fine-Grain Checkpointing with In Cache Line Logging
NachshonCohen (Amazon);David T.Aksun (EPFL);HillelAvni (Huawei);JamesLarus (EPFL);
Speaker:David T. Aksun, EPFL
AbstractNVM can enable a fast restart of a failed machine. On existing machines, restarting after a power failure requires a significant amount of time due to the need to load application information from durable media such as disk or SSD, parse it, and rebuild internal data structures. Storing a structure in NVM can avoid these costly steps. The main challenge in utilizing NVM is that processor caches are transient. At a power failure, all structure modifications that have not been propagated from cache to NVM will be lost. An additional challenge is that cache lines are not written back to memory (NVM) in the order in which the application wrote them, but rather according to the low-level and undocumented cache replacement policy of the memory system. This creates a well-studied challenge: how to ensure that a durable copy of a data structure is well-formed (consistent) after a crash, even though NVM may contain a mixture of old and new cache lines? Most NVM systems solve this challenge with programmer-specified fine-grained transactions, which guarantee that all memory writes in a transaction reach NVM or none of them do. However, the system must ensure that the log is completely updated in NVM before the data structure is modified, by using a cache write flush instruction. This overhead is significant since it occurs on the application's critical path before a data structure modification completes. An alternative to fine-grain transactions and logging is checkpointing, which is another widely used technique for ensuring persistence. Since a checkpoint typically copies the entire state of an application to a slow durable media, most systems take checkpoint at long intervals (minutes to hours) to reduce this cost. In this work, we use \textbf{Fine-Grained Checkpointing} that uses frequent checkpoints to NVM to ensure persistence at low cost. Instead of ensuring that every memory write propagates to NVM as rapidly as possible, we partition an application's execution into epochs and ensure that after a crash, data structures can be restored to their state at the end of the most recently completed epoch. Our system flushes the processors' caches to NVM at the end of an epoch, thus ensuring that at this point NVM contains \textbf{all} modified data. Our approach also differs from a traditional checkpoint in that there is no distinct in-memory copy of a data structure or memory image. The in-NVM data structure essentially serves as its own durable checkpoint. To solve the problem of fine-grained modifications in NVM, we introduce the novel concept of an \textbf{In Cache Line Log (InCLL)}. InCLL is similar to an undo log. But instead of using an external log, the InCLL is placed in the same cache line as the modified data structure field. InCLL relies on the Persistent Cache Store Order (PCSO) memory-ordering model of NVM (two writes to the same cache line reach NVM in program order) to ensure the ordering requirements of the log, without introducing cache flushes and delays on the critical path. We incorporated our Fine-Grained Checkpointing algorithm into Masstree, a high-performance combination of a B+ tree and Trie.
Speaker bioDavid Aksun is a Ph.D. candidate at EPFL in Lausanne. He has a B.S. (summa cum laude) in Computer Engineering from Istanbul Technical University. His research focuses on programming language challenges and performance optimization issues for byte-addressable non-volatile memory. His current research investigates the potential optimizations that can be used for building persistent data structures.
Hardware Implementation of Strand Persistency
VaibhavGogte (University of Michigan);WilliamWang (ARM);StephanDiestelhorst (ARM);Peter M.Chen (University of Michigan);SatishNarayanasamy (University of Michigan);Thomas F.Wenisch (University of Michigan);
Hardware Implementation of Strand Persistency
VaibhavGogte (University of Michigan);WilliamWang (ARM);StephanDiestelhorst (ARM);Peter M.Chen (University of Michigan);SatishNarayanasamy (University of Michigan);Thomas F.Wenisch (University of Michigan);
Speaker:Vaibhav Gogte, Graduate Student Research Assistant
AbstractEmerging persistent memory (PM) technologies promise the performance of DRAM with the durability of Flash. Several language-level persistency models have emerged recently to aid programming recoverable data structures in PM. Unfortunately, these persistency models are built upon hardware primitives that impose stricter ordering constraints on PM operations than the persistency models require. Alternative solutions use fixed and inflexible hardware logging techniques to relax ordering constraints on PM operations, but do not readily apply to general synchronization primitives employed by language-level persistency models. Instead, we propose StrandWeaver, a hardware strand persistency model, to minimally constrain ordering on PM operations. StrandWeaver manages PM order within a strand, a logically independent sequence of operations within a thread. PM operations that lie on separate strands are unordered and may drain concurrently to PM. StrandWeaver implements primitives under strand persistency to allow programmers to improve concurrency and relax ordering constraints on updates as they drain to PM. Furthermore, we design mechanisms that map persistency semantics in high-level language persistency models to the primitives implemented by StrandWeaver. We demonstrate that StrandWeaver can enable greater concurrency of PM operations than existing ISA-level ordering mechanisms, improving performance by up to 1.97x (1.45x avg.).
Speaker bioVaibhav Gogte is a PhD student in the Computer Science and Engineering department at University of Michigan and works with Prof. Thomas Wenisch. He is currently working on architecture and language support for integrating non-volatile memories into future computing systems. He received his bachelors degree from BITS, Pilani, India and has previously worked at Texas Instruments, Bangalore.
Durable Transactions Can Scale With Timestone
R. MadhavaKrishnan (Virginia Tech);JaehoKim (Huawei Dresden Research Center);AjitMathew (Virginia Tech);XinweiFu (Virginia Tech);AnthonyDemeri (Virginia Tech);ChangwooMin (Virginia Tech);SudarsunKannan (Rutgers University);
Durable Transactions Can Scale With Timestone
R. MadhavaKrishnan (Virginia Tech);JaehoKim (Huawei Dresden Research Center);AjitMathew (Virginia Tech);XinweiFu (Virginia Tech);AnthonyDemeri (Virginia Tech);ChangwooMin (Virginia Tech);SudarsunKannan (Rutgers University);
Speaker:R. Madhava Krishnan, Virginia Tech
AbstractNon-volatile main memory (NVMM) technologies promise byte addressability and near-DRAM access that allows developers to build persistent applications with common load and store instructions. However, it is difficult to realize these promises because NVMM software should also provide crash consistency while providing high performance, and scalability. Durable Transactional Memory (DTM) systems address these challenges. However, none of them scale beyond 16 cores. The poor scalability either stems from the underlying STM layer or from employing limited write parallelism (single writer or dual version). In addition, other fundamental issues with guaranteeing crash consistency are high write amplification and memory footprint in existing approaches. To address these challenges, we propose TIMESTONE : a highly scalable DTM system with low write amplification and minimal memory footprint. TIMESTONE uses a novel multi- layered hybrid logging technique, called TOC logging, to guarantee crash consistency. Also, TIMESTONE further relies on Multi-Version Concurrency Control (MVCC) mechanism to achieve high scalability and to support different isolation levels on the same data set. Our evaluation of TIMESTONE against the state-of-the-art DTM systems shows that it significantly outperforms other systems for a wide range of workloads with varying data-set size and contention levels, up to 112 hardware threads. In addition, with our TOC logging, TIMESTONE achieves a write amplification of less than 1, while existing DTM systems suffer from 2×-6× overhead.
Speaker bioMadhava Krishnan is a second-year Ph.D. student at Virginia Tech working under Dr. Changwoo Min. Madhav primarily works on computer systems and he is particularly interested in designing next-gen storage systems and operating systems. His current research focuses on building systems around Non-Volatile Main Memory (NVMM) for data center scale applications. Madhav has a Masters in Computer Architecture from the University of Texas and a Bachelors in ECE from Anna University, India.
6:00 pm – 10:00 pm
Banquet
Mister A’s
2550 Fifth Avenue Twelfth Floor
San Diego, CA 92103
TUESDAY, MARCH 10
8:00 am – 9:00 am | The Conrad Prebys Performing Arts Center – The Wu Tsai QRT.yrd
Continental Breakfast
9:00 am – 10:00 am | The Conrad Prebys Performing Arts Center – The Baker-Baum Concert Hall
Keynote II
Rethinking the Operating System Stack for Byte-Addressable NVM
Ethan L.Miller (University of California, Santa Cruz / Pure Storage);
Rethinking the Operating System Stack for Byte-Addressable NVM
Ethan L.Miller (University of California, Santa Cruz / Pure Storage);
Speaker:Ethan L. Miller, University of California, Santa Cruz
AbstractByte-addressable non-volatile memory (NVM) promises applications the ability to persist small units of data, enabling new programming paradigms and system designs. However, such gains will require significant help from the operating system: it needs to "get out of the way" while still providing strong guarantees for security and resource management. This talk will describe our approach to designing an operating system and programming environment that leverages the advantages of NVM to provide a single-level store for application data. Under this approach, NVM can be accessed, transparently, by any thread at any time, with pointers retaining their meanings across multiple invocations. Equally improtant, the operating system is minimally involved in program operation, limiting itself to managing virtualized devices, scheduling threads, and managing page tables to enforce user-supplied access controls. Structuring the system in this way provides both a simpler programming model and, in many cases, higher performance, allowing NVM-based systems to fully leverage the new ability to persist data with a single write.
Speaker bioEthan L. Miller is a Professor in the Computer Science and Engineering Department at the University of California, Santa Cruz. He is a Fellow of the IEEE and an ACM Distinguished Scientist, and his publications have received multiple Best Paper awards. Prof. Miller received an Sc.B. from Brown University in 1987 and a Ph.D. from UC Berkeley in 1995, and has been on the UC Santa Cruz faculty since 2000. He has co-authored over 150 papers in a range of topics in file and storage systems, operating systems, parallel and distributed systems, information retrieval, and computer security; his research has received over 13,000 citations. He was a member of the team that developed Ceph, a scalable high-performance distributed file system for scientific computing that is now being adopted by several high-end computing organizations. His current research projects, which are funded by the National Science Foundation and industry support for the CRSS and SSRC, include system support for byte-addressable non-volatile memory, archival storage systems, reliable and secure storage systems, and issues in ultra-scale storage systems. Prof. Miller has worked with Pure Storage since its founding in 2009, helping to design and refine its storage architecture. He has also worked with other companies, including Samsung, Veritas, and Seagate, to help move research results into commercial use. Additional information is available at https://www.crss.ucsc.edu/person/elm.html.
10:50 am – 11:50 am
Session 5: Persistent RAMs: outlook and optimization
Chair: TBA
An Empirical Guide to the Behavior and Use of Scalable Persistent Memory
JianYang (UC San Diego);JunoKim (UC San Diego);MortezaHoseinzadeh (UC San Diego);JosephIzraelevitz (University of Colorado, Boulder);StevenSwanson (UC San Diego);
An Empirical Guide to the Behavior and Use of Scalable Persistent Memory
JianYang (UC San Diego);JunoKim (UC San Diego);MortezaHoseinzadeh (UC San Diego);JosephIzraelevitz (University of Colorado, Boulder);StevenSwanson (UC San Diego);
Speaker:Juno Kim, UCSD
AbstractWe have characterized the performance and behavior of Optane DIMMs using a wide range of micro-benchmarks, benchmarks, and applications. The data we have collected demonstrate that many of the assumptions that researchers have made about how NVDIMMs would behave and perform are incorrect. We have found the actual behavior of Optane DIMMs to be more complicated and nuanced than the “slower, persistent DRAM” label would suggest. This paper presents a detailed evaluation of the behavior and performance of Optane DIMMs on microbenchmarks and applications and provides concrete, actionable guidelines for how programmers should tune their programs to make the best use of these new memories.
Speaker bioJuno Kim is a 3rd-year PhD student at UC San Diego. His research interests lie in the field of storage systems optimized for persistent memory. His advisor is Prof. Steven Swanson.
Understanding and Improving Persistent Transactions on Optane DC Memory
PanteaZardoshti (Lehigh University);MichaelSpear (Lehigh U.);AidaVosoughi (Oracle);GarretSwart (Oracle);
Understanding and Improving Persistent Transactions on Optane DC Memory
PanteaZardoshti (Lehigh University);MichaelSpear (Lehigh U.);AidaVosoughi (Oracle);GarretSwart (Oracle);
Speaker:Pantea Zardoshti, Lehigh University
AbstractStoring data structures in high-capacity byte-addressable persistent memory instead of DRAM or a storage device offers the opportunity to (1) reduce cost and power consumption compared with DRAM, (2) decrease the latency and CPU resources needed for an I/O operation compared with storage, and (3) allow for fast recovery as the data structure remains in memory after a machine failure. The first commercial offering in this space is Intel Optane Direct Connect (Optane DC) Persistent Memory. Optane DC promises access time within a constant factor of DRAM, with larger capacity, lower energy consumption, and persistence. We present an experimental evaluation of persistent transactional memory performance, and explore how Optane DC durability domains affect the overall results. Given that neither of the two available durability domains can deliver performance competitive with DRAM, we introduce and emulate a new durability model, called PDRAM, in which the memory controller tracks enough information (and has enough reserve power) to make DRAM behave like a persistent cache of Optane DC memory.
Speaker bioPantea Zardoshti is a fifth year PhD student at Lehigh University in CSE department working on hardware/software co-design for Non-volatile memory. Her previous work focused on adaptive sparse matrix representations for efficient matrix-vector multiplication on GPUs. Pantea obtained her B.Sc. and M.Sc. in computer engineering with the highest honor and joined the Institute for Research in Fundamental Science (IPM) as a research assistant. In Iran, she received the silver medal in the National Computer Science Olympiad and an Outstanding Student Award.
Janus: Optimizing Memory and Storage Support for Non-Volatile Memory System
SihangLiu (University of Virginia);KorakitSeemakhupt (University of Virginia);GennadyPekhimenko (University of Toronto);AasheeshKolli (Penn State University & VMware Research);SamiraKhan (University of Virginia);
Janus: Optimizing Memory and Storage Support for Non-Volatile Memory System
SihangLiu (University of Virginia);KorakitSeemakhupt (University of Virginia);GennadyPekhimenko (University of Toronto);AasheeshKolli (Penn State University & VMware Research);SamiraKhan (University of Virginia);
Speaker:Sihang Liu, University of Virginia
AbstractNon-volatile memory (NVM) technologies can manipulate persistent data directly in memory. Ensuring crash consistency of persistent data enforces that data updates reach all the way to NVM, which puts these write requests on the critical path. Recent literature sought to reduce this performance impact. However, prior works have not fully accounted for all the backend memory operations (BMOs) performed at the memory controller that are necessary to maintain persistent data in NVM. These BMOs include support for encryption, integrity protection, compression, deduplication, etc., necessary to provide security, endurance, and lifetime guarantees. These BMOs significantly increase the NVM write latency and exacerbate the performance degradation caused by the critical write requests. The goal of this work is to minimize the BMO overhead of write requests in an NVM system. The central challenge is to figure out how to optimize these seemingly dependent and monolithic BMOs. Our key insight is to decompose each BMO into a series of sub-operations and then reduce their overall latency through two mechanisms: (i) parallelize sub-operations across BMOs and (ii) pre-execute sub-operations off the critical path as soon as their inputs are ready. We expose a generic software interface that can be used to issue pre-execution requests compatible with common crash-consistency programming models and various BMOs. Based on these ideas, we propose Janus1 – a hardware-software co-design that parallelizes and pre-executes BMOs in an NVM system. We evaluate Janus in an NVM system that integrates encryption, integrity verification, and deduplication and issues pre-execution requests through the proposed software interface, either manually or using an automated compiler pass. Compared to a system that performs these operations serially, Janus achieves 2.35× and 2.00× speedup using manual and automated instrumentation, respectively.
Speaker bioSihang Liu is a 4th year Ph.D. student at the University of Virginia, advised by Professor Samira Khan. Before pursuing the doctoral degree, he has obtained Bachelor’s degrees from both the University of Michigan and Shanghai Jiaotong University. Sihang Liu’s primary research interest lies in the software and hardware co-design of persistent memory systems. On the hardware side, his research aims to optimize the performance and guarantee crash consistency for practical persistent memory systems that are integrated with both storage and memory support. On the software side, he works on testing the crash consistency guarantees of PM-based programs. His works have provided several open-source tools and detected real-world bugs in well known persistent memory software systems. He has published these works at top conferences in Computer Architecture, including ISCA, ASPLOS, and HPCA. He has also served as reviewer for ToS and artifact reviewer for ASPLOS.
11:50 am – 01:30 pm | The Conrad Prebys Performing Arts Center – The Wu Tsai QRT.yrd & The Belanich Terrace
Lunch Break + Poster Session
01:30 pm – 02:30 pm | The Conrad Prebys Performing Arts Center – The Baker-Baum Concert Hall
01:30 am – 02:30 am | The Conrad Prebys Performing Arts Center – The JAI
Session 6: Flash and SSDs
Chair: TBA
Project Almanac: A Time-Traveling Solid-State Drive
XiaohaoWang (University of Illinois, Urbana-Champaign);Chance C.Coats (University of Illinois at Urbana-Champaign);JianHuang (University of Illinois at Urbana-Champaign);
Project Almanac: A Time-Traveling Solid-State Drive
XiaohaoWang (University of Illinois, Urbana-Champaign);Chance C.Coats (University of Illinois at Urbana-Champaign);JianHuang (University of Illinois at Urbana-Champaign);
Speaker:Jian Huang, University of Illinois at Urbana-Champaign
AbstractPreserving the history of storage states is critical to ensuring system reliability and security. It facilitates system functions such as debugging, data recovery, and forensics. Existing software-based approaches like data journaling, logging, andbackups not only introduce performance and storage cost but also are vulnerable to malware attacks, as adversariescan obtain kernel privileges to terminate or destroy them. In this paper, we present Project Almanac, which includes(1) a time-travel solid-state drive (SSD) namedTimeSSDthatretains a history of storage states in hardware for a window of time, and (2) a toolkit namedTimeKitsthat providesstorage-state query and rollback functions.TimeSSDtracksthe history of storage states in the hardware device, withoutrelying on explicit backups, by exploiting the property thatthe flash retains old copies of data when they are updatedor deleted. We implementTimeSSDwith a programmableSSD and developTimeKitsfor several typical system applications. Experiments, with a variety of real-world casestudies, demonstrate thatTimeSSDcan retain all the storagestates for eight weeks, with negligible performance overhead, while providing the device-level time-travel property.
Speaker bioJian Huang is an Assistant Professor in the ECE department at UIUC. He received his Ph.D. in Computer Science at Georgia Tech in 2017. His research interests lie in the areas of computer systems, including memory/storage systems, systems architecture, systems security, distributed systems, and especially the intersections of them. He enjoys building systems. His research contributions have been published at top-tier systems, architecture, and security conferences. His works received the Best Paper Award at USENIX ATC in 2015, IEEE Micro Top Picks Honorable Mention in 2016, and Micro Top Picks in 2020. He received the NetApp Faculty Fellowship Award and Google Faculty Research Award in 2018.
PROFS: CPU Profile-based Tiering File System for All Flash Storage System
YuNakanishi (KIOXIA, UCSD);StevenSwanson (UCSD);
PROFS: CPU Profile-based Tiering File System for All Flash Storage System
YuNakanishi (KIOXIA, UCSD);StevenSwanson (UCSD);
Speaker:Yu Nakanishi, KIOXIA, UCSD
AbstractSolid State Drive (SSD) plays an essential role in modern computing systems. In recent years, the new kinds of SSD, such as the low-latency high-cost SSDs and the low-cost high-latency SSDs, have been emerged. However, when integrating different types of SSD in a storage system, it can be more challenging to make a balance between cost and performance due to the conflicting features of SSDs. For example, not only does redundant data migration cause overhead, but it also accelerates wearing out of Flash SSDs. We propose the CPU Profile-based Data allocation Management (CPDM) scheme, and the CPU PROfile-based tiering File System (PROFS), which is a dynamic tiering file system based on CPDM. CPDM and PROFS are designed to provide not only performance-efficiency but also low and stable write amplification. We have performed an experimental evaluation and compared it with the popular caching system. Experimental results demonstrate that PROFS provides both a low average and low variance of write amplification.
Speaker bioYu Nakanishi is an engineer from KIOXIA. He was visiting scholar at UC San Diego and was advised by Prof. Steven Swanson. His research focuses on designing a cost-effective and performance-effective storage system for new memory.
Dynamic Multi-Resolution Data Storage
Yu-ChingHu (University of California, Riverside);Murtuza TaherLokhandwala (North Carolina State University);TeI (Google Inc.);Hung-WeiTseng (University of California, Riverside);
Dynamic Multi-Resolution Data Storage
Yu-ChingHu (University of California, Riverside);Murtuza TaherLokhandwala (North Carolina State University);TeI (Google Inc.);Hung-WeiTseng (University of California, Riverside);
Speaker:Yu-Ching Hu, University of California, Riverside
AbstractApproximate computing that works on less precise data leads to significant performance gains and energy-cost reductions for compute kernels. However, without leveraging the full-stack design of computer systems, modern computer architectures undermine the potential of approximate computing. In this paper, we present Varifocal Storage, a dynamic multiresolution storage system that tackles challenges in performance, quality, flexibility and cost for computer systems supporting diverse application demands. Varifocal Storage dynamically adjusts the dataset resolution within a storage device, thereby mitigating the performance bottleneck of exchanging/preparing data for approximate compute kernels. Varifocal Storage introduces Autofocus and iFilter mechanisms to provide quality control inside the storage device and make programs more adaptive to diverse datasets. Varifocal Storage also offers flexible, efficient support for approximate and exact computing without exceeding the costs of conventional storage systems by (1) saving the raw dataset in the storage device, and (2) targeting operators that complement the power of existing SSD controllers to dynamically generate lower-resolution datasets. We evaluate the performance of Varifocal Storage by running applications on a heterogeneous computer with our prototype SSD. The results show that Varifocal Storage can speed up data resolution adjustments by 2.02× or 1.74× without programmer input. Compared to conventional approximate-computing architectures, Varifocal Storage speeds up the overall execution time by 1.52×.
Speaker bioYu-Ching is a 3-year Ph.D. student from UC Riverside under the supervision of Professor Hung-Wei Tseng. His research interests focus on improving the performance of the heterogeneous computer, storage system, and machine learning applications.
Session 7: Machine Learning Applications
Chair: TBA
MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation
LillianPentecost (Harvard University);MarcoDonato (Harvard University);BrandonReagen (Harvard University);UditGupta (Harvard University);SimingMa (Harvard University);Gu-YeonWei (Harvard University);DavidBrooks (Harvard University);
MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation
LillianPentecost (Harvard University);MarcoDonato (Harvard University);BrandonReagen (Harvard University);UditGupta (Harvard University);SimingMa (Harvard University);Gu-YeonWei (Harvard University);DavidBrooks (Harvard University);
Speaker:Lillian Pentecost, Harvard University
AbstractDeeply embedded applications require low-power, low-cost hardware that fits within stringent area constraints. Deep learning has many potential uses in these domains, but introduces significant inefficiencies stemming from off-chip DRAM accesses of model weights. Ideally, models would fit entirely on-chip. However, even with compression, memory requirements for state-of-the-art models make on-chip inference impractical. Due to increased density, emerging eNVMs are one promising solution. MaxNVM is a principled co-design of sparse encodings, protective logic, and fault-prone MLC eNVM technologies (i.e., RRAM and CTT) to enable highly-efficient DNN inference. We find bit reduction techniques (e.g., clustering and sparse compression) increase weight vulnerability to faults. This limits the capabilities of MLC eNVM. To circumvent this limitation, we improve storage density with minimal overhead using protective logic. Tradeoffs between density and reliability result in a rich design space. We show that by balancing these techniques, the weights of large DNNs are able to reasonably fit on-chip. Compared to a naive, single-level-cell eNVM solution, our highly-optimized MLC memories reduce weight area by up to 29X. We compare our technique against NVDLA, a state-of-the-art industry-grade CNN accelerator, and demonstrate up to 3.2X reduced power and up to 7.5X reduced energy per ResNet50 inference depending on input frame rate.
Speaker bioLillian is a fourth-year PhD student investigating software-hardware co-design methodologies, specialized hardware for deep learning, and applications of eNVM technologies. She is advised by Professors David Brooks and Gu-Yeon Wei at Harvard University.
Coded Deep Neural Networks for Robust Neural Computation
NetanelRaviv (Washington University in St. Louis);PulakeshUpadhyaya (Texas A&M University);SiddharthJain (California Institute of Technology);JehoshuaBruck (California Institute of Technology);AnxiaoJiang (Texas A&M University);
Coded Deep Neural Networks for Robust Neural Computation
NetanelRaviv (Washington University in St. Louis);PulakeshUpadhyaya (Texas A&M University);SiddharthJain (California Institute of Technology);JehoshuaBruck (California Institute of Technology);AnxiaoJiang (Texas A&M University);
Speaker:Pulakesh Upadhyaya, Texas A&M University, College Station
AbstractDeep Neural Networks (DNNs) are the revolutionary force supporting AI today. When DNNs are implemented in hardware, their weights are often stored in non-volatile memory devices. Hardware implementation can help DNNs be used ubiquitously. However, numerous types of errors will appear, making their reliability a critical challenge. To conquer the challenge, new robust DNN architectures are needed. This work presents a new scheme for robust DNNs called Coded Deep Neural Network (CodNN). It transforms the internal structure of DNNs by adding redundant neurons and edges to increase its reliability. The added redundancy can be seen as a new type of error-correcting codes customized for machine learning. A construction for CodNN based on analog error-correcting codes is presented, which is shown to improve the robustness of the DNN substantially for a wide range of weight-erasure probabilities. In addition, fundamental properties and more designs for CodNN are also studied.
Speaker bioPulakesh Upadhyaya is a Ph.D. candidate at Department of Computer Science and Engineering, Texas A&M University. His research interests include error correction codes, machine learning, and information theory.
Performance Oriented Error Correction for Robust Neural Networks
KunpingHuang (Texas A&M U., USA);Paul H.Siegel (UCSD);AnxiaoJiang (Texas A&M U., USA);
Performance Oriented Error Correction for Robust Neural Networks
KunpingHuang (Texas A&M U., USA);Paul H.Siegel (UCSD);AnxiaoJiang (Texas A&M U., USA);
Speaker:Kunping Huang, Texas A&M University
AbstractWhen deep neural networks (DNNs) are implemented in hardware, their weights usually need to be stored in non-volatile memory devices. As noise accumulates in the stored weights, the DNN's performance will degrade. This work studies how to use error correcting codes (ECCs) to protect the weights. Different from classic error correction in data storage, the optimization objective is to optimize the DNN's performance after error correction, instead of minimizing the Uncorrectable Bit Error Rate in the protected bits. That is, the error correction scheme is performance-oriented. A main challenge is that a DNN often has millions to hundreds of millions of weights, causing a large redundancy overhead for ECCs, and the relationship between the weights and its DNN's performance is highly complex. To address the challenge, we propose a Selective Protection (SP) scheme, which chooses only a subset of important bits for ECC protection. To find such bits and achieve an optimized tradeoff between ECC's redundancy and DNN's performance, we present an algorithm based on deep reinforcement learning. Experimental results verify that compared to the natural baseline scheme, the proposed algorithm achieves substantially better performance for the performance-oriented error correction task.
Speaker bioKunping Huang is a master student in Computer Science and Engineer Department at Texas A&M University and a member in Information Innovation Lab. Working with Dr. Anxiao (Andrew) Jiang, his research interests focus on deep learning. Kunping developed a deep learning algorithm to improve the reliability of neural networks by protecting the important bits in neural networks.
02:30 pm – 03:00 pm
Break
03:00 pm – 04:00 pm | The Conrad Prebys Performing Arts Center – The Baker-Baum Concert Hall
03:00 pm – 04:00 pm | The Conrad Prebys Performing Arts Center – The JAI
Session 8: Reliability
Chair: TBA
Variability-Aware Read and Write Channel Models for 1S1R Crossbar Resistive Memory with High Wordline/Bitline Resistance
ZehuiChen (UCLA);LaraDolecek (UCLA);
Variability-Aware Read and Write Channel Models for 1S1R Crossbar Resistive Memory with High Wordline/Bitline Resistance
ZehuiChen (UCLA);LaraDolecek (UCLA);
Speaker:Zehui Chen, UCLA
AbstractAs technology scales down to single-nm regime, the increasing resistivity of wordline/bitline becomes a limiting factor to device reliability. This paper presents write/read communication channels while considering the line resistance and device variabilities by statistically relating the degraded write/read margins and the channel parameters. These models provide quantitative tools for evaluating the trade-offs between memory reliability and design parameters, such as array size, technology nodes, and aspect ratio, and for designing coding theoretical solutions for crossbar memory.
Speaker bioZehui Chen is a Ph.D. candidate in the Electrical and Computer Engineering Department at the University of California, Los Angeles (UCLA). He received his M.S. degrees in Electrical Engineering from UCLA in 2018. He received his B.S. degree in Electrical Engineering from Purdue University, West Lafayette in 2016. He is currently a graduate student researcher at the Laboratory for Robust Information Systems (LORIS) at UCLA. His research interests include coding theory, information theory and their applications in emerging memory medium.
Bad Page Detector for NAND Flash Memory
YiLiu (UC San Diego);SiWu (UC San Diego);PaulSiegel (UCSD);
Bad Page Detector for NAND Flash Memory
YiLiu (UC San Diego);SiWu (UC San Diego);PaulSiegel (UCSD);
Speaker:Yi Liu, Department of Electrical and Computer Engineering, CMRR, UCSD
AbstractNAND flash memory cells gradually wear out during the program/erase (P/E) cycling, resulting in an increasing bit error count (BEC) at the page level. The BEC behavior varies significantly among pages. To increase the lifetime of a flash memory device, we propose a bad page detector, which predicts whether a page will become a "bad'' page in the near future based on its current and previous BEC information. Two machine learning algorithms, based upon time-dependent neural network and long-short term memory architectures, were used to design the detector. Experimental results based on data collected from a TLC flash memory test platform quantify its effectiveness.
Speaker bioYi Liu received the B.S. degree in physics from Peking University, Beijing, China, in 2014. He is currently pursuing the Ph.D. degree with the Center for Memory and Recording Research, Department of Electrical and Computer Engineering, University of California San Diego. His current research interest includes coding for costly channel and non-volatile memories.
The Parallel Persistent Memory Model
Guy E.Blelloch (Carnegie Mellon University);Phillip B.Gibbons (Carnegie Mellon University);YanGu (University of California, Riverside);CharlesMcGuffey (Carnegie Mellon University);JulianShun (Massahusetts Institute of Technology);
The Parallel Persistent Memory Model
Guy E.Blelloch (Carnegie Mellon University);Phillip B.Gibbons (Carnegie Mellon University);YanGu (University of California, Riverside);CharlesMcGuffey (Carnegie Mellon University);JulianShun (Massahusetts Institute of Technology);
Speaker:Charles McGuffey, Carnegie Mellon University
AbstractWe consider a parallel computational model, the \emph{Parallel Persistent Memory model}, comprised of $P$ processors, each with a fast local ephemeral memory of limited size, and sharing a large persistent memory. The model allows for each processor to fault at any time (with bounded probability), and possibly restart. When a processor faults, all of its state and local ephemeral memory is lost, but the persistent memory remains. This model is motivated by upcoming non-volatile memories that are nearly as fast as existing random access memory, are accessible at the granularity of cache lines, and have the capability of surviving power outages. It is further motivated by the observation that in large parallel systems, failure of processors and their caches is not unusual. We present several results for the model, using an approach that breaks a computation into \emph{capsules}, each of which can be safely run multiple times. For the single-processor version we describe how to simulate any program in the RAM, the external memory model, or the ideal cache model with an expected constant factor overhead. For the multiprocessor version we describe how to efficiently implement a work-stealing scheduler within the model such that it handles both soft faults, with a processor restarting, and hard faults, with a processor permanently failing. For any multithreaded fork-join computation that is race free, \war{} conflict free and has $W$ work, $D$ depth, and $C$ maximum capsule work in the absence of faults, the scheduler guarantees a time bound on the model of $O\left(\frac{W}{P_A} + \frac{DP}{P_A} \left\lceil\log_{1/(C\faultprob)} W\right\rceil\right)$ in expectation, where $P$ is the maximum number of processors, $P_A$ is the average number, and $\faultprob{} \leq 1/(2C)$ is the probability a processor faults between successive persistent memory accesses. Within the model, and using the proposed methods, we develop efficient algorithms for parallel prefix sums, merging, sorting, and matrix multiply.
Speaker bioCharles McGuffey is a fifth year PhD student at CMU, where he is advised by Phillip Gibbons. He works on NVRAM and caching, and is generally interested in theory of practical systems.
Session 9: Memory Management
Chair: TBA
Panthera: holistic memory management for big data processing over hybrid memories
ChenxiWang (UCLA);HarryXu (UCLA);
Panthera: holistic memory management for big data processing over hybrid memories
ChenxiWang (UCLA);HarryXu (UCLA);
Speaker:Chenxi Wang, University of California, Los Angeles
AbstractTo process real-world datasets, modern data-parallel systems often require extremely large amounts of memory, which are both costly and energy-inefficient. Emerging non-volatile memory (NVM) technologies offer high capacity compared to DRAM and low energy compared to SSDs. Hence, NVMs have the potential to fundamentally change the dichotomy between DRAM and durable storage in Big Data processing. However, most Big Data applications are written in managed languages and executed on top of a managed runtime that already performs various dimensions of memory management. Supporting hybrid physical memories adds in a new dimension, creating unique challenges in data replacement. This paper proposes Panthera, a semantics-aware, fully automated memory management technique for Big Data processing over hybrid memories. Panthera analyzes user programs on a Big Data system to infer their coarse-grained access patterns, which are then passed to the Panthera runtime for efficient data placement and migration. For Big Data applications, the coarse-grained data division information is accurate enough to guide the GC for data layout, which hardly incurs overhead in data monitoring and moving. We implemented Panthera in OpenJDK and Apache Spark. Our extensive evaluation demonstrates that Panthera reduces energy by 32 – 52% at only a 1 – 9% time overhead.
Speaker bioI'm a postdoc in UCLA working with Dr. Harry Xu.My research interest is to build hard core system, managed runtime and big data systems for emerging hardwares, such as non-volatile memory and disaggregated cluster
Metall: An Allocator for Persistent Memory
KeitaIwabuchi (Lawrence Livermore National Laboratory);RogerPearce (Lawrence Livermore National Laboratory);MayaGokhale (Lawrence Livermore National Laboratory);
Metall: An Allocator for Persistent Memory
KeitaIwabuchi (Lawrence Livermore National Laboratory);RogerPearce (Lawrence Livermore National Laboratory);MayaGokhale (Lawrence Livermore National Laboratory);
Speaker:Keita Iwabuchi, Lawrence Livermore National Laboratory
AbstractWe present Metall, a persistent memory allocator designed to provide developers with an API to allocate custom C++ data structures in both block-storage and byte-addressable persistent memories (e.g., NVMe and Intel Optane DC Persistent Memory). Metall incorporates the state-of-the-art allocation algorithms in Supermalloc with the rich C++ interface developed by Boost.Interprocess, and provides persistent memory snapshotting (versioning) capabilities. We perform a performance evaluation using a graph construction benchmark, persistently storing data structures on NVMe SSDs and byte-addressable Optane. Our approach using Metall works without code modifications in both sceneries — block-addressable and byte-addressable. Metall provides a coarse-grained consistency model, allowing the application to determine when it is appropriate to create durable snapshots of the persistent heap, and provides improved performance over Boost.Interprocess and memkind on an NVMe device.
Speaker bioDr. Keita Iwabuchi is a postdoctoral researcher in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. His research area is distributed systems and parallel computing, particularly in high performance computing (HPC) and big-data processing. Major focuses are large-scale graph processing on distributed memory systems with non-volatile memory (NVM) and system software for big-data processing.
RECIPE : Converting Concurrent DRAM Indexes to Persistent-Memory Indexes
Se KwonLee (University of Texas at Austin);JayashreeMohan (University of Texas at Austin);SanidhyaKashyap (Georgia Insitute of Technology);TaesooKim (Georgia Institute of Technology);VijayChidambaram (University of Texas at Austin and VMware Research);
RECIPE : Converting Concurrent DRAM Indexes to Persistent-Memory Indexes
Se KwonLee (University of Texas at Austin);JayashreeMohan (University of Texas at Austin);SanidhyaKashyap (Georgia Insitute of Technology);TaesooKim (Georgia Institute of Technology);VijayChidambaram (University of Texas at Austin and VMware Research);
Speaker:Se Kwon Lee, The University of Texas at Austin
AbstractWe present Recipe, a principled approach for converting concurrent DRAM indexes into crash-consistent indexes for persistent memory (PM). The main insight behind Recipe is that isolation provided by a certain class of concurrent in-memory indexes can be translated with small changes to crash-consistency when the same index is used in PM. We present a set of conditions that enable the identification of this class of DRAM indexes, and the actions to be taken to convert each index to be persistent. Based on these conditions and conversion actions, we modify five different DRAM indexes based on B+ trees, tries, radix trees, and hash tables to their crash-consistent PM counterparts. The effort involved in this conversion is minimal, requiring 30–200 lines of code. We evaluated the converted PM indexes on Intel DC Persistent Memory, and found that they outperform state-of-the-art, hand-crafted PM indexes in multi-threaded workloads by up-to 5.2×. For example, we built P-CLHT, our PM implementation of the CLHT hash table by modifying only 30 LOC. When running YCSB workloads, P-CLHT performs up to 2.4× better than Cacheline-Conscious Extendible Hashing (CCEH), the state-of-the-art PM hash table.
Speaker bioSe Kwon Lee is a Ph.D. student at The University of Texas at Austin, advised by Professor Vijay Chidambaram. His research interests primarily lie in improving the performance and reliability of storage/file systems based on emerging storage and memory technologies.
04:00 pm – 4:20 pm | The Conrad Prebys Performing Arts Center
Break
4:20 pm – 5:00 pm | The Conrad Prebys Performing Arts Center – The Baker-Baum Concert Hall
Session 10: Processing in memory
Chair: TBA
The Bitlet Model: A Deluge of PIM Frameworks and How Analytical Modeling Could Help!
KunalKorgaonkar (Technion (formerly UC San Diego));RonnyRonen (Technion);AnupamChattopadhyay (NTU);ShaharKvatinsky (Technion);
The Bitlet Model: A Deluge of PIM Frameworks and How Analytical Modeling Could Help!
KunalKorgaonkar (Technion (formerly UC San Diego));RonnyRonen (Technion);AnupamChattopadhyay (NTU);ShaharKvatinsky (Technion);
Speaker:Kunal Korgaonkar, Technion (formerly UC San Diego)
AbstractDespite the recent resurgence of processing-in-memory (PIM), it is very challenging to analyze and quantify the advantages or disadvantages of new PIM solutions over existing computing paradigms (CPU/GPU). In this work, we motivate an analytical approach to address this issue. Our proposed analytical model called Bitlet is used to compare the affinity of algorithms to PIM as opposed to traditional CPU/GPU computing. The proposed model for abstracts PIM implementation details through the parameterization of architecture and technology specific factors. In spite of its simplicity, the Bitlet model has proved useful. Our talk will focus on the need for the Bitlet model, how to use it, and prospects to extend and refine it, to further increase its utility.
Speaker bioKunal is currently a Postdoctoral Fellow at Technion (Israel) and holds a PhD from UC San Diego (US). His research focus is on building scalable and cost-effective architectures using emerging memory and storage technologies.
True In-memory Computing with the CRAM: From Technology to Applications
Zamshed I.Chowdhury (University of Minnesota);SalonikResch (University of Minnesota);MasoudZabihi (University of Minnesota);ZhengyangZhao (University of Minnesota);ThomasPeterson (University of Minnesota);MahendraDC (University of Minnesota);UlyaKarpuzcu (University of Minnesota);Jian-PingWang (University of Minnesota);SachinSapatnekar (University of Minnesota);
True In-memory Computing with the CRAM: From Technology to Applications
Zamshed I.Chowdhury (University of Minnesota);SalonikResch (University of Minnesota);MasoudZabihi (University of Minnesota);ZhengyangZhao (University of Minnesota);ThomasPeterson (University of Minnesota);MahendraDC (University of Minnesota);UlyaKarpuzcu (University of Minnesota);Jian-PingWang (University of Minnesota);SachinSapatnekar (University of Minnesota);
Speaker:,
AbstractTraditional Von Neumann computing is falling apart in the era of exploding data volumes as the overhead of data transfer becomes forbidding. Instead, it is more energy-efficient to fuse compute capability with memory where the data reside. Emerging spintronic technologies show remarkable versatility for the tight integration of logic and memory. In this presentation, we introduce Computational RAM (CRAM), a novel high-density, reconfigurable spintronic in-memory substrate, and survey several years of progress in developing the CRAM concept across the system stack from the device level all the way to applications.
Speaker bioSalonik Resch is a PhD student at the University of Minnesota. His research includes Processing-In-Memory, Intermittent Computing, and Quantum Computing. He received a Bachelor's of Computer Engineering from the University of Minnesota in 2016.
4:20 pm – 5:00 pm | The Conrad Prebys Performing Arts Center – The JAI
Session 11: Security
Chair: TBA
Ensuring Fast Crash Recovery for Secure NVMs
Kazi AbuZubair (University of Central Florida);AmroAwad (University of Central Florida);
Ensuring Fast Crash Recovery for Secure NVMs
Kazi AbuZubair (University of Central Florida);AmroAwad (University of Central Florida);
Speaker:Kazi Abu Zubair, University of Central Florida
AbstractImplementing secure Non-Volatile Memories (NVMs) is challenging, mainly due to the necessity to persist security metadata along with data. Unlike conventional secure memories, NVM-equipped systems are expected to recover data after crashes and hence security metadata must be recoverable as well. While prior work explored recovery of encryption counters, fewer efforts have been focused on recovering integrity-protected systems. In particular, how to recover Merkle Tree. We observe two major challenges for this. First, recovering parallelizable integrity trees, e.g., Intel's SGX trees, requires very special handling due to inter-level dependency. Second, the recovery time of practical NVM sizes (terabytes are expected) would take hours. Most data centers, cloud systems, intermittent-power devices and even personal computers, are anticipated to recover almost instantly after power restoration. In fact, this is one of the major promises of NVMs. In this paper, we propose Anubis, a novel hardware-only solution that speeds up recovery time by almost 10^7 times (from 8 hours to only 0.03 seconds). Moreover, we propose a novel and elegant way to recover inter-level dependent trees, as in Intel's SGX. Most importantly, while ensuring recoverability of one of the most challenging integrity-protection schemes among others, Anubis incurs performance overhead that is only 2% higher than the state-of-the-art scheme, Osiris, which takes hours to recover systems with general Merkle Tree and fails to recover SGX-style trees.
Speaker bioKazi Abu Zubair is a second year PhD student pursuing his PhD in Computer Engineering at the University of Central Florida. He is a member of Secure and Advanced Computer Architecture (SACA) Research Group led by Dr. Amro Awad. He is actively doing research in the field of NVM security, crash consistency, and reliability.
ARSENAL: Architecture for Secure Non-Volatile Memories
ShivamSwami (University of Pittsburgh);KartikMohanram (University of Pittsburgh);
ARSENAL: Architecture for Secure Non-Volatile Memories
ShivamSwami (University of Pittsburgh);KartikMohanram (University of Pittsburgh);
Speaker:Shivam Swami, University of Pittsburgh, Micron Technology
AbstractWhereas data persistence in non-volatile memories (NVMs) enables instant data recovery (IDR) in the face of power/system failures, it also exposes NVMs to data confidentiality and integrity attacks. Counter mode encryption and Merkle Tree authentication are established measures to thwart data confidentiality and integrity attacks, respectively, in NVM-based main memories. However, these security mechanisms require high overhead atomic security meta-data updates on every write-back in order to support IDR in NVM systems. This increases memory traffic and negatively impacts system performance and memory lifetime. Architecture for Secure Non-Volatile Memories (ARSENAL) is an IDR-preserving, low cost, high performance security solution that protects NVM systems against data confidentiality and integrity attacks. ARSENAL synergistically integrates (i) Smart Writes for Faster Transactions (SWIFT), a novel technique to reduce the performance overhead of atomic security meta-data updates on every write-back, with (ii) Terminal BMT Updates (TBU), a novel BMT-consistency-preserving technique, to facilitate IDR in the face of power/system failures. Our evaluations show that on average, ARSENAL improves system performance (measured in IPC) by 2.26× (4×), reduces memory traffic overhead by 1.47× (1.88×), and improves memory lifetime by 2× (3.5×) in comparison to conventional IDR-preserving 64-bit (128-bit) encryption+authentication.
Speaker bioShivam Swami is an engineer in the Emerging-Memory Media Optimization team at Micron Technology. Prior to joining Micron, Shivam obtained a Ph.D. in security and reliability of emerging non-volatile memory technologies from the University of Pittsburgh, PA in 2018. His Ph.D. work received the best research poster award at ISVLSI 2019.