MONDAY, MARCH 11
11:50 am – 1:00 pm | Price Center Ballroom West
Lunch
1:00 pm – 1:10 pm | Price Center Ballroom East
Opening Remarks
Chair: Hung-Wei Tseng, UCR
1:10 pm – 2:20 pm | Price Center Ballroom East
Keynote I:
Near Data Compute
Mats Öberg (Marvell);
Near Data Compute
Mats Öberg (Marvell);
Speaker: Mats Öberg, Marvell
Abstract: Near data compute is an extension of computational storage to also include near memory compute. Recently, standards organizations have been busy completing standards for computational storage. SNIA created a first version of an architectural document in 2022 and an accompanying API in 2023. NVMe ratified a computational programs command set in early 2024. In addition, there are efforts within Open Compute community to create specifications around a more general near data compute. In this presentation we will provide examples of computational storage implementations, and discuss near memory compute in the context of CXL.
Speaker bio: Mats Öberg is Associate Vice President, Storage DSP and ML Architecture at Marvell. He leads research and development of signal processing and machine learning for storage products. Mats has been leading the development of read channels for perpendicular recording and TDMR, as well optical recording for DVD and Blu-ray. Mats earned his PhD in Electrical Engineering from University of California, San Diego.
Chair: Hung-Wei Tseng, UCR
2:40 pm – 4:00 pm | Price Center Ballroom East
Session 1: Memorable Paper Award Finalists
Chair: Heiner Litz
Persistent Processor Architecture
Jianping Zeng (Purdue University);Jungi Jeong (Purdue University);Changhee Jung (Purdue University);
Persistent Processor Architecture
Jianping Zeng (Purdue University);Jungi Jeong (Purdue University);Changhee Jung (Purdue University);
Speaker: Jianping Zeng, Purdue University
Abstract: This paper presents PPA (Persistent Processor Architecture), simple microarchitectural support for lightweight yet performant whole-system persistence. PPA offers fully transparent crash consistency to all sorts of program covering the entire computing stack and even legacy applications without any source code change or recompilation. As a basis for crash consistency, PPA leverages so-called store integrity that preserves store operands during program execution, persists them on impending power failure, and replays the stores when power comes back. In particular, PPA realizes the store integrity via hardware by keeping the operands in a physical register file (PRF), though the stores are committed. Such store integrity enforcement leads to region-level persistence, i.e., whenever PRF runs out, PPA starts a new region after ensuring that all stores of the prior region have already been written to persistent memory. To minimize the pipeline stall across regions, PPA writes back the stores of each region asynchronously, overlapping their persistence latency with the execution of other instructions in the region. The experimental results with 41 applications from SPEC CPU2006/2017, SPLASH3, STAMP, WHISPER, and DOE Mini-apps show that PPA incurs only a 2% average run-time overhead and a 0.005% areal cost, while the state-of-the-art work suffers a 26% overhead along with prohibitively high hardware and energy costs.
Speaker bio: Jianping Zeng is a final-year PhD candidate advised by Prof. Changhee Jung in the department of Computer Science at Purdue University. His research interests generally lie in designing more reliable and performant computing systems including server-class systems and energy-harvesting systems against soft errors and power failures. To achieve that, he usually codesigns compiler and architecture to maintain a minimal hardware complexity while achieving high performance. His research works are usually published at top-tier system venues, e.g., PLDI, MICRO, HPDC, ISCA, and RTSS.
RackBlox: A Software-Defined Rack-Scale Storage System with Network-Storage Co-Design
Benjamin Reidys (University of Illinois Urbana-Champaign);Yuqi Xue (University of Illinois Urbana-Champaign);Daixuan Li (University of Illinois Urbana-Champaign);Bharat Sukhwani (IBM Research);Wen-mei Hwu (University of Illinois Urbana-Champaign; NVIDIA Research);Deming Chen (University of Illinois Urbana-Champaign);Sameh Asaad (IBM Research);Jian Huang (University of Illinois Urbana-Champaign);
RackBlox: A Software-Defined Rack-Scale Storage System with Network-Storage Co-Design
Benjamin Reidys (University of Illinois Urbana-Champaign);Yuqi Xue (University of Illinois Urbana-Champaign);Daixuan Li (University of Illinois Urbana-Champaign);Bharat Sukhwani (IBM Research);Wen-mei Hwu (University of Illinois Urbana-Champaign; NVIDIA Research);Deming Chen (University of Illinois Urbana-Champaign);Sameh Asaad (IBM Research);Jian Huang (University of Illinois Urbana-Champaign);
Speaker: Benjamin Reidys, University of Illinois Urbana-Champaign
Abstract: Software-defined networking (SDN) and software-defined flash (SDF) have been serving as the backbone of modern data centers. They are managed separately to handle I/O requests. At first glance, this is a reasonable design by following the rack-scale hierarchical design principles. However, it suffers from suboptimal end-to-end performance, due to the lack of coordination between SDN and SDF. In this paper, we co-design the SDN and SDF stacks by redefining the functions of their control plane and data plane, and splitting them up within a new architecture named RackBlox. RackBlox has three major components: (1) coordinated I/O scheduling, to coordinate the effort of I/O scheduling across the network and storage stack to achieve predictable end-to-end performance; (2) coordinated garbage collection (GC), to coordinate the GC activities across the SSDs in a rack to minimize their impact on incoming I/O requests; (3) rack-scale wear leveling, to enable global wear leveling among SSDs in a rack by periodically swapping data, for achieving improved device lifetime for the entire rack. We implement RackBlox using programmable SSDs and a programmable switch. Our experiments demonstrate that RackBlox can reduce the tail latency of I/O requests by up to 5.8× over state-of-the-art rack-scale storage systems.
Speaker bio: Benjamin Reidys is a fourth-year Ph.D. student at the University of Illinois Urbana-Champaign. His research incorporates ideas from operating systems, networking, and some machine learning to improve the performance and usability of memory and storage systems. Currently, he is focused on improving software-defined data centers through network/storage co-design
Optimal Shaping Codes for a TLC Flash Memory
Simeng Zheng (University of California, San Diego);Andrew Tan (University of California, San Diego);Carolina Fernández (University of California, San Diego);Ismael González Valenzuela (University of California, San Diego);Paul Siegel (University of California, San Diego);
Optimal Shaping Codes for a TLC Flash Memory
Simeng Zheng (University of California, San Diego);Andrew Tan (University of California, San Diego);Carolina Fernández (University of California, San Diego);Ismael González Valenzuela (University of California, San Diego);Paul Siegel (University of California, San Diego);
Speaker: Andrew Tan, University of California, San Diego
Abstract: Shaping codes are distribution-matching codes that can be useful for coding over communication and storage channels with symbol costs and cost constraints. For example, the durability of a flash memory device can be quantified using wear costs associated with coded symbols, and in previous literature, shaping codes have been successfully applied to yield a notable lifetime gain for SLC (1 bit/cell) and MLC (2 bits/cell) flash devices. To account for the trending popularity of increased cell bit-density in flash, we formulate an optimal shaping scheme for TLC flash memories (3 bits/cell). The design procedure includes: wear cost estimation for 8 programmed levels, construction of an optimal shaping code using a concatenation of data compression and Varn coding to minimize the average wear cost with no overall expansion factor, and experimental performance evaluation on a 1x-nm TLC flash memory. Experimental results show a 10 times improvement in chip endurance.
Speaker bio: Andrew is a third year Ph.D. student in the Electrical and Computer Engineering Department at UCSD. He received his B.S. in Electical, Electronics, and Communications Engineering from UCLA in 2019. His research is in the area of coding theory for DNA-based storage.
Read-and-Run Constrained Coding for Modern Flash Memories
Ahmed Hareedy (Middle East Technical University);Simeng Zheng (University of California, San Diego);Paul Siegel (UCSD);Robert Calderbank (Duke University);
Read-and-Run Constrained Coding for Modern Flash Memories
Ahmed Hareedy (Middle East Technical University);Simeng Zheng (University of California, San Diego);Paul Siegel (UCSD);Robert Calderbank (Duke University);
Speaker: Simeng Zheng, University of California, San Diego
Abstract: Constrained coding is used in Flash devices to increase reliability via mitigating inter-cell interference that stems from charge propagation among cells. In this project, we suggest new constrained coding schemes that have low-complexity and preserve the desirable high access speed in modern Flash devices. The idea is to eliminate error-prone patterns by coding data either only on the left-most page (binary coding) or only on the two left-most pages (4-ary coding) while leaving data on all the remaining pages uncoded. Since the proposed schemes enable the separation of pages, except the two left-most pages in the case of 4-ary coding, we refer to them as read-and-run (RR) constrained coding schemes. We analyze the new RR coding schemes and discuss their impact on the probability of occurrence of different charge levels. We also demonstrate the performance improvement achieved via RR coding on a 1X-nm triple-level cell (TLC) Flash memory.
Speaker bio: Simeng is a fifth-year Ph.D. student from the UCSD ECE department. He currently works with Prof. Paul H. Siegel on machine learning techniques for data storage systems.
4:20 pm-5:40 pm | Price Center Ballroom East
4:20 pm-6:00 pm | Price Center Ballroom West
Systems Session I: Persistency & Consistency
Chair: Jian Huang
SpecPMT: Speculative Logging for Resolving Crash Consistency Overhead of Persistent Memory
Chencheng Ye (Huazhong University of Science and Technology);Yuanchao Xu (University of California, Santa Cruz);Xipeng Shen (North Carolina State University);Yan Sha (Huazhong University of Science and Technology);XIAOFEI LIAO (Huazhong University of Science and Technology);Hai Jin (Huazhong University of Science and Technology);Yan Solihin (University of Central Florida);
SpecPMT: Speculative Logging for Resolving Crash Consistency Overhead of Persistent Memory
Chencheng Ye (Huazhong University of Science and Technology);Yuanchao Xu (University of California, Santa Cruz);Xipeng Shen (North Carolina State University);Yan Sha (Huazhong University of Science and Technology);XIAOFEI LIAO (Huazhong University of Science and Technology);Hai Jin (Huazhong University of Science and Technology);Yan Solihin (University of Central Florida);
Speaker: Naveed Ul Mustafa, UCF
Abstract: Crash consistency overhead is a long-standing barrier to the adoption of byte-addressable persistent memory in practice. Despite continuous progress, persistent transactions for crash consistency still incur a 5.6X slowdown, making persistent memory prohibitively costly in practical settings. This paper introduces speculative logging, a new method that forgoes most memory fences and reduces data persistence overhead by logging data values early. This technique enables a novel persistent transaction model, speculatively persistent memory transactions (SpecPMT). Our evaluation shows that SpecPMT reduces the execution time overheads of persistent transactions substantially to just 10%.
Highly-Efficient Persistent FIFO Queues
Panagiota Fatourou (Institute of Computer Science, Foundation for Research and Technology - Hellas, and Computer Science Department, and University of Crete);Nikos Giachoudis (Institute of Computer Science, Foundation for Research and Technology Hellas);George Mallis (Institute of Computer Science, Foundation for Research and Technology - Hellas, and Computer Science Department, University of Crete);
Highly-Efficient Persistent FIFO Queues
Panagiota Fatourou (Institute of Computer Science, Foundation for Research and Technology - Hellas, and Computer Science Department, and University of Crete);Nikos Giachoudis (Institute of Computer Science, Foundation for Research and Technology Hellas);George Mallis (Institute of Computer Science, Foundation for Research and Technology - Hellas, and Computer Science Department, University of Crete);
Speaker: ,
Abstract: We study the question whether techniques employed, in a conventional system, by state-of-the-art concurrent algorithms to avoid contented hot spots are still efficient for recoverable computing in settings with NVMs. We focus on concurrent queues that have two end-points, head and tail, which are highly contended. We present a persistent queue implementation that performs a single pair of persistence instructions (e.g. pwb-psync) per operation (enqueue or dequeue). Even more, the algorithm achieves to perform these instructions on variables of low contention by employing Fetch&Increment and using the state-of-the-art queue implementation by Afek and Morrison (PPoPP'13). These result in performance that is up 2x faster than state-of-the-art persistent queue implementations.
Detecting and Repairing Bugs in Persistent Concurrent Programs
Tooba Khan (University of Southern California);Srivatsan Ravi (University of Southern California);Chao Wang (University of Southern California);
Detecting and Repairing Bugs in Persistent Concurrent Programs
Tooba Khan (University of Southern California);Srivatsan Ravi (University of Southern California);Chao Wang (University of Southern California);
Speaker: Tooba Khan, University of Southern California
Abstract: The availability of byte-addressable Persistent Memory (PM) has sparked increased efforts towards designing persistent concurrent data structures capable of recovering from system crashes due to power failures. For concurrent programs, in particular, it can be challenging for programmers to ensure proper data persistence, which is crucial for maintaining consistency and guaranteeing the correct order of commits. Our work aims to find the missing \textit{CLFLUSHOPTs}(efficiently flushes cache lines, optimizing cache performance) and \textit{SFENCEs}(ensures the correct ordering of STOREs in the PM). While there are attempts to address this PM debugging challenge, they all rely on programmers to write and debug software; furthermore, there is little work on debugging automation. To overcome this challenge, we propose a \emph{trace-based analysis} method for detecting and repairing PM bugs. Given an execution trace and a desired property, our method first generates a symbolic formula $\Phi_{\mathit{program}} \wedge \neg \Phi_{\mathit{property}}$ to encode the persistency-related program behaviors and check them against the property. It guarantees that the symbolic formula is satisfiable \emph{iff} there exists a PM bug. Our method also formulates and solves the repair problem similarly using an off-the-shelf constraint solver.
Speaker bio: Tooba Khan is a PhD student at the University of Southern California. She has completed her Master's degree from the Indian Institute of Technology Delhi, India. Her work focuses on Concurrent Systems and Non-Volatile Memory. She is particularly interested in applying formal methods techniques to solve the challenges in the fields of concurrency and non-volatile memories.
Hardware Support for Durable Atomic Instructions for Persistent Parallel Programming
Khan Shaikhul Hadi (University of Central Florida);Naveed Ul Mustafa (Deepartment of Computer Science, University of Central Florida (UCF));Mark Heinrich (University of Central Florida College of Engineering and Computer Science);Yan Solihin (University of Central Florida);
Hardware Support for Durable Atomic Instructions for Persistent Parallel Programming
Khan Shaikhul Hadi (University of Central Florida);Naveed Ul Mustafa (Deepartment of Computer Science, University of Central Florida (UCF));Mark Heinrich (University of Central Florida College of Engineering and Computer Science);Yan Solihin (University of Central Florida);
Speaker: Khan Shaikhul Hadi, ARPERS, UCF
Abstract: Persistent memory is emerging as an attractive main memory fabric capable of hosting persistent data. However, its programmability is hampered by the lack of persistent synchronization primitives. Atomic instructions are immensely useful for higher-level synchronization (locks and barriers) and for supporting lock-free data structures, but they have no durable/persistent version. In this paper, we propose a new approach to solve the problem: {\em durable atomic instructions} (DAIs). We show that DAIs can be supported with minor hardware support (low-cost modifications to the cache coherence protocol), and simultaneously achieve high performance, scalability, and crash consistency.
Speaker bio: Khan Shaikhul Hadi is a 4th year PhD student in Computer Science at University of Central Florida. His research focus on crash consistency and data recovery in persistent memory system. As a Graduate researcher assistant, he explore solutions to ensure data integrity and accessibility for persistent system in the face of system crash.
Potpourri Session
Chair:
Analysis and Designs of Analog ECC
Anxiao Jiang (Texas A&M University);Xiangwu Zuo (Texas A&M University);
Analysis and Designs of Analog ECC
Anxiao Jiang (Texas A&M University);Xiangwu Zuo (Texas A&M University);
Speaker: Anxiao Jiang, Texas A&M University
Abstract: Nonvolatile resistive memories (e.g., memristors and phase-change memories) have become an important part of neuromorphic computing. In particular, crossbars of such nanoscale memories have been essential for realizing vector-matrix multiplications in analog circuits, which are widely used operations in deep neural networks. Analog error-correcting codes (Analog ECCs) have been proposed to make the operations more reliable. This paper explores the analysis and constructions of Analog ECCs in multiple ways. It presents a linear-programming based algorithm that computes the $m$-heights of Analog ECCs efficiently, which can be used to determine the error correction/detection capabilities of the codes. It presents a family of Analog ECCs based on permutations, and proves that the time complexity for determining the $m$-heights of such codes can be further reduced substantially. It then presents a number of newly discovered codes, which achieve state-of-the-art performance.
Speaker bio: Anxiao (Andrew) Jiang received the B.Sc. degree in electronic engineering from Tsinghua University, Beijing, China in 1999, and the M.Sc. and Ph.D. degrees in electrical engineering from the California Institute of Technology, Pasadena, California, USA in 2000 and 2004, respectively. He is currently a Professor in the Computer Science and Engineering Department at Texas A\&M University in College Station, Texas, USA. He has been a visiting professor or associate at California Institute of Technology, University of California in San Diego and Ecole Polytechnique Federale de Lausanne (EPFL), and a consulting researcher at HP Labs, EMC and Microsoft Research. His research interests include information theory and coding theory, data storage and machine learning. Prof. Jiang is a recipient of the NSF CAREER Award in 2008 for his research on information theory for flash memories, the 2009 IEEE Communications Society Data Storage Technical Committee (DSTC) Best Paper Award in Signal Processing and Coding for Data Storage, and the 2020 Non-Volatile Memories Workshop (NVMW) Persistent Impact Prize in information theory and coding.
ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture with Low Mapping Overhead
Natan Peled (Technion);Yuval Cassuto (Technion);
ECC-Map: A Resilient Wear-Leveled Memory-Device Architecture with Low Mapping Overhead
Natan Peled (Technion);Yuval Cassuto (Technion);
Speaker: Yuval Cassuto, Technion - Israel Institute of Technology
Abstract: New non-volatile memory technologies show great promise for extending the memory hierarchy, but have limited endurance that needs to be mitigated toward their reliable use closer to the processor. Wear leveling is a common technique for prolonging the life of endurance-limited memory, where existing wear-leveling approaches either employ costly full-indirection mapping between logical and physical addresses, or choose simple mappings that cannot cope with extremely unbalanced write workloads. In this work, we propose ECC-Map, a new wear-leveling device architecture that can level even the most unbalanced and adversarial workloads, while enjoying low mapping complexity compared to full indirection. ECC-Map is evaluated on common synthetic workloads, and is shown to significantly outperform existing wear-leveling architectures.
Speaker bio: Yuval Cassuto is a Professor at the Viterbi Department of Electrical and Computer Engineering, Technion – Israel Institute of Technology. His research interests lie at the intersection of the theoretical information sciences and the engineering of practical computing and storage systems. He has served on the technical program committees of leading conferences in both theory and systems. During 2010-2011 he has been a Scientist at EPFL, the Swiss Federal Institute of Technology in Lausanne. From 2008 to 2010 he was a Research Staff Member at Hitachi Global Storage Technologies, San Jose Research Center. In 2018-2019 he held a Visiting Professor position at Western Digital Research, and a Visiting Scholar position at UC Berkeley. He received the B.Sc degree in Electrical Engineering, summa cum laude, from the Technion in 2001, and the M.S. and Ph.D. degrees in Electrical Engineering from the California Institute of Technology, in 2004 and 2008, respectively. From 2000 to 2002, he was with Qualcomm, Israel R&D Center, where he worked on modeling, design and analysis in wireless communications. Dr. Cassuto has won the Best Student Paper Award in data storage from the IEEE Communications Society in 2010 as a student, and in 2019 as an adviser. He also won faculty awards from Qualcomm, Intel, and Western Digital. As an undergraduate student, he won the 2001 Texas Instruments DSP and Analog Challenge $100,000 prize.
Neural Decoder for Analog ECC
Xiangwu Zuo (Texas A&M University);Anxiao Jiang (Texas A&M University);
Neural Decoder for Analog ECC
Xiangwu Zuo (Texas A&M University);Anxiao Jiang (Texas A&M University);
Speaker: Anxiao Jiang, Texas A&M University
Abstract: Non-volatile memories (NVMs) have been pivotal in implementing deep neural networks in analog circuits. Analog error-correcting codes (Analog ECCs) have been proposed to make their computation more reliable. Although a number of Analog ECCs have already been designed, how to develop decoders for them remains largely unknown. The only decoder known so far is for the analog $[n,1]$ repetition code. This work explores the designs of neural networks as decoders for Analog ECCs. Principled approaches are used for decoding, including error locating and using regression to find the values of errors. An ensemble method is used to improve the accuracy of error locating. And a transformer-based model is shown to achieve good regression performance (e.g., increasing the signal-to-noise ratio by over 10 or 20 dB).
Speaker bio: Anxiao (Andrew) Jiang received the B.Sc. degree in electronic engineering from Tsinghua University, Beijing, China in 1999, and the M.Sc. and Ph.D. degrees in electrical engineering from the California Institute of Technology, Pasadena, California, USA in 2000 and 2004, respectively. He is currently a Professor in the Computer Science and Engineering Department at Texas A\&M University in College Station, Texas, USA. He has been a visiting professor or associate at California Institute of Technology, University of California in San Diego and Ecole Polytechnique Federale de Lausanne (EPFL), and a consulting researcher at HP Labs, EMC and Microsoft Research. His research interests include information theory and coding theory, data storage and machine learning. Prof. Jiang is a recipient of the NSF CAREER Award in 2008 for his research on information theory for flash memories, the 2009 IEEE Communications Society Data Storage Technical Committee (DSTC) Best Paper Award in Signal Processing and Coding for Data Storage, and the 2020 Non-Volatile Memories Workshop (NVMW) Persistent Impact Prize in information theory and coding.
Memory-Awareness: Boosting NVM Energy Efficiency and Endurance
Saeed Kargar (University of California Santa Cruz);Faisal Nawab (university of california irvine);
Memory-Awareness: Boosting NVM Energy Efficiency and Endurance
Saeed Kargar (University of California Santa Cruz);Faisal Nawab (university of california irvine);
Speaker: Saeed Kargar, University of California Santa Cruz
Abstract: Non-volatile memory (NVM) technologies find extensive use in data storage solutions and battery-powered mobile and IoT devices. Within this domain, wear-out and energy efficiency pose critical challenges for NVM utilization. This paper delves into exploring these challenges, detailing why previous methodologies have fallen short in attaining optimal efficiency when addressing NVM limitations. To overcome these hurdles, we introduce a software-level memory-aware solution, Hamming-Tree, designed to strategically select the memory segment where a write operation is applied, effectively mitigating these challenges [12]. Our proposal, Hamming-Tree, outperforms existing state-of-the-art solutions significantly. This innovative approach is adaptable to various indexing data structures, including trees and linked lists. Moreover, it not only complements existing indexing methods but can also synergize with preceding hardware-based solutions to further enhance efficiency. Real-world evaluations conducted on an Optane memory device demonstrate that our proposed solution achieves a reduction of up to 67.8% in energy consumption.
Speaker bio: Saeed Kargar is a 6th-year Ph.D. student in the Computer Science Department at UC Santa Cruz, advised by Faisal Nawab. His research interests broadly cover machine learning for systems, big data management systems, Non-Volatile Memory (NVM) technology, and in-memory database systems.
Telepathic Datacenters: Efficient and High-Performance RPCs using Shared CXL Memory
Suyash Mahar (University of California San Diego);Ehsan Hajyasini (University of California San Diego);Seungjin Lee (University of California San Diego);Zifeng Zhang (University of California San Diego);Mingyao Shen (University of California San Diego);Steven Swanson (University of California San Diego);
Telepathic Datacenters: Efficient and High-Performance RPCs using Shared CXL Memory
Suyash Mahar (University of California San Diego);Ehsan Hajyasini (University of California San Diego);Seungjin Lee (University of California San Diego);Zifeng Zhang (University of California San Diego);Mingyao Shen (University of California San Diego);Steven Swanson (University of California San Diego);
Speaker: Suyash Mahar, UC San Diego
Abstract: Datacenters require fast, efficient, and secure communication. However, today's popular choice, remote procedure calls (RPCs), are inefficient and slow as they require expensive serialization and compression. To solve this problem, we present RPCool, an RPC framework that exploits the shared memory nature of the upcoming CXL 3.0 standard. RPCool passes references to data rather than copying the data itself. Further, RPCool provides several security features and automatic fallback to RDMA. Finally, we compare RPCool against Unix domain sockets and ThriftRPC and found that RPCool outperforms both.
Speaker bio: Suyash Mahar is a fourth-year Ph.D. student at UC San Diego interested in the datacenter's memory efficiency. He is advised by Prof. Steven Swanson. In addition to his PhD thesis, he has worked with Google, Meta, and Intel on datacenter efficiency, studying their memory hierarchy and acceleration opportunities. Before starting his Ph.D. program, he worked on architecture and safety of persistent memories at the University of Virginia, CMU, and Technion. His works on memory systems have appeared in Eurosys, ASPLOS, PACT, and ICCD.
6:00 pm | Shuttle will be provided to the Banquet
Banquet: Mister A’s
6:30pm
TUESDAY, MARCH 12
8:00 am – 8:50 am | Price Center Ballroom East
Breakfast
9:00 am – 10:10 am | Price Center Ballroom East
Keynote II:
Coding Theory for Analog AI Computing
Anxiao Jiang (Texas A\&M University);
Coding Theory for Analog AI Computing
Anxiao Jiang (Texas A\&M University);
Speaker: Anxiao (Andrew) Jiang, Texas A&M University
Abstract: AI has great potential for human's future. But to realize that potential, AI computing needs to be much more efficient. Analog in-memory computing can potentially improve the speed and energy efficiency of AI by multiple orders, and break the "memory wall" that is currently a major bottleneck for AI. In this talk we introduce a new coding theory for analog in-memory computing (pioneered by Ron M. Roth), which we call "quantized-analog error-correcting codes (QA-ECC)". The codes focus on the correction of errors in vector-matrix multiplications, which are a dominant part of computation in deep neural networks. The codes consider small but ubiquitous noise as tolerable, and focus on the correction of large errors. We introduce the theoretical foundations for the codes, as well as their practical constructions. We further discuss their integration with AI systems, as well as extended topics that are closely tied with nonvolatile memories. The talk focuses on a clear explanation of the key concepts, and aims at enhancing collaboration in this field.
Speaker bio: Anxiao (Andrew) Jiang received the B.Sc. degree in electronic engineering from Tsinghua University, Beijing, China in 1999, and the M.Sc. and Ph.D. degrees in electrical engineering from the California Institute of Technology, Pasadena, California, USA in 2000 and 2004, respectively. He is currently a Professor in the Computer Science and Engineering Department at Texas A\&M University in College Station, Texas, USA. He has been a visiting professor or associate at California Institute of Technology, University of California in San Diego and Ecole Polytechnique Federale de Lausanne (EPFL), and a consulting researcher at HP Labs, EMC and Microsoft Research. His research interests include information theory and coding theory, data storage and machine learning. Prof. Jiang is a recipient of the NSF CAREER Award in 2008 for his research on information theory for flash memories, the 2009 IEEE Communications Society Data Storage Technical Committee (DSTC) Best Paper Award in Signal Processing and Coding for Data Storage, and the 2020 Non-Volatile Memories Workshop (NVMW) Persistent Impact Prize in information theory and coding.
Chair: Eitan Yaakobi
10:30 am-11:50 am | Price Center Ballroom East
10:30 am-11:50 am | Price Center Ballroom West
Systems Session II: ML
Chair: Matias Bjorling
Swapping-Centric Neural Recording Systems
Muhammed Ugur (Yale University);Raghavendra Pradyumna Pothukuchi (Yale University);Abhishek Bhattacharjee (Yale University);
Swapping-Centric Neural Recording Systems
Muhammed Ugur (Yale University);Raghavendra Pradyumna Pothukuchi (Yale University);Abhishek Bhattacharjee (Yale University);
Speaker: Muhammed Ugur, Yale University
Abstract: Neural interfaces read the activity of biological neurons to help advance the neurosciences and offer treatment options for severe neurological diseases. The total number of neurons that are now being recorded using multi-electrode interfaces is doubling roughly every 4-6 years. However, processing this exponentially-growing data in real-time under strict power-constraints puts an exorbitant amount of pressure on both compute and storage within traditional neural recording systems. Existing systems deploy various accelerators for better performance-per-watt while also integrating NVMs for data querying and better treatment decisions. These accelerators have direct access to a limited amount of fast SRAM-based memory that is unable to manage the growing data rates. Swapping to the NVM becomes inevitable; however, naive approaches are unable to complete during the refractory period of a neuron -- i.e., a few milliseconds -- which disrupts timely disease treatment. We propose co-designing accelerators and storage, with swapping as a primary design goal, using theoretical and practical models of compute and storage respectively to overcome these limitations.
Betty: Enabling Large-Scale GNN Training with Batch-Level Graph Partitioning and Tiered Memory
Shuangyan Yang (University of California, Merced);Minjia Zhang (Microsoft Research);Dong Li (University of California, Merced);
Betty: Enabling Large-Scale GNN Training with Batch-Level Graph Partitioning and Tiered Memory
Shuangyan Yang (University of California, Merced);Minjia Zhang (Microsoft Research);Dong Li (University of California, Merced);
Speaker: Shuangyan Yang, University of California, Merced
Abstract: The Graph Neural Network (GNN) is showing outstanding results in improving the performance of graph-based applications. Recent studies demonstrate that GNN performance can be boosted via using more advanced aggregators, deeper aggregation depth, larger sampling rate, etc. While leading to promising results, the improvements come at a cost of significantly increased memory footprint, easily exceeding GPU memory capacity. In this paper, we introduce a method, Betty, to make GNN training be more scalable and accessible via batch-level partitioning. Different from DNN training, a mini-batch in GNN has complex dependencies between input features and output labels, making batch-level partitioning difficult. Betty introduces two novel techniques, redundancy-embedded graph (REG) partitioning and memory-aware partitioning, to effectively mitigate the redundancy and load imbalances issues across the partitions. Our evaluation of large-scale real-world datasets shows that Betty can significantly mitigate the memory bottleneck, extend GNN training with much deeper aggregation depths, larger sampling rate, larger training batch sizes, together with more advanced aggregators, with a single GPU.
Speaker bio: Shuangyan Yang is pursuing a Ph.D. at the University of California, Merced. The research interests include systems, HPC, and deep learning. Presently, the research focuses on training graph neural networks on large memory systems.
Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor Migrations
Haoyang Zhang (University of Illinois Urbana-Champaign);Yirui Eric Zhou (University of Illinois Urbana-Champaign);Yuqi Xue (University of Illinois Urbana-Champaign);Yiqi Liu (University of Illinois Urbana-Champaign);Jian Huang (University of Illinois Urbana-Champaign);
Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor Migrations
Haoyang Zhang (University of Illinois Urbana-Champaign);Yirui Eric Zhou (University of Illinois Urbana-Champaign);Yuqi Xue (University of Illinois Urbana-Champaign);Yiqi Liu (University of Illinois Urbana-Champaign);Jian Huang (University of Illinois Urbana-Champaign);
Speaker: ,
Abstract: To break the GPU memory wall for scaling deep learning workloads, a variety of architecture and system techniques have been proposed recently. Their typical approaches include memory extension with flash memory and direct storage access. However, these techniques still suffer from suboptimal performance and introduce complexity to the GPU memory management, making it hard to meet the scalability requirement of deep learning workloads today. In this work, we present a unified GPU memory and storage architecture named G10 driven by the fact that the tensor behaviors of deep learning workloads are highly predictable. G10 integrates the host memory, GPU memory, and flash memory into a unified memory space, scaling the GPU memory capacity while enabling transparent data migrations. Based on this unified GPU memory and storage architecture, G10 utilizes compiler techniques to characterize the tensor behaviors in deep learning workloads. This allows G10 to schedule data migrations in advance by considering the available bandwidth of flash memory and host memory. The cooperative mechanism between deep learning compilers and the unified memory architecture enables G10 to hide data transfer overheads transparently. We implement G10 based on an open-source GPU simulator. Our experiments demonstrate that G10 outperforms state-of-the-art GPU memory solutions by up to 1.75×, without code modifications to deep learning workloads. With the smart data migration mechanism, G10 can reach 90.3% of the performance of the ideal case assuming unlimited GPU memory.
Enabling Large Dynamic Neural Network Training with Learning-based Memory Management on Tiered Memory
Jie Ren (College of William and Mary);Dong Xu (University of California, Merced);Shuangyan Yang (University of California, Merced);Jiacheng Zhao (Institute of Computing Technology, Chinese Academy of Sciences);Zhicheng Li (Institute of Computing Technology, Chinese Academy of Sciences);Christian Navasca (UCLA);Chenxi Wang (Chinese Academy of Sciences);Harry Xu (UCLA);Dong Li (University of California, Merced);
Enabling Large Dynamic Neural Network Training with Learning-based Memory Management on Tiered Memory
Jie Ren (College of William and Mary);Dong Xu (University of California, Merced);Shuangyan Yang (University of California, Merced);Jiacheng Zhao (Institute of Computing Technology, Chinese Academy of Sciences);Zhicheng Li (Institute of Computing Technology, Chinese Academy of Sciences);Christian Navasca (UCLA);Chenxi Wang (Chinese Academy of Sciences);Harry Xu (UCLA);Dong Li (University of California, Merced);
Speaker: Shuangyan yang, University of California, Merced
Abstract: Dynamic neural network (DyNN) enables high computational efficiency and strong representation capability. However, training DyNN can face a memory capacity problem because of increasing model size or limited GPU memory capacity. Managing tensors to save GPU memory is challenging, because of the dynamic structure of DyNN. We present DyNN-offload, a memory management system to train DyNN. DyNN-offload uses a learned approach (using a neural network called the \textit{pilot model}) to increase predictability of tensor accesses to facilitate memory management. The key of DyNN-offload is to enable fast inference of the pilot model in order to reduce its performance overhead, while providing high inference (or prediction) accuracy. DyNN-offload reduces input feature space and model complexity of the pilot model based on a new representation of DyNN; DyNN-offload converts the hard problem of making prediction for individual operators into a simpler problem of making prediction for a group of operators in DyNN. DyNN-Offload enables 8× larger DyNN training on a single GPU compared with using PyTorch alone (unprecedented with any existing solution). Evaluating with AlphaFold (a production-level, large-scale DyNN), we show that DyNN-offload outperforms unified virtual memory (UVM) and dynamic tensor rematerialization (DTR), the most advanced solutions to save GPU memory for DyNN, by 3× and 2.1× respectively in terms of maximum batch size.
Speaker bio: Shuangyan Yang is pursuing a Ph.D. at the University of California, Merced. The research interests include systems, HPC, and deep learning. Presently, the research focuses on training graph neural networks on large memory systems.
Systems Session III: Compute in Storage
Chair: Joseph Izraelevitz
Rethinking Programming Frameworks for In-Storage Processing
Yu-Chia Liu (University of California, Riverside);Kuan-Chieh Hsu (University of California, Riverside);Hung-Wei Tseng (University of California, Riverside);
Rethinking Programming Frameworks for In-Storage Processing
Yu-Chia Liu (University of California, Riverside);Kuan-Chieh Hsu (University of California, Riverside);Hung-Wei Tseng (University of California, Riverside);
Speaker: ,
Abstract: In-storage processing (ISP) is the most commercialized implementation of the near-data processing (NDP) model that executes tasks near their storage locations. However, programming ISP devices is complicated as it requires programmers to work closely with the underlying hardware, and even highly-optimized code can easily lead to suboptimal performance. This paper introduces ActivePy. ActivePy makes the programmer completely agnostic to ISP hardware. ActivePy automatically, transpartently, and dynamically generates high-performance code to balance system-wide trade-offs and maximize the benefits of ISP. Our real system implementation shows that ActivePy can use ISP as efficiently as conventional C-based frameworks.
CS-Assist: A Tool to Assist Computational Storage Device Offload
Lokesh N. Jaliminche (University of California Santa Cruz);Yangwook Kang (Samsung Semiconductor, Inc, USA);Changho Choi (Samsung Semiconductor, Inc, USA);Pankaj Mehra (Elephance Memory, Inc.);Heiner Litz (University of California, Santa Cruz);
CS-Assist: A Tool to Assist Computational Storage Device Offload
Lokesh N. Jaliminche (University of California Santa Cruz);Yangwook Kang (Samsung Semiconductor, Inc, USA);Changho Choi (Samsung Semiconductor, Inc, USA);Pankaj Mehra (Elephance Memory, Inc.);Heiner Litz (University of California, Santa Cruz);
Speaker: Lokesh N. Jaliminche, University of California Santa Cruz
Abstract: The exponential growth of data has made data movement an obvious target of performance and power optimization for data processing applications. This has fueled a growing interest in Computational Storage Devices(CSDs) that can mitigate data movement overhead between host and storage devices in modern data-intensive applications~\cite{barbalace2021computational}. CSDs such as Samsung's SmartSSD enable this capability by integrating a hardware accelerator in every device. With ongoing technological advancements while we get faster ways of interfacing physically to CSDs (PCIe 5 and 6), better ways of interacting with it (via CXL)~\cite{leedatabase}, and better software mechanisms for programming offload~\cite{acharya1998active}, identifying the functions to offload remains the principal, and ever-present technical problem every current and future application of computational storage must address. Existing methodologies follow an iterative implementation and evaluation cycle, which is slow and cost-prohibitive. We propose a systematic and general methodology for automated application offload analysis to address this issue. In particular, we propose CS-Assist to determine candidate kernels that should be offloaded to CSDs. Recognizing the distinct nature of CSD's hardware capabilities and its position in system architecture, we first identify essential hardware and kernel characteristics contributing to performance (Section~\ref{sec:cs_device_characteristics}). Then, Section~\ref{sec:csd_overall_workflow} provides an overall workflow to identify candidate kernels for offloading. Section~\ref{sec:csd_initial_eval} shows our initial evaluation of an analytics workload running atop PostgreSQL DB, demonstrating accurate estimations (less than 7\% prediction error) from applying our methodology.
Speaker bio: Lokesh is a PhD Candidate in Computer Science at the Baskin School of Engineering, University of California Santa Cruz, under the guidance of Prof. Heiner Litz. His research focuses on enabling new storage technologies for better performance and resource utilization and applying Machine Learning techniques to enhance Quality of Service (QoS). He currently works as a Graduate Student Researcher at the Center for Research in Systems and Storage. Prior to his doctoral studies, Lokesh worked as a software engineer at Seagate and DataDirect Networks. At Seagate, he contributed to the design and development of the Lustre File System. At DataDirect Networks, he worked on the Infinite Memory Engine, designed to improve the efficiency of I/O-intensive applications by providing a high-speed, low-latency storage layer between computing resources and the underlying storage infrastructure.
TCAM-SSD: A Framework for Search-Based Computing in Solid-State Drives
Ryan Wong (University of Illinois Urbana-Champaign);Nikita Kim (Carnegie Mellon University);Kevin Higgs (University of Illinois Urbana-Champaign);Engin Ipek (Qualcomm);Sapan Agarwal (Sandia National Laboratories);Saugata Ghose (University of Illinois Urbana-Champaign);Ben Feinberg (Sandia National Laboratories);
TCAM-SSD: A Framework for Search-Based Computing in Solid-State Drives
Ryan Wong (University of Illinois Urbana-Champaign);Nikita Kim (Carnegie Mellon University);Kevin Higgs (University of Illinois Urbana-Champaign);Engin Ipek (Qualcomm);Sapan Agarwal (Sandia National Laboratories);Saugata Ghose (University of Illinois Urbana-Champaign);Ben Feinberg (Sandia National Laboratories);
Speaker: ,
Abstract: Over the past decade, the amount of data generated and consumed by modern applications has grown exponentially, which induces a large amount of data movement between processing elements, main memory, and storage. This data movement has become a major bottleneck in modern systems, as it consumes large amounts of energy and results in significant performance penalties. \textit{Processing-in-memory} (PIM) architectures provide a solution to alleviating data movement, by performing data processing closer to (i.e., \textit{near}) or directly inside (i.e., \textit{using}) memory arrays. While a large amount of research has explored processing-near-memory (PNM) and processing-using-memory (PUM) for main memory subsystems, they do not alleviate data movement to storage for applications with very large datasets. In this work, we propose TCAM-SSD, a new framework for efficient in-SSD search-based computing. TCAM-SSD makes use of the previously-proposed IMS primitive, which treats a NAND flash cell array as a bulk ternary content-addressable memory (TCAM). We demonstrate that TCAM-SSD can mitigate the performance penalties of accessing the disk, with a 60.1% speedup over a conventional system executing transaction processing. For database analytics, TCAM-SSD can provide an average speedup of 17.7× over a conventional system for a collection of analytical queries. For graph analytics, TCAM-SSD improves graph computing for larger-than-memory datasets by 12.6%.
AutoSSD: CXL-based Autonomic and Collaborative SSD Scheduling to Reduce Tail Latency
Mingyao Shen (University of California San Diego);Suyash Mahar (University of California San Diego);Joseph Izraelevitz (University of Colorado, Boulder);Steven Swanson (University of California San Diego);
AutoSSD: CXL-based Autonomic and Collaborative SSD Scheduling to Reduce Tail Latency
Mingyao Shen (University of California San Diego);Suyash Mahar (University of California San Diego);Joseph Izraelevitz (University of Colorado, Boulder);Steven Swanson (University of California San Diego);
Speaker: Mingyao Shen, UC San Diego
Abstract: We present AutoSSD, a CXL-based, autonomic, and collaborative SSD-centric scheduling system designed to reduce the tail latency in all-SSD RAID configurations. AutoSSD effectively addresses key challenges associated with redirecting requests to backup SSDs, such as replication overhead, block tracking, CPU-centric issues, and redirection performance. Overall, AutoSSD demonstrates superior performance compared to traditional Linux mdraid.
Speaker bio: Mingyao Shen is a Ph.D. student at UC San Diego. He is advised by Prof. Steven Swanson. He is interested in general system research and focuses on improving the reliability and efficiency of memory and storage systems. His work appeared in ICS, Eurosys and ICCD.
11:50 am – 1:00 pm | Price Center Ballroom West
Lunch
1:00 pm-2:20 pm | Price Center Ballroom East
1:00 pm-2:20 pm | Price Center Ballroom West
Session IV: Caches & Security
Chair: Steven Swanson
Rethinking Metadata Caches in Secure NVMs
Samuel Thomas (Brown University);Hammad Izhar (Brown University);Eliott Dinfotan (Boston University);Tali Moreshet (Boston University);Maurice Herlihy (Brown University);Iris Bahar (Colorado School of Mines);
Rethinking Metadata Caches in Secure NVMs
Samuel Thomas (Brown University);Hammad Izhar (Brown University);Eliott Dinfotan (Boston University);Tali Moreshet (Boston University);Maurice Herlihy (Brown University);Iris Bahar (Colorado School of Mines);
Speaker: Samuel Thomas, Brown University
Abstract: Secure memory describes a hardware system in which the memory controller guarantees the integrity and privacy of data in memory. Unfortunately, secure memory protections have been removed from commodity hardwares like Intel SGX due to inhibiting performance. Optimizations like the metadata cache can help, but don’t scale to the growing memory footprint of modern workloads. In our work, we propose the Huffmanized Merkle Tree, which takes a different approach to secure memory. The Huffmanized Merkle Tree constructs the secure memory integrity tree as a Huffman tree in order to minimize the metadata required per data authentication based on the frequency of access. The Huffman tree is adaptable, and maintains low runtime overheads. The Huffmanized Merkle Tree reduces runtime overhead in metadata cache-free workloads by 46% and reduces metadata per data by 30% on average.
Speaker bio: Samuel Thomas is a PhD candidate at Brown University. His research concerns secure memory, particularly as it relates to emerging technologies and use cases. His work has appeared in CAL, HPEC, and at a number of workshops (BARC, NOPE, YArch, NEHWS). His most recent work is set to appear at ASPLOS 2024.
SweepCache: Intermittence-Aware Cache on the Cheap
Yuchen Zhou (Purdue University);Jianping Zeng (Purdue University);Jungi Jeong (Google);Jongouk Choi (University of Central Florida);Changhee Jung (Purdue University);
SweepCache: Intermittence-Aware Cache on the Cheap
Yuchen Zhou (Purdue University);Jianping Zeng (Purdue University);Jungi Jeong (Google);Jongouk Choi (University of Central Florida);Changhee Jung (Purdue University);
Speaker: Yuchen Zhou, Purdue University
Abstract: This paper presents SweepCache, a new compiler/architecture co-design scheme that can equip energy harvesting systems with a volatile cache in a performant yet lightweight way. Unlike prior just-in-time checkpointing designs that persists volatile data just before power failure and thus dedicates additional energy, SweepCache partitions program into a series of recoverable regions and persists stores at region granularity to fully utilize harvested energy for computation. In particular, SweepCache introduces persist buffer—as a redo buffer resident in nonvolatile memory (NVM)—to keep the main memory consistent across power failure while persisting region’s stores in a failure-atomic manner. Specifically, for write-backs during region execution, SweepCache saves their cachelines to the persist buffer. At each region end, SweepCache first flushes dirty cachelines to the buffer, allowing the next region to start with a clean cache, and then moves all buffered cachelines to the corresponding NVM locations. In this way, no matter when power failure occurs, the buffer contents or their memory locations always remain intact, which serves as a basis for correct recovery. To hide the persistence delay, SweepCache speculatively starts a region right after the prior region finishes its execution—as if its stores were already persisted—with the two regions having their own persist buffer, i.e., dual-buffering. This region-level parallelism helps SweepCache to achieve the full potential of a high-performance data cache. The experimental results show that compared to the original cache-free nonvolatile processor, SweepCache delivers speedups of 14.60x and 14.86x—outperforming the state-of-the-art work by 3.47x and 3.49x—for two representative energy harvesting power traces, respectively.
Speaker bio: Yuchen Zhou is a third-year Ph.D. student in the Department of Computer Science at Purdue University, under the guidance of Professor Changhee Jung. His research mainly focuses on energy harvesting systems and whole system persistence. To achieve this objective, he often engages in compiler and architecture co-design to reduce hardware complexity while achieving high performance.
Write-Light Cache for Energy Harvesting Systems
Jongouk Choi (University of Central Florida);Jianping Zeng (Purdue University);Dongyoon Lee (Stony Brook University);Changwoo Min (Igalia);Changhee Jung (Purdue University);
Write-Light Cache for Energy Harvesting Systems
Jongouk Choi (University of Central Florida);Jianping Zeng (Purdue University);Dongyoon Lee (Stony Brook University);Changwoo Min (Igalia);Changhee Jung (Purdue University);
Speaker: Jongouk Choi, University of Central Florida
Abstract: Energy harvesting system has huge potential to enable battery-less Internet of Things (IoT) services. However, it has been designed without a cache due to the difficulty of crash consistency guarantee, limiting its performance. This paper introduces Write-Light Cache ( WL-Cache), a specialized cache architecture with a new write pol- icy for energy harvesting systems. WL-Cache combines benefits of a write-back cache and a write-through cache while avoiding their downsides. Unlike a write-through cache, WL-Cache does not ac- cess a non-volatile main memory (NVM) at every store but it holds dirty cache lines in a cache to exploit locality, saving energy and im- proving performance. Unlike a write-back cache, WL-Cache limits the number of dirty lines in a cache. When power is about to be cut off, WL-Cache flushes the bounded set of dirty lines to NVM in a failure-atomic manner by leveraging a just-in-time (JIT) checkpoint- ing mechanism to achieve crash consistency across power failure. For optimization, WL-Cache interacts with a run-time system that estimates the quality of energy source during each power-on period, and adaptively reconfigures the possible number of dirty cache lines at boot time. Our experiments demonstrate that WL-Cache reduces hardware complexity and provides a significant speedup over the state-of-the-art volatile cache design with non-volatile backup. For two representative power outage traces, WL-Cache achieves 1.35x and 1.44x average speedups, respectively, across 23 benchmarks used in prior work.
Speaker bio: Jongouk Choi is an Assistant Professor at University of Central Florida (UCF), Department of Computer Science. He generally develops architecture/compiler co-design solutions to improve performance, reduce hardware complexity, and address reliability/security problems.
Using a Fast Subtree for Efficient Secure NVMs
Samuel Thomas (Brown University);Kidus Workneh (University of Colorado, Boulder);Jac McCarty (Bryn Mawr College);Joseph Izraelevitz (University of Colorado, Boulder);Tamara Lehman (University of Colorado, Boulder);Iris Bahar (Colorado School of Mines);
Using a Fast Subtree for Efficient Secure NVMs
Samuel Thomas (Brown University);Kidus Workneh (University of Colorado, Boulder);Jac McCarty (Bryn Mawr College);Joseph Izraelevitz (University of Colorado, Boulder);Tamara Lehman (University of Colorado, Boulder);Iris Bahar (Colorado School of Mines);
Speaker: Samuel Thomas, Brown University
Abstract: Secure memory is a highly desirable property to prevent memory corruption-based attacks. The emergence of non-volatile, storage class memory (SCM) devices presents new challenges for secure memory. Metadata for integrity verification, organized in a Bonsai Merkle Tree (BMT), is cached on-chip in volatile caches, and may be lost on a power failure. As a consequence, care is required to ensure that metadata updates are always propagated into SCM. To optimize this, state-of-the-art approaches propose lazily updated crash consistent metadata update schemes. However, few consider the implications of their optimizations on on-chip area, which leads to inefficient utilization of scarce on-chip space. In this paper, we propose a hybrid persistence approach to provide crash consistent integrity with low run-time overhead while limiting on-chip area for security metadata. Our approach offloads the potential hardware complexity of our technique to software to keep area overheads low. Our proposed mechanism results in significant improvements (a 41% reduction in execution overhead on average versus the state-of-the-art) for in-memory storage applications while significantly reducing the required on-chip area to implement our protocol.
Speaker bio: Samuel Thomas is a PhD candidate at Brown University. His research concerns secure memory, particularly as it relates to emerging technologies and use cases. His work has appeared in CAL, HPEC, and at a number of workshops (BARC, NOPE, YArch, NEHWS). His most recent work is set to appear at ASPLOS 2024.
Short Presentations
Chair: Anxiao Jiang
Fully Private Grouped Matrix Multiplication
Lev Tauz (University of California, Los Angeles);Lara Dolecek (University of California, Los Angeles);
Fully Private Grouped Matrix Multiplication
Lev Tauz (University of California, Los Angeles);Lara Dolecek (University of California, Los Angeles);
Speaker: Lev Tauz, UCLA
Abstract: In this paper, we consider the novel concept of batch size privacy in distributed coded matrix multiplication which adds the constraint that workers cannot learn the number of matrix products being calculated. Batch size privacy helps hide the activity of the master and prevents workers from discriminating users based on usage patterns. As a primary example, we focus on the model of fully private grouped matrix multiplication where a master wants to compute a group of matrix products between two matrix libraries that can be accessed by all workers while ensuring that any number of prescribed colluding workers learn nothing about which matrix products the master desires, nor the number of matrix products. We present an achievable scheme using a variant of Cross-Subspace Alignment (CSA) codes that offers flexibility in communication and computation costs with good straggler resilience.
Speaker bio: Lev Tauz is a Ph.D. candidate in the Electrical and Computer Engineering Department at the University of California, Los Angeles (UCLA). He received his B.S. (with honors) in Electrical Engineering and Computer Science from the University of California, Berkeley in 2016 and his Masters in Electrical and Computer Engineering from the University of California, Los Angeles in 2020. Currently, he works at the Laboratory for Robust Information Systems (LORIS), and is focused on coding techniques for distributed storage and computation. His research interests include distributed systems, error-correcting codes, machine learning, and graph theory.
Side Information-Assisted Symbolic Regression for Data Storage
Xiangwu Zuo (Texas A&M University);Anxiao Jiang (Texas A&M University);Netanel Raviv (Washington University in St. Louis);Paul Siegel (UCSD);
Side Information-Assisted Symbolic Regression for Data Storage
Xiangwu Zuo (Texas A&M University);Anxiao Jiang (Texas A&M University);Netanel Raviv (Washington University in St. Louis);Paul Siegel (UCSD);
Speaker: Xiangwu Zuo, Texas A&M University
Abstract: There are various ways to use machine learning to improve data storage techniques. In this paper, we study symbolic regression, a machine-learning method for recovering the symbolic form of a function from its samples. We present a new symbolic regression scheme that utilizes side information for higher accuracy and speed in function recovery. The scheme enhances latest results on symbolic regression that were based on recurrent neural networks and genetic programming. The scheme is tested on a new benchmark of functions for data storage.
Speaker bio: Xiangwu Zuo is a Ph.D. candidate in the Department of Computer Science and Engineering at Texas A&M University, advised by Dr. Anxiao (Andrew) Jiang. He earned his master’s degree in Computer Engineering from the same institution in 2013. His research spans several areas, including machine learning, symbolic regression, neuromorphic computing, game theory, financial modeling, and asset pricing.
Learning to Drive Software-Defined Solid-State Drives
Daixuan Li (UIUC);Jinghan Sun (UIUC);Jian Huang (UIUC);
Learning to Drive Software-Defined Solid-State Drives
Daixuan Li (UIUC);Jinghan Sun (UIUC);Jian Huang (UIUC);
Speaker: ,
Abstract: Thanks to the mature manufacturing techniques, flash-based solid-state drives (SSDs) are highly customizable for applications today, which brings opportunities to further improve their storage performance and resource utilization. However, the SSD efficiency is usually determined by many hardware parameters, making it hard for developers to manually tune them and determine the optimized SSD hardware configurations. In this paper, we present an automated learning-based SSD hardware configuration framework, named AutoBlox, that utilizes both supervised and unsupervised machine learning (ML) techniques to drive the tuning of hardware configurations for SSDs. AutoBlox automatically extracts the unique access patterns of a new workload using its block I/O traces, maps the workload to previous workloads for utilizing the learned experiences, and recommends an optimized SSD configuration based on the validated storage performance. Auto- Blox accelerates the development of new SSD devices by automating the hardware parameter configurations and reducing the manual efforts. We develop AutoBlox with simple yet effective learning algorithms that can run efficiently on multi-core CPUs. Given a target storage workload, our evaluation shows that AutoBlox can deliver an optimized SSD configuration that can improve the performance of the target workload by 1.30× on average, compared to commodity SSDs, while satisfying specified constraints such as SSD capacity, device interfaces, and power budget. And this configuration will maximize the performance improvement for both target workloads and non-target workloads.
ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear Solvers
Linghao Song (University of California, Los Angeles);Fan Chen (Indiana University Bloomington);Hai "Helen" Li (Duke University);Yiran Chen (Duke University);
ReFloat: Low-Cost Floating-Point Processing in ReRAM for Accelerating Iterative Linear Solvers
Linghao Song (University of California, Los Angeles);Fan Chen (Indiana University Bloomington);Hai "Helen" Li (Duke University);Yiran Chen (Duke University);
Speaker: Linghao Song, UCLA
Abstract: Resistive random access memory (ReRAM) is a promising technology that can perform low-cost and in-situ matrix-vector multiplication (MVM) in analog domain. Scientific computing requires high-precision floating-point (FP) processing. However, performing floating-point computation in ReRAM is challenging because of high hardware cost and execution time due to the large FP value range. In this work we present ReFloat, a data format and an accelerator architecture, for low-cost and high-performance floating-point processing in ReRAM for iterative linear solvers. ReFloat matches the ReRAM crossbar hardware and represents a block of FP values with reduced bits and an optimized exponent base for a high range of dynamic representation. Thus, ReFloat achieves less ReRAM crossbar consumption and fewer processing cycles and overcomes the noncovergence issue in a prior work. The evaluation on the SuiteSparse matrices shows ReFloat achieves 5.02× to 84.28× improvement in terms of solver time compared to a state-of-the-art ReRAM based accelerator.
Speaker bio: Linghao Song is a postdoctoral researcher in UCLA Computer Science Department. His research interests include domain specific accelerator, computer architecture, memory centric computing, machine learning acceleration, and FPGA prototyping and acceleration. He received Ph.D. in Computer Engineering from Duke University in 2020, M.S. from University of Pittsburgh, and B.S.E. from Shanghai Jiao Tong University. He received 2020 EDAA Outstanding Dissertation Award and 2021 Duke ECE Outstanding Dissertation Award. More information is available at https://linghaosong.github.io.
On Inter-PMO Security Attacks
Naveed Ul Mustafa (Deepartment of Computer Science, University of Central Florida (UCF));Yan Solihin (University of Central Florida);
On Inter-PMO Security Attacks
Naveed Ul Mustafa (Deepartment of Computer Science, University of Central Florida (UCF));Yan Solihin (University of Central Florida);
Speaker: Naveed Ul Mustafa, University of Central Florida
Abstract: Persistent Memory Object (PMO) is a general system abstraction for holding persistent data in persistent main memory, managed by an operating system. PMO programming model breaks inter-process isolation as it results in sharing of persistent data between two processes as they alternatively access the same PMO. In this paper, we discuss security implications of PMO model. We demonstrate that the model enables one process to affect execution of another process even without sharing a PMO over time. This allows an adversary to launch inter-PMO security attacks if two processes are linked via other unshared PMOs.
Speaker bio: Dr. Naveed Ul Mustafa is a postdoctoral research scholar at the University of Central Florida (UCF), where he is mentoring multiple graduate students in the ARPERS research group. Before joining UCF, he served as an assistant professor at TED University, Ankara, Turkey, and completed two internships in industry during his Ph.D. He earned his Ph.D. in 2019 from Bilkent University, Ankara, Turkey. His research interest lies in the areas of computer architecture and computer security, with a focus on improving memory security and performance through computer architecture and system software. His research work has resulted in the publication of several articles, including two in journals, eight in conferences, and five in workshops, either published or currently under review.