MONDAY, MARCH 9
8:00 am – 9:00 am |
Breakfast & Registration
9:00 am – 10:00 am
Keynote I:
Oxide Semiconductor Gain Cell Memory
H.-S. Philip Wong (Stanford University);
Oxide Semiconductor Gain Cell Memory
(Stanford University);
Speaker: ,
Abstract: Domain-specific computing systems require cross-compute-stack optimization, with a particular focus on memory access, which often limits overall system performance (broadly defined). Understanding and optimizing memory based on specific applications is crucial. The 2T oxide semiconductor gain cell on-chip memory presents a promising solution to the memory wall problem by offering high density and minimizing off-chip DRAM accesses.Oxide Semiconductor (OS) transistors with ultra-low leakage expand the design space by providing longer retention times, albeit at the cost of speed. Traditionally, retention has been managed through refresh operations. Yet, exposing retention to software and mapping it to data lifetime can become a powerful design parameter in balancing speed, power, and density. The wide range of retention times offered by various gain cell configurations (Si-Si, OS-OS, OS-Si) provides an extensive design space for different applications. This talk will cover gain cell device physics and design from the ground up, including: 1.Physics understanding and modeling of oxide semiconductor transistors. 2.Device design guidelines, integration with CMOS, and scaling to advance nodes. 3.Scaling up in array size and scaling down to advanced technology nodes. 4.Memory macro compiler that accelerates memory circuits design.
Chair: Wenjuan Zhu
10:00 am – 10:10 am |
Break
10:10 am – 11:10 am |
Session 1: CXL memory
Chair: Dongyoon Lee
AIORE: Efficient Security Support for CXL Memory through Adaptive Incremental Offloaded (Re-)Encryption
Chuanhan Li (UC-Santa Cruz);Jishen Zhao (UCSD);Yuanchao Xu (University of California Santa Cruz);
AIORE: Efficient Security Support for CXL Memory through Adaptive Incremental Offloaded (Re-)Encryption
(UC-Santa Cruz); (UCSD); (University of California Santa Cruz);
Speaker: ,
Abstract: Compute eXpress Link (CXL) emerges as a promising technology to address critical memory scaling limitations, nonetheless, broad adoption of CXL memory introduces substantial security challenges. Trusted Execution Environments (TEEs) provide robust protection for data integrity and confidentiality in cloud environments, However, this approach incurs significant performance overhead due to the latency of XTS encryption on memory-intensive workloads. To mitigate this, we propose Adaptive Incremental Offloaded (Re-)Encryption (AIORE), an adaptive security framework combining Counter (CTR) and XTS encryption. Evaluation with Gem5 across diverse benchmarks reveals that AIORE significantly reduces security overhead by 62.8\% on average.
PoolSwitch: Hiding Model Switching Latency using CXL Pooled Memory
Hane (Stella) Yie (SK hynix America);Younghoon Min (SK hynix America);Jongryool Kim (SK hynix America);
PoolSwitch: Hiding Model Switching Latency using CXL Pooled Memory
(SK hynix America); (SK hynix America); (SK hynix America);
Speaker: ,
Abstract: In recent times, significant advancements in Artificial Intelligence (AI) and Machine Learning (ML) systems have triggered an exponential demand for memory resources. The performance of such systems is severely impacted by metrics such as available memory size and memory access latency. While adopting disaggregated memory systems can supplement additional memory capacity, traditional network-attached approaches incur high overhead due to serialization and swapping latencies. To this end, we present our approach to supplementing additional memory while drastically reducing access latency using Niagara 2.0, a multi-headed Compute Express Link (CXL) Pooled Memory system. We integrate Niagara 2.0 with the SLoRA serving framework to create a shared-memory architecture for distributed Low-Rank Adaptation (LoRA) serving, named PoolSwitch. We demonstrate that by loading adapters into a centralized CXL pool accessible by multiple host servers, we eliminate the TCP/Ethernet overheads of conventional distributed serving. Our evaluation shows that compared to a TCP-based baseline, PoolSwitch improved adapter loading time by 40% and Time-To-First-Token (TTFT) by 20%.
PIPM: Partial and Incremental Page Migration for Multi-host CXL Disaggregated Shared Memory
Gangqi Huang (University of California, Santa Cruz);Heiner Litz (University of California, Santa Cruz);Yuanchao Xu (University of California, Santa Cruz);
PIPM: Partial and Incremental Page Migration for Multi-host CXL Disaggregated Shared Memory
(University of California, Santa Cruz); (University of California, Santa Cruz); (University of California, Santa Cruz);
Speaker: ,
Abstract: The emerging Compute Express Link (CXL) interconnect supports multi-host cache-coherent disaggregated shared memory (CXL-DSM). However, existing page migration approaches, designed primarily for single-host systems, are inefficient in multi-host CXL-DSM scenarios. To address this, we propose Partial and Incremental Page Migration (PIPM), a hardware-based solution that transparently leverages host-side local memory. PIPM is co-designed with the CXL multi-host coherence protocol, enabling coherent access to data residing in local DRAM. To overcome limitations of existing migration methods, PIPM supports fine-grained data migration and integrates hardware-based monitoring and decision-making mechanisms to optimize data placement. Evaluation results demonstrate that PIPM delivers performance improvements of up to 2.54× (1.86× on average) over the default multi-host CXL-DSM configuration.
11:10 am – 11:20 am |
Short Break
11:20 am – 12:30 pm |
Session 2: Device
Chair: Wenjuan Zhu and Akul Malhotra
Enabling Analog In-Memory Computing: A Full-Stack Co-Design Approach
Akul Malhotra (IBM);
Enabling Analog In-Memory Computing: A Full-Stack Co-Design Approach
(IBM);
Speaker: ,
Abstract: Deep Neural Networks (DNNs) have demonstrated revolutionarycapabilities in various AI applications, such as machine vision, natural languageprocessing, and content generation. However, the relentless scaling of modern DNNmodels has exposed severe energy and latency bottlenecks. Traditional architecturesare increasingly strained by the "memory wall," expending excessive energy and timesimply shuttling massive amounts of weight and activation data between physicallyseparated compute and memory units.Analog in-memory computing (AIMC) based on emerging non-volatile memory (NVM)devices-such as Spin-Transfer-Torque Magnetic Random Access Memory (STT-MRAM), Phase Change Memory (PCM), and Resistive Random Access Memory(RRAM)-offers a revolutionary computing paradigm to overcome these limitations. Bymapping neural network weights to the conductance states of dense resistive memoryarrays, analog accelerators enable massively parallel multiply-accumulate (MAC)operations directly where the data resides. This approach utilizes Ohm’s law andKirchhoff’s current law to execute MAC operations at a constant time even for very largenetworks. This talk will explore the cross-stack innovations required to make NVM-based analoginference accelerators a practical reality. We will highlight recent state-of-the-arthardware demonstrations, including a 14-nm inference chip integrating multiple PCMtiles. AIMC requires innovative solutions to mitigate accuracy impact from tile non-idealities, including conductance drift, asymmetry, IR-drop, and weight programmingnoise. This presentation will detail the critical advancements across devices, circuits,and algorithms designed to tackle these issues. Ultimately, we demonstrate competitiveiso-accuracy inference using AIMC on increasingly large DNN models.
Ferroelectric Four-Mode Reconfigurable Transistors and Transformable Logic Circuits
Junzhe Kang (University of Illinois at Urbana-Champaign);Hanwool Lee (University of Illinois at Urbana-Champaign);Ashwin Tunga (University of Illinois at Urbana-Champaign);Xiaotong Xu (University of Illinois Urbana-Champaign);Ye Lin (University of Illinois Urbana-Champaign);Takashi Taniguchi (National Institute for Materials Science);Kenji Watanabe (National Institute for Materials Science);Shaloo Rakheja (University of Illinois Urbana-Champaign);Wenjuan Zhu (UIUC);
Ferroelectric Four-Mode Reconfigurable Transistors and Transformable Logic Circuits
(University of Illinois at Urbana-Champaign); (University of Illinois at Urbana-Champaign); (University of Illinois at Urbana-Champaign); (University of Illinois Urbana-Champaign); (University of Illinois Urbana-Champaign); (National Institute for Materials Science); (National Institute for Materials Science); (University of Illinois Urbana-Champaign); (UIUC);
Speaker: ,
Abstract: The rapid growth of AI-driven computing has exposed the energy and latency limitations of von Neumann architectures, motivating in-memory computing and reconfigurable devices for improved efficiency and security [1]. Two-dimensional (2D) materials with ambipolar transport provide an attractive platform for non-volatile reconfigurable transistors beyond CMOS scaling limits. Here, we demonstrate non-volatile reconfigurable four-mode field-effect transistors based on 2D MoTe2, integrating ferroelectric CuInP2S6 polarization with a multilayer graphene floating gate to enable dynamic control of transistor polarity and threshold voltage. The device supports four operating modes—logic nFET, logic pFET, memory On, and memory Off. Exploiting these modes, we developed a one-transistor-per-bit ternary content-addressable memory, reconfigurable logic gates, and highly compact look-up tables (LUTs), achieving substantial improvements in area and energy efficiency over conventional CMOS implementations.
Electrode-Engineered Switching Polarity in ScAlN Ferroelectric Tunnel Junctions for High-Temperature Non-volatile Memory
Xiaotong Xu (Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign);Jiangnan Liu (Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor);Ye Lin (Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign);Hanwool Lee (Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign);Zetian Mi (Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor);Wenjuan Zhu (Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign);
Electrode-Engineered Switching Polarity in ScAlN Ferroelectric Tunnel Junctions for High-Temperature Non-volatile Memory
(Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign); (Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor); (Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign); (Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign); (Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor); (Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign);
Speaker: ,
Abstract: Nonvolatile memory devices for extreme environments are essential in applications such as automotive electronics, aerospace, and power systems, where conventional memory technologies have limited thermal stability. Ferroelectric tunnel junctions (FTJs) offer a simple two-terminal, low-power, nonvolatile memory structure based on polarization-controlled conductance. Scandium aluminum nitride (ScAlN) has recently emerged as a promising ferroelectric for high-temperature memory due to its high Curie temperature and large remanent polarization. FTJs based on ScAlN have been demonstrated; however, how to control their polarity remains largely unexplored. In this work, we investigate electrode-engineered switching polarity in ScAlN FTJs and evaluate their thermal stability. We found that FTJs with Ni and Ti electrodes exhibit opposite switching polarities, with device characteristics remaining nearly unchanged after high-temperature operation up to 500 °C. These FTJs with tunable polarity offer advantages such as simplified circuitry and improved integration.
12:30 pm – 1:30 pm |
Lunch
1:30 pm – 3:00 pm |
Session 3: SSD
Chair: Hung-Wei Tseng
Towards Building a Learning-Based Storage Ecosystem and Beyond
Jinghan Sun (Meta);
Towards Building a Learning-Based Storage Ecosystem and Beyond
(Meta);
Speaker: ,
Abstract: Storage systems have evolved over decades into a complex ecosystem spanning storage hardware, system software, and data applications. This growing complexity poses significant challenges for storage development and deployment. However, current human-driven, heuristic-based approaches to building storage systems cannot keep pace with the ever-increasing performance and efficiency demands of modern applications. With the rapid advancement of machine learning (ML) today, we are now at a golden age to invent new approaches to develop and deploy storage systems with ML techniques.In this talk, I will present my research on building a learning-based storage ecosystem. Specifically, I will demonstrate how we can leverage ML to (1) automate and accelerate storage hardware development, (2) optimize system software for improved memory efficiency and storage performance, and (3) manage the complexity of cloud storage deployment at scale. I will also discuss how my work sheds light on the broader vision of “ML for Systems” research.
Bridging the Storage Gap in Space-Borne Data Centers: A Cost-Effective Soft Error Mitigation for SSD Controllers
Bob Teng (Silicon Motion);Shiuan-Hao Kuo (Silicon Motion);
Bridging the Storage Gap in Space-Borne Data Centers: A Cost-Effective Soft Error Mitigation for SSD Controllers
(Silicon Motion); (Silicon Motion);
Speaker: ,
Abstract: The emergence of space-borne data centers, exemplified by Google's Project Suncatcher, promises to revolutionize AI infrastructure by leveraging the high efficiency of extraterrestrial solar energy. While recent findings indicate that compute units like TPUs possess inherent radiation resilience, data storage remains a critical vulnerability. High-capacity Commercial Off-The-Shelf (COTS) SSD controllers are essential but susceptible to soft errors. Standard parity overlapping techniques fail for ``non-transparent'' modules (e.g., AES (Advanced Encryption Standard)) that fundamentally alter data patterns. Consequently, the industry relies on full hardware duplication with temporal delay, which incurs unacceptable penalties in area and power. This work proposes a cost-effective architecture that combines boundary parity overlapping with fine-grained internal ECC. Our design achieves a 19.5\% reduction in area and a 26.7\% reduction in power compared to traditional duplication, while maintaining a fault detection rate exceeding 99\%, making it a viable solution for next-generation space AI infrastructure.
From Storage to Cyber-Storage: Reliability and Security for Healthcare-Critical Data Paths
Rino Micheloni (Avaneidi S.p.A.);Luca Crippa (Avaneidi S.p.A.);Alessia Marelli (Avaneidi S.p.A.);Sebastiano Fabio Schifano (Università degli Studi di Ferrara - Dipartimento di Ingegneria);Giada Minghini (Università degli Studi di Ferrara - Dipartimento di Ingegneria);Andrea Miola (Università degli Studi di Ferrara - Dipartimento di Ingegneria);Cristian Zambelli (Università degli Studi di Ferrara - Dipartimento di Ingegneria);
From Storage to Cyber-Storage: Reliability and Security for Healthcare-Critical Data Paths
(Avaneidi S.p.A.); (Avaneidi S.p.A.); (Avaneidi S.p.A.); (Università degli Studi di Ferrara - Dipartimento di Ingegneria); (Università degli Studi di Ferrara - Dipartimento di Ingegneria); (Università degli Studi di Ferrara - Dipartimento di Ingegneria); (Università degli Studi di Ferrara - Dipartimento di Ingegneria);
Speaker: ,
Abstract: Deep Learning (DL) has become essential for high-accuracy medical image analysis in healthcare. However, the reliance on traditional Picture Archiving and Communication Systems (PACS) introduces significant bottlenecks regarding data throughput, system uptime for continual learning, and vulnerability to data tampering. This industrial work in collaboration with an academic partner proposes a proprietary CyberStorage platform designed to meet the rigorous availability and security demands of DL-driven medical environments.
USING DEEP Q-NETWORK (DQN) TO OPTIMIZE DATA CENTER SOLID STATE DRIVE (SSD) QUALITY OF SERVICE (QOS) AND PERFORMANCE
Dan McLeran (Solidigm);Ravi Motwani (Solidigm);Yogesh Wakchaure (Solidigm);Mark Golez (Solidigm);Holman Su (Solidigm);Praveen Janga (Solidigm);Sarvesh Gangadhar (Solidigm);Declan Bryne (Solidigm);Ryan Norton (Solidigm);
USING DEEP Q-NETWORK (DQN) TO OPTIMIZE DATA CENTER SOLID STATE DRIVE (SSD) QUALITY OF SERVICE (QOS) AND PERFORMANCE
(Solidigm); (Solidigm); (Solidigm); (Solidigm); (Solidigm); (Solidigm); (Solidigm); (Solidigm); (Solidigm);
Speaker: ,
Abstract: We present a mechanism for optimizing Solidigm Data Center SSD QoS and performance using a DQN. The approach, used on Solidigm SSDs, works as follows: 1. Input: The current SSD parameter settings are fed into a DQN. 2. Adjustment: The network suggests changes to these parameters. 3. Update: The SSD applies the adjusted settings. 4. Execution: A workload runs on the SSD with the new settings. 5. Evaluation: Performance data is collected during execution. 6. Reward: A reward score is calculated based on the QoS and performance. 7. Learning: The neural network is updated through backpropagation using this reward.
3:00 pm – 3:10 pm |
Short Break
3:10 pm – 4:30 pm |
Session 4: Intermittent computing
Chair: Sihang Liu
Rethinking Prefetching for Intermittent Computing
Gan Fang (Purdue University);Jianping Zeng (Arizona State University);Aditya Gupta (Purdue University);Changhee Jung (Purdue University);
Rethinking Prefetching for Intermittent Computing
(Purdue University); (Arizona State University); (Purdue University); (Purdue University);
Speaker: ,
Abstract: Prefetching improves performance by reducing cache misses. However, conventional prefetchers are too aggressive to serve batteryless energy harvesting systems (EHSs) where energy efficiency is the utmost design priority due to weak input energy and the resulting frequent outages. To this end, this paper proposes IPEX, an Intermittence-aware Prefetching EXtension that can be integrated into existing prefetchers on EHSs. IPEX aims to avert useless prefetches by suppressing the prefetching of the cache blocks receiving no hit before their loss on power failure, which would otherwise waste harvested energy. At a proper moment before an upcoming outage, IPEX throttles the prefetch degree to target only those blocks that are likely to be used before the outage. That way IPEX saves energy and spends it on making further execution progress. Experimental results show that on average, IPEX reduces energy consumption by 7.86% (up to 21.64%) and improves performance by 8.96% (up to 23.49%) compared to a conventional prefetcher.
WarmCache : Exploiting STT-RAM Cache for Low-Power Intermittent Systems
Noureldin Hassan (University of Central Florida);Byounguk Min (Purdue University);Changhee Jung (Purdue University);Yan Solihin (UCF);Jongouk Choi (University of Central Florida);
WarmCache : Exploiting STT-RAM Cache for Low-Power Intermittent Systems
(University of Central Florida); (Purdue University); (Purdue University); (UCF); (University of Central Florida);
Speaker: ,
Abstract: This paper introduces WarmCache, an optimized STT-RAM cache design with relaxed non-volatility, for an energy harvesting system (EHS) to avoid compulsory misses across power failure. The key insight is that if the retention time of a cache is longer than a power outage period, the cache contents can be preserved, thereby preventing compulsory misses. Based on this insight, WarmCache leverages a STT-RAM cache with reduced thermal stability to preserve non-volatility during power outages while not persisting any data. To mitigate retention failure that may occur in the relaxed STT-RAM cache, WarmCache lets its compiler partition program into a series of regions and conducts region-level error correction. At each region boundary, WarmCache verifies the execution of the region by scrubbing updated cache lines and re-executes it if any multi-bit error is detected therein. For optimization, WarmCache compiler introduces a novel region formation technique that adjusts the size of each region to match the scrubbing interval. This is achieved through region stitching for combining shorter regions and region splitting for dividing longer regions. Our experiments demonstrate that WarmCache manages to avoid compulsory cache misses and improves the performance by 1.3∼1.4x on average compared to the state-of-the-art cache design for EHS.
INTOS: Persistent Embedded Operating System and Language Support for Multi-threaded Intermittent Computing
Yilun Wu (Stony Brook University);Byounguk Min (Purdue University);Mohannad Ismail (Virginia Tech);Wenjie Xiong (Virginia Tech);Changhee Jung (Purdue University);Dongyoon Lee (Stony Brook University);
INTOS: Persistent Embedded Operating System and Language Support for Multi-threaded Intermittent Computing
(Stony Brook University); (Purdue University); (Virginia Tech); (Virginia Tech); (Purdue University); (Stony Brook University);
Speaker: ,
Abstract: This paper introduces INTOS, an embedded operating system and language support for multi-threaded intermittent computing on a battery-less energy-harvesting platform. INTOS simplifies programming with a traditional “thread” and a “transaction” with automatic undo-logging of persistent objects in non-volatile memory. While INTOS allows the use of volatile memory for performance and energy efficiency, conventional transactions do not ensure crash consistency of volatile register and memory states. To address this challenge, INTOS proposes a novel replay-and-bypass approach, eliminating the need for users to checkpoint volatile states. Upon power restoration, INTOS recovers non-volatile states by undoing the updates of power-interrupted transactions. To reconstruct volatile states, INTOS restarts each thread bypassing committed transactions and system calls by returning recorded results without re-execution. INTOS seeks to build a persistent, full-fledged embedded OS, supporting priority-based preemptive multithreading while ensuring crash consistency even if power failure occurs during a system call or while some threads are blocked. Experiments on a commodity platform MSP430FR5994 show that when subjected to an extreme power failure frequency of 1 ms, INTOS demonstrated 1.24x lower latency and 1.29x less energy consumption than prior work leveraging idempotent processing. This trend turns out to be more pronounced on Apollo 4 Blue Plus.
Adaptive Computing In Memory Meets Conventional Batteryless Platforms
Khakim Akhunov (University of Trento & imec);Kasim Sinan Yildirim (University of Trento);Jongouk Choi (University of Central Florida);Changhee Jung (Purdue University);
Adaptive Computing In Memory Meets Conventional Batteryless Platforms
(University of Trento & imec); (University of Trento); (University of Central Florida); (Purdue University);
Speaker: ,
Abstract: Computing In-Memory (CIM) with emerging nonvolatile memory (NVM) technologies is promising for batteryless systems since it removes the need for explicit backup and energy-hungry data transfer between the processor and memory. However, existing CIM solutions are not effective in accelerating memory-bound inference tasks efficiently on batteryless systems. They operate at relatively low frequencies, complicate application development, and do not consider energy harvesting dynamics to optimize their throughput. To address the issues, this paper presents a novel CIM-based batteryless computing platform, called Viadotto, that provides efficient and adaptive acceleration for memory-bound computing workloads. Viadotto meets adaptive CIM and microcontroller-based (MCU-based) conventional batteryless platforms for the first time. Basically, Viadotto exposes a programming model supported by its compiler and a pipelined memory controller, which hides low-level CIM operations from applications. Furthermore, its runtime issues CIM operations in an energy-efficient manner and optimizes throughput in a programmer-transparent way by adapting CIM parallelism to react to ambient power dynamics. Our evaluation shows that Viadotto outperforms existing CIM solutions for batteryless systems by 48\%.
4:30 pm – 4:40 pm |
Short Break
4:40 pm – 5:40 pm |
Session 5: Caching
Chair: Jongouk Choi
µleak: Bypassing MPU Isolation on Cortex-M7 via Cache-Timing Attacks
Muhammad Hammad Bashir (Pennsylvania State University);Taegyu Kim (The Pennsylvania State University);Arslan Khan (Pennsylvania State University);Kyungtae Kim (Dartmouth College);
µleak: Bypassing MPU Isolation on Cortex-M7 via Cache-Timing Attacks
(Pennsylvania State University); (The Pennsylvania State University); (Pennsylvania State University); (Dartmouth College);
Speaker: ,
Abstract: High-performance microcontrollers (MCUs) rely on hardware L1 caches to mask the high latency of embedded Flash memory, inadvertently introducing data-dependent timing channels. We present μLEAK, an unprivileged Prime+Probe attack on ARM Cortex-M7 that recovers AES keys despite MPU-enforced privilege separation. We show that vendor-recommended mitigations eliminate leakage but can reduce AES throughput by up to 73%, highlighting a security–performance trade-off driven by NVM-constrained execution.
MASA: a Memory-Aware Self-Adjusting Cache Policy for Hybrid Memory Architecture
Ming-Hong Yang (University of Minnesota);David H.C. Du (University of Minnesota);
MASA: a Memory-Aware Self-Adjusting Cache Policy for Hybrid Memory Architecture
(University of Minnesota); (University of Minnesota);
Speaker: ,
Abstract: MASA (Memory-Aware Self-Adjusting) is a novel cache policy designed to overcome the scalability and power consumption limitations of DRAM-only systems by integrating Non-Volatile Memory (NVM) into a hybrid architecture. Unlike traditional hierarchical systems that use DRAM merely as a buffer, MASA manages DRAM and NVM side-by-side to fully utilize the aggregate memory capacity. The policy categorizes data into eight distinct partitions based on state (clean/dirty), access pattern (recency/frequency), and nature (real/ghost), and utilizes a unique dual-factor adjustment mechanism. This mechanism combines traditional "ghost hits" with a hardware-aware "Beneficial Factor" that weights cache decisions based on actual memory latencies, ensuring that clean data stays in DRAM for speed while dirty data resides in NVM for persistence. To further enhance efficiency, MASA employs a conservative swapping strategy that triggers data migration only when necessary, minimizing unnecessary memory traffic. Experimental evaluations demonstrate that MASA achieves a 22% reduction in storage I/O operations compared to existing schemes like H-ARC, leading to significantly improved system lifespan (especially for SSDs), higher energy efficiency, and a flexible, tech-agnostic framework capable of adapting to future NVM developments.
Cache Your Prompt When It’s Green — Carbon-aware Caching for Large Language Model Serving
Yuyang Tian (University of Waterloo);Desen Sun (University of Waterloo);Yi Ding (Purdue University);Sihang Liu (University of Waterloo);
Cache Your Prompt When It’s Green — Carbon-aware Caching for Large Language Model Serving
(University of Waterloo); (University of Waterloo); (Purdue University); (University of Waterloo);
Speaker: ,
Abstract: As large language models (LLMs) become widely used, their environmental impact — especially carbon emission — has attracted more attention. Prior studies focus on compute-related carbon emissions. In this paper, we find that storage is another key contributor. LLM caching, which saves and reuses KV caches for repeated context, reduces operational carbon by avoiding redundant computation. However, this benefit comes at the cost of embodied carbon from high-capacity, high-speed SSDs. As LLMs scale, the embodied carbon of storage grows significantly. To address this tradeoff, we present GreenCache, a carbon-aware cache management framework that dynamically derives resource allocation plans for LLM serving. GreenCache analyzes the correlation between carbon emission and SLO satisfaction, reconfiguring the resource over time to keep the balance between SLO and carbon emission under dynamic workloads. Evaluations from real traces demonstrate that GreenCache achieves an average carbon reduction of 12.6 % when serving Llama-3 70B, while staying within latency constraints for > 90 % of requests.
5:40 pm – |
Social & Refreshments
TUESDAY, MARCH 10
8:00 am – 9:40 am |
Breakfast
9:40 am – 10:40 am |
Keynote II:
Ensuring Crash Consistency for Persistent Memory and CXL Shared Memory
Brian Demsky (UCI);
Ensuring Crash Consistency for Persistent Memory and CXL Shared Memory
(UCI);
Speaker: ,
Abstract: Persistent Memory and Compute Express Link (CXL) shared memorytechnologies allow data to survive crash events, enabling efficientcommunication and data sharing across clusters of machines. However,ensuring that data remains consistent after a crash is a majorchallenge. Developers must carefully manage flush and fence operationsto force cached data to persistent storage, and mistakes arenotoriously difficult to detect through testing alone.In this talk, I will present complementary approaches to tackling thecrash consistency problem. First, I will introduce PMROBUST, acompiler that automatically inserts flush and fence operations,guaranteeing the absence of missing flush and fence bugs. Second, Iwill present CrashLang, a language and type system that guaranteeswell-typed data structure implementations are crash-consistent byleveraging the commit-store pattern, in which a single store logicallycommits an entire operation. Third, I will describe CXLMC, a modelchecker that systematically explores crashing executions on thex86-CXL shared memory platform. Together, these tools offer acomprehensive toolkit—spanning compilation, language design, andverification—for building reliable software on emerging CXL andpersistent memory platforms.
Chair: Changhee Jung
10:40 am – 10:50 am |
Break
10:50 am – 11:50 am |
Session 6: Persistent memory
Chair: Yuanchao Xu
Rethinking Dead Block Prediction for Intermittent Computing
Gan Fang (Purdue University);Changhee Jung (Purdue University);
Rethinking Dead Block Prediction for Intermittent Computing
(Purdue University); (Purdue University);
Speaker: ,
Abstract: Existing dead block predictors have proven to be effective in reducing cache leakage power of conventional systems. However, prior work is significantly less effective in energy harvesting systems in that it does not take into account their unique characteristic, i.e., frequent power failure during program execution. Even if some cache blocks are predicted to be live, they may not be used due to their loss upon power failure. In response, this paper introduces EDBP, an extension to existing dead block predictors, to enhance their performance in various energy harvesting environments. EDBP can identify and deac- tivate those cache blocks that are not reused before upcoming power failure—though they are considered live by the existing predictor—thereby lowering cache leakage and preserving more energy for forward execution progress. Experimental results show that for 20 applications from Mediabench and MiBench, EDBP alone reduces total energy consumption by 6.5% and improves performance by 6.9% compared to the baseline with no dead block predictor. When combined with a conventional dead block predictor (Cache Decay), EDBP achieves 9.8% reduction in total energy consumption—approaching the theoretical minimum—and 11.9% performance improvement.
Compiler-Directed Whole-System Persistence
Jianping Zeng (Arizona State University);Huatao Wu (Arizona State University);Tong Zhang (Samsung Electronics);Changhee Jung (Purdue University);
Compiler-Directed Whole-System Persistence
(Arizona State University); (Arizona State University); (Samsung Electronics); (Purdue University);
Speaker: ,
Abstract: Nonvolatile memory (NVM) technologies have gained increasing attention thanks to their density and durability benefits. However, leveraging NVM can cause a crash consistency issue. For example, if a younger store is evicted (persisted) to NVM from volatile caches before an older one and power failure occurs in between, it might be impossible to correctly resume the interrupted program in the wake of the failure. Traditionally, addressing this issue involves expensive persist barriers for enforcing the original store order, which not only incurs a high run-time overhead but also places a significant burden on users due to the difficulty of persistent programming. To this end, this paper presents cWSP, compiler/architecture codesign for lightweight yet performant whole-system persistence (WSP). In particular, cWSP compiler partitions not only user applications but also OS and runtime libraries into a series of recoverable regions (epochs), thus enabling persistence and crash consistency for the entire software stack. To achieve high-performance crash consistency, cWSP leverages advanced compiler optimizations for checkpointing a minimal set of registers and proposes simple hardware support for expediting data persistence on the cheap. Experimental results with 37 applications from SPEC CPU2006/2017, DOE Mini-apps, SPLASH3, WHISPER, and STAMP, show that cWSP incurs an average runtime overhead of 6%, outperforming the state-of-the-art work with a significant margin.
A brief primer on lightweight Persistent Memory Objects
Derrick Greenspan (University of Central Florida College of Engineering and Computer Science);Naveed Ul Mustafa (New Mexico State University);Mark Heinrich (University of Central Florida College of Engineering and Computer Science);Yan Solihin (UCF);Jongouk Choi (University of Central Florida);
A brief primer on lightweight Persistent Memory Objects
(University of Central Florida College of Engineering and Computer Science); (New Mexico State University); (University of Central Florida College of Engineering and Computer Science); (UCF); (University of Central Florida);
Speaker: ,
Abstract: Persistent Memory Objects (PMOs) are the state-of-the-art OS-based approach for persistent memory (PM) management. Recent PMO designs have limited performance due to the properties of the PM substrate. To address this challenge, this paper introduces LPMO, or lightweight PMOs, that enables two key performance optimization techniques: software-based DRAM caching and prediction. First, through DRAM caching, LPMO moves reads/writes to a faster medium, while retaining crash consistency. Second, LPMO introduces software-based predecryption to predict when pages might be used and decrypt them ahead of time. Our evaluation shows that software-based DRAM Caching and software-based predecryption with LPMO can improve the performance of a PMO system by up to 1.25 × compared to the prior state-of-the-art implementations when using LPMO locally. When bundled with a stream predictor, the improvement reaches 1.81 ×, depending on the workload. To further demonstrate the flexibility and performance benefits of our LPMO design, we evaluated our solution in a CXL memory system and introduce a CXL memory hierarchy that our LPMO system can configure. In such a CXL system, when integrated with CXL Enhanced Memory Functions (EMFs) that perform encryption in hardware, LPMO is capable of performing comparably or in some workloads faster than the prior state-of-the-art design, despite the added latency of CXL memory.
