Publications

FlashLLM: A Chiplet-Based In-Flash Computing Architecture to Enable On-Device Inference of 70B LLM

Published in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024

we introduce FlashLLM, a chiplet-based hybrid architecture designed for on-device inference of 70B LLM. FlashLLM features a dedicated flash chip connected directly to NPU through chiplet technology. The flash not only stores model weight matrices but also leverages on-die processing capabilities to reduce data transfers to NPU, thereby mitigating both the footprint and bandwidth limitations. The NPU, in addition to collaborating with flash for matrix operations…

BIZA: Design of Self-Governing Block-Interface ZNS AFA for Endurance and Performance

Published in ACM Symposium on Operating Systems Principles (SOSP), 2024

In this work, we propose BIZA, a self-governing block-interface ZNS AFA to proactively schedule I/O requests and SSD internal tasks via the ZNS interface while exposing the user-friendly block interface to upper-layer software. BIZA achieves both long endurance and high performance by exploiting the zone random write area (ZRWA) and internal parallelisms of ZNS SSDs…

ScalaCache: Scalable User-Space Page Cache Management with Software-Hardware Coordination

Published in USENIX Annual Technical Conference (ATC), 2024

We propose ScalaCache, a scalable user-space page cache with software-hardware coordination. Specifically, to reduce the host CPU overhead, we offload the cache management into computational SSDs (CSDs) and further merge the indirection layers in both the cache and flash firmware, which facilitates lightweight cache management…

ScalaAFA: Constructing User-Space All-Flash Array Engine with Holistic Designs

Published in USENIX Annual Technical Conference (ATC), 2024

We propose ScalaAFA, a unique holistic design of AFA engine that can extend the throughput of next-generation SSD arrays in scale with low CPU costs. We incorporate ScalaAFA into user space to avoid user-kernel context switches while harnessing SSD built-in resources for handling AFA internal tasks…

Flagger: Cooperative Acceleration for Large-Scale Cross-Silo Federated Learning Aggregation

Published in IEEE/ACM International Symposium on Computer Architecture (ISCA), 2024

We propose Flagger, an efficient and high-performance FL aggregator. Flagger meticulously integrates the data processing unit (DPU) with computational storage drives (CSD), employing these two distinct near-data processing (NDP) accelerators as a holistic architecture to collaboratively enhance FL aggregation…

Achieving Near-Zero Read Retry for 3D NAND Flash Memory

Published in ACM Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024

We characterize different types of real flash chips, based on which we further develop models for the correlation among the optimal read offsets of read voltages required for reading each page. By leveraging characterization observations and the models, we propose a methodology to generate a tailored RRT for each flash model…

StreamPIM: Streaming Matrix Computation in Racetrack Memory

Published in IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2024

We propose StreamPIM, a new processing-in-RM architecture, which tightly couples the memory core and the computation units. Specifically, StreamPIM directly constructs a matrix processor from domain-wall nanowires without the usage of CMOS-based computation units. It also designs a domainwall nanowire-based bus, which can eliminate electromagnetic conversion…

Midas Touch: Invalid-Data Assisted Reliability and Performance Boost for 3D High-Density Flash

Published in IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2024

This work proposes invalid-data assisted strategies for performance and reliability boosting of valid data in 3D QLC-based flash storage systems. We first propose a high-efficiency re-programming (RP) scheme to reprogram the valid data and a high-reliability not-programming (NP) scheme to program data on the partially-invalid WLs…

LearnedFTL: A Learning-based Page-level FTL for Reducing Double Reads in Flash-based SSDs

Published in IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2024

We present LearnedFTL, a new on-demand pagelevel flash translation layer (FTL) design, which employs learned indexes to improve the address translation efficiency of flashbased SSDs. The first of its kind, it reduces the number of double reads induced by address translation in random read accesses. LearnedFTL proposes three key techniques…

BeaconGNN: Large-Scale GNN Acceleration with Out-of-Order Streaming In-Storage Computing

Published in IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2024

We propose BeaconGNN, an instorage computing (ISC) design for GNN that supports both large-scale graph structures and feature tables. First, it utilizes a novel graph format to enable out-of-order GNN neighbor sampling, improving flash resource utilization. Second, it deploys near-data processing engines across multiple levels of the flash hierarchy (i.e., controller, channel, and die)…

Ohm-GPU: Integrating New Optical Network and Heterogeneous Memory into GPU Multi-Processors

Published in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2021

We propose Ohm-GPU, a new optical network based heterogeneous memory design for GPUs. Specifically, Ohm-GPU can expand the memory capacity by combing a set of high-density 3D XPoint and DRAM modules as heterogeneous memory. To prevent memory channels from throttling throughput of GPU memory system, Ohm-GPU replaces the electrical lanes in the traditional memory channel with a high-performance optical network…

Revamping Storage Class Memory With Hardware Automated Memory-Over-Storage Solution

Published in International Symposium on Computer Architecture (ISCA), 2021

HAMS aggregates the capacity of NVDIMM and ultra-low latency flash archives (ULL-Flash) into a single large memory space, which can be used as a working memory expansion or persistent memory expansion, in an OS-transparent manner.to make HAMS more energy-efficient and reliable, we propose an “advanced HAMS” which removes unnecessary data transfers between NVDIMM and ULL-Flash after optimizing the datapath and hardware modules of HAMS …

ZnG: Architecting GPU Multi-Processors with New Flash for Scalable Data Analysis

Published in International Symposium on Computer Architecture (ISCA), 2020

We propose ZnG, a new GPU-SSD integrated architecture, which can maximize the memory capacity in a GPU and address performance penalties imposed by an SSD. Specifically, ZnG replaces all GPU internal DRAMs with an ultra-low-latency SSD to maximize the GPU memory capacity. ZnG further removes performance bottleneck of the SSD by replacing its flash channels with a high-throughput flash network and integrating SSD firmware in the GPUs MMU to reap the benefits of hardware accelerations…

FastDrain: Removing Page Victimization Overheads in NVMe Storage Stack

Published in Computer Architecture Letters, 2020

Host-side page victimizations can easily overflow the SSD internal buffer, which interferes I/O services of diverse user applications thereby degrading user-level experiences. To address this, we propose FastDrain, a co-design of OS kernel and flash firmware to avoid the buffer overflow, caused by page victimizations. Specifically, FastDrain can detect a triggering point where a near-future page victimization introduces an overflow of the SSD internal buffer…

Scalable Parallel Flash Firmware for Many-core Architectures

Published in USENIX Conference on File and Storage Technologies (FAST), 2020

We propose DeepFlash, a novel manycore-based storage platform that can process more than a million I/O requests in a second (1MIOPS) while hiding long latencies imposed by its internal flash media. Inspired by a parallel data analysis system, we design the firmware based on many-to-many threading model that can be scaled horizontally. The proposed DeepFlash can extract the maximum performance of the underlying flash memory complex by concurrently executing multiple firmware components across many cores within the device…

DRAM-less: Hardware Acceleration of Data Processing with New Memory

Published in International Symposium on High Performance Computer Architecture (HPCA), 2020

In this work, we propose, DRAM-less, a hardware automation approach that precisely integrates many state-of-the-art phase change memory (PRAM) modules into its data processing network to dramatically reduce unnecessary data copies with a minimum of software modifications. We implement a new memory controller that plugs a real 3x nm multi-partition PRAM to 28nm technology FPGA logic cells and interoperate its design into a real PCIe accelerator emulation platform…

Faster than Flash: An In-Depth Study of System Challenges for Emerging Ultra-Low Latency SSDs

Published in IEEE International Symposium on Workload Characterization (IISWC), 2019

In this work, we comprehensively perform empirical evaluations with 800GB ULL SSD prototypes and characterize ULL behaviors by considering a wide range of I/O path parameters, such as different queues and access patterns. We then analyze the efficiencies and challenges of the polled-mode and hybrid polling I/O completion methods (added into Linux kernels 4.4 and 4.10, respectively) and compare them with the efficiencies of a conventional interrupt-based I/O path…

Exploring Fault-Tolerant Erasure Codes for Scalable All-Flash Array Clusters

Published in IEEE Transactions on Parallel and Distributed Systems (TPDS), 2019

To understand the impact of using erasure coding on the system performance and other system aspects such as CPU utilization and network traffic, we build a storage cluster that consists of approximately 100 processor cores with more than 50 high-performance solid-state drives (SSDs), and evaluate the cluster with a popular open-source distributed parallel file system, called Ceph…

Maximizing GPU Cache Utilization with Adjustable Cache Line Management

Published in Korea Computer Congress (KCC), 2019

Executing the irregular applications in general-purpose graphics processing units (GPGPUs) exposes serious challenges to their cache system. This paper proposes JUSTIT, an adjustable cache line management design that maximizes the GPU L1D cache utilization by being aware of the memory request access granularity…

FlashGPU: Placing New Flash Next to GPU Cores

Published in The 56th Design Automation Conference (DAC), 2019

We propose FlashGPU, a new GPU architecture that tightly blends new flash (Z-NAND) with massive GPU cores. Specifically, we replace global memory with Z-NAND that exhibits ultra-low latency. We also architect a flash core to manage request dispatches and address translations underneath L2 cache banks of GPU cores…

FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads

Published in International Symposium on High Performance Computer Architecture (HPCA), 2019

In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPUs multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs…

Amber: Enabling Precise Full-System Simulation with Detailed Modeling of All SSD Resources

Published in The 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018

SSDs become a major storage component in modern memory hierarchies, and SSD research demands exploring future simulation-based studies by integrating SSD subsystems into a full-system environment. However, several challenges exist to model SSDs under a full-system simulations; SSDs are composed upon their own complete system and architecture, which employ all necessary hardware, such as CPUs, DRAM and interconnect network. Employing the hardware components, SSDs also require to have…

FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs

Published in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018

In this paper, we propose FlashShare to assist ULL SSDs to satisfy different levels of I/O service latency requirements for different co-running applications. Specifically, FlashShare is a holistic cross-stack approach, which can significantly reduce I/O interferences among co-running applications at a server without any change in applications. At the kernel-level, we extend the data structures of the storage stack to pass attributes of (co-running) applications through all the layers of the underlying storage stack spanning from the OS kernel to the SSD firmware…

FlashAbacus: A Self-governing Flash-based Accelerator for Low-power Systems

Published in The European Conference on Computer Systems (EuroSys), 2018

Energy efficiency and computing flexibility are some of the primary design constraints of heterogeneous computing. In this paper, we present FlashAbacus, a data-processing accelerator that self-governs heterogeneous kernel executions and data storage accesses by integrating many flash modules in lightweight multiprocessors. The proposed accelerator can simultaneously process data from different applications with diverse types of operational functions, and it allows multiple kernels to directly access flash without the assistance of a host-level file system or an I/O runtime library…

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Published in 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2018

A modern GPU aims to simultaneously execute more warps for higher Thread-Level Parallelism (TLP) and performance. When generating many memory requests, however, warps contend for limited cache space and thrash cache, which in turn severely degrades performance. To reduce such cache thrashing, we may adopt cache locality-aware warp scheduling which gives higher execution priority to warps with higher potential of data locality. However, we observe that warps with high potential of data locality often incurs far more cache thrashing or interference than warps with low potential of data locality….

ReveNAND: A Fast-Drift Aware Resilient 3D NAND Flash Design

Published in ACM Transactions on Architecture and Code Optimization (TACO), 2018

In this work, we first present an elastic read reference (VRef) scheme (ERR) for reducing such errors in ReveNAND—our fast-drift aware 3D NAND design. To address the inherent limitation of the adaptive VRef, we introduce a new intra-block page organization (hitch-hike) that can enable stronger error correction for the error-prone pages. In addition, we propose a novel reinforcement-learning-based smart data refill scheme (iRefill) to counter the impact of fast-drift with minimum performance and hardware overhead. Finally, we present the first analytic model to characterize fast-drift and evaluate its system-level impact….

TraceTracker: Hardware/Software Co-Evaluation for Large-Scale I/O Workload Reconstruction

Published in IEEE International Symposium on Workload Characterization (IISWC), 2017

Block traces are widely used for system studies, model verifications, and design analyses in both industry and academia. While such traces include detailed block access patterns, existing trace-driven research unfortunately often fails to find true-north due to a lack of runtime contexts such as user idle periods and system delays, which are fundamentally linked to the characteristics of target storage hardware. In this work, we propose TraceTracker, a novel hardware/software co-evaluation method that allows users to reuse a broad range of the existing block traces by keeping most their execution contexts and user scenarios while adjusting them with new system information…

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems

Published in IEEE International Symposium on Workload Characterization (IISWC), 2017

Large-scale systems with arrays of solid state disks (SSDs) have become increasingly common in many computing segments. To make such systems resilient, we can adopt erasure coding such as Reed-Solomon (RS) code as an alternative to replication because erasure coding can offer a significantly lower storage cost than replication. To understand the impact of using erasure coding on system performance and other system aspects such as CPU utilization and network traffic, we build a storage cluster consisting of approximately one hundred processor cores with more than fifty high-performance SSDs…

An In-depth Performance Analysis of Many-Integrated Core for Communication Efficient Heterogeneous Computing

Published in IFIP International Conference on Network and Parallel Computing (NPC), 2017

Many-integrated core (MIC) architecture combines dozens of reduced x86 cores onto a single chip to offer high degrees of parallelism. The parallel user applications executed across many cores that exist in one or more MICs require a series of work related to data sharing and synchronization with the host. In this work, we build a real CPU+MIC heterogeneous cluster and analyze its performance behaviors by examining different communication methods such as message passing method and remote direct memory accesses…

Enabling Realistic Logical Device Interface and Driver for NVM Express Enabled Full System Simulations

Published in IFIP International Conference on Network and Parallel Computing (NPC) and Invited for International Journal of Parallel Programming (IJPP), 2017

In this work, we implement an NVMe disk and controller to enable a realistic storage stack of next generation interfaces and integrate them into gem5 and a high-fidelity solid state disk simulation model. We verify the functionalities of NVMe that we implemented, using a standard user-level tool, called NVMe command line interface…

SimpleSSD: Modeling Solid State Drive for Holistic System Simulation

Published in IEEE Computer Architecture Letters (CAL), 2017

Existing solid state drive (SSD) simulators unfortunately lack hardware and/or software architecture models. Consequently, they are far from capturing the critical features of contemporary SSD devices. More importantly, while the performance of modern systems that adopt SSDs can vary based on their numerous internal design parameters and storage-level configurations, a full system simulation with traditional SSD models often requires unreasonably long runtimes and excessive computational resources. In this work, we propose SimpleSSD, a high-fidelity simulator that models all detailed characteristics of hardware and software…

Couture: Tailoring STT-MRAM for Persistent Main Memory

Published in USENIX Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW), 2016

In this work, we present Couture – a main memory design using tailored STT-MRAM that can offer a storage density comparable to DRAM and high performance with low-power consumption. In addition, we propose an intelligent data scrubbing method (iScrub) to ensure data integrity with minimum overhead…

ROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture

Published in USENIX Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW), 2016

In this paper, we propose a hybrid non-uniform cache architecture (NUCA) by employing STT-MRAM as a read-oriented on-chip storage. The key observation here is that many cache lines in LLC are only touched by read operations without any further write updates. These cache lines, referred to as singular-writes, can be internally migrated from SRAM to STT-MRAM in our hybrid NUCA. Our approach can significantly improve the system performance by avoiding many cache read misses with the larger STT-MRAM cache blocks, while it maintains the cache lines requiring write updates in the SRAM cache…

An In-Depth Study of Next Generation Interface for Emerging Non-Volatile Memories

Published in IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA), 2016

Non-Volatile Memory Express (NVMe) is designed with the goal of unlocking the potential of low-latency, randomaccess, memory-based storage devices. Specifically, NVMe employs various rich communication and queuing mechanism that can ideally schedule four billion I/O instructions for a single storage device. To explore NVMe with assorted user scenarios, we model diverse interface-level design parameters such as PCI Express, NVMe protocol, and different rich queuing mechanisms by considering a wide spectrum of host-level system configurations. In this work, we also assemble a comprehensive memory stack with different types of emerging NVM technologies, which can give us detailed NVMe related statistics like I/O request lifespans and I/O thread-related parallelism…

DUANG: Fast and Lightweight Page Migration in Asymmetric Memory Systems

Published in IEEE Symposium on High Performance Computer Architecture (HPCA), 2016

In this paper, we propose a novel resistive memory architecture sharing a set of row buffers between a pair of neighboring banks. It enables two attractive techniques: (1) migrating memory pages between slow and fast banks with little performance overhead and (2) adaptively allocating more row buffers to busier banks based on memory access patterns…

Integrating 3D Resistive Memory Cache into GPGPU for Energy-Efficient Data Processing

Published in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2015

In this work, we redesign the shared last-level cache (LLC) of GPU devices by introducing non-volatile memory (NVM), which can address the cache thrashing issues with low energy consumption. Specifically, we investigate two architectural approaches, one of each employs a 2D planar resistive random-access memory (RRAM) as our baseline NVM-cache and a 3D-stacked RRAM technology. Our baseline NVM-cache replaces the SRAM-based L2 cache with RRAM of similar area size…

NVMMU: A Non-Volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures

Published in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2015

In this work, NVMMU unifies two discrete software stacks (one for the SSD and other for the GPU) in two major ways. While a new interface provided by our NVMMU directly forwards file data between the GPU runtime library and the I/O runtime library, it supports non-volatile direct memory access (NDMA) that pairs those GPU and SSD devices via physically shared system memory blocks. This unification in turn can eliminate unnecessary user/kernel-mode switching, improve memory management, and remove data copy overheads…

OpenNVM: An Open-Sourced FPGA-based NVM Controller for Low Level Memory Characterization

Published in IEEE International Conference on Computer Design (ICCD), 2015

In this paper, we present Open-NVM, an open-sourced, highly configurable FPGA based evaluation/characterization platform for various NVM technologies. Through our OpenNVM, this work reveals important low-level NVM characteristics, including i) static and dynamic latency disparity, ii) error rate variation, iii) power consumption behavior, vi) interrelationship between frequency and NVM operational current. In addition, we also examine state-of-the-art write-once-memory (WOM) codes on a real NVM device and study diverse system-level performance impacts based on our findings…

CoDEN: A Hardware/Software CoDesign Emulation Platform for SSD-Accelerated Near Data Processing

Published in IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA), 2015

For the past few decades, solid state disks (SSDs) significantly revamped their internal system architecture by employing more compute resources, multiple data channels, and tens or hundreds of non-volatile memory (NVM) packages. These ample internal resources in turn enable modern SSDs to accelerate near data processing. While the prior simulation-based work uncovered potential benefits of offloading the computation from a host to the SSDs, their analytical models make several assumptions that ignore not only detailed…

Power, Energy and Thermal Considerations in SSD-Based I/O Acceleration

Published in USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage), 2014

Solid State Disks (SSDs) have risen to prominence as an I/O accelerator with low power consumption and high energy efficiency. In this paper, we question some common assumptions regarding SSDs’ operating temperature, dynamic power, and energy consumption through extensive empirical analysis. We examine three different real high-end SSDs that respectively employ multiple channels, cores, and flash chips. Our evaluations reveal that dynamic power consumption of many-resource SSD is, on average, 5x and 4x worse than an enterprise-scale SSD and HDD, respectively…