PhD Defense

Daniel Lin-Kit Wong

Topic: Machine Learning for Flash Caching in Bulk Storage Systems

Slides | Thesis

Location

Panther Hollow Conference Room, 4th floor (CIC 4105) (directions)
Collaborative Innovation Center (map) (garage)

Zoom Meeting Details

Meeting ID: 980 3128 2389

Passcode: 620147

Link: https://cmu.zoom.us/j/98031282389?pwd=bXW6LQM2XPTLg196GzZ9Dai1xiPFH3.1

Abstract

Flash caches are used to reduce peak backend load for throughput-constrained data center services, reducing the total number of backend servers required. Bulk storage systems are a large-scale example, backed by high-capacity but low-throughput hard disks, and use flash caches to provide a cost-effective storage layer underlying everything from blobstores to data warehouses.

However, flash caches must manage their limited write endurance and limit the flash write rate to avoid premature wear-out. They do so via admission policies that filter cache insertions and maximize the workload-reduction value of each write.

I evaluate and demonstrate potential uses of ML in place of traditional heuristic cache management policies for flash caches in bulk storage systems. The most successful elements of my research are embodied in a flash cache system called Baleen, which uses coordinated ML admission and prefetching to reduce peak backend load. After learning painful lessons with early ML policy attempts, I exploit a new cache residency model (episodes) to guide model training. I focus on optimizing an end-to-end metric (Disk-head Time) that measures backend load more accurately than IO or byte miss rate. Evaluation using 7-day Meta traces from 7 storage clusters shows Baleen reducing Peak Disk-head Time (and backend hard disks required) by 12% over state-of-the-art policies for a fixed flash write rate.

I present a TCO (total cost of ownership) formula quantifying the costs of additional flash writes against reductions in Peak Disk-head Time in terms of flash drives and hard disks needed. Baleen-TCO chooses optimal flash write rates and reduces estimated TCO by 17%.

Workloads change over time, requiring that caches adapt to maintain performance. I present a strategy for peak load reduction that adapts selectivity to load levels. I evaluated workload drift and its impact on ML policy performance on 30-day Meta traces.

Baleen is the result of substantial exploration and experimentation with ML for caching. I present lessons learned from additional strategies considered and explain why they saw limited success on our workloads. These include improvements for ML eviction and more advanced ML models.

Code and traces are available via https://www.pdl.cmu.edu/CILES/.

Thesis Committee

Gregory R. Ganger (Chair)

Nathan Beckmann

David G. Andersen

Daniel S. Berger (Microsoft Research / University of Washington)