Undergraduate researcher wins first place at the ACM Student Research Competition
CS undergraduate Shixin Song competed in the Student Research Competition at the 2021 International Symposium on Microarchitecture (MICRO), taking first place among undergraduates. Shixin has been working on this project since summer 2021 in Prof. Baris Kasikci’s group. Her project provides a new mechanism to speed up data centers by reducing stalls from missed instruction predictions.
Data center processors crunch through massive sets of instructions from the large-scale modern applications they support. To make this high-volume operation possible, the processors make use of a technique called fetch-directed instruction prefetching to gather certain instructions ahead of when they’re needed. This instruction prefetching mechanism falls significantly short of ideal performance, however, largely as a consequence of the prefetcher needing to make correct predictions about what to grab next.
When a program branches, the prefetcher is put in the position of needing a next instruction without knowing which one is correct yet. A cache called the Branch Target Buffer (BTB) holds these predicted results, and if it provides the wrong branch then the system has had a BTB miss. These misses, beyond being a lost opportunity for predictive performance boost, actually incur a time loss while the processor collects the correct instruction from memory.
“Data center applications spend a large percentage of pipeline slots in waiting for the processor front-end to fetch and return instructions,” Song said in her presentation. “Even improving 1% of these front-end stalls would save millions of dollars.”
A perfect BTB with no misses could speed up data center operations by 65%, Song says, but existing techniques fell far short of the mark. Other strategies to mitigate these misses have produced less than 1.4% speedup with BTB prefetching, or even under 0.7% speedup with state-of-the-art BTB replacement strategies. BTB replacement is a set of methods that analyze the contents of the BTB and replace, or “evict,” stored instruction targets that are the least likely to be used again.
Song devised a new type of BTB replacement mechanism to tackle this gap, which she calls Thermometer.
“I show that existing BTB replacement mechanisms can not identify the wide variance in the branch access pattern of modern data center applications,” Song writes. To address this limitation, her replacement mechanism makes use of hardware-software codesign to distinguish different branch access behavior. The mechanism has a profiling phase that identifies which branches are taken most often, and bases the eviction policy on different categories.
The end result achieves near-ideal BTB performance in nine commonly used data center applications, for an average speedup of 9% and maximum of 65%. The strategy also incurs no additional processing overhead.