U-M researchers present three papers at ISCA 2021

Fourteen researchers presented work on accelerating genome sequence alignment, fast multi-GPU systems, and more reliable data center caches.

Image Enlarge

Fourteen researchers from U-M were co-authors of papers presented at the 2021 International Symposium on Computer Architecture, which is among the top-tier conferences in the field. The event is sponsored by the Association for Computing Machinery’s Special Interest Group on Computer Architecture (ACM SIGARCH) and the Institute of Electrical and Electronics Engineers Computer Society. In addition, Prof. Reetuparna Das served as session chair for the conference’s session on reliability and security. Read more about the three papers below:

Accelerated Seeding for Genome Sequence Alignment with Enumerated Radix Trees
Arun Subramaniyan, Jack Wadden, Kush Goliya, Nathan Ozog, Xiao Wu, Satish Narayanasamy, David Blaauw, Reetuparna Das (all of Michigan)

Genomics can transform precision health over the next decade. A genome is essentially a long string of DNA basepairs. During primary analysis, a sequencing instrument splits a DNA strand into billions of short strings called reads. Secondary analysis aligns the reads to a reference genome and determines genetic variants in the analyzed genome compared to the reference. Read alignment is a time-consuming step in this analysis. The most widely used software for read alignment is based on a read alignment paradigm with a major bottleneck called the seeding step, contributing around 40% to the overall execution time of the software when aligning whole human genome reads. 

The researchers propose a novel indexing data structure named Enumerated Radix Tree (ERT) and a design for a custom seeding accelerator based on it. ERT improves bandwidth efficiency of the sequencing software by 4.5× while guaranteeing 100% identical output to the original software, and still operating within 64GB of DRAM. Overall, the proposed seeding accelerator improves seeding throughput by 3.3×. When combined with seed-extension accelerators, the researchers demonstrated a 2.1× improvement in overall read alignment throughput. The software implementation of ERT is open source and a part of Broad Institute / Intel’s official BWA-MEM 2 software (https://github.com/bwa-mem2/bwa-mem2/tree/ert). BWA-MEM is the de-facto genomics read alignment tool used by researchers and practitioners worldwide.

Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-Grained Transfers
Harini Muthukrishnan (Michigan); David Nellans, Daniel Lustig (NVIDIA); Jeffrey A. Fessler (Michigan), Thomas Wenisch (Michigan)

Despite continuing research into inter-GPU communication mechanisms, extracting performance from multi-GPU systems remains a significant challenge. To remedy several major issues, the researchers propose PROACT, a system which enables remote memory transfers with the programmability and pipelining advantages of peer-to-peer stores, while achieving interconnect efficiency that rivals bulk direct memory access (DMA) transfers. PROACT enables interconnect-friendly data transfers while hiding the transfer latency with pipelining during kernel execution. This paper describes both hardware and software implementations of PROACT and demonstrates the effectiveness of a PROACT software prototype on three generations of GPU hardware and interconnects. 

PROACT achieves a mean speedup of 3.0× over single GPU performance for 4-GPU systems, capturing 83% of available performance opportunity. On a 16-GPU NVIDIA DGX-2 system, the team demonstrated an 11.0× average strong-scaling speedup over single-GPU performance, 5.3× better than a bulk DMA-based approach.

Ripple: Profile-Guided Instruction Cache Replacement for Data Center Applications
Tanvir Ahmed Khan (Michigan); Dexin Zhang (USTC); Akshitha Sriraman (Michigan); Joseph Devietti (UPenn); Gilles A. Pokam (Intel); Heiner Litz (UCSC); Baris Kasikci (Michigan)

The deep software stacks in modern data center applications result in enormous instruction footprints, frequently causing instruction cache (I-cache) misses and degrading performance. Although there have been many mechanisms proposed to mitigate these misses, they still fall short of ideal cache behavior, and furthermore, introduce significant hardware overheads. The researchers first examine why these mitigation mechanisms hurt performance for data center applications, and find that widely-studied instruction prefetchers fall short due to wasteful prefetch-induced cache line evictions that are not handled by existing replacement policies. This is a result of existing replacement policies’ lack of knowledge about a data center application’s complex program behavior.

To make existing replacement policies aware of these eviction-inducing program behaviors, the research team proposes Ripple, a novel software technique that profiles programs and uses program context to inform the underlying replacement policy about efficient replacement decisions.  They evaluate Ripple with nine popular data center applications and demonstrate that Ripple enables any replacement policy to achieve speedup that is closer to that of an ideal I-cache.