Slot 4

Memory Systems Resilience
Rakesh Kumar, University of Illinois at Urbana Champaign, USA

Abstract

Memory is fast becoming the power, and therefore, performance bottleneck of modern and emerging computer systems. For example, memory consumes 30% of total power in HPC systems, on average, and up to 40% of total power in data centers. If the conventional DDR3 memory technology were deployed in future exascale systems, memory power consumption will constitute 2.5X the total system's power budget. At the other extreme, the Vmin of on-chip memory often dictates the power of low power systems. It has also been identified as one of the primary bottlenecks in enabling near-threshold computing.

In this tutorial, we will discuss how memory power in a computer system is strongly tied to the technique used to keep memory reliable. For example, operating voltage is kept relatively high for on-chip memories to keep them reliable, but it leads to high on-chip memory consumption. Similarly, each memory line is striped over 18 to 36 (sometimes even 72!) DRAM chips in server systems so that DRAM chip failure(s) can be tolerated. This increases memory system power.

We observe that that the primary reason why today's memory systems are expensive in terms of power is that memory resilience architectures for such systems are designed and optimized for the worst case. For example, today's HPC memories are protected using ECC that can correct the maximum number of errors that can accumulate over the lifetime of the system.

Since, most memory will never accumulate the worst-case number of faults, this worst-case design is unnecessarily expensive. Similarly, today's HPC memories protect each memory word (or memory channel) using dedicated ECC resources even though only a fraction of memory words (or memory channels) will see a fault during the lifetime of a computer system. For on-chip memories, all lines are protected using a strong ECC even though a small fraction of memory cells will see faults even low voltages.

We will discuss how these inefficiencies can be targeted through memory resilience solutions that are optimized for the common case ? A fallback mechanism is provided for the worst. We will discuss several common-case optimized techniques both targeting cache memory and main memory systems.

I will further argue that the philosophy of common case optimization can be extended beyond memory resilience.

Bio

Rakesh Kumar is an Associate Professor in the Electrical and Computer Engineering Department at the University of Illinois at Urbana Champaign and a Co-Founder and Chief Architect at Hyperion Core, Inc. His current research interests are in computer architecture, low power and error resilient computer systems, and approximate computing. 

His most significant research contributions are in the areas of multi-core architecture and design (his past research on heterogeneous multi-core architectures and conjoined-core architectures has directly influenced processor products and roadmaps from several companies), peak power management (he co-developed the first techniques for peak power management for single-core and multi-core processors), stochastic and approximate computing (he co-developed the first set of techniques for graceful voltage-reliability tradeoffs in hardware and functional units; he also co-developed the concept of recovery-driven design), algorithm-based fault tolerance (he co-developed the first ABFT techniques targeting sparse algebra; he also led the development of several algorithmic techniques to build error tolerant versions of applications), low power computing (he introduced techniques such as software canaries, power-balanced pipelines, and correction prediction), and memory systems for high performance computing (he developed several novel techniques for reducing the power cost of building reliable memory systems). The research contributions have been covered by news outlets as varied as BBC, HPCWire, IEEE Spectrum, and Slashdot. 

His research recognitions include several best paper awards and best paper award nominations, ARO Young Investigator Award, Arnold O Beckman Research Award, FAA Creative Research Award, UCSD CSE Best Dissertation Award, and an IBM PhD Fellowship. Teaching recognitions include Ronal W Pratt Faculty Outstanding Teaching Award and multiple appearances on UIUC's List of Teachers Ranked as Excellent. Advising recognitions include Engineering Council Outstanding Advisor Award.

Rakesh has a BS from IIT Kharagpur and a PhD from University of California at San Diego.


  Back to course info