Domain-Specific Architectures for Deep Neural Networks Keynote Speaker: David Patterson - Google Brain Slides
Turing Award Lecture (Symphony Hall)
Domain-Specific Architectures for Deep Neural Networks
David Patterson - Google Brain
With the ending of Moore's Law, many computer architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. The Tensor Processing Unit (TPU), deployed in Google datacenters since 2015, is a custom chip that accelerates deep neural networks (DNNs). We compare the TPU to contemporary server-class CPUs and GPUs deployed in the same datacenters. Our benchmark workload, written using the high-level TensorFlow framework, uses production DNN applications that represent 95% of our datacenters’ DNN demand. The TPU is an order of magnitude faster than contemporary CPUs and GPUs and its relative performance per Watt is even larger. We also describe the next two generations of TPUs, which are designed for training.
Don’t Use a Single Large Systolic Array, Use Many Small Ones Instead
H. T. Kung - Harvard University
It is known that systolic arrays can efficiently implement matrix multiplications in deep learning computations. For high array utilization, we describe how we can (1) pack sparse filter matrices for small systolic arrays [ASPLOS 2019], and (2) employ small arrays to accommodate heterogeneous workloads [ASAP 2019], such as the Transformer for NLP. Scheduling-wise, we will tile matrices according to the dimensions of the given systolic arrays and then pipe the
resulting tiles into the arrays for processing. We discuss an ongoing effort on the development of a parallel computing system composed of multiple memory and logic blocks as well as their switching structure in support of the “tile and pipe” computation model.
Automated Building of Safe and Robust Intelligent Systems
Farinaz Koushanfar - UCSD
The fourth industrial revolution shaped by the Machine Learning (ML) algorithms is underway. However, the widescale adoption of the emerging intelligent learning methodologies is hindered by security, privacy and safety considerations in sensitive scenarios such as smart transportation, health-care, warfare, and financial systems. In this talk, I advocate automated end-to-end co-design of algorithms, hardware, software, and data for building safe and assured machine learning systems. The presentation is centered on model explainability and internal characterization of the hierarchical learning models such as deep neural networks. I discuss important applications of the extracted characteristics in IP protection, Trojan detection, and thwarting of the adversarial attacks. These applications can be systematically and automatically customized for various platforms using our holistic end-to-end co-design methodology. I summarize by outlining the challenges and opportunities ahead.
Teaching an Old Cache New Tricks: Learning Better Caching Policies Online
Nathan Beckmann - CMU
Caches are pervasive in computer systems and often determine overall system performance. Unfortunately, finding a caching policy that performs well is hard because applications vary so much in how they access data. Since traditional heuristics like recency and frequency leave too much performance on the table, caching policies are often hand-tuned in practice. But hand-tuning is unattractive because it takes significant effort and also makes caches fragile to changes in application behavior.
This talk will cover recent work on caching policies that learn & improve themselves online, without any hand-tuning or heuristics. Unsupervised reinforcement learning is not very effective in caching. To make a success of machine learning, caching systems must combine learning with the structure of caching problems. We will discuss two approaches: our recent work based on Bayesian inference, and policies that learn to imitate optimal cache replacement. We will show that these policies outperform the state-of-the-art by a large margin.
Deep Learning Acceleration via Low Precision Computing
Zhaoxia (Summer) Deng - Facebook
Machine learning models are becoming more and more complicated and resource demanding, along with the fast growth of the deep learning area. These models are promising to achieve higher accuracies on various learning tasks, but the inference performance at realtime still needs to meet application-level latency or throughput requirements, in spite of practical constraints of available compute resources. Low-precision computing can effectively boost the inference efficiency while still maintaining similar learning accuracies. It can mitigate the memory bandwidth requirements and exploit the high performance of low-precision arithmetics on existing CPU platforms as well as future customized accelerators. In this talk, I'll present the low-precision computing techniques we have explored to optimize the deep learning workloads at facebook datacenters. Furthermore, I'll talk about opportunities for model co-design and guidance on the accelerator design exposed from numerical optimizations.
As the performance of computer systems stagnates due to the end of Moore’s Law, there is a need for new models that can understand and optimize the execution of general purpose code. While there is a growing body of work on using Graph Neural Networks (GNNs) to learn representations of source code, these representations do not understand how code dynamically executes. In this work, we propose a new approach to use GNNs to learn fused representations of general source code and its execution. Our approach defines a multi-task GNN over low-level representations of source code and program state (i.e., assembly code and dynamic memory states), converting complex source code constructs and complex data structures into a simpler, more uniform format. We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone. As an illustration of this, we apply the new model to challenging dynamic tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite, outperforming the state-of-the-art by 26% and 45% respectively. Moreover, we use the learned fused graph embeddings to demonstrate transfer learning with high performance
on an indirectly related task (algorithm classification).
Search-Based Approaches to Accelerate Deep Learning
Zhihao Jia - Stanford University
Current deep learning (DL) frameworks accelerate DL computation by applying a sequence of heuristic optimizations designed for common DNN models and hardware architectures. These frameworks generally miss subtle optimization opportunities that are specific to particular models and hardware. To address this limitation, I will present our recent work on designing search-based approaches to accelerate deep learning. To automatically optimize DNN computation on a specific hardware platform, we first design a comprehensive search space of possible deployment strategies, and use efficient search algorithms to discover optimized strategies in the search space. I will present two search-based DL systems: (1) FlexFlow automatically discovers fast strategies to parallelize DNN training, and outperforms existing data/model parallelism by up to 3.3x; and (2) XFlow is a DNN computation graph optimizer with automatically generated graph substitutions, which outperforms existing rule-based graph optimizers by up to 2.9x. I will conclude the talk by discussing the challenges and research opportunities in building end-to-end automated deep learning frameworks.