Sat 8:30 a.m. - 8:40 a.m.
|
Opening Remarks
|
🔗
|
Sat 8:40 a.m. - 9:30 a.m.
|
Jeff Dean - Invited Talk
(Talk)
SlidesLive Video »
|
Jeff Dean
🔗
|
Sat 9:30 a.m. - 10:15 a.m.
|
Dawn Song - Invited Talk
(Talk)
|
Dawn Song
🔗
|
Sat 10:20 a.m. - 10:50 a.m.
|
Poster Session | Coffee Break
(Break)
|
🔗
|
Sat 10:55 a.m. - 11:05 a.m.
|
A code superoptimizer through neural Monte-Carlo tree search
(Spotlight)
link »
There are many ways to turn a high-level program into a sequence of instructions consistent with that computation. Selecting the most performant such instruction sequence for a given piece of hardware - optimized compilation - is a central challenge of computer science. Optimizing compilers perform this task through a series of reductions and local transformations (e.g. register allocation, instruction scheduling, peephole optimization) driven by heuristics. A natural and well-explored avenue of research is to replace current hand-written heuristics by data-driven, automatically-designed heuristics which may be obtained from machine learning. We propose a radically different approach, in which we view compilation as a combinatorial optimization problem which consists of finding the optimal (e.g. fastest executing or shortest) sequence of instructions subject to the constraint that it has the semantics of the specified program. We show how this problem can be practically framed as a finite Markov decision process, unlocking a rich space of potential algorithms from reinforcement learning. We implement one such algorithm in particular, an AlphaGo-like distributed neural Monte-Carlo tree search procedure, and demonstrate that it is able to directly generate optimized assembly. Unlike a traditional optimizing compiler, this approach does not rely on an existing library of optimizations to transform the code, but rather directly attempts to generate the most optimal program instruction-by-instruction, taking into account effects including register allocation, instruction scheduling and operation fusion.
|
Wenda Zhou · Olga Solodova · Ryan Adams
🔗
|
Sat 11:10 a.m. - 11:20 a.m.
|
Predicting Network Buffer Capacity for BBR Fairness
(Spotlight)
BBR is a newer TCP congestion control algorithm with promising features, but it can often be unfair to existing loss-based congestion-control algorithms. This is because BBR's sending rate is dictated by static parameters that do not adapt well to dynamic and diverse network conditions. In this work, we introduce BBR-ML, an in-kernel ML-based tuning system for BBR, designed to improve fairness when in competition with loss-based congestion control. To build BBR-ML, we discretized the network condition search space and trained a model on 2,500 different network conditions. We then modified BBR to run an in-kernel model to predict network buffer sizes, and then use this prediction for optimal parameter settings. Our preliminary evaluation results show that BBR-ML can improve fairness when in competition with Cubic by up to 30% in some cases.
|
Ibrahim Umit Akgun · Santiago Vargas · Andrew Burford · Michael McNeill · Michael Arkhangelskiy · Aruna Balasubramanian · Anshul Gandhi · Erez Zadok
🔗
|
Sat 11:25 a.m. - 11:35 a.m.
|
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression
(Spotlight)
link »
SlidesLive Video »
Self-attention and feedforward layers in large-scale Transformer models are overparameterized, limiting inference speed and energy efficiency. Tensor decomposition is a promising technique to reduce parameter redundancy by expressing weight matrices in an efficiently factorized form. Prior efforts used manual or heuristic decomposition settings without hardware-aware customization, resulting in poor hardware efficiencies and large performance degradation.In this work, we propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of tensor decompositions and automates the choice of tensorization shape and decomposition rank with hardware-aware co-optimization. We jointly investigate tensor contraction path optimizations and a fused Einsum mapping strategy to bridge the gap between theoretical benefits and real hardware efficiency improvement. Our two-stage knowledge distillation flow resolves the trainability bottleneck and thus significantly boosts the final accuracy of factorized Transformers. We find that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss and achieve a better efficiency-accuracy Pareto frontier than hand-tuned and heuristic baselines.
|
Jiaqi Gu · Ben Keller · Jean Kossaifi · Anima Anandkumar · Brucek Khailany · David Pan
🔗
|
Sat 11:45 a.m. - 11:55 a.m.
|
Learning to Drive Software-Defined Storage
(Spotlight)
link »
SlidesLive Video »
Thanks to the development of manufacturing technology, storage devices suchas solid-state drives (SSDs) are becoming highly customizable to meet the ever-increasing demands on storage performance and capacity for different applications(i.e., software-defined storage). However, it is challenging to develop optimizedstorage devices with current human-driven systems-building approaches, due tothe complicated storage stack. In this paper, we present learning-based approachesto facilitating the development of software-defined storage. To accelerate themanufacturing of efficient storage devices, we enable the automated learning ofoptimized hardware specifications for developing customized storage devices forspecific application types. Our preliminary study shows that utilizing learning-based techniques to drive the development of software-defined storage is promising.
|
Jian Huang · Daixuan Li · Jinghan Sun
🔗
|
Sat 1:05 p.m. - 1:20 p.m.
|
Multi-Agent Reinforcement Learning for Microprocessor Design Space Exploration
(Spotlight)
link »
Microprocessor architects are increasingly resorting to domain-specific customization in the quest for high-performance and energy-efficiency. As the systems growin complexity, fine-tuning architectural parameters across multiple sub-systems (e.g., datapath, memory blocks in different hierarchies, interconnects, compileroptimization, etc.) quickly results in a combinatorial explosion of design space.This makes domain-specific customization an extremely challenging task. Prior work explores using reinforcement learning (RL) and other optimization methods to automatically explore the large design space. However, these methods have traditionally relied on single-agent RL/ML formulations. It is unclear how scalable single-agent formulations are as we increase the complexity of the design space (e.g., full stack System-on-Chip design). Therefore, we propose an alternative formulation that leverages Multi-Agent RL (MARL) to tackle this problem. The key idea behind using MARL is an observation that parameters across different sub-systems are more or less independent, thus allowing a decentralized role as-signed to each agent. We test this hypothesis by designing domain-specific DRAMmemory controller for several workload traces. Our evaluation shows that the MARL formulation consistently outperforms single-agent RL baselines such as Proximal Policy Optimization and Soft Actor-Critic over different target objectives such as low power and latency. To this end, this work opens the pathway for new and promising research in MARL solutions for hardware architecture search.
|
Srivatsan Krishnan · Natasha Jaques · Shayegan Omidshafiei · Dan Zhang · Izzeddin Gur · Vijay Janapa Reddi · Aleksandra Faust
🔗
|
Sat 1:30 p.m. - 1:45 p.m.
|
Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs
(Spotlight)
link »
SlidesLive Video »
Cloud datacenters are exponentially growing both in numbers and size. This increase results in a network activity surge that warrants better congestion avoidance. The resulting challenge is two-fold: (i) designing algorithms that can be custom-tuned to the complex traffic patterns of a given datacenter; but, at the same time (ii) run on low-level hardware with the required low latency of effective Congestion Control (CC). In this work, we present a Reinforcement Learning (RL) based CC solution that learns from certain traffic scenarios and successfully generalizes to others. We then distill the RL neural network policy into binary decision trees to achieve the desired $\mu$sec decision latency required for real-time inference. We deploy the distilled policy on NVIDIA NICs in a real network and demonstrate state-of-the-art performance, balancing all tested metrics simultaneously: bandwidth, latency, fairness, and drops.
|
Benjamin Fuhrer · Yuval Shpigelman · Chen Tessler · Shie Mannor · Gal Chechik · Eitan Zahavi · Gal Dalal
🔗
|
Sat 1:55 p.m. - 2:40 p.m.
|
Steve Keckler - Invited Talk
(Talk)
|
Stephen Keckler
🔗
|
Sat 1:55 p.m. - 2:25 p.m.
|
Poster Session | Coffee Break
(Break)
|
🔗
|
Sat 2:50 p.m. - 3:35 p.m.
|
Newsha Ardalani - Invited Talk
(Talk)
|
Newsha Ardalani
🔗
|
Sat 3:45 p.m. - 4:30 p.m.
|
Riyadh Baghdadi - Invited Talk
(Talk)
|
Riyadh Baghdadi
🔗
|
-
|
Towards Continually Learning Application Performance Models
(Poster - Recorded Presentation)
link »
SlidesLive Video »
Machine learning-based performance models are increasingly being used to build critical job scheduling and application optimization decisions. Traditionally, these models assume that data distribution does not change as more samples are collected over time. However, owing to the complexity and heterogeneity of production HPC systems, they are susceptible to hardware degradation, replacement, and/or software patches, which can lead to drift in the data distribution that can adversely affect the performance models. To this end, we develop continually learning performance models that account for the distribution drift, alleviate catastrophic forgetting, and improve generalizability. Our best model was able to retain accuracy, regardless of having to learn the new distribution of data inflicted by system changes, while demonstrating a 2X improvement in the prediction accuracy of the whole data sequence in comparison to the naive approach.
|
Ray Sinurat · Sandeep Madireddy · Anurag Daram · Haryadi Gunawi · Robert Ross
🔗
|
-
|
Learning to Drive Software-Defined Storage
(Poster)
link »
Thanks to the development of manufacturing technology, storage devices suchas solid-state drives (SSDs) are becoming highly customizable to meet the ever-increasing demands on storage performance and capacity for different applications(i.e., software-defined storage). However, it is challenging to develop optimizedstorage devices with current human-driven systems-building approaches, due tothe complicated storage stack. In this paper, we present learning-based approachesto facilitating the development of software-defined storage. To accelerate themanufacturing of efficient storage devices, we enable the automated learning ofoptimized hardware specifications for developing customized storage devices forspecific application types. Our preliminary study shows that utilizing learning-based techniques to drive the development of software-defined storage is promising.
|
Jian Huang · Daixuan Li · Jinghan Sun
🔗
|
-
|
Predicting Network Buffer Capacity for BBR Fairness
(Poster)
link »
BBR is a newer TCP congestion control algorithm with promising features, but it can often be unfair to existing loss-based congestion-control algorithms. This is because BBR's sending rate is dictated by static parameters that do not adapt well to dynamic and diverse network conditions. In this work, we introduce BBR-ML, an in-kernel ML-based tuning system for BBR, designed to improve fairness when in competition with loss-based congestion control. To build BBR-ML, we discretized the network condition search space and trained a model on 2,500 different network conditions. We then modified BBR to run an in-kernel model to predict network buffer sizes, and then use this prediction for optimal parameter settings. Our preliminary evaluation results show that BBR-ML can improve fairness when in competition with Cubic by up to 30% in some cases.
|
Ibrahim Umit Akgun · Santiago Vargas · Andrew Burford · Michael McNeill · Michael Arkhangelskiy · Aruna Balasubramanian · Anshul Gandhi · Erez Zadok
🔗
|
-
|
LoopStack: ML-friendly ML Compiler Stack
(Poster - Recorded Presentation)
link »
We present LoopStack, a domain-specific compiler stack for tensor operations, composed of a front-end, LoopTool, and an efficient optimizing code generator, LoopNest. LoopStack is designed to produce highly efficient but also predictable code. Such a design allows both experts, and more importantly, ML--based approaches to find good schedules (algorithms). LoopStack is extensible and supports various processors and accelerators while incorporating HPC optimizations often missing from other machine learning compiler back-ends. To show the quality of the generated code we designed a rudimentary AI to search for schedules and compare the speed of generated code with the most optimized, hand-tuned libraries. Further, we show that for a large collection of schedules LoopNest's compilation is orders of magnitude faster than LLVM, while resulting in equal or improved run time performance.
|
Bram Wasti · Dejan Grubisic · Benoit Steiner · Aleksandar Zlateski
🔗
|
-
|
Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR
(Poster - Recorded Presentation)
link »
Large neural network models are commonly trained through a combination of advanced parallelism strategies in a single program, multiple data (SPMD) paradigm. For example, training large transformer models requires combining data, model, and pipeline partitioning; and optimizer sharding techniques. However, identifying efficient combinations for many model architectures and accelerator systems requires significant manual analysis. In this work, we present an automatic partitioner that identifies these combinations through a goal-oriented search. Our key findings are that a Monte Carlo Tree Search-based partitioner leveraging partition-specific compiler analysis directly into the search and guided goals matches expert-level strategies for various models.
|
Sami Alabed · Dominik Grewe · Juliana Franco · Bart Chrzaszcz · Tom Natan · Tamara Norman · Norman Rink · Dimitrios Vytiniotis · Michael Schaarschmidt
🔗
|
-
|
Multi-objective Reinforcement Learning with Adaptive Pareto Reset for Prefix Adder Design
(Poster - Recorded Presentation)
link »
Many hardware design problems require navigating a combinatorial search space to find solutions that balance multiple conflicting objectives, e.g., area and delay. While traditional approaches rely on hand-tuned heuristics to combat the large search space, reinforcement learning (RL) has recently achieved promising results, effectively reducing the need for human expertise. However, the existing RL method has prohibitively high sample complexity requirements.In this paper, we present a novel multi-objective reinforcement learning algorithm for combinatorial optimization and apply it to automating designs for prefix adder circuits, which are fundamental to high-performance digital components. We propose to track the evolving Pareto frontier to adaptively select reset states for an episodic RL agent. Our proposed reset algorithm balances exploiting the best-discovered states so far and exploring nearby states to escape local optima. Through empirical evaluations with a real-world physical synthesis workflow on two different design tasks, we demonstrate that our new algorithm trains agents to expand the Pareto frontier faster compared to other baselines. In particular, our algorithm achieves comparable quality results with only 20% of the samples compared to the scalarized baseline. Additional ablation studies confirm that both exploration and exploitation components work together to accelerate the Pareto frontier expansion.
|
Jialin Song · Rajarshi Roy · Jonathan Raiman · Robert Kirby · Neel Kant · Saad Godil · Bryan Catanzaro
🔗
|
-
|
Preference-Aware Constrained Multi-Objective Bayesian Optimization For Analog Circuit Design
(Poster)
link »
Many analog circuit design optimization problems involve performing expensive simulations to evaluate circuit configurations in terms of multiple objectives and constraints; Oftentimes, practitioners have preferences over objectives. We aim to approximate the optimal Pareto set over feasible circuit configurations by minimizing the number of simulations. We propose a novel and efficient preference-aware constrained multi-objective Bayesian optimization (PAC-MOO) approach that learns surrogate models for objectives and constraints and sequentially selects candidate circuits for simulation that maximize the information gained about the optimal constrained Pareto-front while factoring in the objective preferences. Our experiments on real-world problems demonstrate PAC-MOO’s efficacy over prior methods.
|
Alaleh Ahmadianshalchi · Syrine Belakaria · Jana Doppa
🔗
|
-
|
The Case for Learning Machine Language
(Poster - Recorded Presentation)
link »
SlidesLive Video »
This paper focuses on enabling modern processors to better predict upcoming instructions that will be executed, in order to improve instruction-related speculations at runtime. Using branch prediction as a case study, we take the first step to motivate the potential of learning semantic correlations in machine language (i.e., CPU instructions), and we demonstrate how to apply language modeling to machine language. Although various approaches have been proposed for instruction-related runtime speculations, they remain general-purpose and rely on language-agnostic features. Furthermore, we present a branch predictor design that takes advantage of our Transformer-based language model. Empirical results from SPEC-CPU-2017 benchmarks (on RISC-V) show that language modeling can improve the branch prediction accuracy by up to 11.03%, and the processor IPC by up to 21.16%.
|
Guangda Liu · Chieh-Jan Mike Liang · Shijie Cao · Shuai Lu · Leendert van Doorn
🔗
|
-
|
HloEnv: A Graph Rewrite Environment for Deep Learning Compiler Optimization Research
(Poster - Recorded Presentation)
link »
We introduce HloEnv, an environment based on Accelerated Linear Algebra (XLA) for deep learning (DL) compiler optimization research. HloEnv transforms all graph rewrites into a common representation, providing a flexible interface to control and modify existing graph optimization passes. In this representation, an XLA pass is converted into a set of sequential rewrite decisions, which control when and if the rewrites are applied. Along with HloEnv, we present a dataset with broad coverage of computation graphs drawn from modern real-world machine learning models. We select two XLA passes with the largest impact on the runtime of the compiled program, and explore the potential for further improvement over XLA in this decision space. We show that using simple heuristics for decision-making can achieve on-par or better performance than XLA. Using search algorithms further boosts performance. We intend for HloEnv and our dataset to be an open-source, community-driven effort that helps spur advances in DL compiler optimization research.
|
Chin Yang Oh · Kunhao Zheng · Bingyi Kang · Xinyi Wan · Zhongwen Xu · Shuicheng Yan · Min Lin · Yangzihao Wang
🔗
|
-
|
Robust Scheduling with GFlowNets
(Poster - Recorded Presentation)
link »
SlidesLive Video »
Finding the best way to schedule operations in a computation graph is a classical NP-hard problem which is central to compiler optimization. However, evaluating the goodness of a schedule on the target hardware can be very time-consuming. Traditional approaches as well as previous machine learning ones typically optimize proxy metrics, which are fast to evaluate but can lead to bad schedules when tested on the target hardware. In this work, we propose a new approach to scheduling by sampling proportionally to the proxy metric using a novel GFlowNet method. We introduce a technique to control the trade-off between diversity and goodness of the proposed schedules at inference time and demonstrate empirically that the pure optimization baselines can lead to subpar performance with respect to our approach when tested on a target model. Furthermore, we show that conditioning the GFlowNet on the computation graph enables generalization to unseen scheduling problems for both synthetic and real-world compiler datasets.
|
David Zhang · Corrado Rainone · Markus Peschl · Roberto Bondesan
🔗
|
-
|
A code superoptimizer through neural Monte-Carlo tree search
(Poster - Recorded Presentation)
link »
SlidesLive Video »
There are many ways to turn a high-level program into a sequence of instructions consistent with that computation. Selecting the most performant such instruction sequence for a given piece of hardware - optimized compilation - is a central challenge of computer science. Optimizing compilers perform this task through a series of reductions and local transformations (e.g. register allocation, instruction scheduling, peephole optimization) driven by heuristics. A natural and well-explored avenue of research is to replace current hand-written heuristics by data-driven, automatically-designed heuristics which may be obtained from machine learning. We propose a radically different approach, in which we view compilation as a combinatorial optimization problem which consists of finding the optimal (e.g. fastest executing or shortest) sequence of instructions subject to the constraint that it has the semantics of the specified program. We show how this problem can be practically framed as a finite Markov decision process, unlocking a rich space of potential algorithms from reinforcement learning. We implement one such algorithm in particular, an AlphaGo-like distributed neural Monte-Carlo tree search procedure, and demonstrate that it is able to directly generate optimized assembly. Unlike a traditional optimizing compiler, this approach does not rely on an existing library of optimizations to transform the code, but rather directly attempts to generate the most optimal program instruction-by-instruction, taking into account effects including register allocation, instruction scheduling and operation fusion.
|
Wenda Zhou · Olga Solodova · Ryan Adams
🔗
|
-
|
Target-independent XLA optimization using Reinforcement Learning
(Poster - Recorded Presentation)
link »
SlidesLive Video »
An important challenge in Linear Algebra accelerated compilers like XLA is multi-pass optimization and analysis. There has been recent interest chiefly in XLA target-dependent optimization on the graph-level, subgraph-level, and kernel-level phases. We specifically focus on target-independent optimization pass ordering for XLA HLO, which is the problem of finding the optimal sequence of compiler optimization passes. However, there is little domain specific study in pass ordering for XLA HLO. To this end, we propose introducing deep Reinforcement Learning (RL) based search for optimal XLA HLO pass ordering. We also propose enhancements to the deep RL algorithms to further improve optimal search performance and open the research direction for domain-specific guidance for RL. We create an XLA Gym experimentation framework as a tool to enable RL algorithms to interact with the compiler for passing optimizations and thereby train agents. Overall, in our experimentation we observe an average of $13.3\%$ improvement in operation reduction on a benchmark of GPT-2 training graphs and $10.4\%$ improvement on a diverse benchmark including GPT-2, BERT, and ResNet graphs using the proposed approach over the compiler's default phase ordering.
|
Milan Ganai · Haichen Li · Theodore Enns · Yida Wang · Randy Huang
🔗
|
-
|
Multi-Agent Reinforcement Learning for Microprocessor Design Space Exploration
(Poster)
link »
Microprocessor architects are increasingly resorting to domain-specific customization in the quest for high-performance and energy-efficiency. As the systems growin complexity, fine-tuning architectural parameters across multiple sub-systems (e.g., datapath, memory blocks in different hierarchies, interconnects, compileroptimization, etc.) quickly results in a combinatorial explosion of design space.This makes domain-specific customization an extremely challenging task. Prior work explores using reinforcement learning (RL) and other optimization methods to automatically explore the large design space. However, these methods have traditionally relied on single-agent RL/ML formulations. It is unclear how scalable single-agent formulations are as we increase the complexity of the design space (e.g., full stack System-on-Chip design). Therefore, we propose an alternative formulation that leverages Multi-Agent RL (MARL) to tackle this problem. The key idea behind using MARL is an observation that parameters across different sub-systems are more or less independent, thus allowing a decentralized role as-signed to each agent. We test this hypothesis by designing domain-specific DRAMmemory controller for several workload traces. Our evaluation shows that the MARL formulation consistently outperforms single-agent RL baselines such as Proximal Policy Optimization and Soft Actor-Critic over different target objectives such as low power and latency. To this end, this work opens the pathway for new and promising research in MARL solutions for hardware architecture search.
|
Srivatsan Krishnan · Natasha Jaques · Shayegan Omidshafiei · Dan Zhang · Izzeddin Gur · Vijay Janapa Reddi · Aleksandra Faust
🔗
|
-
|
An Efficient One-Class SVM for Novelty Detection in IoT
(Poster - Recorded Presentation)
link »
SlidesLive Video »
One-Class Support Vector Machines (OCSVM) are a state-of-the-art approach for novelty detection, due to their flexibility in fitting complex nonlinear boundaries between normal and novel data. However, conventional OCSVMs can introduce prohibitive memory and computational overhead at detection time. This work designs, implements and evaluates an efficient OCSVM for such practical settings. We extend Nystr\"om and (Gaussian) Sketching approaches to OCSVM, combining these methods with clustering and Gaussian mixture models to achieve 15-30x speedup in prediction time and 30-40x reduction in memory requirements, without sacrificing detection accuracy.
|
Kun Yang · Samory Kpotufe · Nicholas Feamster
🔗
|
-
|
Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs
(Poster)
link »
Cloud datacenters are exponentially growing both in numbers and size. This increase results in a network activity surge that warrants better congestion avoidance. The resulting challenge is two-fold: (i) designing algorithms that can be custom-tuned to the complex traffic patterns of a given datacenter; but, at the same time (ii) run on low-level hardware with the required low latency of effective Congestion Control (CC). In this work, we present a Reinforcement Learning (RL) based CC solution that learns from certain traffic scenarios and successfully generalizes to others. We then distill the RL neural network policy into binary decision trees to achieve the desired $\mu$sec decision latency required for real-time inference. We deploy the distilled policy on NVIDIA NICs in a real network and demonstrate state-of-the-art performance, balancing all tested metrics simultaneously: bandwidth, latency, fairness, and drops.
|
Benjamin Fuhrer · Yuval Shpigelman · Chen Tessler · Shie Mannor · Gal Chechik · Eitan Zahavi · Gal Dalal
🔗
|
-
|
NeuralFuse: Improving the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes
(Poster - Recorded Presentation)
link »
SlidesLive Video »
Deep neural networks (DNNs) are state-of-the-art models adopted in many machine learning based systems and algorithms. However, a notable issue of DNNs is their considerable energy consumption for training and inference. At the hardware level, one current energy-saving solution at the inference phase is to reduce the voltage supplied to the DNN hardware accelerator. However, operating in the low-voltage regime would induce random bit errors saved in the memory and thereby degrade the model performance. To address this challenge, we propose $\textbf{NeuralFuse}$, a novel input transformation technique as an add-on module, to protect the model from severe accuracy drops in low-voltage regimes. With NeuralFuse, we can mitigate the tradeoff between energy and accuracy without retraining the model, and it can be readily applied to DNNs with limited access, such as DNNs on non-configurable hardware or remote access to cloud-based APIs. Compared with unprotected DNNs, our experimental results show that NeuralFuse can reduce memory access energy up to 24% and simultaneously improve the accuracy in low-voltage regimes up to an increase of 57%. To the best of our knowledge, this is the first model-agnostic approach (i.e., no model retraining) to mitigate the accuracy-energy tradeoff in low-voltage regimes.
|
Hao-Lun Sun · Lei Hsiung · Nandhini Chandramoorthy · Pin-Yu Chen · Tsung-Yi Ho
🔗
|
-
|
Lattice Quantization
(Poster - Recorded Presentation)
link »
SlidesLive Video »
Post-training quantization of neural networks consists in quantizing a model without retraining, which is user-friendly, fast and data frugal. In this paper, we propose LatticeQ, a novel post-training weight quantization method designed for deep convolutional neural networks. Contrary to scalar rounding widely used in state-of-the-art quantization methods, LatticeQ uses a quantizer based on lattices -- discrete algebraic structures. LatticeQ exploits the inner correlations between the model parameters to the benefit of minimizing quantization error. This allows to achieve state-of-the-art results in post-training quantization. In particular, we demonstrate ImageNet classification results close to full precision on the popular Resnet-18/50, with little to no accuracy drop for 4-bit models.
|
Clément Metz · Thibault Allenet · Johannes Thiele · Antoine DUPRET · Olivier BICHLER
🔗
|
-
|
An Adversarial Active Sampling-based Data Augmentation Framework for Manufacturable Chip Design
(Poster - Recorded Presentation)
link »
Lithography modeling is a crucial problem in chip design to ensure a chip design mask is manufacturable. It requires rigorous simulations of optical and chemical models that are computationally expensive. Recent developments in machine learning have provided alternative solutions in replacing the time-consuming lithography simulations with deep neural networks. However, the considerable accuracy drop still impedes its industrial adoption. Most importantly, the quality and quantity of the training dataset directly affect the model performance. To tackle this problem, we propose a litho-aware data augmentation (LADA) framework to resolve the dilemma of limited data and improving the machine learning model performance. First, we pretrain the neural networks for lithography modeling and a gradient-friendly StyleGAN2 generator. We then perform adversarial active sampling to generate informative and synthetic in-distribution mask designs. These synthetic mask images will augment the original limited training dataset used to finetune the lithography model for improved performance. Experimental results demonstrate that LADA can successfully exploits the neural network capacity by narrowing down the performance gap between the training and testing data instances.
|
Mingjie Liu · Haoyu Yang · David Pan · Brucek Khailany · Mark Ren
🔗
|
-
|
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression
(Poster)
link »
Self-attention and feedforward layers in large-scale Transformer models are overparameterized, limiting inference speed and energy efficiency. Tensor decomposition is a promising technique to reduce parameter redundancy by expressing weight matrices in an efficiently factorized form. Prior efforts used manual or heuristic decomposition settings without hardware-aware customization, resulting in poor hardware efficiencies and large performance degradation.In this work, we propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of tensor decompositions and automates the choice of tensorization shape and decomposition rank with hardware-aware co-optimization. We jointly investigate tensor contraction path optimizations and a fused Einsum mapping strategy to bridge the gap between theoretical benefits and real hardware efficiency improvement. Our two-stage knowledge distillation flow resolves the trainability bottleneck and thus significantly boosts the final accuracy of factorized Transformers. We find that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss and achieve a better efficiency-accuracy Pareto frontier than hand-tuned and heuristic baselines.
|
Jiaqi Gu · Ben Keller · Jean Kossaifi · Anima Anandkumar · Brucek Khailany · David Pan
🔗
|