Accepted Papers

πŸ”½ VMR2L: Virtual Machines Rescheduling Using Reinforcement Learning in Data Centers. Xianzhong Ding, Yunkai Zhang, Binbin Chen, Donghao Ying, Tieying Zhang, Jianjun Chen, Lei Zhang, Alberto Cerpa, Wan Du.

Modern industry-scale data centers receive thousands of virtual machine (VM) requests per minute. Due to the continual creation and release of VMs, many small resource fragments are scattered across physical machines (PMs). To handle these fragments, data centers periodically reschedule some VMs to alternative PMs. Despite the increasing importance of VM rescheduling as data centers grow in size, the problem remains understudied. We first show that, unlike most combinatorial optimization tasks, the inference time of VM rescheduling algorithms significantly influences their performance, causing many existing methods to scale poorly. Therefore, we develop a reinforcement learning system for VM rescheduling, VMR2L, which incorporates a set of customized techniques, such as a two-stage framework that accommodates diverse constraints and workload conditions as well as an effective feature extraction module. Our experiments on an industry-scale data center show that VMR2L can achieve a performance comparable to the optimal solution, but with a running time of seconds. (paper)

πŸ”½ ZeRO++: Extremely Efficient Collective Communication for Large Model Training. Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Xiaoxia Wu, Connor Holmes, Zhewei Yao, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He.

While the Zero Redundancy Optimizer (ZeRO) excels in training large-scale models, it struggles to achieve good throughput in environments with limited bandwidth or small batches where communication becomes a major bottleneck. Inspired by the principles of fine-grained quantization in machine learning algorithms, we designed ZeRO++, an optimizer robust to quantization effects that allows for significant communication volume reduction using low-precision quantization techniques. ZeRO++ composes of three communication volume reduction techniques (low-precision all-gather, data remapping, and low-precision gradient averaging) to significantly reduce the communication volume up to 4x that enables up to 2.16x better throughput at 384 GPU scale. Our results also show ZeRO++ can speedup the RLHF by 3.3x compared to vanilla ZeRO. To verify the convergence of ZeRO++, we test up to 13B model for pretraining with 8/6-bits all gather and up to 30B model for finetuning with 4-bit or 2-bit all gather, and demonstrate on-par accuracy as original ZeRO (aka standard training). As a byproduct, the model trained with ZeRO++ is naturally weight-quantized, which can be directly used for inference without post-training quantization or quantization-aware training. (paper)

πŸ”½ On the Promise and Challenges of Foundation Models for Learning-based Cloud Systems Management. Haoran Qiu, Weichao Mao, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Tamer Basar, Ravi Iyer.

Foundation models (FMs) are machine learning models that are trained broadly on large-scale data and can be adapted to a set of downstream tasks via fine-tuning, few-shot learning, or even zero-shot learning. Despite the successes of FMs in the language and vision domain, we have yet to see an attempt to develop FMs for cloud systems management (or known as cloud intelligence/AIOps). In this work, we explore the opportunities of developing FMs for cloud systems management. We propose an initial FM design (i.e., the FLASH framework) based on meta-learning and demonstrate its usage in the task of resource configuration search and workload autoscaling. Preliminary results show that FLASH achieves 52.3-90.5% less performance degradation with no adaptation and provides 5.5x faster adaptation. We conclude this paper by discussing the unique risks and challenges of developing FMs for cloud systems management. (paper)

πŸ”½ Predicting User Experience on Laptops from Hardware Specifications. Saswat Padhi, Sunil Kumar Bhasin, Udaya Kiran Ammu, Alex Bergman, Allan Knies.

Estimating the overall user experience (UX) on a device is a common challenge faced by manufacturers. Today, device makers primarily rely on microbenchmark scores, such as Geekbench, that stress test specific hardware components, such as CPU or RAM, but do not satisfactorily capture real-life consumer workloads. System designers often rely on domain-specific heuristics and extensive testing of prototypes to reach a desired UX goal, and yet there is often a mismatch between the manufacturers’ performance claims and the consumers’ experience. We present our initial results on predicting real-life experience on laptops from their hardware specifications. We target web applications that run on Chromebooks (ChromeOS laptops) for a simple and fair aggregation of experience across applications and workloads. On 54 laptops, we track 9 UX metrics on common end-user workloads: web browsing, video playback and audio / video calls. We focus on a subset of high-level metrics exposed by the Chrome browser, that are part of the Web Vitals initiative for measuring user experience on web applications. With a dataset of 100K UX data points, we train gradient boosted regression trees that predict the metric values from device specifications. Across our 9 metrics, we note a mean score (goodness-of-fit on our dataset) of 97.8% and a mean MAAPE (percentage error in prediction on unseen data) of 10.1%. (paper)

πŸ”½ Ad-Rec: Advanced Feature Interactions to Address Covariate-Shifts in Recommendation Networks. Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, Prashant J. Nair.

Recommendation models enhance user experiences by utilizing input feature correlations. However, deep learning-based models encounter challenges from changing user behavior and item features, leading to data distribution shifts. Effective cross-feature learning is crucial in addressing this. We introduce Ad-Rec, an advanced network that leverages feature interaction techniques to tackle these issues. It utilizes masked transformers to learn higher-order cross-features while mitigating data distribution drift. Our approach improves model quality, accelerates convergence, and reduces training time. We demonstrate scalability of Ad-Rec and its superior model quality through extensive ablation studies. (paper)

πŸ”½ Renamer: A Transformer Architecture Invariant to Variable Renaming. Zachary Ankner, Alex Renda, Michael Carbin.

Many modeling tasks involve learning functions which are invariant to certain types of input transformations. We study a specific class of invariance: semantics- preserving variable renaming for models of code. We show that vanilla Transformers trained on renaming-invariant tasks do not exhibit renaming invariance. We propose Renamer, a Transformer architecture which is itself invariant to semantics- preserving variable renaming. On a CPU simulation task, Renamer reduces error by between 24.79% and 52.8% compared to a vanilla Transformer. (paper)

πŸ”½ Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs. Bowen Tan, Yun Zhu, Lijuan Liu, Hongyi Wang, Yonghao Zhuang, Jindong Chen, Eric P. Xing, Zhiting Hu .

The recent progress of AI can be largely attributed to large language models (LLMs). However, their escalating memory requirements introduce challenges for machine learning (ML) researchers and engineers. Addressing this requires developers to partition a large model to distribute it across multiple GPUs or TPUs. This necessitates considerable coding and intricate configuration efforts with existing model parallel tools, such as Megatron-LM, DeepSpeed, and Alpa. These tools require users' expertise in machine learning systems (MLSys), creating a bottleneck in LLM development, particularly for developers without MLSys background. In this work, we present Red Coast (Redco), a lightweight and user-friendly tool crafted to automate distributed training and inference for LLMs, as well as to simplify ML pipeline development. The design of Redco emphasizes two key aspects. First, to automate model parallism, our study identifies two straightforward rules to generate tensor parallel strategies for any given LLM. Integrating these rules into Redco facilitates effortless distributed LLM training and inference, eliminating the need of additional coding or complex configurations. We demonstrate the effectiveness by applying Redco to a set of LLM architectures, such as GPT-J, LLaMA, T5, and OPT, up to the model size of 66B, and in the setting of multi-host. Second, we propose a mechanism that allows for the customization of diverse ML pipelines through the definition of merely three functions, eliminating redundant and formulaic code like multi-host related processing. This mechanism proves adaptable across a spectrum of ML algorithms, from foundational language modeling to complex algorithms like meta-learning and reinforcement learning. Consequently, Redco implementations exhibit much fewer code lines compared to their official counterparts. Redco is released under Apache License 2.0 at https://github.com/tanyuqian/redco. (paper)

πŸ”½ On a Foundation Model for Operating Systems. DIVYANSHU SAXENA, Nihal Sharma, Donghyun Kim, Rohit Dwivedula, Jiayi Chen, Chenxi Yang, Sriram Ravula, Zichao Hu, Aditya Akella, Joydeep Biswas, Swarat Chaudhuri, Isil Dillig, Alex Dimakis, Daehyeok Kim, Christopher Rossbach.

This paper lays down the research agenda for a domain-specific foundation model for operating systems (OSes). Our case for a foundation model revolves around the observations that several OS components {such as CPU, memory, and network subsystems} are interrelated and that OS traces offer the ideal dataset for a foundation model to grasp the intricacies of diverse OS components and their behavior in varying environments and workloads. We discuss a wide range of possibilities that then arise, from employing foundation models as policy agents to utilizing them as generators and predictors to assist traditional OS control algorithms. Our hope is that this paper spurs further research into OS foundation models and creating the next generation of operating systems for the evolving computing landscape. (paper)

πŸ”½ Choice-Based Learning in JAX. Dan Zheng, Shangyin Tan, Gordon Plotkin, Ningning Xie.

Choice-based learning is a programming paradigm for expressing learning system in terms of choices and losses. We explore a practical implementation of choice-based learning in JAX by combining two techniques in a novel way: algebraic effects and the selection monad. We describe the design and implementation of our library, explore its usefulness for real-world applications like hyperparameter tuning and deep reinforcement learning, and compare it with existing approaches. (paper)

πŸ”½ Silhouette: Toward Performance-Conscious and Transferable CPU Embeddings. Tarikul Islam Papon, Abdul Wasay.

Learned embeddings are widely used to obtain concise data representation and enable transfer learning between different data sets and tasks. In this paper, we present our approach Silhouette, that leverages publicly-available CPU performance data sets to learn CPU performance embeddings. We show how Silhouette enables transfer learning across different types of CPU and leads to a significant improvement in performance prediction accuracy for the target CPUs. (paper)

πŸ”½ Performance Roulette: How Cloud Weather Affects ML-Based System Optimization . Johannes Freischuetz, Konstantinos Kanellis, Brian Kroth, Shivaram Venkataraman .

As system complexity, workload diversity, and cloud computing adoption continue to grow, both operators and developers are turning to machine learning (ML) based approaches for optimizing systems. ML based approaches typically perform measurements to evaluate candidate system configurations to discover the most optimal configuration. However, it is widely recognized that cloud systems can be effected by "cloud weather", i.e., shifts in performance due to hardware heterogeneity, interference from co-located workloads, virtualization overheads, etc. Given these two trends, in this work we ask: how much can performance variability during training affect ML approaches applied to systems? Using DBMS knob configuration tuning as a case study, we present two measurement studies that show how ML based optimizers can be affected by noise. This leads to four main observable problems: (1) there exist of very sensitive configurations, the performance of which do not transfer across machines of the same type, (2) unstable configurations during training significantly impact configuration transferability, (3) tuning in an environment with non-representative noise degrades final performance in the deployment environment, (4) sampling noise causes a convergence slowdown. Finally, we propose a set of methods to mitigate the challenges in measurements for training ML based system components. (paper)

πŸ”½ [Retracted]. .

(paper)

πŸ”½ Reinforcement Learning for FPGA Placement. Shang Wang, Deepak Ranganatha Sastry Mamillapalli, Qianxi Li, Tianpei Yang, Matthew E. Taylor.

This paper introduces the problem of learning to place blocks in Field-Programmable Gate Arrays (FPGAs) and a preliminary learning-based method. In contrast to previous FPGA placement algorithms, we depart from simulated annealing techniques and instead employ deep reinforcement learning (deep RL) for the placement task with the objective of minimizing wirelength. To facilitate the agent's decision making, we design unique state representations including the chipboard observations and interconnections between different blocks. Additionally, we ground representation learning in the supervised task of predicting placement quality to enhance the RL policy's generalization capabilities. To the best of our knowledge, we are the first to introduce a deep RL agent for FPGA placement, with preliminary results to suggest the feasibility of our approach. We hope that this paper will attract more attention to using RL in FPGAs by electronic design automation engineers. (paper)

πŸ”½ MASE: An Efficient Representation for Software-Defined ML Hardware System Exploration. Cheng Zhang, Jianyi Cheng, Zhewen Yu, Yiren Zhao.

Machine learning (ML) accelerators have been studied and used extensively to compute ML models with high performance and low power. However, designing such accelerators normally takes a long time and requires significant effort. Unfortunately, the pace of development of ML software models is much faster than the accelerator design cycle, leading to frequent and drastic modifications in the model architecture, thus rendering many accelerators obsolete. Existing design tools and frameworks can provide quick accelerator prototyping, but only for a limited range of models that can fit into a single hardware device, such as an FPGA. Furthermore, with the emergence of large language models, such as GPT-3, there is an increased need for hardware prototyping of these large models within a many-accelerator system to ensure the hardware can scale with the ever-growing model sizes. The design space is often huge, involving both software and hardware optimization. To address this, we propose a novel representation named MASE IR (Machine-learning Accelerator System Exploration Intermediate Representation) that describes data types, software algorithms, and hardware design constraints. MASEIR opens up opportunities for exploring software and hardware co-optimization at scale. As an application of MASEIR, we implemented a PyTorch-based framework named MASE that automatically optimizes and maps an ML model onto an efficient hardware accelerator system. We believe MASE IR will open new research opportunities for ML system design. (paper)

πŸ”½ Secrecy and Sensitivity: Privacy-Performance Trade-Offs in Encrypted Traffic Classification. Spencer Giddens, Raphael Labaca-Castro, Dan Zhao, Sandra Guasch, Parth Mishra, Nicolas Gama.

As datasets and models grow in size and complexity to increase performance, the risks associated with sensitive data also grow. Differential privacy (DP) offers a framework for designing mechanisms that provide a degree of privacy that can help conceal sensitive features or information. However, different domains and applications can naturally exhibit different rates of trade-offs between privacy and performance depending on their characteristics. In contrast to well-studied areas (e.g., healthcare), one relatively unexplored domain is network traffic analysis where the data contains sensitive information on users' communications. In this paper, we apply DP to various machine learning models trained to classify between encrypted and non-encrypted packets from network traffic; we emphasize that our goal is to examine a relatively unexplored area to analyze the trade-offs between privacy and performance when the data contains both encrypted and un-encrypted observations. We show how varying model architecture and feature sets can be a relatively simple way to achieve more optimal performance-privacy trade-offs; we also compare and contextualize reasonable privacy budgets from our analysis in the network traffic domain against those in other more well-studied domains. (paper)

πŸ”½ Can Semi-Supervised Learning Improve Prediction of Deep Learning Model Resource Consumption?. Karthick Panner Selvam, Mats Brorsson .

With the increasing computational demands of Deep Learning (DL), predicting training characteristics like training time and memory usage is crucial for efficient hardware allocation. Traditional methods rely solely on supervised learning for such predictions. Our work integrates a semi-supervised approach for improved accuracy. We present TraPPM, which utilizes a graph autoencoder to understand representations of unlabeled DL graphs, then combined with a supervised graph neural network training to predict the metrics. Our model significantly surpasses standard methods in prediction accuracy, with MAPE values of 9.51% for training step time and 4.92% for memory usage. The code and dataset are available at https://github.com/karthickai/trappm (paper)

πŸ”½ PLPilot: Benchmark an Automated Programming Language Design Framework Enabled by Large Language Models. Kaiyan Chang, Kun Wang, Mengdi Wang, shengwen Liang, Yinhe Han, Huawei Li, Xiaowei Li, ying wang.

The design of new programming languages traditionally requires expertise across syntax and semantics. Recently, large language models(LLMs) have provided unprecedented power in the code generation field, which has the potential to revolutionize the current programming language design stack, including automating writing passes and formally defining a programming language's semantics and syntax. However, there is yet no framework to leverage LLMs to support programming language design. We propose an programming language design framework enabled by large language models, which decouples every part in the programming language design process into a form acceptable by LLMs. We then propose a set of benchmarks on LLM-based programming language tasks. We evaluate this framework on eight decoupled programming language design stages, which shows great productivity improvements over manually designed languages. (paper)

πŸ”½ Improving Large Language Model Hardware Generating Quality through Post-LLM Searchpdf icon. Kaiyan Chang, Haimeng Ren, Mengdi Wang, shengwen Liang, Yinhe Han, Huawei Li, Xiaowei Li, ying wang.

As large language models (LLMs) like ChatGPT exhibited unprecedented machine intelligence, it also shows great performance in assisting hardware engineers to realize higher-efficiency logic design via natural language interaction. However, due to the limitation of LLM, existing LLM-based hardware generating frameworks generate verilog register transfer language(RTL) without considering its performance, power, area(PPA). To overcome this challenge, we design a post LLM search approach to merge design space exploration(DSE) process into current LLM hardware generation workflow, which enables the PPA optimization. At first, our framework begins by generating prompts for the LLM, which then produces initial Verilog programs. Second, an output manager corrects and optimizes these programs before collecting them into the final design space, which is constructed as a HDL search tree. Eventually, the most important post-search stage, our work will do search through this space to select the optimal design under the target metrics. The evaluation shows that our approach improves generating Verilog quality, and shows broader design optimization space compared to prior work and native LLMs alone. (paper)

πŸ”½ SmartChoices: Augmenting Software with Learned Implementations. Daniel Golovin, Gabor Bartok, Eric Chen, Emily A Donahue, Tzu-Kuo Huang, Efi Kokiopoulou, Ruoyan Qin, Nikhil Sarda, Justin Sybrandt, Vincent Tjeng .

We are living in a golden age of machine learning. Powerful models perform many tasks far better than is possible using traditional software engineering approaches alone. However, developing and deploying these models in existing software systems remains challenging. In this paper, we present SmartChoices, a novel approach to incorporating machine learning into mature software stacks easily, safely, and effectively. We highlight key design decisions and present case studies applying SmartChoices within a range of large-scale industrial systems. (paper)

πŸ”½ Learning Distributed Protocols with Zero Knowledge. Yujie Hui, Drew Ripberger, Xiaoyi Lu, Yang Wang .

The success of AlphaGo Zero shows that a computer can learn to play a complicated board game without relying on the knowledge from human players. We observe that designing a distributed protocol is similar to playing board games to some extent: when determining the next action to take, they both want to ensure they can win even when a smart opponent tries to drive the game/protocol to the worst case. In this work, we explore whether we can apply similar techniques to learn a distributed protocol with zero knowledge. Towards this goal, we model the process in a distributed protocol as a state machine, and further rely on model checking to validate the correctness of the learned state machine. With this approach, we successfully learned a correct atomic commit protocol with three processes, and upon that, we further discuss future work. (paper)

πŸ”½ Mitigating Tail Catastrophe in Steered Database Query Optimization with Risk-Averse Contextual Bandits. MΓ³nika Farsang, Paul Mineiro, Wangda Zhang.

Contextual bandits with average-case statistical guarantees are inadequate in risk-averse situations because they might trade off degraded worst-case behaviour for better average performance. Designing a risk-averse contextual bandit is challenging because exploration is necessary but risk-aversion is sensitive to the entire distribution of rewards; nonetheless we exhibit the first risk-averse contextual bandit algorithm with an online regret guarantee. We apply the technique to a self-tuning software scenario in a production exascale data processing system, where worst-case outcomes should be avoided. (paper)

πŸ”½ Learning Bit Allocations for Z-Order Layouts in Analytic Data Systems. Jenny Gao, Jialin Ding, SIVAPRASAD SUDHIR, Samuel Madden .

To improve the performance of scanning and filtering, modern analytic data systems such as Amazon Redshift and Databricks Delta Lake give users the ability to sort a table using a Z-order, which maps each row to a "Z-value" by interleaving the binary representations of the row's attributes, then sorts rows by their Z-values. These Z-order layouts essentially sort the table by multiple columns simultaneously and can achieve superior performance to single-column sort orders when the user's queries filter over multiple columns. However, the Z-orders currently used by modern systems treat all columns as equally important, which often does not result in the best performance due to the unequal impact that different columns have on query performance. In this work, we investigate the performance impact of using Z-orders that place unequal importance on columns: instead of using an equal number of bits from each column in the Z-value interleaving, we allow unequal bit allocation. We introduce a technique that automatically learns the best bit allocation for a Z-order layout on a given dataset and query workload. Z-order layouts using our learned bit allocations outperform traditional Z-order layouts by up to 1.6X in query runtime and up to 2X in rows scanned. (paper)

πŸ”½ ACLTuner: A Profiling-Driven Fast Tuning to Optimized Deep Learning Inference. Yongin Kwon, Joo Hyoung Cha, Jubin Lee, Misun Yu, Jeman Park, Jemin Lee .

Deep learning has expanded its footprint across diverse domains. The performance of these computations hinges on the interplay between deep learning compilers and inference libraries. While compilers adapt efficiently to new deep learning operations or models, their tuning processes are too time-consuming. In contrast, inference libraries offer quick execution but with adaptability limitations. To address these challenges, we propose ACLTuner, which optimizes execution configurations using existing inference library kernels. ACLTuner identifies and assigns the optimal kernel through targeted device profiling. Compared to ArmNN, AutoTVM, Ansor, ONNXRuntime, and TFLite, ACLTuner not only achieves up to 2.0x faster execution time across seven deep learning models, but also reduces the average tuning time by 95%. (paper)

πŸ”½ Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference. Zihao Yu, Haoyang Li, Fangcheng Fu, Xupeng Miao, Bin CUI .

Due to the recent success of diffusion models, text-to-image generation is becoming increasingly popular and achieves a wide range of applications. Among them, text-to-image editing, or continuous text-to-image generation, attracts lots of attention and can potentially improve the quality of generated images. It's common to see that users may want to slightly edit the generated image by making minor modifications to their input textual descriptions for several rounds of diffusion inference. However, such an image editing process suffers from long-standing heuristics and low inference efficiency. This means that the extent of image editing is uncontrollable, and unnecessary editing invariably leads to extra computation. To solve this problem, we introduce Fast Image Semantically Edit (FISEdit), a cached-enabled sparse diffusion model inference method for efficient text-to-image editing. Extensive empirical results show that FISEdit can be 3.4x and 4.4x faster than existing methods on NVIDIA TITAN RTX and A100 GPUs respectively, and even generates more satisfactory images. (paper)

πŸ”½ PARM: Adaptive Resource Allocation for Datacenter Power Capping. Haoran Qiu, Linghao Zhang, Chen Wang, Hubertus Franke, Zbigniew Kalbarczyk, Ravi Iyer .

Energy efficiency is pressing in today's cloud datacenters. Various power management strategies, such as oversubscription, power capping, and dynamic voltage and frequency scaling, have been proposed and are in use by datacenter operators to better control power consumption at any management unit (e.g., node-level or rack-level) without breaking power budgets. In addition, by gaining more control over different management units within a datacenter (or across datacenters), operators are able to shift the energy consumption either spatially or temporally to optimize carbon footprint based on the spatio-temporal patterns of carbon intensity. The drive for automation has resulted in the exploration of learning-based resource management approaches. In this work, we first systematically investigate the impact of power capping on both latency-critical datacenter workloads and learning-based resource management solutions (i.e., reinforcement learning or RL). We show that even a 20% reduction in power limit (power capping) leads to an 18% degradation in resource management effectiveness (i.e., defined by an RL reward function) which causes 50% higher application latency. We then propose PARM, an adaptive resource allocation framework that provides graceful performance-preserving transition under power capping for latency-critical workloads. Evaluation results show that PARM achieves 10.2-99.3% improvement in service-level objective (SLO) preservation under power capping while improving 3.1-5.8% utilization. (paper)

πŸ”½ Efficient Prompt Caching for Large Language Model Inference via Embedding Similarity. Hanlin Zhu, Banghua Zhu, Jiantao Jiao.

Large language models (LLMs) have achieved huge success in numerous natural language process (NLP) tasks. However, it faces the challenge of significant resource consumption during inference. In this paper, we aim to improve the inference efficiency of LLMs by prompt caching, i.e., if the current prompt can be answered by the same response of a previous prompt, one can directly utilize that response without calling the LLM. Specifically, we focus on the prediction accuracy of prompt caching for single-round question-answering tasks via embedding similarity. The existing embeddings of prompts mostly focus on whether two prompts are semantically similar, which is not necessarily equivalent to whether the same response can answer them. Therefore, we propose a distillation-based method to fine-tune the existing embeddings for better caching prediction. Theoretically, we provide finite-sample guarantees for the convergence of our method under different types of loss functions. Empirically, we construct a dataset based on Kwiatkowski et al. [2019] and fine-tune the embedding from Wang et al. [2022], which improves the AUC of caching prediction from 0.85 to 0.92 within 10 minutes of training. The resulting embedding model improves the throughput over the initial embedding model. (paper)

πŸ”½ CloudEval-YAML: A Realistic and Scalable Benchmark for Cloud Configuration Generation. Yifei Xu, Yuning Chen, Xumiao Zhang, Xianshang Lin, Pan Hu, Yunfei Ma, Songwu Lu, Wan Du, Zhuoqing Mao, Ennan Zhai, Dennis Cai.

Among the thriving ecosystem of cloud computing and the proliferation of Large Language Model (LLM)-based code generation tools, there is a lack of benchmarking for code generation in cloud-native applications. In response to this need, we present CloudEval-YAML, a practical benchmark for cloud configuration generation. CloudEval-YAML tackles the diversity challenge by focusing on YAML, the de facto standard of numerous cloud-native tools. We develop the CloudEval-YAML benchmark with practicality in mind: the dataset consists of hand-written problems with unit tests targeting practical scenarios. To improve practicality during evaluation, we build a scalable evaluation platform for CloudEval-YAML that achieves a 20 times speedup over a single machine. To the best of our knowledge, the CloudEval-YAML dataset is the first hand-written dataset targeting cloud-native applications. We present an in-depth evaluation of 13 LLMs, leading to a deeper understanding of the problems and LLMs, as well as effective methods to improve task performance and reduce cost. The codebase is released at https://github.com/alibaba/CloudEval-YAML. (paper)

πŸ”½ Learning Collaborative Information Dissemination with Graph-based Multi-Agent Reinforcement Learning. Raffaele Galliera, K. Brent Venable, Matteo Bassani, Niranjan Suri.

In modern communication systems, efficient and reliable information dissemination is crucial for supporting critical operations across domains like disaster response, autonomous vehicles, and sensor networks. This paper introduces a Multi-Agent Reinforcement Learning (MARL) approach as a significant step forward in achieving more decentralized, efficient, and collaborative solutions. We propose a Partially Observable Stochastic Game (POSG) formulation for information dissemination empowering each agent to decide on message forwarding independently, based on their one-hop neighborhood and the degree of connectivity of each neighbor. This constitutes a significant paradigm shift from traditional heuristics based on Multi-Point Relay (MPR) selection. Our approach harnesses Graph Convolutional Reinforcement Learning, employing Graph Attention Networks (GAT) with dynamic attention to capture essential network features. We propose two approaches, L-DGN and HL-DGN, which differ in the information that is exchanged among agents. We evaluate the performance of our decentralized approaches, by comparing them with a widely-used MPR heuristic, and we show that our trained policies are able to efficiently cover the network while bypassing the MPR set selection process. Our approach promises a first step toward supporting the resilience of real-world broadcast communication infrastructures via learned, collaborative information dissemination. (paper)

πŸ”½ ComPile: A Large IR Dataset from Production Sources. Aiden Grossman, Ludger Paehler, Konstantinos Parasyris, Tal Ben-Nun, Jacob Hegna, William S. Moses, Mircea Trofin, Johannes Doerfert.

Code is increasingly becoming a core data modality of modern machine learning research impacting not only the way we write code with conversational agents like OpenAI's ChatGPT, Google's Bard, or Anthropic's Claude, the way we translate code from one language into another, but also the compiler infrastructure underlying the language. While modeling approaches may vary and representations differ, the targeted tasks often remain the same within the individual classes of models. Relying solely on the ability of modern models to extract information from unstructured code does not take advantage of 70 years of programming language and compiler development by not utilizing the structure inherent to programs in the data collection. This detracts from the performance of models working over a tokenized representation of input code and precludes the use of these models in the compiler itself. To work towards the first intermediate representation (IR) based models, we fully utilize the LLVM compiler infrastructure, shared by a number of languages, to generate a 182B token dataset of LLVM IR. We generated this dataset from programming languages built on the shared LLVM infrastructure, including Rust, Swift, Julia, and C/C++, by hooking into LLVM code generation either through the language's package manager or the compiler directly to extract the dataset of intermediate representations from production grade programs. Our dataset shows great promise for large language model training, and machine-learned compiler components. (paper)

πŸ”½ Enhancing ML model accuracy for Digital VLSI circuits using diffusion models: A study on synthetic data generation. Prasha Srivastava, Pawan Kumar, Zia Abbas.

Generative AI has seen remarkable growth over the past few years, with diffusion models being state-of-the-art for image generation. This study investigates the use of diffusion models in generating artificial data generation for electronic circuits for enhancing the accuracy of subsequent machine learning models in tasks such as performance assessment, design, and testing when training data is usually known to be very limited. We utilize simulations in the HSPICE design environment with 22nm CMOS technology nodes to obtain representative real training data for our proposed diffusion model. Our results demonstrate the close resemblance of synthetic data using diffusion model to real data. We validate the quality of generated data, and demonstrate that data augmentation is certainly effective in predictive analysis of VLSI design for digital circuits. (paper)

πŸ”½ LLM4DV: Using Large Language Models for Hardware Test Stimuli Generation. Zixi Zhang, Greg Chadwick, Hugo McNally, Yiren Zhao, Robert D. Mullins .

Test stimuli generation has been a crucial but labour-intensive task in hardware design verification. In this paper, we revolutionize this process by harnessing the power of large language models (LLMs) and present a novel benchmarking framework, LLM4DV. This framework introduces a prompt template for interactively eliciting test stimuli from the LLM, along with four innovative prompting improvements to support the pipeline execution and further enhance its performance. We compare LLM4DV to traditional constrained-random testing (CRT), using three self-designed design-under-test (DUT) modules. Experiments demonstrate that LLM4DV excels in efficiently handling straightforward DUT scenarios, leveraging its ability to employ basic mathematical reasoning and pre-trained knowledge. While it exhibits reduced efficiency in complex task settings, it still outperforms CRT in relative terms. The proposed framework and the DUT modules used in our experiments are open-sourced. (paper)

πŸ”½ Multi-Agent Join. Arash Termehchy, Bakhtiyar Doskenov, Bharghav Srikhakollu, Summit Haque, Huazheng Wang.

Real-time performance is crucial for interactive and exploratory data analysis, where users require quick access to subsets or progressive presentations of query results. Delivering real-time results over large data for common relational binary operators like join is challenging, as join algorithms often spend considerable time scanning and attempting to join parts of relations that may not produce any results. Existing solutions often involve repetitive preprocessing, which is costly and may not be feasible for interactive workloads or evolving datasets. Additionally, these solutions may support only restricted types of joins. This paper presents a novel approach for achieving efficient progressive join processing. The scan operator of the join learns online during query execution, identifying portions of its underlying relation that satisfy the join condition. Additionally, an algorithm is introduced where both scan operators collaboratively learn to optimize join execution. (paper)

πŸ”½ Early notice: GenAI-based Datarace Fix for Real-World Golang Programs. Feiyang Jin, Zhizhou Zhang, Rajkishore Barik, Gautam Korlam, Milind Chabbi .

Data race detection has been a subject of extensive research for decades; the practical deployment of race detectors has also become increasingly commonplace in industrial settings. However, the focus has mainly been on the detection aspect, with relatively little attention directed toward the challenging task of autonomously repairing programs with data races. This discrepancy is understandable given the inherent complexities of fixing the data race and the substantial engineering efforts required to integrate fixes into existing workflows. In this paper, we introduce a novel closed-loop application that harnesses the power of Generative AI to fix data races automatically. Our early experiments involving this application within Uranus's internal codebase have yielded promising results. The evaluation results suggest a bright future for integrating this application into Uranus's infrastructure, potentially revolutionizing how data races are handled in large-scale software development environments. (paper)