Accepted Papers

🔽 CubicML: Automated ML for Large ML Systems Co-design with ML Prediction of Performance. Wei Wen, Quanyu Zhu, Weiwei Chu, Wen-Yen Chen, Jiyan Yang.

Scaling up deep learning models has been proven effective to improve intelligence of machine learning (ML) models, especially for industry recommendation models and large language models. The co-design of large distributed ML systems and algorithms (to maximize training performance) plays a pivotal role for its success. As it scales, the number of co-design hyper-parameters grows rapidly which brings challenges to feasibly find the optimal setup for system performance maximization. In this paper, we propose CubicML which uses ML to automatically optimize training performance of large distributed ML systems. In CubicML, we use an ML model as a proxy to predict the training performance for search efficiency and performance modeling flexibility. We proved that CubicML can effectively optimize training speed of in-house ads recommendation models with 73 billion parameters and large language models up to 405 billion parameters at Meta. (paper)

🔽 ML^2 Tuner : Efficient Code Tuning via Multi-Level Machine Learning Models. JooHyoung Cha, Munyoung Lee, Jinse Kwon, Jubin Lee, Jemin Lee, Yongin Kwon.

The increasing complexity of deep learning models necessitates specialized hardware and software optimizations, particularly for deep learning accelerators. Existing autotuning methods often suffer from prolonged tuning times due to profiling invalid configurations, which can cause runtime errors. We introduce ML^2 Tuner, a multi-level machine learning tuning technique that enhances autotuning efficiency by incorporating a validity prediction model to filter out invalid configurations and an advanced performance prediction model utilizing hidden features from the compilation process. Experimental results on an extended VTA accelerator demonstrate that ML^2 Tuner achieves equivalent performance improvements using only 12.3% of the samples required with a similar approach as TVM and reduces invalid profiling attempts by an average of 60.8%, Highlighting its potential to enhance autotuning performance by filtering out invalid configurations. (paper)

🔽 Understanding and Alleviating Memory Issue in RLHF for LLMs. Jin Zhou, Hanmei Yang, Steven Jiaxun Tang, Mingcan Xiang, Hui Guan, Tongping Liu.

Fine-tuning with Reinforcement Learning with Human Feedback (RLHF) is essential for aligning large language models (LLMs). However, RLHF often encounters significant memory challenges. This study is the first to examine memory usage in the RLHF context, exploring various memory management strategies and unveiling the reasons behind excessive memory consumption. Additionally, we introduce a simple yet effective approach that substantially reduces the memory required for RLHF fine-tuning. (paper)

🔽 FALCON: Long Short Term Memory Feedback-Driven Adaptive Code Generation for Enhanced Automated Programming Systems. Zeyuan Li, Yangfan He, Lewei He, Jianhui Wang, TIANYU SHI, Bin Lei, Yuchen Li, Qiuwu Chen.

Recently, large language models (LLMs) have achieved significant progress in automated code generation. Despite their strong instruction-following capabilities, these models frequently struggle to align with user intent. Specifically, they are hampered by datasets that lack diversity and fail to address specialized tasks or edge cases. Furthermore, challenges in supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) fail to generate precise, human-intent-aligned code. To tackle these challenges and improve the code generation performance for automated programming systems, we propose a novel Feedback-driven Adaptive Long/short-term memory based Coding Optimization framework (i.e., FALCON) to enhance automated programming system performance. From the global level, Long-term memory improves code quality by retaining and applying learned knowledge, while from the local level, Short-term memory allows for the incorporation of immediate feedback from compilers and AI systems. Additionally, we introduce meta-reinforcement learning with feedback rewards to solve the global-local bi-level optimization problem, enhancing the model’s adaptability across diverse code generation tasks. Evaluations using benchmarks such as APPs and CodeUltraFeedback demonstrate that our approach not only increases the functional correctness of the generated code but also improves its overall quality. (paper)

🔽 OMPar: Automatic Parallelization with AI-Driven Source-to-Source Compilation. Tal Kadosh, Niranjan Hasabnis, Prema Soundararajan, Vy A. Vo, Mihai Capotă, Nesreen K. Ahmed, Yuval Pinter, Gal Oren.

Existing automatic code parallelization tools are either too conservative (formal-methods-based tools) or too inaccurate (AI-based tools). This paper introduces OMPar, an AI-driven tool that breaks the problem into two sub-problems of parallelism detection and parallel pragma generation and then integrates two state-of-the-art models to solve the problem. We evaluate OMPar and competing existing tools in terms of accuracy (in suggesting correct pragma), syntax, semantics, and run-time performance of suggested pragmas. Overall, we found that OMPar outperforms existing tools in accurately suggesting parallelization pragmas. Moreover, we found that OMPar-suggested pragmas are also syntactically- and semantically valid (high compilation and test success rate), and they also deliver performance improvement over corresponding baseline serial programs. The sources of this work are available in our repository: https://github.com/Scientific-Computing-Lab/OMPar (paper)

🔽 BladeDISC++: Memory Optimizations Based On Symbolic Shape. Xiulong Yuan, Xu Yan, Wenting Shen, Xiafei Qiu, Ang Wang, Jie Zhang, Yong Li, Wei Lin.

Recent deep learning workloads exhibit dynamic characteristics, leading to the rising adoption of dynamic shape compilers. These compilers can generate efficient kernels for dynamic shape graphs characterized by a fixed graph topology and uncertain tensor shapes. However, memory optimization, although particularly crucial in this large model era, remains relatively underexplored for dynamic shape graphs. The fundamental challenge lies in the lack of precise tensor shapes which are essential in conventional methods such as operation scheduling(op scheduling) and rematerialization. To address this challenge, we propose op scheduling and rematerialization approaches based on symbolic shapes and developed BladeDISC++. Besides, since rematerialization decisions cannot be made solely at compile time when tensor shapes are unknown, BladeDISC++ employs a compilation-runtime combined strategy to optimally address shape dynamics. Evaluations indicate that BladeDISC++ effectively reduces memory usage for dynamic shape graphs, achieving memory consumption comparable to optimizations using precise shapes, thereby promoting the broader adoption of dynamic shape compilers. (paper)

🔽 Fixrleak: GenAI-based Resource Leak Fix for Real-World Java Programs. Zhizhou Zhang, Akshay Utture, Manu Sridharan, Jens Palsberg.

Resource leaks frequently occur in Java applications, which can cause severe performance issues and system malfunctions. To address this, we introduce FixrLeak, a Generative AI-based framework that automatically generates fixes for resource leaks in real-world Java programs. FixrLeak integrates AST-level analysis to generate correct and idiomatic fixes. The framework automates the entire process and significantly reduces the manual effort required by developers. Our evaluation within Uber's Java codebase demonstrates the effectiveness of FixrLeak, achieving a high success rate in fixing resource leaks. (paper)

🔽 Don't Transform the Code, Code the Transforms: Towards Precise Code Rewriting using LLMs. Chris Cummins, Volker Seeker, Hugh James Leather, Jordi Armengol-Estapé, Aram H. Markosyan, Gabriel Synnaeve.

Tools for rewriting, refactoring and optimizing code should be fast and correct. Large Language Models (LLMs), by their nature, possess neither of these qualities. Yet, there remains tremendous opportunity in using LLMs to improve code. We explore the use of LLMs not to *transform code*, but to *code transforms*. We propose a chain-of-thought approach to synthesizing code transformations from a small number of input/output code examples. Unlike the direct rewrite approach, LLM-generated transformations are easy to inspect, debug, and validate. The logic of the rewrite is explicitly coded and easy to adapt. The compute required to run code transformations is minute compared to that of LLM rewriting. We test our approach on 16 Python code transformations and find that LLM-generated transforms are perfectly precise for 7 of them and less imprecise than direct LLM rewriting on the others. We hope to encourage further research to improving the precision of LLM code rewriting. (paper)

🔽 TurboMoE: Enhancing MoE Model Training with Smart Kernel-Fusion and Data Transformation. Reza Yazdani Aminabadi, Connor Holmes, Samyam Rajbhandari, Zhewei Yao, Yuxiong He.

The Mixture of Experts (MoE) model is a powerful architecture that dynamically selects a subset of experts for each input, enabling the model to scale efficiently. However, the gating mechanism, which determines the assignment of tokens to experts, introduces 4-dimensional (S x E x C x M) computational complexity due to its reliance on sparse representation which results in wasteful dense-computation. In this work, we present TurboMoE, a novel approach to accelerate MoE model training by optimizing the gating logic through smart kernel-fusion and data-layout transformations. Our method addresses the computational bottlenecks of the gating process by introducing three specialized kernels. The first kernel efficiently computes expert scores and performs top-k expert selection, while the second kernel scatters input tokens into expert-specific buffers, minimizing the need for sparse operations. Furthermore, we introduce the third MoE-Gather kernel, which replaces the traditional sparse matrix multiplication, streamlining the process of combining expert outputs. By integrating these kernels, TurboMoE achieves substantial end-to-end speedups over the state-of-the-art solution, MegaBlocks, with a 55% faster training time for top-1 selection and a 41% improvement for top-2 selection configurations. These optimizations significantly reduce the computation overhead of the gating functionality from O(SECM) -> O(SM). TurboMoE demonstrates that by removing the reliance on sparse computation, MoE models can achieve unprecedented training efficiency, reaching 460 Tera-Flops on 32 NVIDIA-H100 for a 32-expert MoE architecture with Top-2 gating configuration, paving the way for more scalable and effective applications. (paper)

🔽 Exploring CXL-based KV Cache Storage for LLM Serving. Yupeng Tang, Runxiang Cheng, Ping Zhou, Tongping Liu, Fei Liu, Wei Tang, Kyoungryun Bae, Jianjun Chen, Wu Xiang, Rui Shi.

Large language model (LLM) serving systems often store key-value (KV) cache to reduce redundant inference computations. However, storing KV cache of long-context requests under high-throughput serving demand can overwhelm GPU and host memory. Recently, Compute Express Link (CXL) emerges as a promising interconnect technology that offers low-latency host-device communication and expanded memory capacity. In this paper, we explore leveraging CXL memory to store KV cache in LLM serving. Our results show that CXL-CPU interconnect performs comparably to CPU-GPU interconnect in both data-transfer latency and bandwidth, enabling a 30% increase in batch size compared to full KV re-compute under the same SLO. Our production Return on Investment (ROI) modeling further shows that storing KV cache on CXL memory can reduce GPU requirements by up to 87%, with 7.5x higher GPU utilization for prefill, compared to full KV re-compute. Our overall results demonstrate the performance improvement and resource benefits of using CXL memory to store KV cache in LLM serving. (paper)

🔽 FlexFlood: Efficiently Updatable Learned Multi-dimensional Index. Fuma Hidaka, Yusuke Matsui.

A learned multi-dimensional index is a data structure that efficiently answers multi-dimensional orthogonal queries by understanding the data distribution using machine learning models. One of the existing problems is that the search performance significantly decreases when the distribution of data stored in the data structure becomes skewed due to update operations. To overcome this problem, we propose FlexFlood, a flexible variant of Flood. FlexFlood partially reconstructs the internal structure when the data distribution becomes skewed. Moreover, FlexFlood is the first learned multi-dimensional index that guarantees the time complexity of the update operation. Through experiments using both artificial and real-world data, we demonstrate that the search performance when the data distribution becomes skewed is up to 10 times faster than existing methods. We also found that partial reconstruction takes only about twice as much time as naive data updating. (paper)

🔽 V“Mean”ba: Visual State Space Models only need 1 hidden dimension. TienYu Chi, Hung-Yueh Chiang, Chi-Chih Chang, Ning-Chi Huang, Kai-Chiang Wu.

Vision transformers dominate image processing tasks due to their superior performance. However, the quadratic complexity of self-attention limits the scalability of these systems and their deployment on resource-constrained devices. State Space Models (SSMs) have emerged as a solution by introducing a linear recurrence mechanism, which reduces the complexity of sequence modeling from quadratic to linear. Recently, SSMs have been extended to high-resolution vision tasks. Nonetheless, the linear recurrence mechanism struggles to fully utilize matrix multiplication units on modern hardware, resulting in a computational bottleneck. We address this issue by introducing VMeanba, a training-free compression method that eliminates the channel dimension in SSMs using mean operations. Our key observation is that the output activations of SSM blocks exhibit low variances across channels. Our VMeanba leverages this property to optimize computation by averaging activation maps across the channel to reduce the computational overhead without compromising accuracy. Evaluations on image classification and semantic segmentation tasks demonstrate that VMeanba achieves up to a 1.12x speedup with less than a 3% accuracy loss. When combined with 40% unstructured pruning, the accuracy drop remains under 3%. (paper)

🔽 Chiplet Placement and Routing Optimization: A Novel Benchmark and Neural Solver. Haeyeon Kim, Federico Berto, Chuanbo Hua, Minsu Kim, Joungho Kim, Jinkyoo Park.

The exponential growth of AI model sizes has amplified the demand for specialized hardware accelerators capable of efficiently managing complex workloads. To meet the customization needs of AI accelerators, recent advancements have enabled the use of chiplets -- modular components of a larger integrated circuit that can be combined to create a complete system on a chip. Chiplet-based architectures offer a flexible and cost-effective solution by integrating modular chiplets with high-bandwidth memory (HBM), effectively addressing both computational power and memory capacity requirements. However, the increased complexity of chiplet designs introduces significant challenges in placement and routing. This paper presents a novel optimization benchmark and neural approach to the chiplet placement and routing problem, leveraging a hierarchical Markov decision process (MDP). We propose ChipletFormer, a neural architecture that optimizes placement and routing by not only minimizing routing length but also improving datarate-dependent electrical system performance, aiming to enhance the efficiency and scalability of AI acceleration systems. We finally outline several promising directions for future work. (paper)

🔽 On the Role of Context Granularity in LLM-Driven Program Repair. Tyler Holloway, Ethan R. Elenberg.

Recent advances in Large Language Models (LLMs) have created new opportunities for scalable Automated Program Repair (APR). However, an underexplored aspect of APR is the impact of the surrounding code on patch correctness. We propose a novel context granularity based on backward static slicing, capturing lines on which the buggy line is data- and control-dependent. We then evaluate its performance against five commonly used APR context granularities as well as state-of-the-art APR systems. Using GPT-4, we assess all six context granularities on 109 single-line bugs from the Defects4J dataset. Our results show that the sliced context achieves the highest Correct/Plausible ratio (79%) in the dataset, suggesting that a more focused context improves the generation of semantically accurate patches per passed test case. In contrast, larger contexts, such as entire files, produce more correct patches overall but at a lower ratio. We propose that future work explore combining smaller, focused contexts like slicing with larger ones to enhance both semantic accuracy and the total number of correct patches, as well as investigating context granularity strategies tailored to specific bug types. (paper)

🔽 LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts. Zhuohan Gu, Jiayi Yao, Kuntai Du, Junchen Jiang.

As large language models (LLMs) show impressive performance on complex tasks, they still struggle with longer contextual understanding and high computational costs. To balance efficiency and quality, we introduce LLMSteer, a fine-tuning-free framework that enhances LLMs through query-independent attention steering. Tested on popular LLMs and datasets, LLMSteer narrows the performance gap with baselines by 65.9% and reduces the runtime delay by up to 4.8× compared to recent attention steering methods. (paper)

🔽 Predicting LLM Inference Latency: A Roofline-Driven ML Method. Saki Imai, Rina Nakazawa, Marcelo Amaral, Sunyanan Choochotkaew, Tatsuhiro Chiba.

Recent advancements in Large Language Models (LLMs) for Generative AI have significantly increased their popularity, resulting in an exponential arise of new close and open LLM models with frequent algorithm updates. This further complicates the challenge of optimal application management, resource allocation, and scaling in cloud environments for optimal inference latency. Hence, the typical approach of running and learning to define the optimal configuration starts to be unpractical due to the large combinatorial problem and shortage/cost of GPU resources, which creates the necessity for predictive performance models. Given that, we propose a new LLM performance prediction model that can be leveraged for optimal cluster management. The novelty of our approach is the combination of an analytical Roofline Model (RLM) specific for LLM implementation and based on the hardware characteristic with data from Regression Models trained with historical data. More specifically, our apprach calibrates the theoretical hardware performance given from RLM with inherent runtime overhead captured by Regression Models, offering a more interpretable and accurate prediction method in cloud-based deployments. We validate our method for both vLLM and Triton inference servers, demonstrating that by combining the RLM with regression, our approach improves the R^2 value by 12% and reduces MSE by up to 80% on vLLM, and improves the R^2 value by 4% and reduces MSE by up to 61% on Triton, compared to other regression-only models. (paper)

🔽 Mycroft: Towards Effective and Efficient External Data Augmentation. Zain Sarwar, Van Tran, Arjun Bhagoji, Nick Feamster, Ben Y. Zhao, Supriyo Chakraborty.

In data-scarce domains like networked systems, external data augmentation may often be necessary to improve training data quality, as model trainers usually only have visibility into limited portions of the underlying data distribution. However, relevant data is often privately owned, making it both difficult and expensive for trainers to identify and acquire the needed training data. In this study, we introduce Mycroft, a data-efficient approach that allows model trainers to evaluate the utility of private data from various owners while operating under a limited data-sharing budget.Mycroft leverages feature space distances to identify small, high-utility data subsets from each data owner, which serve as indicators of the overall dataset's utility. In domains with differentiable models, Mycroft can effectively apply gradient matching techniques to identify these valuable data subsets. Our experiments, including novel threat detection in IoT networks and image classification in the vision domain, show that Mycroft quickly reaches performance levels comparable to the baseline where all the data is shared. (paper)

🔽 Scalable RL for Systems via Offline Imitation from Multiple Baselines: A Case Study in Compiler Optimization. Teodor Vanislavov Marinov, Alekh Agarwal, Mircea Trofin.

From scheduling, to resource allocation to optimization of complex workflows, systems are replete with decision-making problems which are typically addressed with hand-designed heuristics. Recent literature studies pose these setups as Reinforcement Learning (RL) problems owing to a natural fit, with several successes in simulated benchmark environments. However, bringing the RL approach to any complex system in practice is full of challenges in integrating the system into the act-observe-learn paradigm of RL, which has limited the adoption of these techniques. In this work, we present an alternative approach which uses offline data collected using multiple existing baseline policies to simultaneously improve upon them. By repeating multiple iterations of this improvement process, including any learned policies into the set of baselines, we show how performance can be quickly bootstrapped using our approach. We demonstrate the practicality of our approach through evaluation in optimizing the inlining decisions for the LLVM compiler, and obtain significant improvements even over prior RL-based policies. (paper)

🔽 Debug-HD: Debugging TinyML models on-device using Hyper-Dimensional computing. Nikhil Pratap Ghanathe, Steven J E Wilton.

TinyML models often operate in remote, dynamic environments without cloud connectivity, making them prone to failures. Ensuring reliability in such scenarios requires not only detecting model failures but also identifying their root causes. However, transient failures, privacy concerns, and the safety-critical nature of many applications—where systems cannot be interrupted for debugging—complicate the use of raw sensor data for offline analysis. We propose DEBUG-HD, a novel, resource-efficient on-device debugging approach optimized for KB-sized tinyML devices that utilizes hyper-dimensional computing (HDC). Our method introduces a new HDC encoding technique that leverages conventional neural networks, allowing DEBUG-HD to outperform prior binary HDC methods by 27% on average in detecting input corruptions across various image and audio datasets. (paper)

🔽 WarpDrive: An Agentic Workflow for Ninja GPU Transformations. Sana Damani, Siva Kumar Sastry Hari, Mark Stephenson, Christos Kozyrakis.

Performance engineering for GPU-accelerated applications is challenging and time-consuming. We propose WarpDrive, a customizable LLM-driven performance analysis and optimization framework that automatically transforms and tests GPU applications. WarpDrive automates the optimization process using agents that analyze run time performance, create optimization plans, transform the code, and test for correctness. We demonstrate its effectiveness by customizing it to four different levels of optimization, including compiler options, compiler hints, function-level transformations, and application-level transformations. (paper)

🔽 The Unreasonable Effectiveness of LLMs for Query Optimization. Peter Akioyamen, Zixuan Yi, Ryan Marcus.

Recent work in database query optimization has used complex machine learning strategies, such as customized reinforcement learning schemes. Surprisingly, we show that LLM embeddings of query text contain useful semantic information for query optimization. Specifically, we show that a simple binary classifier deciding between alternative query plans, trained only on a small number of labeled embedded query vectors, can outperform existing heuristic systems. Although we only present some preliminary results, an LLM-powered query optimizer could provide significant benefits, both in terms of performance and simplicity. (paper)

🔽 Subnormal Number Attacks on Binarized Neural Networks. Nicolás Berrios.

Binarized Neural Networks (BNNs) have emerged as a sensible quantization method to reduce compute costs at inference time. As with other machine learning systems deployed in practice, they are susceptible to side-channel attacks that can be leveraged to reveal their internal characteristics and architecture. Previous work on side-channels in BNNs has been limited to the physical domain, requiring a powerful adversary with granular access to the system and advanced hardware tools. In this paper, we show how the inherent binary weight distribution of BNNs make them susceptible to timing attacks in a practical, software-based threat model. We achieve this by leveraging abnormal timing differences in subnormal number arithmetic. Our contributions are two-fold; (a) we show how carefully crafted inputs can produce a time signal strong enough to reveal all the weights of an individual neuron and (b), we scale the attack to infer a fraction of the input layer of a BNN. We conclude by assessing the challenges of BNN implementations in hopes that our findings will motivate safer deployments of BNNs. (paper)

🔽 Eagle: Efficient Training-Free Router for Multi-LLM Inference. Zesen Zhao, Shuowei Jin, Zhuoqing Mao.

The proliferation of Large Language Models (LLMs) with varying capabilities and costs has created a need for efficient model selection in AI systems. LLM routers address this need by dynamically choosing the most suitable model for a given query based on task requirements and budget constraints. However, existing routers face challenges in scalability and real-time adaptation, particularly in high-volume online environments. We present Eagle, a novel LLM routing approach that combines global and local ELO ranking modules to overcome these limitations. By evaluating both general and specialized LLM abilities, Eagle provides a scalable, training-free solution that enhances model selection quality while reducing computational overhead. Our experiments across multiple datasets show Eagle consistently outperforms baseline methods, with improvements of up to 23.52% in Area Under Curve (AUC) scores. Moreover, Eagle demonstrates remarkable efficiency, requiring only 1/20 of baseline methods’ time for initialization and 100-200x faster incremental updates in online scenarios, making it well-suited for dynamic, high-volume online serving environments. (paper)

🔽 IFMoE: An Inference Framework Design for Fine-grained MoE. Yuwei An, Zhuoming Chen, Beidi Chen.

Mixture-of-Experts (MoE) based large language models (LLMs) have demonstrated exceptional performance across a wide range of downstream tasks and application scenarios. Recent advancements in MoE-based LLMs, such as Deepseek MoE, incorporate fine-grained expert segmentation and shared expert isolation to unlock greater potential for expert specialization. While this technique significantly enhances model capability and reduces training costs, it introduces challenges related to increased inference latency and reduced throughput. To address these challenges, we propose IFMoE (Inference Framework for Fine-grained MoE), a system specifically designed to enhance the inference performance of fine-grained MoE models. IFMoE introduces a redesigned parallelism mechanism tailored for MoE inference and incorporates the concept of Speculative Decoding to alleviate the high latency introduced by expert fusion kernel calculation. Although it is not an entirely lossless method, experiments demonstrate that IFMoE maintains downstream performance while achieving a 30% improvement in both inference latency and throughput. (paper)

🔽 Accelerating Malware Classification: A Vision Transformer Solution. Shrey Bavishi, Shrey Modi.

The escalating frequency and scale of recent malware attacks underscore the urgent need for swift and precise malware classification in the ever-evolving cybersecurity landscape. Key challenges include accurately categorizing closely related malware families. To tackle this evolving threat landscape, this paper proposes a novel architecture LeViT-MC which produces state-of-the-art results in malware detection and classification. LeViT-MC leverages a vision transformer-based architecture, an image-based visualization approach, and advanced transfer learning techniques. Experimental results on multi-class malware classification using the MaleVis dataset indicate LeViT-MC’s significant advantage over existing models. This study underscores the critical importance of combining image-based and transfer learning techniques, with vision transformers at the forefront of the ongoing battle against evolving cyber threats. We propose a novel architecture LeViT-MC which not only achieves state of the art results on image classification but is also more time efficient. (paper)

🔽 Reward Copilot for RL-driven Systems Optimization. Karan Tandon, Manav Mishra, Gagan Somashekar, Mayukh Das, Nagarajan Natarajan.

Systems optimization problems such as workload auto-scaling, kernel parameter tuning, and cluster management arising in large-scale enterprise infrastructure are becoming increasingly RL-driven. While effective, it is difficult to set up the RL framework for such real-world problems --- designing correct and useful reward functions or state spaces is highly challenging and needs a lot of domain expertise. Our proposed novel reward co-pilot solution can help design suitable and interpretable reward functions guided by client-provided specifications for any RL framework. Using experiments on standard benchmarks as well as systems-specific optimization problems, we show that our solution can return reward functions with a certain (informal) feasibility certificate in addition to pareto-optimality. (paper)