Accepted Papers
Large language model (LLM) agents have demonstrated high potential of improving performance for complex computer system, such as cluster scheduling, network congestion control, and adaptive video streaming. However, in lack of a standard, safe, and extensible benchmarking platform, it is difficult to evaluate whether these LLM agents improve real-world system performance and by how much.
We present InfraGym, an open, extensible platform where researchers can study computer system optimization with LLM agents.
Our current release includes three real-world cases and supports interaction with both simulated and real environments. We benchmark multiple LLM agents on these tasks using both open-source and closed-source LLMs, and outline future directions. The code is available at \hyperlink{https://github.com/MLSysOps/InfraGym}{https://github.com/MLSysOps/InfraGym} (paper)🔽 InfraGym: Empowering LLM Agents for Real-World Computer System Optimization. Huaizheng Zhang, Lei Zhang, Yuanming Li, Yizheng Huang, Xiaotong Yang, Kuntai Du, Yihua Cheng, Junchen Jiang, Wencong Xiao.
Classical machine-learning auto-tuners for OS control struggle with semantic gaps, brittle rewards, and unsafe exploration.
We introduce an online, LLM-driven agent that emulates expert reasoning for continuous OS optimization.
When tuning the Linux Completely Fair Scheduler’s hyperparameters, the agent outperforms Bayesian optimization by 5\% in single-parameter tuning, 7.1\% in two-parameter co-tuning, and a human expert by 2.98\% overall, while converging faster and adapting more quickly to workload changes.
When application counters are unavailable, system-level proxies (e.g., Instructions Per Cycle (IPC)) preserved tail latency in our setup.
Putting this together, we propose adopting the Model Context Protocol (MCP) for tool/resource discovery and invocation and a logging channel; on top of that, we propose adding transactional apply--commit--revert, host-mediated approval gates, and policy controls in the OS-tuning server and host to ensure safe, auditable operation. Our results and reference design suggest a practical path toward safe, self‑adapting OS control. (paper)🔽 An Expert in Residence: LLM Agents for Always-On Operating System Tuning. Georgios Liargkovas, Vahab Jabrayilov, Hubertus Franke, Kostis Kaffes.
Traditional auto-parallelizing compilers, which depend on rigid heuristics, face challenges with the complexity of modern heterogeneous systems. This paper introduces a detailed evaluation of auto-parallelization driven by small (1B parameter) Language Models (LLMs) for compilers. We assess three models-gemma3, llama3.2, and qwen2.5 employing six reasoning strategies on 11 real-world kernels from scientific computing, graph algorithms, and machine learning. Our system is compared against strong compiler baselines such as LLVM Polly, TVM, and Triton. Across 376 evaluations, our LLM-driven method achieves an average speedup of 6.81x and a maximum performance of 43.25x on convolution operations. We examine scalability, confirm correctness using multiple sanitizers, and validate robustness across various compilers and hardware. Our results show that small, efficient LLMs can act as effective reasoning engines for intricate compiler optimization tasks. (paper)🔽 Small Language Models as Compiler Experts: Auto-Parallelization for Heterogeneous Systems. Prathamesh Devadiga.
The sustainable control of geo-distributed datacenters is a critical systems challenge, defined by large-scale, dynamic, and uncertain operating conditions. While specialized numerical experts, such as those from Reinforcement Learning (RL) or Model Predictive Control (MPC), can be trained to find optimal control policies, their practical deployment is blocked by fundamental systems-level flaws: they are brittle, failing to scale with the system; opaque, preventing operator trust; and rigid, unable to adapt to new runtime objectives.
This paper introduces a novel framework that directly addresses these issues by distilling the policy of a numerical expert into an adaptive LLM agent. Our method transforms the expert's opaque logic into a transparent, interactive, and agentic workflow. To validate this approach, we distill a state-of-the-art RL policy for carbon-aware workload orchestration.
Evaluated in a high-fidelity simulation, our resulting LLM agent demonstrates the capabilities essential for real-world systems deployment. It solves the scalability problem, successfully managing topologies more than three times larger than the expert's training environment. It enables true runtime adaptability, altering its strategy in minutes in response to complex operator commands that would require days of costly retraining for the original expert. By making powerful optimizers manageable and resilient, our work offers a practical pathway to the sustainable control of large-scale computer systems. (paper)🔽 Sustainable Control of Geo-Distributed Datacenters by Distilling Numerical Experts into Adaptive LLM Agents. Antonio Guillen-Perez, Ashwin Ramesh Babu, Sahand Ghorbanpour, Avisek Naug, Vineet Gundecha, Sifat Muhammad Abdullah, Ricardo Luna Gutierrez, Soumyendu Sarkar.
Efficient thermal management is a major bottleneck in scaling high-performance computing (HPC) systems, where cooling accounts for a substantial share of total energy use. Liquid-cooled cold plates are increasingly adopted in data centers and power electronics, yet their design optimization remains costly due to computationally burdensome computational fluid dynamics (CFD) simulations and high-dimensional geometric spaces. We introduce a physics-informed neural network (PINN) framework for rapid thermal analysis and design exploration of parameterized cold plates. Our approach jointly solves the incompressible Navier–Stokes and conjugate heat transfer equations, leveraging a two-stage curriculum that first stabilizes liquid flow field learning before introducing thermal coupling. Once trained, the model produces physically consistent predictions and orders-of-magnitude faster inference than conventional CFD solvers. We demonstrate the framework across multiple cold plate topologies, capturing design-dependent flow patterns and thermal gradients that inform geometry–performance trade-offs. These results establish PINNs as a promising surrogate modeling tool for accelerating liquid-cooling design workflows, with implications for reducing the energy and carbon footprint of HPC infrastructure. (paper)🔽 ML-Guided Cold Plate Design and Thermal Analysis for Liquid-Cooled HPC Servers. Refik Mert Cam, Avisek Naug, Andrew E. Shao, Soumyendu Sarkar.
While agentic AI systems perform impressively on emerging capability benchmarks, existing performance evaluation suites focus on non-agentic workloads, leaving a critical gap in understanding system efficiency for multi-step, tool-using agents. We present the Agentic Bridge Framework for extracting actionable performance insights from capability evaluations through trace-level telemetry. Applying this framework to a multi-agent system on GAIA validation, we reveal that: (1) pass@N strategies provide diminishing accuracy returns; (2) search agents dominate token usage and latency, identifying web data gathering as the primary bottleneck; (3) reasoning models spend more tokens on context preservation than actual reasoning, highlighting costly inter-agent communication overhead. These findings inform critical design choices—context engineering, tool-use optimization, and phase-aware resource allocation—and illustrate how agent traces can inform reproducible performance workloads, bridging capability achievements with systems optimization for efficient agentic AI. (paper)🔽 Agentic Bridge Framework: Closing the Gap Between Agentic Capability and Performance Benchmarks. Yun Du, Rubens Lacouture, Qizheng Zhang, Genghan Zhang, Tian Zhao, Kunle Olukotun.
Distributed LLM inference requires careful coordination of parallelization strategies across hundreds to thousands of NPUs to meet production SLOs. Current systems like Megatron-LM rely on static heuristics that separately configure parallelism degrees and per-operator sharding dimensions, leaving significant performance on the table as models scale and hardware topologies diversify. We introduce Learn to Shard, to our knowledge, the first RL-based approach to co-optimize both coarse-grained parallelism degrees and fine-grained per-operator sharding dimensions for distributed LLM inference. Our method employs an attention-based policy over an elite history that learns from high-performing strategies to efficiently navigate the vast combinatorial search space. Evaluated on H100 clusters with MoE models up to 1.6T parameters, Learn to Shard achieves up to 3.5$\times$ throughput improvement over metaheuristic baselines and 1.06$\times$ over Megatron heuristics. (paper)🔽 Learning to Shard: RL for Co-optimizing the Parallelism Degrees and Per-operator Sharding Dimensions in Distributed LLM Inference. Ruokai Yin, Sattwik Deb Mishra, Xuan Zuo, Hokchhay Tann, Preyas Shah, Apala Guha.
We describe the development of a specialized code-completion solution for hardware designers in a large enterprise. It handles their specific flavor of System Verilog, and uses a low-latency on-prem fine-tuned model. We outline the process of developing this solution, from data curation, through several stages of model fine-tuning with different contexts, to evaluation and real-time confidence assessment. We then present our results for fine-tuning a 1B-parameter model on ∼1B tokens of in-domain System Verilog code, achieving high semantic fidelity and low latency for both end-of-line and multi-line completions. Our results demonstrate that small, specialized models can satisfy the latency and privacy requirements of enterprise deployment, offering a viable alternative to general-purpose LLMs in constrained settings. (paper)🔽 Small, Fast, and Certain: Developing a Specialized Verilog Code Completion Solution for the Enterprise. Eran Avidan, Lior Tondovsky, Amitai Armon.
In multi-GPU Mixture-of-Experts (MoE) networks, distributing experts across GPUs leads to load imbalance as token assignments vary. Recent methods address this by duplicating popular experts on additional GPUs, requiring accurate prediction of token distributions before routing. This paper examines the tradeoffs between prediction strategy, accuracy, overhead, and system performance. We introduce MoE-GPS, a framework that quantifies these impacts and identifies optimal predictor designs for various system settings. Our results highlight Distribution-Only Prediction, which predicts coarse token distribution with much lower overhead than Token-to-Expert Prediction, achieving 23\% faster inference on the Mixtral 8×7B MMLU dataset. (paper)🔽 MoE-GPS: Guidlines for Prediction Strategy with Expert Duplication in MoE Load Balancing. Haiyue Ma, Zhixu Du, Yiran Chen.
Count-Min Sketch (CMS) is a memory-efficient data structure for estimating the frequency of elements in a multiset.
Learned Count-Min Sketch (LCMS) enhances CMS with a machine learning model to reduce estimation error under the same memory usage, but suffers from slow construction due to empirical parameter tuning and lacks theoretical guarantees on intolerable error probability.
We propose Optimized Learned Count-Min Sketch (OptLCMS), which partitions the input domain and assigns each partition to its own CMS instance, with CMS parameters $(\epsilon, \delta)$ analytically derived for fixed thresholds, and thresholds optimized via dynamic programming with approximate feasibility checks. This reduces the need for empirical validation, enabling faster construction while providing theoretical guarantees under these assumptions.
OptLCMS also allows explicit control of the allowable error threshold, improving flexibility in practice.
Experiments show that OptLCMS builds faster, achieves lower intolerable error probability, and matches the estimation accuracy of LCMS. (paper)🔽 Optimized Learned Count-Min Sketch. Kyosuke Nishishita, Atsuki Sato, Yusuke Matsui.
The matrix multiplications which comprise the bulk of computation in deep learning are being performed in increasingly narrow-precision formats. For example, next generation AI accelerators support dot products in MXFP4, a format requiring only 4.25 bits per element. However, accelerator performance for low-precision matrix multiplication far outstrips their performance on reductions and elementwise computations that are still being performed in higher precision. In this work, we reduce the cost of normalisation tensors by approximating the RMSNorm of an MXFP tensor using only the MX block scales, thereby enabling a 32x decrease in the size of reductions needed for normalisation. We validate our approximation on pre-training of Llama-3 models of 250M and 1B parameters, finding minimal loss of training accuracy compared to a baseline using RMSNorm with MXFP8 matmuls. (paper)🔽 MXNorm: Reusing block scales for efficient tensor normalisation. Callum McLean, Luke Yuri Prince, Alexandre Payot, Paul Balanca, Carlo Luschi.
Optimizing large-language model (LLM) training on distributed domain-specific accelerator systems presents significant challenges due to its complex optimization space. Existing optimization methods, however, rely on time-consuming manual tuning or resource-intensive black-box searches, which struggle to keep pace with the rapidly evolving LLM domain, leading to slow development and underutilized resources. To address this, we introduce $\textbf{ASAP}$, an $\textbf{A}$gentic $\textbf{S}$olution to $\textbf{A}$uto-optimize $\textbf{P}$erformance of Large-Scale LLM Training. It is a multi-agent system, featuring Coordinator, Analyzer, and Proposal agents, which integrates LLM reasoning with insights from performance profiling tools, roofline analysis, and a knowledge base of best practices and successful past optimizations from human experts. Our proposed design can automate the diagnosis of performance bottlenecks and recommend optimized sharding configurations with reasoning, thus effectively improving the efficiency of distributed LLM training. Experiments have shown that the ASAP-generated sharding configurations can contribute up to 28\% training step time reduction and 1.43$\times$ throughput improvement. When combined with additional optimization from human experts, throughput can be further increased to 2.58$\times$. The proposed ASAP promises to provide a scalable and explainable methodology for AI-assisted performance engineering in large-scale LLM training. (paper)🔽 ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training. Yuran Ding, Xinwei Chen, Xiaofan Zhang, Zongwei Zhou.
Large language models (LLMs) have shown progress in GPU kernel performance engineering using inefficient search-based methods that optimize around runtime. Any existing approach lacks a key characteristic that human performance engineers rely on for near-optimal utilization - hardware-awareness. By leveraging the workload's specific memory access patterns, architecture specifications, filtered profiling logs, and reflections on historical performance, we can make software-level optimizations that are tailored to the underlying hardware. SwizzlePerf automatically generates spatial optimizations for GPU kernels on disaggregated architectures by giving LLMs explicit hardware-awareness.
For a GEMM kernel, SwizzlePerf takes less than 5 minutes to generate the same hardware-specific optimal swizzling pattern that took expert performance engineers 2 weeks to find. On a suite of 10 diverse ML and Science kernels, SwizzlePerf can generate swizzling patterns for 9 of the kernels that achieve up to a 2.06x speedup and 70% improvement in L2 hit rate. This work is the first of many steps toward systematically creating hardware-aware LLM performance engineering agents. (paper)🔽 SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization. Arya Tschand, Kesavan Ramakrishnan, Muhammad A. Awad, Ryan Swann, Jeffrey Jian Ma, Keith Lowery, Vijay Janapa Reddi.
High-performance computing (HPC) systems rely on job schedulers like Slurm to
allocate compute resources to submitted workloads. Recently, machine learning
models have been used to predict job runtimes which can be used by schedulers
to optimize utilization. However, many of these models struggle to effectively
encode string-type job features, typically relying on integer-based Label or One-hot
encoding methods. In this paper, we use Transformer-based large language models,
particularly Sentence-BERT (SBERT), to semantically encode job features for
regression-based job runtime prediction. Using a 90,000-record 169-feature Slurm
dataset we evaluate four SBERT variants and compare them against traditional
encodings using four regression models. Our results show that SBERT-based
encodings—especially using the all-MiniLM-L6-v2 model—substantially outperform
conventional methods, achieving an r2 score up to 0.88; 2.3× higher than
traditionally-used Label encoding. Moreover, we highlight practical trade-offs,
such as model memory size versus accuracy, to guide the selection of efficient
encoders for production HPC systems. (paper)🔽 Leveraging Large Language Models to Enhance Machine-Learning-Driven HPC Job Scheduling. Kshitij Bhardwaj, Torrey Wagner, Edgar A. Leon.
Large Language Model (LLM)-enabled agents are rapidly emerging across a wide range of applications, but their deployment introduces vulnerabilities with security implications. While prior work has examined prompt-based attacks (e.g., prompt injection) and data-oriented threats (e.g., data exfiltration), time-of-check to time-of-use (TOCTOU) remain largely unexplored in this context. TOCTOU arises when an agent validates external state (e.g., a file or API response) that is later modified before use, enabling practical attacks such as malicious configuration swaps or payload injection. In this work, we present the first study of TOCTOU vulnerabilities in LLM-enabled agents. We introduce TOCTOU-Bench, a benchmark with 66 realistic user tasks designed to evaluate this class of vulnerabilities. As countermeasures, we adapt detection and mitigation techniques from systems security to this setting and propose prompt rewriting, state integrity monitoring, and tool-fusing. Our study highlights challenges unique to agentic workflows, where we achieve up to 25% detection accuracy using automated detection methods, a 3% decrease in vulnerable plan generation, and a 95% reduction in the attack window. When combining all three approaches, we reduce the TOCTOU vulnerabilities from an executed trajectory from 12% to 8%. Our findings open a new research direction at the intersection of AI safety and systems security. (paper)🔽 Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents. Derek Lilienthal, Sanghyun Hong.
We present a retrieval system for answering questions about Verilog / System Ver-
ilog code bases. Standard vector RAG (retrieval augmented generation) often fails
on hardware description languages due to identifier renaming, coding-style vari-
ation, hierarchy, and concurrency. We instead construct knowledge graphs over
the code and its LLM-generated explanations and retrieve based on the entities
and relations. We achieve this by adapting the GraphRAG package, originally
intended for natural language, to our specific code use-case. We compare (i)
standard semantic retrieval on the explanations, (ii) GraphRAG over code and
(iii) GraphRAG over the explanations. On a corpus of ∼3.5K files and a bench-
mark of 29 questions, using top-1 file-level recall, the first baseline reaches 31%.
GraphRAG consistently outperforms it, achieving 55–59% when utilizing the ex-
planations, and up to 79% when considering retrieved equivalent files. Construct-
ing the graph with GPT-4o-mini worked well without requiring the larger GPT-
4o, but GPT-4o was required for answering the queries better. Our results indicate
that the suggested graph-based approach could be useful for answering questions
of hardware designers on the code base. (paper)🔽 Retrieval on Verilog Repositories: A Knowledge-Graph Based Solution. Adi Szeskin, Itay Lieder, Divyasree Tummalapalli, Amitai Armon.
The rapid growth of large language models (LLMs) in high-performance computing (HPC) data centers necessitates a shift from purely energy-efficient to carbon-aware control for liquid cooling systems. We introduce a novel multi-agent framework that leverages LLM-powered agents to achieve autonomous, carbon-aware thermal management. Our architecture features eight specialized agents coordinated via a hybrid Redis and Model Control Protocol (MCP) backbone for real-time operation. We validate our approach on a high-fidelity digital twin of the Frontier supercomputer's cooling system, focusing on a core contribution: a hybrid Reinforcement Learning (RL) and LLM control strategy. Experimental results show that our `RL $\rightarrow$ LLM` hybrid model significantly outperforms traditional baselines and other LLM configurations, achieving the lowest average blade temperatures (28.29°C) and the lowest carbon emissions (11.1 kg/hr), while maintaining operational stability. This work presents a practical blueprint for deploying agentic AI to create sustainable, efficient, and explainable control systems for complex cyber-physical infrastructure. (paper)🔽 Carbon-Aware RL-LLM Control for Energy-Efficient Liquid-Cooled HPC Data Centers. Avisek Naug, Sahand Ghorbanpour, Ashwin Ramesh Babu, Antonio Guillen-Perez, Vineet Gundecha, Ricardo Luna Gutierrez, Soumyendu Sarkar.
Operating system schedulers suffer from a fundamental semantic gap, where kernel
policies fail to understand application-specific needs, leading to suboptimal perfor-
mance. We introduce SchedCP, the first framework that enables fully autonomous
Large Language Model (LLM) agents to safely and efficiently optimize Linux
schedulers without human involvement. Our core insight is that the challenge is
not merely to apply a better LLM, but to architect a decoupled control plane that
separates the AI’s role of semantic reasoning ("what to optimize") from the sys-
tem’s role of execution ("how to observe and act"). Implemented as Model Context
Protocol(MCP) server, SchedCP provides a stable interface with three key services:
a Workload Analysis Engine, an evolving Scheduler Policy Repository, and an
Execution Verifier that validates all AI-generated code and configure before deploy-
ment with static and dynamic analysis. We demonstrate this architecture’s power
with sched-agent, a multi-agent system that autonomously analyzes workloads,
synthesizes custom eBPF scheduling policies, and deploys them via the sched_ext
infrastructure. Our evaluation shows that SchedCP achieves up to 1.79x perfor-
mance improvement and 13x cost reduction compared to naive agentic approaches,
all while maintaining high success rate. The code will be open-sourced. (paper)🔽 Towards Agentic OS: An LLM Agent Framework for Linux Schedulers. YUSHENG ZHENG, YanPeng Hu, Wei Zhang, Andi Quinn.
We introduce OptRot, a data-free preprocessing method to learn fusible rotations for post-training quantization of language models. OptRot reduces weight outliers by finding rotations which minimize the element-wise fourth power of the rotated weights. We show how reducing weight outliers can provably improve weight quantization performance and how OptRot rotations can outperform both Hadamard rotations and rotations learned by the data-dependent method SpinQuant. (paper)🔽 OptRot: Mitigating Weight Outliers via Data-Free Rotations for Post-Training Quantization. Advait Gadhikar, Riccardo Grazzi, James Hensman.
Approximate Nearest Neighbor Search (ANNS) has recently gained significant attention due to its many applications, such as Retrieval-Augmented Generation.
Such applications require ANNS algorithms that support dynamic data, so the ANNS problem on dynamic data has attracted considerable interest.
However, a comprehensive evaluation methodology for data deletion in ANNS has yet to be established.
This study proposes an experimental framework and comprehensive evaluation metrics to assess the efficiency of data deletion for ANNS indexes under practical use cases.
Specifically, we categorize data deletion methods in graph-based ANNS into three approaches and formalize them mathematically.
The performance is assessed in terms of accuracy, query speed, and other relevant metrics.
Finally, we apply the proposed evaluation framework to Hierarchical Navigable Small World, one of the state-of-the-art ANNS methods, to analyze the effects of data deletion, and propose Deletion Control, a method which dynamically selects the appropriate deletion method under a required search accuracy. (paper)🔽 How Should We Evaluate Data Deletion in Graph-Based ANN Indexes?. Tomohiro Yamashita, Daichi Amagata, Yusuke Matsui.
With the rapid adoption of Large Language Models (LLMs), LLM-adapters have become increasingly common, providing lightweight specialization of large-scale models. Serving hundreds or thousands of these adapters on a single GPU allows request aggregation, increasing throughput, but may also cause request starvation if GPU memory limits are exceeded. To address this issue, this study focuses on determining the joint configuration of concurrent and parallel adapters that maximizes GPU throughput without inducing starvation, given heterogeneous adapter and traffic properties. We propose a data-driven ML approach leveraging interpretable models to tackle this caching problem and introduce the first Digital Twin capable of reproducing an LLM-adapter serving system, enabling efficient training data generation. Experiments with the vLLM framework and LoRA adapters show that the Digital Twin reproduces throughput within 5.1% of real results, while the ML approach predicts optimal numbers of concurrent and parallel adapters with an error of at most 7.2% under heterogeneous, real-world workloads. (paper)🔽 A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving. Ferran Agullo, Joan Oliveras Torra, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Lluis Berral.
Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems. (paper)🔽 When to Reason: Semantic Router for vLLM. Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen.
Optimizing sparse machine learning (ML) workloads requires navigating a vast schedule space. Two of the most critical aspects of that design space include which operators to fuse and which loop/dataflow order to use within each fused region.
We present AutoSparse, an LLM-guided autoscheduler atop FuseFlow, a sparse ML compiler that focuses on fusion grouping and legal dataflow order selection. FuseFlow enumerates lawful orders per fused region and exposes a lightweight FLOPs/byte signal; the LLM proposes structured candidates (fusion sets and orders) that we validate and rank before codegen. With backend defaults for blocking and parallelism held fixed, case studies on GCN, GraphSAGE show consistent gains over unfused baselines and parity with hand-tuned/heuristic schedules. Coupling LLM reasoning with FuseFlow's legality guards and roofline-style signals efficiently explores sparse scheduling spaces with minimal human effort. (paper)🔽 LLM-Guided Autoscheduling for Large-Scale Sparse Machine Learning. Rubens Lacouture, Genghan Zhang, Konstantin Hossfeld, Tian Zhao, Kunle Olukotun.
The adoption of machine learning-based techniques for analog integrated circuit layout, unlike its digital counterpart, has been limited by the stringent requirements imposed by electric and problem-specific constraints, along with the interdependence of floorplanning and routing steps.
In this work, we address a prevalent concern among layout engineers regarding the need for readily available routing-aware floorplanning solutions. To this extent, we develop an automatic floorplanning engine based on reinforcement learning and relational graph convolutional neural network specifically tailored to condition the floorplan generation towards more routable outcomes.
A combination of increased grid resolution and precise pin information integration, along with a dynamic routing resource estimation technique, allows balancing routing and area efficiency, eventually meeting industrial standards.
When analyzing the place and route effectiveness in a simulated environment, the proposed approach achieves a 13.8% reduction in dead space, a 40.6% reduction in wirelength and a 73.4% increase in routing success when compared to past learning-based state-of-the-art techniques. (paper)🔽 Advancing Routing-Awareness in Analog ICs Floorplanning. Davide Basso, Luca Bortolussi, Mirjana Videnovic-Misic, Husni Habal.
Large-scale training jobs, especially those utilizing GPU clusters, are vulnerable to various failure modes, including individual hardware faults, network issues, and software-level problems. These failures can lead to significant downtime, wasted computational resources, and delays in research or production workflows. We propose a ML based forecasting algorithm designed for predicting health status of GPU clusters. Through extensive ablation studies, we found that cascading 1D CNNs achieved the best performance. The model leverages time-series data representing various cluster metrics, such as temperature, power consumption, and resource utilization towards predicting cluster failures, enabling proactive maintenance and resource optimization. By tuning differently per use-case, the model is able to achieve overall PRAUC of 0.90, and precision and recall of 0.99 and 0.90 respectively. This work is motivated by the need to improve the reliability and efficiency of large-scale training jobs that are susceptible to hardware and software failures. (paper)🔽 Forecasting machine degradation of GPU Clusters. Shengnan Cai, Shuxin Nie, Zhehui Chen, Nupur Gulalkari, George Vanica, Chetna Jain, Sethuraman Sankaran.
We present NetGent, an AI-agent framework for automating complex application workflows to generate realistic network traffic datasets. Developing generalizable ML models for networking requires data collection from network environments with traffic that results from a diverse set of real-world web applications. However, using existing browser automation tools that are diverse, repeatable, realistic, and efficient remains fragile and costly. NetGent addresses this challenge by allowing users to specify workflows as natural-language rules that define state-dependent actions. These abstract specifications are compiled into nondeterministic finite automata (NFAs), which a state synthesis component translates into reusable, executable code. This design enables deterministic replay, reduces redundant LLM calls through state caching, and adapts quickly when application interfaces change. In experiments, NetGent automated more than 50+ workflows spanning video-on-demand streaming, live video streaming, video conferencing, social media, and web scraping, producing realistic traffic traces while remaining robust to UI variability. By combining the flexibility of language-based agents with the reliability of compiled execution, NetGent provides a scalable foundation for generating the diverse, repeatable datasets needed to advance ML in networking. (paper)🔽 NetGent : Agent-Based Automation of Network Application Workflows. Jaber Daneshamooz, Eugene Vuong, Laasya Koduru, Sanjay Chandrasekaran, Arpit Gupta.
Provenance-based intrusion detection is an increasingly popular application of graphical machine learning in cybersecurity. where system activities are modeled as provenance graphs to capture causality and correlations among potentially malicious actions. Graph Neural Networks (GNNs) have demonstrated strong performance in this setting. However, traditional statically-provisioned GNN5
inference architectures fall short in meeting two crucial demands of intrusion detection: (1) maintaining consistently low detection latency, and (2) handling highly irregular and bursty workloads. To holistically address these challenges, we present GraphFaaS, a serverless architecture tailored for GNN-based intrusion detection. GraphFaaS leverages the elasticity and agility of serverless computing to dynamically scale the GNN inference pipeline. We parallelize and adapt GNN workflows to a serverless environment, ensuring that the system can respond in real time to fluctuating workloads. By decoupling compute resources from static provisioning, GraphFaaS delivers stable inference latency, which is critical for dependable intrusion detection and timely incident response in cybersecurity operations. Preliminary evaluation shows GraphFaaS reduces average detection latency by 85% and coefficient of variation (CV) by 64% compared to the baseline. (paper)🔽 GraphFaaS: Serverless GNN Inference for Burst-Resilient, Real-Time Intrusion Detection. Lingzhi Wang, Vinod Yegneswaran, Xinyi Shi, Ziyu Li, Ashish Gehani, Yan Chen.
Far-memory systems, where applications store less-active data in more energy-efficient memory media, are increasingly adopted by datacenters.
However, applications are bottlenecked by on-demand data fetching from far- to local-memory.
We present $\textbf{\textit{Memix}}$,
a far-memory system that embodies a deep learning–system co-design for efficient and accurate prefetching, minimizing on-demand far-memory accesses.
One key observation is that memory accesses are shaped by both application semantics and runtime context, providing an opportunity to optimize each independently.
Preliminary evaluation of Memix on data-intensive workloads shows that it outperforms the state-of-the-art far-memory system by up to 42%. (paper)🔽 An Early Exploration of Deep-Learning-Driven Prefetching for Far Memory. Yutong Huang, Zhiyuan Guo, Yiying Zhang.
Increasing demand for Large Language Models (LLMs) querying services imposes substantial deployment and computation costs.
LLM routing offers a cost-efficient solution by directing queries to the optimal LLM based on model and query features.
However, existing works focus on offline scenarios and struggle to adapt to online settings with high query volume and constrained token budgets.
In this work, we introduce PORT, the first training-free algorithm designed for online routing scenarios.
Our algorithm leverages approximate nearest neighbor search to efficiently estimate query features and performs a one-time optimization over a small set of initial queries to learn a routing strategy that guides future routing.
We provide theoretical guarantees demonstrating that our algorithm achieves a competitive ratio of $1 - o(1)$ under natural assumptions, which is further validated by extensive experiments across 3 benchmark datasets and 8 baselines, showing an average improvement of **3.55$\times$** in overall performance, **1.85$\times$** in cost efficiency, and nearly **4.25$\times$** in throughput.
Our code is available at https://github.com/fzwark/PORT. (paper)🔽 PORT: Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving. Fangzhou Wu, Sandeep Silwal.
Identifying efficient execution strategies for Large Language Models (LLMs) on specialized hardware accelerators requires exploring a vast design space where exhaustive search is computationally prohibitive. Traditional black-box optimization (BBO) methods offer a principled alternative, but their efficiency degrades in high-dimensional, sparse spaces with many infeasible points. We propose LLM-Box, a framework that integrates an LLM agent to guide multi-objective BBO toward the Pareto frontier while significantly reducing sampling of infeasible points. By leveraging the LLM agent to retrieve and structure prior exploration data through retrieval-augmented generation (RAG), and by warm-starting and filtering BBO suggestions, our approach guides the search towards feasible and promising regions of the design space. As a result, LLM-Box identifies Pareto-optimal configurations with a hypervolume difference of less than 3\% using $40{-}150\times$ fewer simulations than an exhaustive search, and compared to a well-known BBO tool, achieves 2\% better accuracy with $20\times$ fewer trials. Moreover, the framework demonstrates zero-shot generalization, transferring knowledge from prior models and hardware to unseen targets. (paper)🔽 LLM-Box : An Agentic Framework for Guided Black-Box Optimization in Mapping LLMs onto Specialized Hardware Accelerators. Sujay Pandit, Akanksha Jain, Rami Cohen, Zhijie Deng, Sagar Karandikar, Sagi Perel, Anand Raghunathan, Parthasarathy Ranganathan.
Domain-specific hardware accelerators for deep neural network (DNN) inference have been widely adopted. Traditional DNN compression techniques such as pruning and quantization help but can fall short when aggressive hardware efficiency is required. We present \textit{NeuSym-HLS}, a partial symbolic distillation and high-level hardware synthesis flow to compress and accelerate DNN inference for edge computing. NeuSym-HLS replaces a portion of the layers of a trained DNN model with compact analytic expressions obtained via symbolic regression, and generates efficient hardware accelerators. The resulting hardware accelerator of the hybrid DNN-symbolic model provides well balanced performance in algorithmic accuracy, hardware resource, and inference latency. Our evaluation on vision tasks showed that NeuSym-HLS reduces hardware resource usage, reduces latency, while maintaining model inference accuracy. (paper)🔽 NeuSym-HLS: Learning-Driven Symbolic Distillation in High-Level Synthesis of Hardware Accelerators. Chung-Mou Pan, Salma Elmalaki, Yasser Shoukry, Sitao Huang.
Deploying useful Long-Context Transformer Models (LCTMs) requires addressing two key challenges: (1) A growing memory footprint due to quadratic self-attention and linear KV-cache scaling in memory as sequence length increase; (2) the ContextRot phenomena where empirical evidence suggests that transformer architecture’s performance degrades with increasing context length. Given the shared dependency on the input, a natural question arises: "Can we surgically select the most important input chunks for processing to synergistically (a) reduce the memory footprint, and (b) mitigate the ContextRot effects?" In this paper, we answer this question in the affirmative for long-context summarization tasks. We propose APCE as a context-aware solution to select the most important input chunks through low-dimensional semantic similarity matching with the current query. By directly operating on the input, APCE decouples from strict dependency on underlying hardware or CUDA environments, promising a compatible solution scalable to different deployment systems. Our empirical evaluations have demonstrated superior or on-par summarization performance for APCE compared to the full dense baseline using a fraction (50%-70%) of the input sequence resulting in KV-cache and self-attention memory efficiency improvements. We hope our findings inspire further research on context-aware efficiency solutions for LCTMs geared towards other relevant long-context tasks. (paper)🔽 APCE: Adaptive Progressive Context Expansion for Long Context Processing. Baisub Lee, Sanghyun Byun, Mohanad Odema, Jung Ick Guack, Jacob Song, Woo Seong Chung.
Several learned policies have been proposed to replace heuristics for scheduling, caching, and other system components in modern systems.
By leveraging diverse features, learning from historical trends, and predicting future behaviors, such models promise to keep pace with ever-increasing workload dynamism and continuous hardware evolution.
However, policies trained in isolation may still achieve suboptimal performance when placed together.
In this paper, we inspect one such instance in the domain of hardware caching -- for the policies of cache replacement and prefetching.
We argue that these two policies are bidirectionally interdependent and make the case for training the two \textit{jointly}.
We propose a joint learning approach based on developing shared representations for the features used by the two policies.
We present two approaches to develop these shared representations, one based on a joint encoder and another based on contrastive learning of the embeddings, and demonstrate promising preliminary results for both of these.
Finally, we lay down an agenda for future research in this direction. (paper)🔽 A Joint Learning Approach to Hardware Caching and Prefetching. Samuel Yuan, Divyanshu Saxena, Jiayi Chen, Nihal Sharma, Aditya Akella.
Image compression and reconstruction are crucial for various digital applications. While contemporary neural compression methods achieve impressive compression rates, the adoption of such technology has been largely hindered by the complexity and large computational costs of the convolution-based decoders during data reconstruction. To address the decoder bottleneck in neural compression, we develop a new compression-reconstruction framework based on incorporating low-rank representation in an autoencoder with vector quantization. We demonstrated that performing a series of computationally efficient low-rank operations on the learned latent representation of images can efficiently reconstruct the data with high quality. Our approach dramatically reduces the computational overhead in the decoding phase of neural compression/reconstruction, essentially eliminating the decoder compute bottleneck while maintaining high fidelity of image outputs. (paper)🔽 Ultra-Efficient Decoding for End-to-End Neural Compression and Reconstruction. Ethan G. Rogers, Cheng Wang.
Benchmark workloads are extremely important to the database management research community, especially as more machine learning components are integrated into database systems. Here, we propose a Bayesian optimization technique to automatically search for difficult benchmark queries, significantly reducing the amount of manual effort usually required. In preliminary experiments, we show that our approach can generate queries with more than double the optimization headroom compared to existing benchmarks. (paper)🔽 Adversarial Query Synthesis via Bayesian Optimization. Yimeng Zeng, Jeffrey Tao, Haydn Thomas Jones, Natalie Maus, Osbert Bastani, Jacob R. Gardner, Ryan Marcus.
High-Performance Computing (HPC) schedulers must balance user performance with facility-wide resource constraints. The task boils down to selecting the optimal number of nodes for a given job. We present a surrogate-assisted multi-objective Bayesian optimization (MOBO) framework to automate this complex decision. Our core hypothesis is that surrogate models informed by attention-based embeddings of job telemetry can capture performance dynamics more effectively than standard regression techniques. We pair this with an intelligent sample acquisition strategy to ensure the approach is data-efficient. On two production HPC datasets, our embedding-informed method consistently identified higher-quality Pareto fronts of runtime-power trade-offs compared to baselines. Furthermore, our intelligent data sampling strategy drastically reduced training costs while improving the stability of the results. To our knowledge, this is the first work to successfully apply embedding-informed surrogates in a MOBO framework to the HPC scheduling problem, jointly optimizing for performance and power on production workloads. (paper)🔽 Attention-Informed Surrogates for Navigating Power-Performance Trade-offs in HPC. Ashna Nawar Ahmed, Banooqa Banday, Terry Jones, Tanzima Z. Islam.
The rise of agentic AI workflows unlocks novel opportunities for computer systems design and optimization. However, for specialized domains such as program synthesis, the relative scarcity of HDL and proprietary EDA resources online compared to more common programming tasks introduces challenges, often necessitating task-specific fine-tuning, high inference costs, and manually-crafted agent orchestration. In this work, we present VeriMaAS, a multi-agent framework designed to automatically compose agentic workflows for RTL code generation. Our key insight is to integrate formal verification feedback from HDL tools directly into workflow generation, reducing the cost of gradient-based updates or prolonged reasoning traces. Our method improves synthesis performance by 5–7% for pass@k over fine-tuned baselines, while requiring only a few hundred "training" examples, representing an order-of-magnitude reduction in supervision cost. (paper)🔽 Automated Multi-Agent Workflows for RTL Design. Amulya Bhattaram, Janani Ramamoorthy, Ranit Gupta, Diana Marculescu, Dimitrios Stamoulis.
Large Language Models (LLMs) are increasingly deployed in real-world systems, with Retrieval-Augmented Generation (RAG) a dominant production workload. Yet LLM deployments are energy-intensive, as inference accounts for over 90% of the model lifecycle in cloud workloads. We show that RAG workflows with near-identical accuracy can differ drastically in energy consumption—a property we call “workflow fungibility.” For example, pairing Llama3-8B with stronger retrievers matches the accuracy of Llama3-70B while using over 5× less energy. To study this effect, we profile retrieval and generation configurations across FinanceBench and FRAMES, mapping the joint accuracy–energy landscape. Our results reveal configurations within ≤3% accuracy that differ by up to 20.2× in energy, exposing large hidden opportunities for efficiency. We further demonstrate that lightweight regressors can predict accuracy from a small set of configuration knobs, enabling prediction-guided pruning of the design space. These findings establish workflow fungibility as a key lever for sustainable RAG, and point toward systematic, energy-aware configuration as a critical direction for retrieval-based LLM systems. (paper)🔽 Towards Automatically Optimizing Retrieval Augmented AI Systems. Melissa Pan, Negar Arabzadeh, Mathew Jacob, Fiodar Kazhamiaka, Esha Choukse, Matei Zaharia.
As LLM inference shifts to multi-tenant GPU clusters, co-batching improves throughput but obscures per-tenant usage and limits control. Enabling fractional sharing of the inference engine requires a real-time, per-request attribution primitive that is accurate and light enough to run inside the scheduling loop. We present LLMVisor, a roofline-guided latency attribution model that captures the memory-bound and compute-bound phases via a concise piecewise-linear form over features proportional to FLOPs and memory I/O traffic. LLMVisor decomposes batch latency into additive, per-request shares and runs efficiently at microsecond (µs) scale. We evaluate LLMVisor across Llama3.1-8B and Qwen2.5-14B/32B on A100/H100 under varying tensor parallelism and workload mixes. Compared to a token-count baseline, LLMVisor attains near-perfect R² and reduces relative error by up to 2.5×/3.3× (p90/p99) for prefill and 3.5×/4.4× for decode, despite batching variability and sequence divergence. (paper)🔽 LLMVisor: A Real-Time Latency Attribution Model for Multi-Tenant LLM Serving. Shuowei Jin, Xueshen Liu, Jiaxin Shan, Le Xu, Tieying Zhang, Liguang Xie, Zhuoqing Mao.
Large language models (LLMs) achieve strong performance, yet inference is still bounded by trade-offs between efficiency and accuracy.
While quantization cuts memory and latency, it fails to flexibly accommodate heterogeneous inputs.
We introduce Query-Aware Quantization (QAQ), a dynamic-precision scheme that decomposes model weights into bit-planes, employs a trainable router for query-conditioned precision selection, and supports on-demand CPU$\leftrightarrow$GPU loading.
On Qwen3 and LLaMA-3.1, QAQ matches the accuracy of 8-bit baselines while reducing GPU memory footprint, with an associated latency overhead.
These results suggest that QAQ offers a practical operating point on the efficiency–accuracy frontier for LLM inference. (paper)🔽 QAQ: Query-adaptive Mixed-precision Quantization for Large Language Models. Shuxing Li, Huanrong Liu, Zelin Wang, Ruoyang Du, S H Lee, Chunlin Tian, Qingbiao Li.
Learned query optimizers struggle to generalize, causing performance regressions for a subset of queries. To address this, DataSwift is introduced, a hint-recommendation framework that integrates LLM-derived SQL embeddings, GNN-encoded plan representations, a similarity-threshold memory cache, and Thompson-sampling bandit exploration. Incoming queries are embedded to recall proven hints; a low-rank inductive matrix completion model predicts expected latency. Validated hints are cached and the bandit down-weights any hints inducing slowdowns. On the combined JOB benchmarks, DataSwift incurs only a 0.7% regression rate with zero catastrophic regressions and delivers a 1.4x improvement on the 5% slowest queries. Thus, DataSwift provides performance gains without sacrificing safety. (paper)🔽 DataSwift: Smart Choices for Safe Query Optimization. Raahim Lone.