Reliable-AI4Sys

General Scheduling

SOSP'25 COpter: Efficient Large-Scale Resource-Allocation via Continual Optimization

Suhas Jayaram Subramanya, Don Kurian Dennis, Gregory R. Ganger, Virginia Smith

Keywords: Resource-Allocation, MILP.

Motivation: 1) Exsiting solvers are not scalable: Extremely slow with large scales. 2) Exsitings accelerating techniques cannot keep efficiency and optimality. 3) Accelerate and keep optimality .

Design: 1) Solving a standard LP: Utilize standard LP to unify different problems. 2) System-level problem parametere manipulation for acceleration. 3) Eliminate slow post-processes from related LP solutions to MILP solutions by simple rounding.

Storage Systems

SOSP'24 Tiered Memory Management: Access Latency is the Key!

Midhul Vuppalapati, Rachit Agarwal

Keywords: Measurement, Page placement

Motivation: 1) Hotest pages may be not expected: Demonstrated by experiments. 3) Palcement should minimize expected latency.

Design: 1) Measure expected queue with small effort: CPU-to-memory datapath. 2) Algorithm (Or principle): Change type (hot or alternative) of pages by measured latencies.

Networks

SIGCOMM'25 SaTE: Low-Latency Traffic Engineering for Satellite Networks

Hao Wu, Yizhan Han, Mohit Rajpal, Qizhen Zhang, Jingxian Wang

Keywords: Satellite Network, Traffic Engineering, ML Generalization

Motivation: 1) Dynamic: Satellite Networks' topology change frequently. 2) Generalization: NN-like approaches' general problem. 3) Large Scale: Numerous nodes and links.

Design: 1) GNN-only: Elinimate DNN to improve generalizability. 2) Graph pruning based on topology similarity (to an existing baseline topology): improve generalizability. 3) Supervisely learn Gurobi's solution.
SIGCOMM'25Centralium: A Hybrid Route-Planning Framework for Large-Scale Data Center Network Migrations

Yikai Lin, Mohab Gawish(Meta)

Keywords: Centralized Routing, Distributed Routing, Network Migration, Route Planning, BGP

Motivation: 1)Data centers frequently undergo large-scale network migrations 2)BGP cannot encode the sequential and conditional routing behaviors required during transitional migration phases.

Design:1)Route Planning Abstraction (RPA);2)Centralium Architecture;3)Two protection Mechanisms.
SIGCOMM'25From ATOP to ZCube: Automated Topology Optimization Pipeline and a Highly Cost-Effective Network Topology for Large Model Training

Zihan Yan, Dan Li (Tsinghua University)

Keywords: Data center networks, Network topology, AI infrastructure

Motivation: 1)The explosive growth in LLM training scales requires new large-scale network topology designs. 2)Expert-designed topologies overlook potential asymmetric structures and struggle to balance multi-objective performance; existing automated approaches are not mature enough.

Design:1)Insight-Driven Hyperparameterization;2)Multi-Objective Optimization Engine;3)High-Performance Evaluation Pipeline.

EDA

DAC'21 NVCell: Standard Cell Layout in Advanced Technology Nodes with Reinforcement Learning

Haoxing Ren, Matthew Fojtik

Keywords: Standard Cell Layout, RL, DRC, Placement and Routing

Motivation:1) Advanced technology nodes face DRC explosion (2000+ rules) with conditional and multi-pattern correlation, hard to model analytically. 2) Traditional methods (simulated annealing) suffer from long runtime, variable explosion, and poor scalability. 3) Need automated layout generation with competitive area and DRC compliance.

Design: 1) Two-stage framework: Placement (simulated annealing + RL + ML routability predictor) + Routing (genetic algorithm + RL DRC fixer). 2) RL for placement: Pre-trained with simulated annealing samples, learns device pairing/ordering to speed up runtime. 3) RL for DRC fixing: Trained on one cell, transferable to all cells by identifying local DRC patterns. 4) ML routability predictor: Two-step (simple + precise) to optimize placement for routing.
DAC'20 GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning

Hanrui Wang, Kuan Wang, Jiacheng Yang, et al.

Keywords: Transistor Sizing, Graph Convolutional Network (GCN), RL, Transfer Learning

Motivation: 1) Analog circuit sizing relies on human experts, time-consuming with complex performance tradeoffs. 2) Traditional black-box methods (BO, ES) ignore circuit topology and cannot transfer knowledge across technology nodes/topologies. 3) Need transferable automated sizing with superior performance.

Design: 1) Graph representation: Circuit modeled as graph (nodes=components, edges=wires) to capture topology. 2) GCN-RL agent: 7-layer GCN aggregates neighbor features, Actor-Critic architecture with DDPG for continuous action space. 3) Action space: Component-specific continuous parameters to avoid discrete space explosion.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reliable-AI4Sys

General Scheduling

Storage Systems

Networks

EDA

About

Uh oh!

Releases

Packages

Contributors 3

License

simmonssong/Reliable-AI4Sys

Folders and files

Latest commit

History

Repository files navigation

Reliable-AI4Sys

General Scheduling

Storage Systems

Networks

EDA

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages