Skip to content

simmonssong/Reliable-AI4Sys

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 

Repository files navigation

Reliable-AI4Sys

General Scheduling

  1. SOSP'25 COpter: Efficient Large-Scale Resource-Allocation via Continual Optimization

    Suhas Jayaram Subramanya, Don Kurian Dennis, Gregory R. Ganger, Virginia Smith

    Keywords: Resource-Allocation, MILP.

    Motivation: 1) Exsiting solvers are not scalable: Extremely slow with large scales. 2) Exsitings accelerating techniques cannot keep efficiency and optimality. 3) Accelerate and keep optimality .

    Design: 1) Solving a standard LP: Utilize standard LP to unify different problems. 2) System-level problem parametere manipulation for acceleration. 3) Eliminate slow post-processes from related LP solutions to MILP solutions by simple rounding.

Storage Systems

  1. SOSP'24 Tiered Memory Management: Access Latency is the Key!

    Midhul Vuppalapati, Rachit Agarwal

    Keywords: Measurement, Page placement

    Motivation: 1) Hotest pages may be not expected: Demonstrated by experiments. 3) Palcement should minimize expected latency.

    Design: 1) Measure expected queue with small effort: CPU-to-memory datapath. 2) Algorithm (Or principle): Change type (hot or alternative) of pages by measured latencies.

Networks

  1. SIGCOMM'25 SaTE: Low-Latency Traffic Engineering for Satellite Networks

    Hao Wu, Yizhan Han, Mohit Rajpal, Qizhen Zhang, Jingxian Wang

    Keywords: Satellite Network, Traffic Engineering, ML Generalization

    Motivation: 1) Dynamic: Satellite Networks' topology change frequently. 2) Generalization: NN-like approaches' general problem. 3) Large Scale: Numerous nodes and links.

    Design: 1) GNN-only: Elinimate DNN to improve generalizability. 2) Graph pruning based on topology similarity (to an existing baseline topology): improve generalizability. 3) Supervisely learn Gurobi's solution.

  2. SIGCOMM'25Centralium: A Hybrid Route-Planning Framework for Large-Scale Data Center Network Migrations

    Yikai Lin, Mohab Gawish(Meta)

    Keywords: Centralized Routing, Distributed Routing, Network Migration, Route Planning, BGP

    Motivation: 1)Data centers frequently undergo large-scale network migrations 2)BGP cannot encode the sequential and conditional routing behaviors required during transitional migration phases.

    Design:1)Route Planning Abstraction (RPA);2)Centralium Architecture;3)Two protection Mechanisms.

  3. SIGCOMM'25From ATOP to ZCube: Automated Topology Optimization Pipeline and a Highly Cost-Effective Network Topology for Large Model Training

    Zihan Yan, Dan Li (Tsinghua University)

    Keywords: Data center networks, Network topology, AI infrastructure

    Motivation: 1)The explosive growth in LLM training scales requires new large-scale network topology designs. 2)Expert-designed topologies overlook potential asymmetric structures and struggle to balance multi-objective performance; existing automated approaches are not mature enough.

    Design:1)Insight-Driven Hyperparameterization;2)Multi-Objective Optimization Engine;3)High-Performance Evaluation Pipeline.

EDA

  1. DAC'21 NVCell: Standard Cell Layout in Advanced Technology Nodes with Reinforcement Learning

    Haoxing Ren, Matthew Fojtik

    Keywords: Standard Cell Layout, RL, DRC, Placement and Routing

    Motivation:1) Advanced technology nodes face DRC explosion (2000+ rules) with conditional and multi-pattern correlation, hard to model analytically. 2) Traditional methods (simulated annealing) suffer from long runtime, variable explosion, and poor scalability. 3) Need automated layout generation with competitive area and DRC compliance.

    Design: 1) Two-stage framework: Placement (simulated annealing + RL + ML routability predictor) + Routing (genetic algorithm + RL DRC fixer). 2) RL for placement: Pre-trained with simulated annealing samples, learns device pairing/ordering to speed up runtime. 3) RL for DRC fixing: Trained on one cell, transferable to all cells by identifying local DRC patterns. 4) ML routability predictor: Two-step (simple + precise) to optimize placement for routing.

  2. DAC'20 GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning

    Hanrui Wang, Kuan Wang, Jiacheng Yang, et al.

    Keywords: Transistor Sizing, Graph Convolutional Network (GCN), RL, Transfer Learning

    Motivation: 1) Analog circuit sizing relies on human experts, time-consuming with complex performance tradeoffs. 2) Traditional black-box methods (BO, ES) ignore circuit topology and cannot transfer knowledge across technology nodes/topologies. 3) Need transferable automated sizing with superior performance.

    Design: 1) Graph representation: Circuit modeled as graph (nodes=components, edges=wires) to capture topology. 2) GCN-RL agent: 7-layer GCN aggregates neighbor features, Actor-Critic architecture with DDPG for continuous action space. 3) Action space: Component-specific continuous parameters to avoid discrete space explosion.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published