This document explains how data flows between CPU and GPU in SAGESim's distributed simulation architecture.
SAGESim runs agent-based simulations across multiple MPI workers, each with its own GPU. The simulation loop follows this pattern for each tick:
CPU: Prepare Data → GPU: Execute Kernel → CPU: Collect Results → MPI: Synchronize
The fundamental challenge is that GPUs require rectangular arrays for efficient parallel processing, but agent-based simulations produce ragged/irregular data (agents have variable numbers of neighbors).
┌─────────────────────────────────────────────────────────────────────────────┐
│ SIMULATION TICK │
└─────────────────────────────────────────────────────────────────────────────┘
Worker 0 Worker 1
┌─────────────────┐ ┌─────────────────┐
│ Local Agents │ │ Local Agents │
│ (Python lists) │ │ (Python lists) │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
┌─────────────────┐ MPI isend/irecv ┌─────────────────┐
│ Contextualize │◄────────────────────────────►│ Contextualize │
│ (exchange │ │ (exchange │
│ neighbor data) │ │ neighbor data) │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Data Prep │ │ Data Prep │
│ - ID → Index │ │ - ID → Index │
│ - Pad arrays │ │ - Pad arrays │
│ - Upload to GPU │ │ - Upload to GPU │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ GPU Kernel │ │ GPU Kernel │
│ (step function) │ │ (step function) │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Collect Results │ │ Collect Results │
│ - Download │ │ - Download │
│ - Index → ID │ │ - Index → ID │
└────────┬────────┘ └────────┬────────┘
│ │
▼ ▼
┌─────────────────┐ MPI barrier ┌─────────────────┐
│ Synchronize │◄────────────────────────────►│ Synchronize │
│ (global reduce) │ │ (global reduce) │
└─────────────────┘ └─────────────────┘
Location: sagesim/agent.py:493-767
Each worker exchanges neighbor agent data with other workers via MPI.
# Simplified flow
for each local agent:
for each neighbor of this agent:
if neighbor is on another worker:
send this agent's data to that worker
receive neighbor data from all other workersAgents need to read their neighbors' properties during the step function. If a neighbor lives on another worker, that data must be transferred before GPU execution.
Example: Agent 5 on Worker 0 has neighbor Agent 100 on Worker 1. Worker 0 must receive Agent 100's properties before the GPU kernel can execute Agent 5's step function.
Only properties marked neighbor_visible=True are sent via MPI:
breed.register_property("internal_counter", default=0, neighbor_visible=False) # Not sent
breed.register_property("health", default=100, neighbor_visible=True) # SentThis reduces MPI bandwidth when agents have properties that neighbors never read.
Location: sagesim/model.py:595-696
This phase transforms Python data structures into GPU-compatible arrays.
all_agent_ids = local_agent_ids + received_neighbor_ids
combined_data = local_data + received_neighbor_dataWhy: The GPU needs all relevant data in contiguous arrays. Local agents are at the beginning (indices 0 to N-1), neighbors follow (indices N to M).
GPU Array Layout:
┌───────────────────────────────────────┐
│ Index 0-99: Local agents │ ← Owned by this worker, will be modified
│ Index 100-150: Neighbor agents │ ← From other workers, read-only
└───────────────────────────────────────┘
Location: sagesim/model.py:29-108
Before: Agent 5's neighbors = [12, 87, 103] # Global agent IDs
After: Agent 5's neighbors = [3, 45, 112] # Local array indices
Why: GPU kernels access data via array indexing:
# GPU kernel code
neighbor_index = locations[my_index][i]
neighbor_health = health[neighbor_index] # Direct array access - fast!Using global agent IDs would require hash map lookups on the GPU, which is extremely slow. Array indices allow direct memory access with simple arithmetic: address = base + index * element_size.
Location: sagesim/internal_utils.py:8-96
Before (ragged - variable length rows):
Agent 0: [5, 12, 8] # 3 neighbors
Agent 1: [2] # 1 neighbor
Agent 2: [1, 4, 7, 9, 11] # 5 neighbors
After (rectangular - all rows same length):
Agent 0: [5, 12, 8, NaN, NaN] # Padded to max length (5)
Agent 1: [2, NaN, NaN, NaN, NaN]
Agent 2: [1, 4, 7, 9, 11]
Why: GPUs require rectangular arrays because:
-
Memory Layout: GPU calculates addresses as
base + row * stride + col. Variable row lengths make this impossible. -
Parallel Execution: All GPU threads execute the same instructions. With rectangular arrays, all threads can use the same loop bounds and memory access patterns.
-
Coalesced Access: Adjacent threads reading adjacent memory locations is fast. Ragged arrays break this pattern.
The NaN values serve as "no neighbor here" markers. The GPU kernel checks for NaN/-1 and skips those positions.
gpu_array = cp.array(padded_numpy_array) # CPU → GPU transferWhy: The data must physically reside in GPU memory (VRAM) for the GPU kernel to access it. This transfer happens over PCIe and has significant latency, which is why minimizing CPU↔GPU transfers is important.
write_buffers = []
for prop_idx in write_property_indices:
write_buffer = cp.copy(read_tensor[prop_idx])
write_buffers.append(write_buffer)Why: Prevents race conditions during parallel execution.
The Problem: If Agent A reads Agent B's health while Agent B simultaneously writes to its own health, the result is undefined (race condition).
The Solution: Double buffering separates read and write:
- All threads read from
read_tensors(original state) - All threads write to
write_buffers(new state) - After all threads complete, copy
write_buffers→read_tensors
This ensures all agents see a consistent snapshot of the simulation state.
Location: sagesim/model.py:706-734
The auto-generated CUDA kernel executes the user-defined step function for each agent in parallel.
# Kernel launch
self._step_func[blocks_per_grid, threads_per_block](
current_tick,
global_data,
*read_tensors, # Read from these
*write_buffers, # Write to these
num_local_agents,
agent_ids,
priority_idx,
)GPU Grid:
┌─────────────────────────────────────────────────────┐
│ Thread 0 → Agent 0 Thread 1 → Agent 1 ... │
│ Thread 100 → Agent 100 (neighbor, no step function) │
└─────────────────────────────────────────────────────┘
Each thread:
- Checks if it's a local agent (index < num_local_agents)
- Reads its breed to determine which step function to run
- Executes the step function, reading from neighbors as needed
- Writes results to write buffer
If multiple breeds exist with different priorities, they execute sequentially:
for priority_idx in range(num_priority_groups):
# Launch kernel for this priority group only
kernel(priority_idx)
# Wait for all threads to complete
cp.cuda.Stream.null.synchronize()
# After all priorities complete, copy writes to reads
for prop_idx in write_indices:
read_tensors[prop_idx] = write_buffers[prop_idx]Why: Ensures deterministic execution order when breeds interact. Priority 0 agents complete before priority 1 agents begin.
Location: sagesim/model.py:742-765
Download modified agent data from GPU back to CPU.
for i in range(num_properties):
# Only get local agents (first N rows), not neighbors
gpu_data = gpu_tensors[i][:num_local_agents]
# GPU → CPU transfer
cpu_array = gpu_data.get() # CuPy → NumPy
# Convert to Python list
python_list = cpu_array.tolist()Location: sagesim/model.py:111-217
Before: Agent 5's neighbors = [3, 45, 112] # Local array indices
After: Agent 5's neighbors = [12, 87, 103] # Global agent IDs
Why: The next tick's MPI exchange uses global agent IDs to route data to the correct workers. The index→ID conversion restores the global namespace.
Neighbor data (rows N to M) is read-only and came from other workers. Those workers have the authoritative copy and will send updated data next tick. Downloading neighbor rows would be wasted bandwidth.
Location: sagesim/model.py:767-779
comm.barrier() # All workers wait hereWhy: Ensures all workers finished GPU execution before any worker starts the next tick. Without this, a fast worker might begin MPI send while a slow worker is still computing, leading to stale data.
global_data = comm.allreduce(global_data, op=max)Why: Synchronizes global simulation state (e.g., total infection count, simulation-wide flags). Each worker contributes its local view; allreduce combines them.
| Transformation | Location | Why Needed |
|---|---|---|
| MPI exchange | agent.py | Agents need neighbor data that lives on other workers |
| Combine local + neighbor | model.py:622-653 | GPU needs all data in contiguous arrays |
| Agent ID → Index | model.py:29-108 | GPU uses array indexing, not hash lookups |
| Ragged → Rectangular padding | internal_utils.py:8-96 | GPU requires fixed-dimension arrays for parallel execution |
| CPU → GPU upload | model.py:662 | Data must be in GPU memory for kernel access |
| Double buffering | model.py:678-684 | Prevents race conditions during parallel writes |
| GPU → CPU download | model.py:742-765 | Results must return to CPU for MPI exchange |
| Index → Agent ID | model.py:111-217 | MPI uses global agent IDs for routing |
| MPI barrier | model.py:767 | Ensures consistent state before next tick |
The CPU↔GPU data transfer happens every tick (by default). This creates overhead because:
- PCIe Bandwidth: CPU↔GPU transfers are limited by PCIe speed (~32 GB/s for PCIe 4.0 x16)
- Padding Overhead: Ragged→rectangular conversion is CPU-bound Python code
- Synchronization: MPI barriers and GPU synchronization add latency
-
sync_workers_every_n_ticks: Run multiple ticks on GPU before synchronizing. Reduces MPI frequency but uses stale neighbor data. -
Selective Property Sync: Mark internal properties as
neighbor_visible=Falseto reduce MPI message size. -
Good Partitioning: Use METIS or similar to minimize cross-worker neighbors, reducing MPI traffic.
-
Batch Processing: Larger agent counts amortize the fixed overhead of kernel launches and transfers.