Skip to content

iotaaxel/triton-kernels-playground

Repository files navigation

Triton Kernels Playground

A playground for implementing and benchmarking Triton kernels against PyTorch equivalents.

Overview

This repository contains optimized Triton kernels and comprehensive benchmarks comparing them to PyTorch implementations. The focus is on performance analysis including throughput, latency, and memory usage.

Features

  • 🚀 High-performance kernels: Optimized Triton implementations
  • 📊 Comprehensive benchmarking: Throughput, latency, and memory analysis
  • 📈 Visualization: Automatic generation of comparison plots
  • 📝 Detailed reports: Markdown reports with speedup analysis
  • Tested: Unit tests for correctness verification

Kernels

  • MatMul (Tile-based): Optimized matrix multiplication using tiling strategies
  • LayerNorm: Efficient layer normalization implementation
  • Fused Softmax: Fused softmax operation with numerical stability

Structure

triton-kernels-playground/
├── kernels/          # Triton kernel implementations
├── bench/            # Benchmarking infrastructure
│   ├── benchmark.py  # Core benchmarking utilities
│   ├── run_all.py    # Run all benchmarks
│   └── report_generator.py  # Generate markdown reports
├── utils/            # Utility functions
│   ├── logger.py     # JSONL logging
│   └── visualizer.py # Matplotlib visualization
├── tests/            # Unit tests
├── examples/         # Usage examples
└── reports/          # Generated benchmark reports

Installation

pip install -r requirements.txt

Quick Start

Run Examples

# Matrix multiplication example
python examples/matmul_example.py

# Layer normalization example
python examples/layernorm_example.py

# Softmax example
python examples/softmax_example.py

Run Benchmarks

# Run all benchmarks
python -m bench.run_all

# Generate report from latest benchmark
python -m bench.report_generator

Run Tests

pytest tests/

Benchmarks

Benchmarks measure:

  • Throughput: Operations per second
  • Latency: Time per operation (milliseconds)
  • Memory: Peak memory usage (MB)

Results are:

  • Logged to JSONL format (reports/benchmark_*.jsonl)
  • Visualized with matplotlib (reports/*.png)
  • Summarized in markdown reports (reports/benchmark_report.md)

Requirements

  • Python >= 3.8
  • PyTorch >= 2.0.0
  • Triton >= 3.0.0
  • CUDA-capable GPU (required for Triton kernels)
  • NumPy >= 1.24.0
  • Matplotlib >= 3.7.0

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Changelog

See CHANGELOG.md for a list of changes and version history.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published

Languages