TileFusion: A High-Level, Modular
Tile Processing Library

Installation | Getting Started | Examples | Documentation

Overview

TileFusion, derived from the research presented in this paper, is an efficient C++ macro kernel library designed to elevate the level of abstraction in CUDA C for tile processing. The library offers:

Higher-Level Programming Constructs: TileFusion supports tiles across the three-level GPU memory hierarchy, providing device kernels for transferring tiles between CUDA memory hierarchies and for tile computation.
Modularity: TileFusion enables applications to process larger tiles built out of BaseTiles in both time and space, abstracting away low-level hardware details.
Efficiency: The library's BaseTiles are designed to match TensorCore instruction shapes and encapsulate hardware-specific performance parameters, ensuring optimal utilization of TensorCore capabilities.

A core design goal of TileFusion is to allow users to understand and utilize provided primitives using logical concepts, without delving into low-level hardware complexities. The library rigorously separates data flow across the memory hierarchy from the configuration of individual macro kernels. This design choice enables performance enhancements through tuning, which operates in three possible ways:

Structural Tuning: Designs various data flows while keeping kernel configurations unchanged.
Parameterized Tuning: Adjusts kernel configurations while maintaining the same data flow.
Combined Tuning: Integrates both structural and parameterized tuning approaches simultaneously.

In summary, TileFusion encourages algorithm developers to focus on designing the data flow of their algorithms using efficient tile primitives. It can be utilized as:

A lightweight C++ library with header-only usage, offering superior readability, modifiability, and debuggability.
A Python library with pre-existing kernels bound to PyTorch.

The Basic GEMM Example

TileFusion approaches the efficient implementation of a kernel by:

Managing dataflow over memory hierarchies.
Configuring tile primitives, such as tile shapes, layouts, and other parameters.

This is an example of a simple GEMM (General Matrix Multiplication) kernel written using TileFusion. For the complete example, please refer to this directory.

Configuration of the Tile Primitives

The core programming constructs in TileFusion are Tile, TileLayout, TileIterator, Loader, and Storer.

Declare the Tile: GlobalTile and RegTile are utilized to customize the shape and layout of 1D (vector) or 2D (matrix) arrays within the GPU's three memory hierarchies, known as a Tile.
Declare the TileIterator: Partition the GlobalTile into smaller, manageable sub-tiles for efficient processing.
Declare Loader and Storer: Loaders and Storers use cooperative threads to transfer a tile from the source to the target location. They operate at the CTA level and accept the following inputs:
- Warp Layout
- Target Tile
- Source Tile
Based on these parameters, they automatically infer a copy plan that partitions the data transfer work among the threads.

1  using WarpLayout = RowMajor<2, 2>;
2
3  // operand A
4  using GlobalA = GlobalTile<InType, RowMajor<128, 256>>;
5  using IteratorA = TileIterator<GlobalA, TileShape<128, 32>>;
6  using RegA = RegTile<BaseTileRowMajor<__half>, RowMajor<8, 8>>;
7  using ALoader = GlobalToRegLoader<RegA, WarpLayout, kRowReuseCont>;
8
9  // operand B
10 using GlobalB = GlobalTile<InType, ColMajor<256, 64>>;
11 using IteratorB = TileIterator<GlobalB, TileShape<32, 64>>;
12 using RegB = RegTile<BaseTileColMajor<__half>, ColMajor<8, 4>>;
13 using BLoader = GlobalToRegLoader<RegB, WarpLayout, kColReuseCont>;
14
15 // output C
16 using GlobalC = GlobalTile<AccType, RowMajor<128, 64>>;
17 using RegC = RegTile<BaseTileRowMajor<float>, RowMajor<8, 8>>;
18 using CStorer = RegToGlobalStorer<GlobalC, RegC, WarpLayout>;

Note: To simplify the demonstration, this example involves only two memory levels: global memory and registers. TileFusion also applies similar concepts to SharedTile.

Dataflow Over Memory Hierarchies

The the kernel is defined as implementing the following dataflow over memory hierarchies:

1  template <typename InType, typename AccType,
2            typename IteratorA, typename RegA, typename LoaderA,
3            typename IteratorB, typename RegB, typename LoaderB,
4            typename GlobalC, typename RegC, typename CStorer>
5  __global__ void simple_gemm(const InType* dA, const InType* dB, AccType* dC) {
6      IteratorA gAs(dA);
7      RegA rA;
8      LoaderA loader_a;
9
10     IteratorB gBs(dB);
11     RegB rB;
12     LoaderB loader_b;
13
14     RegC acc;
15
16     for (int k = 0; k < IteratorA::sc1; ++k) {
17         loader_a(gAs(k), rA);
18         loader_b(gBs(k), rB);
19         __syncthreads();
20
21         gemm(rA, rB, acc);
22     }
23     __syncthreads();
24
25     GlobalC gC(dC);
26     CStorer storer_c;
27     storer_c(acc, gC);
28 }

The TileIterator (IteratorA, IteratorB in lines 6 and 10) serves as a syntactic interface for defining tile partitions. It is used to divide the GlobalTile into smaller sub-tiles and iterate over them.

Loader and Storer (declared in lines 8, 12, and 26) are efficient methods for loading and storing data, transferring data between memory hierarchies using specialized hardware-accelerated instructions (lines 17, 18, and 27). Tiles of data are cooperatively loaded into the RegTile, which is stored in each thread's local register file.

Once the data is loaded into a thread's local register file, gemm (in line 21) performs matrix multiplication using TensorCore's warp-level matrix multiply-and-accumulate (WMMA) instruction on the BaseTiles. The specialized data distribution required by TensorCore is automatically maintained by TileFusion's RegTile layout.

After the gemm operation is completed, the data in the RegTile is cooperatively stored back from registers to global memory using the RegToGlobalStorer.

Installation

TileFusion can be used as a lightweight C++ library with header-only usage, or it can be built as a Python library. You can choose to build either one.

Prerequisites

TileFusion requires:

C++20 host compiler
CUDA 12.0 or later
GCC version 10.0 or higher to support C++20 features

Download the repository:

git clone [email protected]:microsoft/TileFusion.git
cd TileFusion && git submodule update --init --recursive

Building the C++ Library

To build the project using the provided Makefile, simply run:

make

To run a single C++ unit test:

make unit_test_cpp CPP_UT=test_gemm

Building the Python Package

Build the wheel:
```
python setup.py build bdist_wheel
```
Clean the build:
```
python setup.py clean
```
Install the Python package in editable mode (recommended for development):
```
python setup.py develop
```
This allows you to edit the source code directly without needing to reinstall it repeatedly.

Running Unit Tests

Before running the Python unit tests, you need to build and install the Python package (see the Building the Python Package section).

Run a single Python unit test:
```
pytest tests/python/test_scatter_nd.py
```
Run all Python unit tests:
```
python setup.py pytests
```
Run all C++ unit tests:
```
python setup.py ctests
```

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.github		.github
.vscode		.vscode
3rd-party		3rd-party
assets		assets
benchmarks		benchmarks
cmake		cmake
examples		examples
include		include
python		python
scripts		scripts
src		src
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TileFusion: A High-Level, Modular
Tile Processing Library

Overview

The Basic GEMM Example

Configuration of the Tile Primitives

Dataflow Over Memory Hierarchies

Installation

Prerequisites

Building the C++ Library

Building the Python Package

Running Unit Tests

Contributing

Trademarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

microsoft/TileFusion

Folders and files

Latest commit

History

Repository files navigation

TileFusion: A High-Level, ModularTile Processing Library

Overview

The Basic GEMM Example

Configuration of the Tile Primitives

Dataflow Over Memory Hierarchies

Installation

Prerequisites

Building the C++ Library

Building the Python Package

Running Unit Tests

Contributing

Trademarks

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

TileFusion: A High-Level, Modular
Tile Processing Library

Packages