From 636f51995928f0bf65a91fbc7174ce870d6a8a2d Mon Sep 17 00:00:00 2001
From: SangHyeon Park <39648636+shyeonn@users.noreply.github.com>
Date: Fri, 13 Dec 2024 16:39:15 +0900
Subject: [PATCH 1/2] Update automatic optimization code in part1.md

---
 chapter_gpu_acceleration/part1.md | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/chapter_gpu_acceleration/part1.md b/chapter_gpu_acceleration/part1.md
index 45acaee..2fb3527 100644
--- a/chapter_gpu_acceleration/part1.md
+++ b/chapter_gpu_acceleration/part1.md
@@ -310,16 +310,22 @@ print("GEMM-Blocking: %f GFLOPS" % (num_flop / evaluator(A_nd, B_nd, C_nd).mean
 So far, we have been manually writing transformations to optimize the TensorIR program on GPU. We can leverage the automatic program optimization framework to tune the same program. The following code does that, we only set a small number here, and it can take a few min to finish.
 
 ```python
+
 from tvm import meta_schedule as ms
 
-database = ms.tune_tir(
-    mod=MyModuleMatmul,
-    target="nvidia/tesla-p100",
-    max_trials_global=64,
-    num_trials_per_iter=64,
-    work_dir="./tune_tmp",
-)
-sch = ms.tir_integration.compile_tir(database, MyModuleMatmul, "nvidia/tesla-p100")
+database = ms.tune.tune_tasks(
+        tasks=[ms.tune_context.TuneContext(
+            mod=MyModuleMatmul,
+            target="nvidia/geforce-rtx-4090",
+            space_generator=ms.space_generator.PostOrderApply(),
+            search_strategy=ms.search_strategy.ReplayTrace(),
+            )],
+        task_weights=[1.0],
+        max_trials_global=64,
+        num_trials_per_iter=64,
+        work_dir="./tune_tmp",
+        )
+sch = ms.tir_integration.compile_tir(database, MyModuleMatmul, "nvidia/geforce-rtx-4090")
 sch.mod.show()
 ```
 
@@ -338,4 +344,4 @@ This chapter studies another axis of MLC -- how we can transform our program for
 - A typical GPU contains two-level hierarchy. Each thread is indexed by(in cuda terminology) `threadIdx.x` and `blockIdx.x`(there can be multiple dimension indices as well, but they can be fused to one.
 - Shared memory helps cache data commonly used across the threads within the same block.
 - Encourage memory reuse during GPU optimization.
-    
\ No newline at end of file
+    

From 6dabbea7e90c21a0bfdc06bc604f8383bcf40235 Mon Sep 17 00:00:00 2001
From: SangHyeon Park <39648636+shyeonn@users.noreply.github.com>
Date: Fri, 13 Dec 2024 16:45:36 +0900
Subject: [PATCH 2/2] Remove line

---
 chapter_gpu_acceleration/part1.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/chapter_gpu_acceleration/part1.md b/chapter_gpu_acceleration/part1.md
index 2fb3527..44b8db6 100644
--- a/chapter_gpu_acceleration/part1.md
+++ b/chapter_gpu_acceleration/part1.md
@@ -310,7 +310,6 @@ print("GEMM-Blocking: %f GFLOPS" % (num_flop / evaluator(A_nd, B_nd, C_nd).mean
 So far, we have been manually writing transformations to optimize the TensorIR program on GPU. We can leverage the automatic program optimization framework to tune the same program. The following code does that, we only set a small number here, and it can take a few min to finish.
 
 ```python
-
 from tvm import meta_schedule as ms
 
 database = ms.tune.tune_tasks(