CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning¶

🎯 Pitch¶

This paper introduces CUDA-L2, an LLM-guided reinforcement learning system that automatically generates and optimizes half-precision HGEMM CUDA kernels by using measured CUDA execution speed as the reward, exploring 1,000 (M,N,K) configurations. By systematically discovering tiling, pipelining, prefetching, and other low-level optimizations, CUDA-L2 delivers consistent speedups over torch.matmul and NVIDIA’s cuBLAS/cuBLASLt (e.g., +11.4% over cuBLASLt-AutoTuning in offline mode), showing that automated RL-driven kernel search can materially improve performance-critical matmul at scale.

1. Executive Summary (2-3 sentences)¶

CUDA-L2 is an automated system that uses an LLM trained with reinforcement learning to generate and optimize half-precision matrix multiplication (HGEMM) CUDA kernels, using measured CUDA execution speed as the reward. Across 1,000 (M, N, K) configurations on an A100 GPU, it reports systematic speedups not only over torch.matmul but also over NVIDIA’s heavily optimized cuBLAS/cuBLASLt baselines, including an average +11.4% over cuBLASLt-AutoTuning-max in offline mode and +15.9% in server mode (Figure 1, Table 1). The significance is that it demonstrates RL-guided code generation can still improve performance-critical GEMM kernels at scale, where manual tuning is difficult across many shapes and settings (Abstract, Section 1, Section 4).

2. Context and Motivation¶

Problem / gap addressed
Matrix multiplication (matmul) dominates compute in LLM training and inference, so its kernel performance strongly impacts end-to-end speed (Section 1).
Manual optimization across many matrix sizes is hard because:
- Optimal strategies depend on (M, N, K) (Section 1).
- Optimizations often do not transfer across GPU architectures (Section 1).
- Even for the same dimensions, setup details (example given: FP16 vs FP32 accumulators) can change register pressure and thus the best kernel design (Section 1).
Why it matters
Even “deeply optimized” matmul is not fully solved; the introduction cites a targeted 13% throughput gain from model-specific GEMM optimizations as evidence that more gains exist when effort is focused (Section 1).
Scaling that kind of tuning to many shapes and production settings calls for automation (Section 1).
Prior approaches and shortcomings (as positioned here)
Recent work uses LLMs (often with RL) to generate CUDA kernels (Section 1), including AI CUDA Engineer and the authors’ prior CUDA-L1 (Section 1, Section 3.1).
However, the paper argues prior benchmark-focused efforts (e.g., KernelBench) typically optimize a single fixed configuration per task, so it is unclear how well they translate to production-like variability in dimensions (Section 1).
It also states that, to their knowledge, no prior work achieves performance comparable to hand-optimized matmul kernels when compared against cuBLAS (Section 1).
How this work positions itself
CUDA-L2 specifically targets HGEMM and scales optimization across 1,000 shape configurations, covering all 10^3 combinations where each of M, N, K is chosen from {64, 128, 256, 512, 1024, 2048, 4096, 8192, 12288, 16384} (Abstract, Section 1).
It reports comparisons against practical baselines (torch.matmul) and NVIDIA libraries (cuBLAS, cuBLASLt with heuristic and autotuning) under two workload scenarios: offline and server (Abstract, Section 4, Figure 1, Table 1).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

CUDA-L2 is an LLM-driven kernel optimizer that generates CUDA .cu code for half-precision GEMM and improves it via reinforcement learning.
It solves “given a specific (M, N, K) HGEMM shape on a GPU, produce a fast correct CUDA kernel,” by iteratively generating code variants, compiling/running them, and using runtime (plus correctness/size penalties) as the training signal (Section 3.2.3, Section 2.3–2.4).

3.2 Big-picture architecture (diagram in words)¶

(A) Continued pretraining data pipeline → builds a stronger CUDA-capable base model (Section 3.2.1).
(B) General-kernel RL stage → trains the model to optimize diverse CUDA kernels using speedup rewards (Section 3.2.2).
(C) HGEMM-specific RL stage → specializes the model to matmul kernels across many (M, N, K) settings, using runtime + deviation + code-length penalties and adding profiling metrics into the prompt/context (Section 3.2.3).
(D) Evaluation harness → compiles generated .cu with nvcc, checks executability and correctness, measures time robustly, and computes speedups (Section 2.3–2.4, Listing 1).

3.3 Roadmap for the deep dive¶

Explain (1) what HGEMM kernels look like and what is being optimized (Section 2.1).
Explain (2) how kernels are judged: executability, correctness, and timing/speedup measurement (Section 2.3–2.4).
Explain (3) how CUDA-L2 trains: pretraining expansion, then two-stage RL (general → HGEMM) (Section 3.2.1–3.2.3).
Explain (4) what signals guide optimization beyond raw runtime (profiling metrics; reward shaping) (Section 3.2.3).
Explain (5) what kinds of kernel-level optimizations it discovers (Section 5) and how it analyzes learned hyperparameter patterns (Section 6).

3.4 Detailed, sentence-based technical breakdown¶

This is an empirical systems + algorithmic training paper: it builds an LLM+RL system that generates CUDA HGEMM kernels and uses measured GPU performance (and correctness constraints) to iteratively improve them across a large configuration space (Abstract; Section 3.2; Section 4).

3.4.1 What is HGEMM and what knobs exist?¶

The target computation is half-precision GEMM, focused on the common case C = A B (i.e., α = 1, β = 0) where A ∈ R^{M×K} and B ∈ R^{K×N} (Section 2.1).
Modern high-performance HGEMM uses a tiled structure (Section 2.1):
Tiling/partitioning: Split A and B into tiles; each thread block computes an output tile of size B_M × B_N.
Main loop over K: Load B_M × B_K from A and B_K × B_N from B from global memory → shared memory → registers; use tensor cores for the multiply-accumulate; accumulate partial sums in registers; repeat across K tiles.
Epilogue: Write accumulated register results back (registers → shared memory → global memory).
The paper later refers to key kernel “hyperparameters” such as B_M, B_N, B_K tile sizes (Section 6.1), number of pipeline stages n_stage (Section 5.3, Section 6.2), and block swizzle parameters such as swizzle_stride (Section 5.3, Section 6.3).
These are not LLM training hyperparameters; they are kernel configuration parameters that affect memory reuse, parallelism, and resource usage.

3.4.2 Baselines and what “NN/TN/max” mean (so the comparisons are interpretable)¶

torch.matmul is treated as a practical baseline that dispatches to optimized kernels (for FP16 it uses cuBLAS internally) but includes framework overhead (Section 2.2.1).
cuBLAS baseline uses cublasGemmEx with CUBLAS_GEMM_DEFAULT_TENSOR_OP to enable Ampere FP16 tensor cores and rely on cuBLAS’s internal algorithm selection (Section 2.2.2; Listing 6).
Two data layouts are benchmarked:
- NN: “normal-normal” (row-major A, row-major B).
- TN: “transposed-normal” (mixed layout; described in Listing 6 as A row major and B col major, and cublasGemmEx uses CUBLAS_OP_T for one operand).
cuBLAS-max means “take the faster of NN and TN for each configuration,” i.e., max(cuBLAS-NN, cuBLAS-TN) (Section 2.2.2).
cuBLASLt baselines (Section 2.2.3) include:
cuBLASLt-heuristic: query cublasLtMatmulAlgoGetHeuristic, pick the top-ranked suggestion (results[0]), cache it, and evaluate (Section 2.2.3; Listing 7).
cuBLASLt-AutoTuning: query up to 100 candidates from the heuristic API, benchmark them with randomized ordering and random inputs, and choose the algorithm with best median time; cache the chosen algorithm (Section 2.2.3; Listing 8).
Both are tested in NN and TN; *-max takes the best layout per configuration.

3.4.3 What happens first, second, third: the CUDA-L2 training pipeline¶

Continued pretraining expands the model’s CUDA competence beyond KernelBench (Section 3.2.1).
CUDA code is collected from:
- Web sources (cleaned/segmented with rule-based filtering + LLM-based cleaning).
- Established libraries/tutorials (PyTorch, ATen, CUTLASS, NVIDIA tutorials/examples, etc.) (Section 3.2.1).
Because raw CUDA code usually lacks an “instruction prompt,” the pipeline uses Claude Sonnet 4 to generate instruction descriptions for each snippet (Section 3.2.1).
For retrieval augmentation, each instruction is used as a search query; retrieved documentation/examples are concatenated as extra context (Section 3.2.1).
The resulting instruction+context+code triplets are used for continued pretraining on DeepSeek 671B (Section 3.2.1).
- The excerpt does not provide optimizer settings, learning-rate schedule, batch size, context length, or token counts for this training.
A “general kernel RL” stage trains optimization behavior across diverse kernels (Section 3.2.2).
The authors collect roughly ~1K CUDA kernels from established libraries spanning many operation types (linear algebra, convolutions, reductions, elementwise ops, attention, etc.) and pair each with an “official/successful implementation” (Section 3.2.2).
RL reward is the average speedup score across test iterations (Section 3.2.2), and training uses a contrastive RL strategy from CUDA-L1 where the model compares previously generated variants and their measured performance (Section 3.2.2).
Parameter updates use GRPO (Section 3.2.2).
Rewards are “smoothed and clipped” to reduce reward hacking effects (Section 3.2.2), but the excerpt does not specify the smoothing/clipping functions or thresholds.
An “HGEMM RL” stage specializes the model to matmul kernels across many (M, N, K) (Section 3.2.3).
The same contrastive RL strategy is used, but now restricted to HGEMM with varying shapes (Section 3.2.3).
The reward explicitly combines performance, a numerical-deviation penalty, and a code-length penalty (Eq. (3), Section 3.2.3):
- In plain language: the model gets higher reward if the kernel runs faster than a reference, but it is penalized if output deviates from a FP32 CPU reference and if the generated code is long.
- The paper defines:
- r(custom) = (1/N) * Σ_i [ (t_ref^i / t_custom^i) − α * diff_i ] − β * L(custom) (Eq. (3), Section 3.2.3),
- diff_i = max_j |out_FP32^i[j] − out_custom^i[j]| (Section 3.2.3),
- L(custom) is code length (Section 3.2.3),
- α > 0 and β > 0 are penalty coefficients (values not provided in the excerpt).
The HGEMM RL context includes detailed NCU (Nsight Compute) profiling metrics (memory throughput, SM occupancy, cache efficiency, etc.) so the model can condition on performance characteristics rather than only end-to-end time (Section 3.2.3).
Kernel generation constraints / allowed implementation tools
Generated kernels are emitted as CUDA code in .cu files and compiled with nvcc (Section 3.2.3, Section 2.4.1).
The system allows CUDA C/C++, CuTe, inline PTX, CUDA intrinsics, and CUTLASS templates (Section 3.2.3).
It explicitly disallows Python-based DSLs like Triton (Section 3.2.3).

3.4.4 Executability, correctness, and robust timing (to prevent “reward hacking”)¶

Executability criteria
A candidate kernel must compile, launch, and finish within a time limit; memory violations are checked with compute-sanitizer --tool memcheck (Section 2.3.1).
Correctness criteria (two-pronged)
Reference output is computed by an FP32 CPU matmul:
> ref = MatMul(float(a.cpu()), float(b.cpu())) (Eq. (1), Section 2.3.2).
Because floating-point arithmetic is non-associative, exact equality is not required in general; instead the paper uses:
1. Exact match under binary inputs {0,1} below an FP16-exactness threshold (Section 2.3.2).
2. For entries where c_ref[i,j] < 2048, require c_custom[i,j] == c_ref[i,j] exactly; ignore positions with c_ref[i,j] > 2048 (Section 2.3.2).
3. The reasoning given is FP16 has 11 bits of significand precision, so integers in [0, 2048) are exactly representable (Section 2.3.2).
4. Baseline-bounded deviation (Section 2.3.2).
5. Compute outputs from multiple NVIDIA baselines (cuBLAS, cuBLASLt-heuristic, cuBLASLt-AutoTuning, both NN/TN) and take the maximum elementwise difference among them as an “upper bound of variability.”
6. A custom kernel is marked incorrect if its max elementwise deviation from FP32 CPU exceeds that bound (Section 2.3.2).
Timing and speedup computation
Speedup for one run is defined as s(custom) = t_ref / t_custom − 1 (Eq. (2), Section 2.4).
Each evaluation includes:
- 10-second warmup, then at least 30 seconds of timing (Section 2.4).
- Randomized execution order each iteration to reduce ordering/caching artifacts (Section 2.4).
- Final score is the mean speed score over runs (Section 2.4).
Timing is done with CUDA events and torch.cuda.synchronize() around the kernel call (Listing 1, Section 2.4.1).
Avoiding timing-measurement hacking
The authors state that prior work identified “hacks” such as launching asynchronous work in other streams or exploiting Python lazy execution (Section 2.4.1).
CUDA-L2 mitigates these by:
- Disallowing additional CUDA stream creation.
- Generating only .cu code, bypassing Python’s lazy evaluation (Section 2.4.1).

3.4.5 Offline vs server evaluation scenarios (why two modes)¶

offline: kernels run back-to-back with no pauses, representing peak throughput with warm caches (Section 2.4.2; Figure 1a).
server: kernels run with random idle intervals to simulate real-time inference request arrivals; the interval time itself is not included in the measured kernel execution time (Section 2.4.2; Figure 1b).
The paper attributes differing results and higher variance in server mode to GPU thermal/caching dynamics under idling vs sustained load (Section 4.1).

3.4.6 Missing “core configurations and hyperparameters” (explicitly not in the excerpt)¶

The prompt requests model/training hyperparameters like optimizer settings, learning-rate schedule, batch size, context window, tokenizer, number of layers/heads, total training tokens, compute budget, and hardware details. These are not specified in the provided excerpt, except for: - Target GPU architecture focus: A100 (Abstract; Conclusion). - Pretraining base model name: DeepSeek 671B (Section 3.2.1). - RL update algorithm name: GRPO (Section 3.2.2). - Use of nvcc compilation and NCU profiling (Section 3.2.3).

4. Key Insights and Innovations¶

(1) RL directly optimizes CUDA HGEMM kernels using real execution-time rewards across a large shape grid
Novelty: Unlike single-configuration benchmarks, the system optimizes across 1,000 (M,N,K) combinations (Abstract; Section 1).
Significance: The reported improvements are “systematic rather than driven by outliers,” supported by win rates across configurations (Section 4.1; Table 1).
(2) A multi-stage training recipe: continued CUDA pretraining → general-kernel RL → HGEMM RL
Novelty: The paper emphasizes progressive specialization—first broaden general CUDA capabilities, then train on diverse kernels, then specialize on matmul (Section 3.2.1–3.2.3).
Significance: This is positioned as necessary because matmul is substantially harder than typical KernelBench-style kernels (Section 3.1).
(3) Incorporating low-level profiling metrics (NCU) into the RL context
Novelty: The model conditions on metrics like memory throughput, SM occupancy, and cache efficiency, not just wall-clock time (Section 3.2.3).
Significance: This is meant to make optimization decisions more informed (e.g., whether changes improve occupancy vs memory behavior), though the excerpt does not include an ablation isolating this factor.
(4) Retrieval-augmented context and instruction generation for CUDA code pretraining
Novelty: For web-sourced CUDA snippets, it generates instruction prompts (via Claude Sonnet 4) and retrieves relevant docs/examples to add context (Section 3.2.1).
Significance: This is presented as a way to incorporate knowledge that may be missing from the foundation model (e.g., newer libraries/architectures mentioned in Section 3.1).
(5) Discovery of non-obvious kernel strategies (e.g., padding to enable “better” tile sizes)
Novelty: CUDA-L2 sometimes pads input matrices with zeros so it can choose tile sizes where M is not divisible by B_M (Section 5.2).
Significance: The paper gives a concrete example where padding overhead (~1.6%) is outweighed by better tiling choices and yields large speedups for a specific shape (Section 5.2).

5. Experimental Analysis¶

5.1 Evaluation methodology¶

Task suite / dataset
1,000 HGEMM configurations covering all combinations of M,N,K ∈ {64, 128, 256, 512, 1024, 2048, 4096, 8192, 12288, 16384} (Abstract; Section 1).
Baselines
torch.matmul (Section 2.2.1).
cuBLAS with NN, TN, and max(NN,TN) (Section 2.2.2; Listing 6).
cuBLASLt-heuristic (NN/TN/max) using cached top heuristic algorithm (Section 2.2.3; Listing 7).
cuBLASLt-AutoTuning (NN/TN/max) benchmarking up to 100 candidates and selecting best median-time algorithm (Section 2.2.3; Listing 8).
Metrics
Primary metric is speedup ratio t_ref/t_custom − 1 (Eq. (2), Section 2.4).
The tables report mean/median/std of speedup over 1,000 configs and “>1” win counts (Table 1 screenshot included in the provided content).
Two scenarios
offline and server execution modes (Section 2.4.2; Figure 1; Table 1).

5.2 Main quantitative results (with specific numbers)¶

From Figure 1 / Table 1 (speedups are CUDA-L2 over baseline):

Offline mode (mean over 1,000 configs)
Over torch.matmul: +22.0% (Figure 1a; Table 1).
Over cuBLAS-max: +19.2% (Figure 1a; Table 1).
Over cuBLASLt-heuristic-max: +16.8% (Figure 1a; Table 1).
Over cuBLASLt-AutoTuning-max: +11.4% (Figure 1a; Table 1).
Server mode (mean over 1,000 configs)
Over torch.matmul: +28.7% (Figure 1b; Table 1).
Over cuBLAS-max: +26.0% (Figure 1b; Table 1).
Over cuBLASLt-heuristic-max: +22.4% (Figure 1b; Table 1).
Over cuBLASLt-AutoTuning-max: +15.9% (Figure 1b; Table 1).
Win rates / consistency (Table 1)
Example win counts (“>1” column) show CUDA-L2 wins in a large majority of configurations, including against the strongest baseline:
- cuBLASLt-AutoTuning-max: 793/1000 wins offline, 798/1000 wins server (Table 1).
The text summarizes win rates as spanning 79.3% to 95.7% across baselines (Section 4.1).

5.3 “Best-of-both-worlds” selection: max(CUDA-L2, baseline)¶

Table 2 reports a practical deployment scenario where you can choose the faster of CUDA-L2 and a baseline for each configuration:
Offline:
- vs torch.matmul: speedup increases from 22.0% → 23.1% (Table 2).
- vs cuBLAS-max: 19.2% → 20.2% (Table 2).
- vs cuBLASLt-heuristic-max: 16.8% → 17.0% (Table 2).
- vs cuBLASLt-AutoTuning-max: 11.4% → 13.2% (Table 2).
Server:
- vs torch.matmul: 28.7% → 29.8% (Table 2).
- vs cuBLAS-max: 26.0% → 27.2% (Table 2).
- vs cuBLASLt-heuristic-max: 22.4% → 22.7% (Table 2).
- vs cuBLASLt-AutoTuning-max: 15.9% → 18.1% (Table 2).

5.4 Results by problem size¶

Figure 3 analyzes speedup over cuBLASLt-AutoTuning-max vs size:
Speedup is highest on small problems: about ~1.4× when log2(M*N*K) ≈ 18–20 (Section 4.3).
Speedup trends toward ~1.0× on large problems: approximately baseline-level when log2(M*N*K) > 38 (Section 4.3).
The paper interprets this as expected:
Small GEMMs underutilize the GPU, leaving room for better tiling/memory patterns.
Large GEMMs saturate floating-point throughput, leaving less headroom (Section 4.3).

5.5 Do the experiments support the claims?¶

The evidence supporting “systematic improvement” is relatively strong within the provided excerpt because:
It evaluates across a large grid (1,000 shapes) rather than a small curated set (Abstract; Section 4.1).
It reports win counts across configurations and evaluates multiple baselines including cuBLASLt-AutoTuning (Table 1; Section 4.1).
It includes both offline and server-like timing modes (Section 2.4.2; Figure 1).
However, the excerpt does not include:
Hardware/software stack details (CUDA version, driver, clock settings) that can affect reproducibility.
Ablations that quantify the impact of each system enhancement (continued pretraining vs NCU metrics vs retrieval augmentation vs multi-stage RL), even though these are central claims in Section 3.2.

6. Limitations and Trade-offs¶

Hardware scope is limited in the current version
The paper states the current version focuses on A100 (Abstract; Conclusion). It is “designed for broad applicability,” but results shown are on A100 only.
Operation scope is limited
The system targets HGEMM with α = 1, β = 0 (Section 2.1) and does not discuss other GEMM variants (e.g., different epilogues, bias, activation fusion) in the provided excerpt.
Correctness criteria are pragmatic rather than formal
The binary {0,1} exact-match test ignores outputs ≥ 2048 (Section 2.3.2).
The baseline-bounded deviation test depends on variability across selected NVIDIA kernels and treats that as an acceptable bound (Section 2.3.2). This is reasonable engineering practice, but it is not a proof of numerical correctness.
Potential training/evaluation coupling
Reward includes a numerical deviation penalty and code-length penalty (Eq. (3), Section 3.2.3). This can bias solutions toward shorter code even when longer code might be more maintainable or robust, and can bias numerical behavior depending on the chosen α (not provided).
Implementation constraints may reduce portability
The approach compiles .cu and can use CuTe, inline PTX, intrinsics, and CUTLASS (Section 3.2.3), which can produce very architecture-specific kernels—consistent with the motivation that optimizations don’t transfer easily (Section 1). That also implies kernels may need retraining/retuning per architecture.
Missing transparency on training hyperparameters
The excerpt does not provide the RL training hyperparameters or compute budget (optimizer settings, batch sizes, tokens, number of RL steps), making it hard to assess cost/efficiency of achieving the reported gains.

7. Implications and Future Directions¶

Field-level implication
The results suggest LLM-guided RL can produce improvements even over mature vendor libraries for specific workloads/shapes, at least on A100 for HGEMM, by systematically searching a configuration space too large for manual tuning (Abstract; Section 4; Conclusion).
Practical applications
Deployment settings where many GEMM shapes occur (e.g., attention/FFN layers) may benefit from having a per-shape kernel cache, especially if the shapes match the evaluated grid (Abstract; Section 1).
The paper’s Table 2 frames a realistic integration pattern: keep baseline libraries available and select max(CUDA-L2, baseline) per configuration to improve average performance while avoiding regressions (Section 4.2; Table 2).
Research directions suggested by the paper
Architecture expansion: extending beyond A100 to other GPU families is explicitly described as ongoing work (Abstract; Conclusion).
Better generalization and maintainability: since CUDA-L2 uses retrieval-augmented context to incorporate new knowledge (Section 3.2.1, Section 3.2), a natural direction is to study how much retrieval vs retraining is needed when CUDA/CUTLASS/CuTe versions change (the excerpt does not provide such experiments, but it motivates retrieval for this reason).
Repro/Integration Guidance (based on what is described here)
If you can afford maintaining a kernel catalog keyed by (M,N,K) and selecting per-shape at runtime (or ahead of time), the max(CUDA-L2, baseline) approach in Table 2 indicates additional gains beyond always using CUDA-L2 or always using cuBLAS/cuBLASLt (Section 4.2).
If your workload is dominated by very large GEMMs, Figure 3 indicates headroom shrinks and speedups approach ~1.0× (Section 4.3), so the benefit may be smaller than for small/medium GEMMs.
The system’s methodology depends on robust timing (Listing 1), executability checks (compute-sanitizer) (Section 2.3.1), and correctness screening (Section 2.3.2); any reproduction should replicate these checks to avoid selecting “fast but wrong” kernels.