CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning¶
ArXiv: 2512.02551
đŻ Pitch¶
This paper introduces CUDA-L2, an LLM-guided reinforcement learning system that automatically generates and optimizes half-precision HGEMM CUDA kernels by using measured CUDA execution speed as the reward, exploring 1,000 (M,N,K) configurations. By systematically discovering tiling, pipelining, prefetching, and other low-level optimizations, CUDA-L2 delivers consistent speedups over torch.matmul and NVIDIAâs cuBLAS/cuBLASLt (e.g., +11.4% over cuBLASLt-AutoTuning in offline mode), showing that automated RL-driven kernel search can materially improve performance-critical matmul at scale.
1. Executive Summary (2-3 sentences)¶
CUDA-L2 is an automated system that uses an LLM trained with reinforcement learning to generate and optimize half-precision matrix multiplication (HGEMM) CUDA kernels, using measured CUDA execution speed as the reward. Across 1,000 (M, N, K) configurations on an A100 GPU, it reports systematic speedups not only over torch.matmul but also over NVIDIAâs heavily optimized cuBLAS/cuBLASLt baselines, including an average +11.4% over cuBLASLt-AutoTuning-max in offline mode and +15.9% in server mode (Figure 1, Table 1). The significance is that it demonstrates RL-guided code generation can still improve performance-critical GEMM kernels at scale, where manual tuning is difficult across many shapes and settings (Abstract, Section 1, Section 4).
2. Context and Motivation¶
- Problem / gap addressed
- Matrix multiplication (
matmul) dominates compute in LLM training and inference, so its kernel performance strongly impacts end-to-end speed (Section 1). -
Manual optimization across many matrix sizes is hard because:
- Optimal strategies depend on
(M, N, K)(Section 1). - Optimizations often do not transfer across GPU architectures (Section 1).
- Even for the same dimensions, setup details (example given: FP16 vs FP32 accumulators) can change register pressure and thus the best kernel design (Section 1).
- Optimal strategies depend on
-
Why it matters
- Even âdeeply optimizedâ matmul is not fully solved; the introduction cites a targeted
13%throughput gain from model-specific GEMM optimizations as evidence that more gains exist when effort is focused (Section 1). -
Scaling that kind of tuning to many shapes and production settings calls for automation (Section 1).
-
Prior approaches and shortcomings (as positioned here)
- Recent work uses LLMs (often with RL) to generate CUDA kernels (Section 1), including
AI CUDA Engineerand the authorsâ priorCUDA-L1(Section 1, Section 3.1). - However, the paper argues prior benchmark-focused efforts (e.g.,
KernelBench) typically optimize a single fixed configuration per task, so it is unclear how well they translate to production-like variability in dimensions (Section 1). -
It also states that, to their knowledge, no prior work achieves performance comparable to hand-optimized matmul kernels when compared against
cuBLAS(Section 1). -
How this work positions itself
CUDA-L2specifically targetsHGEMMand scales optimization across 1,000 shape configurations, covering all10^3combinations where each ofM, N, Kis chosen from{64, 128, 256, 512, 1024, 2048, 4096, 8192, 12288, 16384}(Abstract, Section 1).- It reports comparisons against practical baselines (
torch.matmul) and NVIDIA libraries (cuBLAS,cuBLASLtwith heuristic and autotuning) under two workload scenarios:offlineandserver(Abstract, Section 4, Figure 1, Table 1).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
CUDA-L2is an LLM-driven kernel optimizer that generates CUDA.cucode for half-precision GEMM and improves it via reinforcement learning.- It solves âgiven a specific
(M, N, K)HGEMM shape on a GPU, produce a fast correct CUDA kernel,â by iteratively generating code variants, compiling/running them, and using runtime (plus correctness/size penalties) as the training signal (Section 3.2.3, Section 2.3â2.4).
3.2 Big-picture architecture (diagram in words)¶
- (A) Continued pretraining data pipeline â builds a stronger CUDA-capable base model (Section 3.2.1).
- (B) General-kernel RL stage â trains the model to optimize diverse CUDA kernels using speedup rewards (Section 3.2.2).
- (C) HGEMM-specific RL stage â specializes the model to matmul kernels across many
(M, N, K)settings, using runtime + deviation + code-length penalties and adding profiling metrics into the prompt/context (Section 3.2.3). - (D) Evaluation harness â compiles generated
.cuwithnvcc, checks executability and correctness, measures time robustly, and computes speedups (Section 2.3â2.4, Listing 1).
3.3 Roadmap for the deep dive¶
- Explain (1) what
HGEMMkernels look like and what is being optimized (Section 2.1). - Explain (2) how kernels are judged: executability, correctness, and timing/speedup measurement (Section 2.3â2.4).
- Explain (3) how
CUDA-L2trains: pretraining expansion, then two-stage RL (general â HGEMM) (Section 3.2.1â3.2.3). - Explain (4) what signals guide optimization beyond raw runtime (profiling metrics; reward shaping) (Section 3.2.3).
- Explain (5) what kinds of kernel-level optimizations it discovers (Section 5) and how it analyzes learned hyperparameter patterns (Section 6).
3.4 Detailed, sentence-based technical breakdown¶
This is an empirical systems + algorithmic training paper: it builds an LLM+RL system that generates CUDA HGEMM kernels and uses measured GPU performance (and correctness constraints) to iteratively improve them across a large configuration space (Abstract; Section 3.2; Section 4).
3.4.1 What is HGEMM and what knobs exist?¶
- The target computation is half-precision GEMM, focused on the common case
C = A B(i.e.,α = 1,ÎČ = 0) whereA â R^{MĂK}andB â R^{KĂN}(Section 2.1). - Modern high-performance HGEMM uses a tiled structure (Section 2.1):
- Tiling/partitioning: Split
AandBinto tiles; each thread block computes an output tile of sizeB_M Ă B_N. - Main loop over K: Load
B_M Ă B_KfromAandB_K Ă B_NfromBfrom global memory â shared memory â registers; use tensor cores for the multiply-accumulate; accumulate partial sums in registers; repeat across K tiles. - Epilogue: Write accumulated register results back (registers â shared memory â global memory).
- The paper later refers to key kernel âhyperparametersâ such as
B_M,B_N,B_Ktile sizes (Section 6.1), number of pipeline stagesn_stage(Section 5.3, Section 6.2), and block swizzle parameters such asswizzle_stride(Section 5.3, Section 6.3). - These are not LLM training hyperparameters; they are kernel configuration parameters that affect memory reuse, parallelism, and resource usage.
3.4.2 Baselines and what âNN/TN/maxâ mean (so the comparisons are interpretable)¶
torch.matmulis treated as a practical baseline that dispatches to optimized kernels (for FP16 it usescuBLASinternally) but includes framework overhead (Section 2.2.1).cuBLASbaseline usescublasGemmExwithCUBLAS_GEMM_DEFAULT_TENSOR_OPto enable Ampere FP16 tensor cores and rely on cuBLASâs internal algorithm selection (Section 2.2.2; Listing 6).- Two data layouts are benchmarked:
NN: ânormal-normalâ (row-major A, row-major B).TN: âtransposed-normalâ (mixed layout; described in Listing 6 asArow major andBcol major, andcublasGemmExusesCUBLAS_OP_Tfor one operand).
cuBLAS-maxmeans âtake the faster ofNNandTNfor each configuration,â i.e.,max(cuBLAS-NN, cuBLAS-TN)(Section 2.2.2).cuBLASLtbaselines (Section 2.2.3) include:cuBLASLt-heuristic: querycublasLtMatmulAlgoGetHeuristic, pick the top-ranked suggestion (results[0]), cache it, and evaluate (Section 2.2.3; Listing 7).cuBLASLt-AutoTuning: query up to 100 candidates from the heuristic API, benchmark them with randomized ordering and random inputs, and choose the algorithm with best median time; cache the chosen algorithm (Section 2.2.3; Listing 8).- Both are tested in
NNandTN;*-maxtakes the best layout per configuration.
3.4.3 What happens first, second, third: the CUDA-L2 training pipeline¶
- Continued pretraining expands the modelâs CUDA competence beyond KernelBench (Section 3.2.1).
- CUDA code is collected from:
- Web sources (cleaned/segmented with rule-based filtering + LLM-based cleaning).
- Established libraries/tutorials (PyTorch, ATen, CUTLASS, NVIDIA tutorials/examples, etc.) (Section 3.2.1).
- Because raw CUDA code usually lacks an âinstruction prompt,â the pipeline uses
Claude Sonnet 4to generate instruction descriptions for each snippet (Section 3.2.1). - For retrieval augmentation, each instruction is used as a search query; retrieved documentation/examples are concatenated as extra context (Section 3.2.1).
-
The resulting instruction+context+code triplets are used for continued pretraining on
DeepSeek 671B(Section 3.2.1).- The excerpt does not provide optimizer settings, learning-rate schedule, batch size, context length, or token counts for this training.
-
A âgeneral kernel RLâ stage trains optimization behavior across diverse kernels (Section 3.2.2).
- The authors collect roughly
~1KCUDA kernels from established libraries spanning many operation types (linear algebra, convolutions, reductions, elementwise ops, attention, etc.) and pair each with an âofficial/successful implementationâ (Section 3.2.2). - RL reward is the average speedup score across test iterations (Section 3.2.2), and training uses a contrastive RL strategy from
CUDA-L1where the model compares previously generated variants and their measured performance (Section 3.2.2). - Parameter updates use
GRPO(Section 3.2.2). -
Rewards are âsmoothed and clippedâ to reduce reward hacking effects (Section 3.2.2), but the excerpt does not specify the smoothing/clipping functions or thresholds.
-
An âHGEMM RLâ stage specializes the model to matmul kernels across many
(M, N, K)(Section 3.2.3). - The same contrastive RL strategy is used, but now restricted to HGEMM with varying shapes (Section 3.2.3).
- The reward explicitly combines performance, a numerical-deviation penalty, and a code-length penalty (Eq. (3), Section 3.2.3):
- In plain language: the model gets higher reward if the kernel runs faster than a reference, but it is penalized if output deviates from a FP32 CPU reference and if the generated code is long.
- The paper defines:
r(custom) = (1/N) * ÎŁ_i [ (t_ref^i / t_custom^i) â α * diff_i ] â ÎČ * L(custom)(Eq. (3), Section 3.2.3),diff_i = max_j |out_FP32^i[j] â out_custom^i[j]|(Section 3.2.3),L(custom)is code length (Section 3.2.3),α > 0andÎČ > 0are penalty coefficients (values not provided in the excerpt).
-
The HGEMM RL context includes detailed
NCU(Nsight Compute) profiling metrics (memory throughput, SM occupancy, cache efficiency, etc.) so the model can condition on performance characteristics rather than only end-to-end time (Section 3.2.3). -
Kernel generation constraints / allowed implementation tools
- Generated kernels are emitted as CUDA code in
.cufiles and compiled withnvcc(Section 3.2.3, Section 2.4.1). - The system allows CUDA C/C++,
CuTe, inlinePTX, CUDA intrinsics, andCUTLASStemplates (Section 3.2.3). - It explicitly disallows Python-based DSLs like
Triton(Section 3.2.3).
3.4.4 Executability, correctness, and robust timing (to prevent âreward hackingâ)¶
- Executability criteria
-
A candidate kernel must compile, launch, and finish within a time limit; memory violations are checked with
compute-sanitizer --tool memcheck(Section 2.3.1). -
Correctness criteria (two-pronged)
- Reference output is computed by an FP32 CPU matmul:
>ref = MatMul(float(a.cpu()), float(b.cpu()))(Eq. (1), Section 2.3.2). -
Because floating-point arithmetic is non-associative, exact equality is not required in general; instead the paper uses:
- Exact match under binary inputs
{0,1}below an FP16-exactness threshold (Section 2.3.2). - For entries where
c_ref[i,j] < 2048, requirec_custom[i,j] == c_ref[i,j]exactly; ignore positions withc_ref[i,j] > 2048(Section 2.3.2). - The reasoning given is FP16 has 11 bits of significand precision, so integers in
[0, 2048)are exactly representable (Section 2.3.2). - Baseline-bounded deviation (Section 2.3.2).
- Compute outputs from multiple NVIDIA baselines (
cuBLAS,cuBLASLt-heuristic,cuBLASLt-AutoTuning, bothNN/TN) and take the maximum elementwise difference among them as an âupper bound of variability.â - A custom kernel is marked incorrect if its max elementwise deviation from FP32 CPU exceeds that bound (Section 2.3.2).
- Exact match under binary inputs
-
Timing and speedup computation
- Speedup for one run is defined as
s(custom) = t_ref / t_custom â 1(Eq. (2), Section 2.4). - Each evaluation includes:
- 10-second warmup, then at least 30 seconds of timing (Section 2.4).
- Randomized execution order each iteration to reduce ordering/caching artifacts (Section 2.4).
- Final score is the mean speed score over runs (Section 2.4).
-
Timing is done with CUDA events and
torch.cuda.synchronize()around the kernel call (Listing 1, Section 2.4.1). -
Avoiding timing-measurement hacking
- The authors state that prior work identified âhacksâ such as launching asynchronous work in other streams or exploiting Python lazy execution (Section 2.4.1).
- CUDA-L2 mitigates these by:
- Disallowing additional CUDA stream creation.
- Generating only
.cucode, bypassing Pythonâs lazy evaluation (Section 2.4.1).
3.4.5 Offline vs server evaluation scenarios (why two modes)¶
offline: kernels run back-to-back with no pauses, representing peak throughput with warm caches (Section 2.4.2; Figure 1a).server: kernels run with random idle intervals to simulate real-time inference request arrivals; the interval time itself is not included in the measured kernel execution time (Section 2.4.2; Figure 1b).- The paper attributes differing results and higher variance in server mode to GPU thermal/caching dynamics under idling vs sustained load (Section 4.1).
3.4.6 Missing âcore configurations and hyperparametersâ (explicitly not in the excerpt)¶
The prompt requests model/training hyperparameters like optimizer settings, learning-rate schedule, batch size, context window, tokenizer, number of layers/heads, total training tokens, compute budget, and hardware details. These are not specified in the provided excerpt, except for:
- Target GPU architecture focus: A100 (Abstract; Conclusion).
- Pretraining base model name: DeepSeek 671B (Section 3.2.1).
- RL update algorithm name: GRPO (Section 3.2.2).
- Use of nvcc compilation and NCU profiling (Section 3.2.3).
4. Key Insights and Innovations¶
- (1) RL directly optimizes CUDA HGEMM kernels using real execution-time rewards across a large shape grid
- Novelty: Unlike single-configuration benchmarks, the system optimizes across
1,000(M,N,K)combinations (Abstract; Section 1). -
Significance: The reported improvements are âsystematic rather than driven by outliers,â supported by win rates across configurations (Section 4.1; Table 1).
-
(2) A multi-stage training recipe: continued CUDA pretraining â general-kernel RL â HGEMM RL
- Novelty: The paper emphasizes progressive specializationâfirst broaden general CUDA capabilities, then train on diverse kernels, then specialize on matmul (Section 3.2.1â3.2.3).
-
Significance: This is positioned as necessary because matmul is substantially harder than typical KernelBench-style kernels (Section 3.1).
-
(3) Incorporating low-level profiling metrics (NCU) into the RL context
- Novelty: The model conditions on metrics like memory throughput, SM occupancy, and cache efficiency, not just wall-clock time (Section 3.2.3).
-
Significance: This is meant to make optimization decisions more informed (e.g., whether changes improve occupancy vs memory behavior), though the excerpt does not include an ablation isolating this factor.
-
(4) Retrieval-augmented context and instruction generation for CUDA code pretraining
- Novelty: For web-sourced CUDA snippets, it generates instruction prompts (via Claude Sonnet 4) and retrieves relevant docs/examples to add context (Section 3.2.1).
-
Significance: This is presented as a way to incorporate knowledge that may be missing from the foundation model (e.g., newer libraries/architectures mentioned in Section 3.1).
-
(5) Discovery of non-obvious kernel strategies (e.g., padding to enable âbetterâ tile sizes)
- Novelty: CUDA-L2 sometimes pads input matrices with zeros so it can choose tile sizes where
Mis not divisible byB_M(Section 5.2). - Significance: The paper gives a concrete example where padding overhead (~
1.6%) is outweighed by better tiling choices and yields large speedups for a specific shape (Section 5.2).
5. Experimental Analysis¶
5.1 Evaluation methodology¶
- Task suite / dataset
1,000HGEMM configurations covering all combinations ofM,N,K â {64, 128, 256, 512, 1024, 2048, 4096, 8192, 12288, 16384}(Abstract; Section 1).- Baselines
torch.matmul(Section 2.2.1).cuBLASwithNN,TN, andmax(NN,TN)(Section 2.2.2; Listing 6).cuBLASLt-heuristic(NN/TN/max) using cached top heuristic algorithm (Section 2.2.3; Listing 7).cuBLASLt-AutoTuning(NN/TN/max) benchmarking up to 100 candidates and selecting best median-time algorithm (Section 2.2.3; Listing 8).- Metrics
- Primary metric is speedup ratio
t_ref/t_custom â 1(Eq. (2), Section 2.4). - The tables report mean/median/std of speedup over 1,000 configs and â>1â win counts (Table 1 screenshot included in the provided content).
- Two scenarios
offlineandserverexecution modes (Section 2.4.2; Figure 1; Table 1).
5.2 Main quantitative results (with specific numbers)¶
From Figure 1 / Table 1 (speedups are CUDA-L2 over baseline):
- Offline mode (mean over 1,000 configs)
- Over
torch.matmul:+22.0%(Figure 1a; Table 1). - Over
cuBLAS-max:+19.2%(Figure 1a; Table 1). - Over
cuBLASLt-heuristic-max:+16.8%(Figure 1a; Table 1). -
Over
cuBLASLt-AutoTuning-max:+11.4%(Figure 1a; Table 1). -
Server mode (mean over 1,000 configs)
- Over
torch.matmul:+28.7%(Figure 1b; Table 1). - Over
cuBLAS-max:+26.0%(Figure 1b; Table 1). - Over
cuBLASLt-heuristic-max:+22.4%(Figure 1b; Table 1). -
Over
cuBLASLt-AutoTuning-max:+15.9%(Figure 1b; Table 1). -
Win rates / consistency (Table 1)
- Example win counts (â>1â column) show CUDA-L2 wins in a large majority of configurations, including against the strongest baseline:
cuBLASLt-AutoTuning-max:793/1000wins offline,798/1000wins server (Table 1).
- The text summarizes win rates as spanning
79.3%to95.7%across baselines (Section 4.1).
5.3 âBest-of-both-worldsâ selection: max(CUDA-L2, baseline)¶
- Table 2 reports a practical deployment scenario where you can choose the faster of CUDA-L2 and a baseline for each configuration:
- Offline:
- vs
torch.matmul: speedup increases from22.0%â23.1%(Table 2). - vs
cuBLAS-max:19.2%â20.2%(Table 2). - vs
cuBLASLt-heuristic-max:16.8%â17.0%(Table 2). - vs
cuBLASLt-AutoTuning-max:11.4%â13.2%(Table 2).
- vs
- Server:
- vs
torch.matmul:28.7%â29.8%(Table 2). - vs
cuBLAS-max:26.0%â27.2%(Table 2). - vs
cuBLASLt-heuristic-max:22.4%â22.7%(Table 2). - vs
cuBLASLt-AutoTuning-max:15.9%â18.1%(Table 2).
- vs
5.4 Results by problem size¶
- Figure 3 analyzes speedup over
cuBLASLt-AutoTuning-maxvs size: - Speedup is highest on small problems: about
~1.4Ăwhenlog2(M*N*K) â 18â20(Section 4.3). - Speedup trends toward
~1.0Ăon large problems: approximately baseline-level whenlog2(M*N*K) > 38(Section 4.3). - The paper interprets this as expected:
- Small GEMMs underutilize the GPU, leaving room for better tiling/memory patterns.
- Large GEMMs saturate floating-point throughput, leaving less headroom (Section 4.3).
5.5 Do the experiments support the claims?¶
- The evidence supporting âsystematic improvementâ is relatively strong within the provided excerpt because:
- It evaluates across a large grid (1,000 shapes) rather than a small curated set (Abstract; Section 4.1).
- It reports win counts across configurations and evaluates multiple baselines including
cuBLASLt-AutoTuning(Table 1; Section 4.1). - It includes both offline and server-like timing modes (Section 2.4.2; Figure 1).
- However, the excerpt does not include:
- Hardware/software stack details (CUDA version, driver, clock settings) that can affect reproducibility.
- Ablations that quantify the impact of each system enhancement (continued pretraining vs NCU metrics vs retrieval augmentation vs multi-stage RL), even though these are central claims in Section 3.2.
6. Limitations and Trade-offs¶
- Hardware scope is limited in the current version
-
The paper states the current version focuses on
A100(Abstract; Conclusion). It is âdesigned for broad applicability,â but results shown are on A100 only. -
Operation scope is limited
-
The system targets
HGEMMwithα = 1,ÎČ = 0(Section 2.1) and does not discuss other GEMM variants (e.g., different epilogues, bias, activation fusion) in the provided excerpt. -
Correctness criteria are pragmatic rather than formal
- The binary
{0,1}exact-match test ignores outputsâ„ 2048(Section 2.3.2). -
The baseline-bounded deviation test depends on variability across selected NVIDIA kernels and treats that as an acceptable bound (Section 2.3.2). This is reasonable engineering practice, but it is not a proof of numerical correctness.
-
Potential training/evaluation coupling
-
Reward includes a numerical deviation penalty and code-length penalty (Eq. (3), Section 3.2.3). This can bias solutions toward shorter code even when longer code might be more maintainable or robust, and can bias numerical behavior depending on the chosen
α(not provided). -
Implementation constraints may reduce portability
-
The approach compiles
.cuand can useCuTe, inline PTX, intrinsics, and CUTLASS (Section 3.2.3), which can produce very architecture-specific kernelsâconsistent with the motivation that optimizations donât transfer easily (Section 1). That also implies kernels may need retraining/retuning per architecture. -
Missing transparency on training hyperparameters
- The excerpt does not provide the RL training hyperparameters or compute budget (optimizer settings, batch sizes, tokens, number of RL steps), making it hard to assess cost/efficiency of achieving the reported gains.
7. Implications and Future Directions¶
- Field-level implication
-
The results suggest LLM-guided RL can produce improvements even over mature vendor libraries for specific workloads/shapes, at least on A100 for HGEMM, by systematically searching a configuration space too large for manual tuning (Abstract; Section 4; Conclusion).
-
Practical applications
- Deployment settings where many GEMM shapes occur (e.g., attention/FFN layers) may benefit from having a per-shape kernel cache, especially if the shapes match the evaluated grid (Abstract; Section 1).
-
The paperâs Table 2 frames a realistic integration pattern: keep baseline libraries available and select
max(CUDA-L2, baseline)per configuration to improve average performance while avoiding regressions (Section 4.2; Table 2). -
Research directions suggested by the paper
- Architecture expansion: extending beyond A100 to other GPU families is explicitly described as ongoing work (Abstract; Conclusion).
-
Better generalization and maintainability: since CUDA-L2 uses retrieval-augmented context to incorporate new knowledge (Section 3.2.1, Section 3.2), a natural direction is to study how much retrieval vs retraining is needed when CUDA/CUTLASS/CuTe versions change (the excerpt does not provide such experiments, but it motivates retrieval for this reason).
-
Repro/Integration Guidance (based on what is described here)
- If you can afford maintaining a kernel catalog keyed by
(M,N,K)and selecting per-shape at runtime (or ahead of time), themax(CUDA-L2, baseline)approach in Table 2 indicates additional gains beyond always using CUDA-L2 or always using cuBLAS/cuBLASLt (Section 4.2). - If your workload is dominated by very large GEMMs, Figure 3 indicates headroom shrinks and speedups approach
~1.0Ă(Section 4.3), so the benefit may be smaller than for small/medium GEMMs. - The systemâs methodology depends on robust timing (Listing 1), executability checks (
compute-sanitizer) (Section 2.3.1), and correctness screening (Section 2.3.2); any reproduction should replicate these checks to avoid selecting âfast but wrongâ kernels.