Skip to content

Inference Optimization

Papers on speeding up inference, decoding strategies, and serving efficiency.

Overview

This section contains 4 papers covering:

  • Classifier-Free Guidance - Inference-time method for better prompt adherence
  • Staged Speculative Decoding - Tree-based speculation for on-device inference
  • FlashFFTConv - Optimized FFT for long-sequence convolutions
  • Flash Attention - Hardware-aware attention optimization