Inference Optimization¶
Papers on speeding up inference, decoding strategies, and serving efficiency.
Overview¶
This section contains 4 papers covering:
- Classifier-Free Guidance - Inference-time method for better prompt adherence
- Staged Speculative Decoding - Tree-based speculation for on-device inference
- FlashFFTConv - Optimized FFT for long-sequence convolutions
- Flash Attention - Hardware-aware attention optimization