This repository explores quantization methods for Large Language Models (LLMs).
Quantization is a key technique for reducing model size and inference cost, enabling LLMs to run efficiently on consumer hardware or limited GPU memory.
We provide examples and experiments for:
- Bitsandbytes (bnb) β nf4 quantization using the Hugging Face integration.
- AWQ (Activation-aware Weight Quantization) β a method that preserves accuracy by considering activation statistics.
- GPTQ (Gradient Post-training Quantization) β post-training quantization optimized for autoregressive transformers.
-
Implementation Examples
- Scripts for loading, quantizing, and saving models.
- Examples include small models (for example;
facebook/opt-125m) so you can try things quickly, and notes for scaling to larger models.
-
Benchmarks
- Inference time comparisons before and after quantization.
- Model size reduction (disk footprint in MB).
-
Guides & Utilities
- Helper functions for measuring model size, timing inference, and testing the quantized model.
- Notes on how to run the examples on Puhti, Mahti and LUMI.