Quantization

This repository explores quantization methods for Large Language Models (LLMs).
Quantization is a key technique for reducing model size and inference cost, enabling LLMs to run efficiently on consumer hardware or limited GPU memory.

We provide examples and experiments for:

Bitsandbytes (bnb) – nf4 quantization using the Hugging Face integration.
AWQ (Activation-aware Weight Quantization) – a method that preserves accuracy by considering activation statistics.
GPTQ (Gradient Post-training Quantization) – post-training quantization optimized for autoregressive transformers.

📖 What to Expect in This Repo

Implementation Examples
- Scripts for loading, quantizing, and saving models.
- Examples include small models (for example; facebook/opt-125m) so you can try things quickly, and notes for scaling to larger models.
Benchmarks
- Inference time comparisons before and after quantization.
- Model size reduction (disk footprint in MB).
Guides & Utilities
- Helper functions for measuring model size, timing inference, and testing the quantized model.
- Notes on how to run the examples on Puhti, Mahti and LUMI.

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
AWQ		AWQ
BitsAndBytes		BitsAndBytes
GPTQ		GPTQ
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quantization

📖 What to Expect in This Repo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Quantization

📖 What to Expect in This Repo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages