Quantization for LLM batch inference

Quantization reduces the numerical precision of model weights, converting them from higher precision formats to lower precision formats. This technique significantly reduces GPU memory requirements, allowing you to process more requests and improve overall throughput.

For additional optimization strategies beyond quantization, see Optimize throughput for Ray Data LLM batch inference.

Benefits for batch inference

Quantization provides several advantages for batch workloads:

Reduced memory usage: Enables larger models to run on fewer or smaller GPUs. FP8 quantization typically halves memory requirements compared to BF16.
Increased throughput: Allows processing more concurrent sequences by freeing up GPU memory for KV cache.
Increased scalability: Enables you to scale up the number of inference workers running on the same cluster.
Lower costs: Enables use of smaller, cheaper GPUs for the same workload.

Configure quantization

vLLM supports several quantization methods through Ray Data LLM:

Method	Precision	Memory reduction	Use case
FP8	8-bit floating point	~50% vs BF16	Best balance of quality and efficiency
INT8	8-bit integer	~50% vs BF16	Good for memory-constrained scenarios
INT4	4-bit integer	~75% vs BF16	Maximum memory savings, quality trade-off
GPTQ	Variable (2-8 bit)	Up to 75% vs BF16	Pre-quantized models from Hugging Face
AWQ	4-bit	~75% vs BF16	Optimized for generation quality

For the complete list of quantization methods and configuration options, see the vLLM quantization documentation.

Ray Data LLM forwards vLLM engine parameters through the engine_kwargs argument in vLLMEngineProcessorConfig. You can configure quantization methods, data types, and other vLLM-specific parameters this way.

The following example shows how to perform inflight quantization to load a Llama model with 4-bit quantization:

# Install BitsAndBytes
# pip install bitsandbytes>=0.46.1
from ray.data.llm import vLLMEngineProcessorConfig
import torch

config = vLLMEngineProcessorConfig(
    model_source="huggyllama/llama-7b",
    ...,
    engine_kwargs={
        "quantization": "bitsandbytes",
        "trust_remote_code": True,
        "dtype": torch.bfloat16,
        ...
    },
)

GPU compatibility

Not all GPUs support all quantization methods. Verify your GPU supports the quantization method you plan to use:

Quantization method	GPU requirements	Recommended GPUs
FP8	Ada Lovelace or Hopper architecture with FP8 Tensor Cores	H100, H200, L4, L40S
INT8	Turing architecture and later generations (Ampere recommended)	A100, A10G, L4, H100, T4
INT4/GPTQ/AWQ	Most GPUs with CUDA compute capability 7.0+	A10G, L4, A100, H100

warning

Always verify GPU compatibility before deploying quantized models. Incompatible configurations can cause runtime errors or performance degradation. See the vLLM compatibility matrix.

Evaluate quality trade-offs

Quantization reduces precision, which can impact model quality. Before deploying to production, test your batch inference on a sample dataset and compare outputs with the unquantized model. Evaluate task-specific quality metrics such as accuracy, BLEU, or perplexity, and pay attention to performance on rare or complex inputs where quantization effects may be more pronounced.

Most modern quantization methods such as FP8 maintain 99%+ of original model quality, but always validate for your specific use case.

Benefits for batch inference​

Configure quantization​

GPU compatibility​

Evaluate quality trade-offs​

Benefits for batch inference

Configure quantization

GPU compatibility

Evaluate quality trade-offs