Troubleshoot LLM serving
Troubleshoot LLM serving
This guide helps you diagnose and resolve common issues when deploying large language models with Ray Serve on Anyscale. Each section provides quick fixes and explains the underlying causes to help you troubleshoot efficiently.
Hugging Face authentication errors
Some models, such as the LLaMA family, are gated on Hugging Face. To use them, navigate to the model's card and follow the prompt to request access. Approval timelines can take anywhere from a few hours to several weeks.
Once access is granted, share your Hugging Face token to your Ray cluster. You can share it in your runtime environment with HF_TOKEN or to your vLLM engine with hf-token.
applications:
- ...
args:
llm_configs:
...
runtime_env:
env_vars:
# Share your Hugging Face token to the vLLM engine.
HF_TOKEN: <YOUR-HUGGINGFACE-TOKEN>
Model loading issues and optimizations
For troubleshooting slow downloads, loading errors, and other related issues, see Model loading: Troubleshooting.
GPU out of memory (OOM) errors
OOM errors are common when deploying large models. Symptoms include CUDA out of memory messages, allocation failures, or Ray Serve replicas getting stuck in STARTING or UNHEALTHY states. Consider upgrading to GPUs with higher memory capacity or better adapted for your system. See Choose a GPU for LLM serving and Tune parameters for LLMs on Anyscale services for more details.
Several strategies can help resolve OOM errors, though they often involve trade-offs in latency or cost:
GPU selection and vRAM adjustment
- Select a GPU with more memory, but also consider other GPU factors such as memory bandwidth and intercommunication methods.
- Adjust GPU memory utilization: By default, vLLM reserves 90% of VRAM. You can increase this by setting
gpu_memory_utilization(for example, to0.95).
Higher values reduce headroom for other GPU operations and can cause system-level OOM errors. Avoid exceeding 0.95.
System reserved GPU memory
vLLM can't detect memory that the system reserves for features such as ECC (error-correcting code), driver/firmware overhead, display output, and MIG/vGPU configurations. This can cause vLLM to attempt to allocate more memory than is available, resulting in OOM errors.
To determine how much GPU memory the system reserves, run the following command:
nvidia-smi -i 0 -q -d MEMORY
By default, vLLM reserves 90% of available GPU memory. If system-level reserved memory exceeds 10%, reduce gpu_memory_utilization accordingly. Anyscale recommends leaving at least 10% headroom to accommodate dynamic memory allocation, such as CUDA kernel overhead during inference.
For example, if using ECC takes 12% of total GPU memory, set gpu_memory_utilization to 0.78 (100% - 12% ECC - 10% headroom = 78%):
applications:
- ...
args:
llm_configs:
...
engine_kwargs:
gpu_memory_utilization: 0.78
If your model still doesn't fit after adjusting gpu_memory_utilization, consider alternative strategies such as Model parallelism.
Model parallelism
Distribute a large model across multiple GPUs when it doesn't fit on a single device.
- Tensor parallelism: Splits model layers across multiple GPUs on the same node. Set
tensor_parallel_sizeto the number of GPUs per replica. - Pipeline parallelism: Splits the entire model across GPUs on different nodes. Set
pipeline_parallel_sizeto the number of nodes per replica.
Reduce model memory footprint
Decrease the model's size before vLLM loads it onto the GPU.
- Apply quantization: Reduce model weight precision (for example, from FP16 to FP8 or AWQ) to significantly cut memory usage. Set
quantizationin your vLLM engine. - Use a smaller model: Consider smaller model variants (for example, 7B instead of 70B) if performance is acceptable.
- Fine-tune LLMs with knowledge distillation: Train a smaller or more efficient model to mimic a larger model's behavior, reducing memory requirements while retaining much of the original performance.
Reduce vLLM pre-allocated memory
The KV cache stores attention keys and values for generated tokens and is a primary consumer of GPU memory.
- Reduce
max_model_len: Reduce memory usage by limiting the model's context length. - Set
enforce_eager: true: Setting this inengine_kwargsturns CUDA Graphs off. You regain some GPU memory, but you typically get higher per-token latency and lower throughput, especially at small batch sizes. You can also adjustcompilation_configto achieve a better balance between inference speed and memory usage.
Configure multi-modal models
For multi-modal models, inputs such as images and videos consume significant memory. Control this behavior with the following:
limit_mm_per_prompt: Set limits per modality (for example,{images: 2, videos: 0}) to cap memory allocation. Setting a modality to0disables it.disable_mm_preprocessor_cache: Set toTrueto avoid caching preprocessed multi-modal inputs if they aren't frequently reused.
GPU compatibility errors
Some GPUs might not support all features required by modern LLMs, which can result in errors during model loading or inference. For example, T4 GPUs don't support bfloat16 data types, which many models use by default, or the mxfp4 quantization method used by gpt-oss models.
If you encounter compatibility errors, try the following:
- Switch to a different GPU: Use GPUs that support the required features, such as A10G, L4, or A100 for
bfloat16data types. - Use supported data types: If you must use a GPU with limited feature support, configure your model accordingly. For example, use
float16instead ofbfloat16by settingdtypein your vLLM engine configuration.
For guidance on GPU selection and capabilities, see Choose a GPU for LLM serving.
Disk out of space errors
If you need more storage capacity for model weights, logs, or datasets, you can increase the default disk size of your instance. See Change the default disk size for GCP or Change the default disk size for AWS.
For example, you can set the default disk size of your worker nodes in the Anyscale service config file:
# service.yaml
...
compute_config:
...
# Change default disk size to 1000GB.
advanced_instance_config:
## AWS example ##
BlockDeviceMappings:
- Ebs:
VolumeSize: 1000
VolumeType: gp3
DeleteOnTermination: true
DeviceName: "/dev/sda1"
#########
## GCP example ##
#instanceProperties:
# disks:
# - boot: true
# auto_delete: true
# initialize_params:
# disk_size_gb: 1000
#########