Troubleshoot LLM serving
This guide helps you diagnose and resolve common issues when deploying large language models with Ray Serve on Anyscale. Each section provides quick fixes and explains the underlying causes to help you troubleshoot efficiently.
Hugging Face authentication errors
Some models, such as the LLaMA family, are gated on Hugging Face. To use them, navigate to the model's card and follow the prompt to request access. Approval timelines can take anywhere from a few hours to several weeks.
Once access is granted, share your Hugging Face token to your Ray cluster. You can share it in your runtime environment with HF_TOKEN
or to your vLLM engine with hf-token
.
applications:
- ...
args:
llm_configs:
...
runtime_env:
env_vars:
# Share your Hugging Face token to the vLLM engine.
HF_TOKEN: <YOUR-HUGGINGFACE-TOKEN>
GPU out of memory (OOM) errors
OOM errors are common when deploying large models. Symptoms include CUDA out of memory
messages, allocation failures, or Ray Serve replicas getting stuck in STARTING
or UNHEALTHY
states. Consider upgrading to GPUs with higher memory capacity or better adapted for your system. See Choose a GPU for LLM serving and Tune parameters for LLMs on Anyscale services for more details.
Several strategies can help resolve OOM errors, though they often involve trade-offs in latency or cost:
GPU selection and vRAM adjustment
- Select a GPU with more memory, but also consider other GPU factors such as memory bandwidth and intercommunication methods.
- Adjust GPU memory utilization: By default, vLLM reserves 90% of VRAM. You can increase this by setting
gpu_memory_utilization
(for example, to0.95
).
Higher values reduce headroom for other GPU operations and can cause system-level OOM errors. Avoid exceeding 0.95
.
Model parallelism
Distribute a large model across multiple GPUs when it doesn't fit on a single device.
- Tensor parallelism: Splits model layers across multiple GPUs on the same node. Set
tensor_parallel_size
to the number of GPUs per replica. - Pipeline parallelism: Splits the entire model across GPUs on different nodes. Set
pipeline_parallel_size
to the number of nodes per replica.
Reduce model memory footprint
Decrease the model's size before vLLM loads it onto the GPU.
- Apply quantization: Reduce model weight precision (for example, from FP16 to FP8 or AWQ) to significantly cut memory usage. Set
quantization
in your vLLM engine. - Use a smaller model: Consider smaller model variants (for example, 7B instead of 70B) if performance is acceptable.
- Fine-tune LLMs with knowledge distillation: Train a smaller or more efficient model to mimic a larger model's behavior, reducing memory requirements while retaining much of the original performance.
Reduce vLLM pre-allocated memory
The KV cache stores attention keys and values for generated tokens and is a primary consumer of GPU memory.
- Reduce
max_model_len
: Reduce memory usage by limiting the model's context length. - Set
enforce_eager: true
: Setting this inengine_kwargs
turns CUDA Graphs off. You regain some GPU memory, but you typically get higher per-token latency and lower throughput, especially at small batch sizes. You can also adjustcompilation_config
to achieve a better balance between inference speed and memory usage.
Configure multi-modal models
For multi-modal models, inputs such as images and videos consume significant memory. Control this behavior with the following:
limit_mm_per_prompt
: Set limits per modality (for example,{images: 2, videos: 0}
) to cap memory allocation. Setting a modality to0
disables it.disable_mm_preprocessor_cache
: Set toTrue
to avoid caching preprocessed multi-modal inputs if they aren't frequently reused.
Disk out of space errors
If you need more storage capacity for model weights, logs, or datasets, you can increase the default disk size of your instance. See Change the default disk size or Change the default disk size.
For example, you can set the default disk size of your worker nodes in the Anyscale service config file:
# service.yaml
...
compute_config:
...
# Change default disk size to 1000GB.
advanced_instance_config:
## AWS example ##
BlockDeviceMappings:
- Ebs:
VolumeSize: 1000
VolumeType: gp3
DeleteOnTermination: true
DeviceName: "/dev/sda1"
#########
## GCP example ##
#instanceProperties:
# disks:
# - boot: true
# auto_delete: true
# initialize_params:
# disk_size_gb: 1000
#########