Skip to main content

Choose a GPU for LLM serving

This page helps you select and configure GPUs for LLM serving on Anyscale. In includes instructions for calculating memory requirements, avoiding out-of-memory (OOM) errors, and implementing parallelism strategies for optimal performance.

GPU memory allocation components

GPU memory in LLM deployments has three categories:

ComponentTypeDescriptionScaling factors
Model weightsStaticModel parameters, present at all timesModel size, precision
KV cache and activationsDynamicToken representations and temporary buffersBatch size, context length
Framework overheadStaticRuntime requirements before tensor allocationFramework, model architecture

How does vLLM allocate GPU resources?

vLLM follows a specific sequence when it allocates GPU resources. When you understand this process, you can better estimate memory requirements, prevent OOM errors, and optimize performance:

  1. Measure usable VRAM: vLLM detects total GPU memory and applies a safety factor (gpu_memory_utilization = 80-95%)
  2. Load model weights: vLLM copies model parameters to VRAM (this is the largest allocation and scales with model size and precision)
  3. Account for framework overhead: vLLM reserves a buffer for CUDA drivers, PyTorch kernels, and logging
  4. Estimate peak activations: vLLM predicts maximum temporary tensor usage during the forward pass
  5. Reserve KV cache: vLLM allocates remaining memory for key-value tensors that store past tokens
  6. Derive concurrency limits: vLLM calculates maximum simultaneous requests based on KV cache budget and sequence length
note

The max_num_seqs parameter sets an upper limit for scheduled sequences but doesn't pre-allocate KV cache blocks. vLLM dynamically determines the actual concurrency based on available memory.

Example: Llama 3 8B memory allocation

The following table shows GPU memory allocation for meta-llama/Meta-Llama-3-8B-Instruct with max_model_len=8192:

Allocation step (in order executed)Memory used (GiB)Details
Measure usable VRAM (safety factor)19.7890% of 21.98 GiB
Load model weights14.96model weights take 14.96 GiB
Framework/runtime overhead0.06non_torch_memory takes 0.06 GiB (CUDA driver, kernels, logs)
Peak activations buffer1.23PyTorch activation peak memory takes 1.23 GiB
KV/attention cache3.53The rest of the memory reserved for the KV Cache is 3.53 GiB
Derive concurrency limitsMaximum concurrency for 8192 tokens per request: 3.53x

Notes:

tip

Memory optimization strategies:

  • Reduce context length: When you halve max_model_len from 8192 to 4096, you double potential concurrency
  • Add GPU capacity: You can use higher-memory GPUs or implement tensor parallelism
  • Unit conversion: 1 GiB is appoximately 1.073 GB (A10G's 24 GB appears as 21.98 GiB after ECC overhead)

Supported GPU types

Anyscale supports GPUs from multiple vendors:

VendorGPU models
NvidiaV100, P100, T4, P4, K80, A10G, L4, L40S, A100 (40GB/80GB), H100, H200, H20
IntelGPU Max 1550, GPU Max 1100, Gaudi
AMDInstinct MI100, MI210, MI250X, MI300X-OAM, Radeon R9/HD series
SpecializedAWS Neuron Core, Google TPU (v2-v6e), Huawei Ascend 910B
note

Not all GPUs are available in all cloud provider regions. Some GPU types require custom Anyscale cloud deployments.

For the complete list of supported accelerators, see the Ray Serve LLM Config documentation.

GPU specifications comparison

The following table compares commonly used GPUs for LLM serving. The specifications represent typical values and might vary by system configuration.

note

Cloud instance offerings change frequently. Check your cloud provider's documentation for current availability.

GPUArchitectureMemory (GB)CUDA coresBandwidth (TB/s)InterconnectAWS instance (single-GPU examples)GCP instance (single-GPU examples)
Nvidia T4Turing162,5600.32PCIe 3g4dn.xlarge (1 × T4)n1-standard-4 + 1 × T4
Nvidia L4Ada247,4240.30PCIe 4g6.xlarge (1 × L4)g2-standard-4 (1 × L4)
Nvidia L40SAda4818,1760.86PCIe 4g6e.2xlarge (1 × L40S)
Nvidia A10GAmpere249,2160.60PCIe 4g5.xlarge (1 × A10G)
Nvidia A100-40GAmpere406,9121.60NVLink 3p4d.24xlarge (8 × A100-40G)a2-highgpu-1g (1 × A100-40G)
Nvidia A100-80GAmpere806,9122.0NVLink 3p4de.24xlarge (8 × A100-80G)a2-ultragpu-1g (1 × A100-80G)
Nvidia H100Hopper8014,5923.35NVLink 4p5.48xlarge (8 × H100)a3-highgpu-8g (8 × H100)
Nvidia H200Hopper HBM3e14116,8964.8NVLink 4p5e.24xlarge (8 × H200)a3-ultragpu-8g (8 × H200)
Nvidia B200Blackwell18016,8968.0NVLink 5p6-b200.48xlarge (8 × B200)a4-highgpu-8g (8 × B200)

NVLink provides high-speed GPU interconnection with significantly better performance than PCIe:

VersionBandwidthAvailable on
NVLink 3.0Up to 600 GB/sA100 series
NVLink 4.0Up to 900 GB/sH100 series
NVLink 5.0Up to 1.8 TB/sB200 series

Selecting GPUs by model size

Model sizeRecommended GPUsConfiguration
Small (≤10B)One to two L4 or A10GSingle GPU or TP=2
Medium (10B-70B)Two to four A10G/L40S or one to two A100Tensor parallelism required
Large (70B-500B)Multiple A100/H100/H200Multi-GPU tensor parallelism
Extreme (500B+)Multi-node H100/H200/B200TP + pipeline parallelism

Selection criteria

CriterionWhy it mattersRecommendation
Memory capacityMust fit model weights, KV cache, and overheadCalculate requirements before selection
Memory bandwidthDetermines token generation speedHigher bandwidth for latency-sensitive apps
InterconnectAffects multi-GPU scalingNVLink for best performance
Quantization supportReduces memory requirementsCheck GPU compatibility with target precision
tip

Quantization can reduce memory requirements by 50% or more. See the vLLM quantization hardware support guide for GPU compatibility.

Parallelism strategies for multi-GPU deployments

You can use parallelism to serve models that exceed single-GPU capacity. Understanding these strategies helps you balance performance and cost when you select GPU configurations.

Tensor parallelism (TP)

Tensor parallelism splits model layers horizontally across GPUs within a single node.

AspectDetails
ConfigurationSet tensor_parallel_size to number of GPUs (use powers of 2)
Best forModels that fit within single-node memory when split
RequirementsHigh-speed interconnect (NVLink preferred)
Typical values2, 4, 8 GPUs

Pipeline parallelism (PP)

Pipeline parallelism splits model layers vertically across multiple nodes.

AspectDetails
ConfigurationSet pipeline_parallel_size to number of nodes
Best forModels exceeding single-node capacity
Trade-offsHigher latency due to inter-node communication
Use whenModel requires more than 8 GPUs

Multi-node configuration

# Configuration for multi-node deployments
tensor_parallel_size = 8 # GPUs per node
pipeline_parallel_size = 2 # Number of nodes
total_gpus = tensor_parallel_size * pipeline_parallel_size # 16 GPUs total

Context window and memory requirements

The context window (maximum model length) determines how many tokens a model can process in a single pass. This parameter directly impacts memory usage through the KV cache.

warning

The model truncates tokens beyond the context limit, which causes it to "forget" earlier content. Choose your context length based on your actual use case requirements.

Context window (max model length) by model family

Model seriesMax context window
Llama 2 (7B, 13B, 70B)4k tokens
Llama 3 (8B, 70B)8k tokens
Llama 3.1 (8B, 70B, 405B)128k tokens
Llama 3.2 (1B, 3B, 11B, 90B)128k tokens
Llama 4 Scout (109B)10M tokens
Llama 4 Maverick (400B)1M tokens
Mistral 7B v0.18k tokens
Mistral 7B v0.2/v0.332k tokens
Mixtral 8x7B32k tokens
Mixtral 8x22B64k tokens
Mistral Small 332k tokens
Qwen 3 (0.6B/1.7B/4B)32k tokens
Qwen 3 (8B/14B/32B/235B)128k tokens
Qwen 2.532k tokens
Qwen 1M1M tokens
Gemma/Gemma 2/CodeGemma8k tokens
Gemma 3128k tokens

Context length selection guide

Use caseRecommended lengthExample applications
Short tasks4k-8k tokensQ&A, simple chat, code completion
Document processing32k-128k tokensAnalysis, summarization, reports
Multi-step agents128k+ tokensComplex reasoning, tool use
Ultra-long context1M+ tokensBook analysis, codebase understanding
tip

Configure context length based on your actual usage patterns, not the maximum model capacity. This approach optimizes memory usage and reduces costs.

Estimating GPU resources

Use the following calculation method to precisely estimate GPU requirements and prevent OOM errors:

Step 1: Calculate minimum GPUs

min_gpus = model_size_gb / gpu_memory_gb

Step 2: Apply safety factor

tensor_parallel_size = 2 * min_gpus  # Round to nearest power of 2
note

Why use a 2x safety factor?

  • It provides a margin for memory overhead
  • It prevents OOM errors under peak load
  • It provides headroom for batch processing
  • It accommodates KV cache growth

Step 3: Adjust for workload requirements

The 2x safety factor provides a baseline. Consider these additional factors:

RequirementActionReference
Long context windowsIncrease GPU memory or countTune parameters for LLMs on Anyscale services
High concurrencyAdd more replicasRay Serve Autoscaling
Dynamic scalingConfigure autoscaling parametersTune parameters for LLMs on Anyscale services

Examples

Example 1: Llama-3.1-8B-Instruct (BF16)

  • Model size: 16 GB
  • GPU: A10G (24 GB), g5.12xlarge instance with 4× A10G
  • Calculation: 16 GB ÷ 24 GB ≈ 1 × 2 = 2 A10G GPUs
  • Configuration:
    • Set tensor_parallel_size = 2
    • On a g5.12xlarge instance (4× A10G), use TP=2 and configure Ray Serve with 2 replicas
  • Notes:
    • TP=2 with 2 replicas outperforms TP=4 on A10G because these GPUs lack NVLink. When you set TP=2 instead of TP=4, you reduce communication overhead per token and scale more efficiently for many independent requests.
    • To minimize latency, consider using TP=1 with an L40S (48 GB memory).

Example 2: Llama-3.1-70B-Instruct (BF16)

  • Model size: 140 GB
  • GPU: H100 (80 GB), p5.48xlarge instance with 8× H100
  • Calculation: 140 GB ÷ 80 GB = 1.75 → × 2 ≈ 4 H100 GPUs
  • Configuration:
    • Set tensor_parallel_size = 4 (when max_model_len is between 2k—32k tokens).
    • Small-batch, low-latency use cases: Use TP=4 with two replicas in Ray Serve for minimal latency.
    • Large-batch, long-context, throughput-optimized: Use TP=8 with a single replica to support the full 128k context length. H100 NVLink helps mitigate the slight extra inter-GPU communication latency.

Example 3: DeepSeek R1-670B (FP8)

  • Model size: 720 GB
  • GPU: H100 (80 GB), p5.48xlarge instance with 8× H100
  • Calculation: 720 GB ÷ 80 GB = 9 → × 2 = 18 GPUs required
  • Configuration:
    • Set tensor_parallel_size = 8
    • Set pipeline_parallel_size = 2
    • Total GPUs: 16 (close to the theoretical 18)
    • Deploy with TP=8 and PP=2